Pan Zhang
- Papers
- 27
Cite
Notes
Only stored in your browser.
Authored papers
27OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
CVPR 2025 1
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
arXiv 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
arXiv 2025
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
arXiv 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
CVPR 2025 1
RelightVid: Temporal-Consistent Diffusion Model for Video Relighting
arXiv 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
arXiv 2025
MM-IFEngine: Towards Multimodal Instruction Following
arXiv 2025
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
arXiv 2025
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
arXiv 2025
Long-CLIP: Unlocking the Long-Text Capability of CLIP
arXiv 2024
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
ICCV 2025
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
arXiv 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
arXiv 2024
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
arXiv 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
arXiv 2024
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
arXiv 2024
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
arXiv 2024
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
arXiv 2024
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
arXiv 2024
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
CVPR 2024 1
FreeDrag: Feature Dragging for Reliable Point-based Image Editing
CVPR 2024 1
VIGC: Visual Instruction Generation and Correction
arXiv 2023
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2024 1
Old Photo Restoration via Deep Latent Space Translation
arXiv 2020
Affiliations
Frequent co-authors
10from 27 papers