Chaoyou Fu
- Papers
- 24
Cite
Notes
Only stored in your browser.
Authored papers
24Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv 2026
PersonaVLM: Long-Term Personalized Multimodal LLMs
arXiv 2026
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
arXiv 2026
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
arXiv 2026
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
arXiv 2026
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
arXiv 2025
Aligning Multimodal LLM with Human Preference: A Survey
arXiv 2025
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv 2025
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
arXiv 2025
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
arXiv 2025
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
arXiv 2025
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
arXiv 2025
Thyme: Think Beyond Images
arXiv 2025
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
arXiv 2025
RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark
arXiv 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models
arXiv 2025
A Survey on Benchmarks of Multimodal Large Language Models
arXiv 2024
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
arXiv 2024
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
CVPR 2025 1
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
arXiv 2024
Aligning and Prompting Everything All at Once for Universal Visual Perception
arXiv 2023
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023
Affiliations
Frequent co-authors
10from 24 papers