0

Chaoyou Fu

Papers
24

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
24papers

Authored papers

24

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

arXiv 2026

2026

PersonaVLM: Long-Term Personalized Multimodal LLMs

arXiv 2026

2026

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

arXiv 2026

2026

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

arXiv 2026

2026

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

arXiv 2026

2026

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

arXiv 2025

2025

Aligning Multimodal LLM with Human Preference: A Survey

arXiv 2025

2025

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

arXiv 2025

2025

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

arXiv 2025

2025

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

arXiv 2025

2025

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

arXiv 2025

2025

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

arXiv 2025

2025

Thyme: Think Beyond Images

arXiv 2025

2025

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

arXiv 2025

2025

RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

arXiv 2025

2025

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

arXiv 2025

2025

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray

arXiv 2025

2025

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

arXiv 2025

2025

A Survey on Benchmarks of Multimodal Large Language Models

arXiv 2024

2024

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

arXiv 2024

2024

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

CVPR 2025 1

2024

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

arXiv 2024

2024

Aligning and Prompting Everything All at Once for Universal Visual Perception

arXiv 2023

2023

Woodpecker: Hallucination Correction for Multimodal Large Language Models

arXiv 2023

2023

Affiliations

No known affiliations.

Frequent co-authors

10

from 24 papers