Haoyu Cao
- Papers
- 10
Cite
Notes
Only stored in your browser.
Authored papers
10RISE-Video: Can Video Generators Decode Implicit World Rules?
arXiv 2026
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv 2026
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
arXiv 2026
Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding
arXiv 2026
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
arXiv 2026
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
arXiv 2025
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
arXiv 2024
Affiliations
Frequent co-authors
10from 10 papers