Yunhang Shen
- Papers
- 15
Cite
Notes
Only stored in your browser.
Authored papers
15Toward Native Multimodal Modeling: A Roadmap
arXiv 2026
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
arXiv 2026
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
arXiv 2026
Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
arXiv 2026
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
arXiv 2025
Aligning Multimodal LLM with Human Preference: A Survey
arXiv 2025
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
arXiv 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025
Solving the Catastrophic Forgetting Problem in Generalized Category Discovery
solving-the-catastrophic-forgetting-problem
FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression
arXiv 2024
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
arXiv 2024
Aligning and Prompting Everything All at Once for Universal Visual Perception
arXiv 2023
Woodpecker: Hallucination Correction for Multimodal Large Language Models
arXiv 2023
FoPro: Few-Shot Guided Robust Webly-Supervised Prototypical Learning
arXiv 2022
Affiliations
Frequent co-authors
10from 15 papers