Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from vision-as-input'' to vision-as-target.'' By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension.
- Year
- 2026
- Venue
- arXiv 2026
- Authors
- 41
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2601.19798ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
41Yunhang ShenDeqiang JiangHaoyu CaoXing SunWei LiuJunru LuBo KeDi YinRuizhi QiaoZhixiang WeiKe LiYunsheng WuXin LiZuwei LongPeixian ChenMengdan ZhangYangning LiHaojia LinYi LiJianfeng HeLexiang TangBing LiuXiaotian LiXiaoyu TanZhehan KanXinghua JiangShifeng LiuHongze ShenYubo ZhuQianyu LiWeibo GuYinsong LiuMingkong TangShuangyin LiuHaodong LinJiarui QinLingfeng QiaoKun YinYunfei WuHuang ChenZhongpeng Cai