Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces.
- Year
- 2026
- Venue
- arXiv 2026
- Stars
- 2.6k
- Authors
- 58
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2605.12500ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Topics
4Abstract
Authors
58Bo LiuYubo WangYan LiYang GaoDahua LinZiwei LiuHaiwen DiaoPenghao WuHanming DengJiahao WangShihao BaiSilei WuWeichen FanWenjie YeWenwen TongXiangyu FanZhijie CaoZhiqian LinZhitao YangZhongang CaiYuwei NiuYue ZhuChengguang LvHaojia YuHaozhe XieHongli WangJianan FanJiaqi LiJiefan LuJingcheng NiJunxiang XuKaihuan LiangLianqiang ShiLinjun DaiLinyan WangOscar QianPeng GaoPengFei LiuQingping SunRui ShenRuisi WangShengnan MaShuang YangSiyi XieSiying LiTianbo ZhongXiangli KongXuanke ShiYongqiang YaoYves WangZhengqi BaiZhengyu LinZixin YinWenxiu SunRuihao GongQuan WangLewei LuLei Yang