The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next
LongCat-Next: Lexicalizing Modalities as Discrete Tokens
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling.
- Year
- 2026
- Venue
- arXiv 2026
- Authors
- 89
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2603.27538ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
89Chi ZhangJing LiHaozhe WangYulei QianYuchen XieSiyu RenJiamu LiFengjiao ChenZiwen WangXuezhi CaoXunliang CaiTaofeng XueChong PengMianqiu HuangLinsen GuoPeng PeiJiawei WangWei WangHao YangJie YangXiaoyang LiYifan LuHang YuQuan ChenHaozhe ZhaoManyuan ZhangYan BaiXiaoyu LiBin XiaoXing HuXiao LiuHaoze SunQi LiChen ChenXu HuangXuanyu ZhuYitian ChenXinyang LinJiale HongYufei GaoChao WangZijian ZhangHongyu LiLin QiuQian WangJiaxing LiuJun KuangXi ChenHong LiuGe YangKunming LuoHui SuDian ZhengZhihang YuYizhen JiangYuqi PengYanJie LiYan FengZhenlong YuanMeituan LongCat TeamHaonan YanKefeng ZhangRumei LiYaoming ZhuYerui SunChengjiang LiJiaQi ZhangMinhao JingTongxin PanXiaotong LiXiaoyu ZhaoYao QiuYing LuoYipeng MeiYufang LiuYufei ChenZhixiong HanChangran WangHaowei GuoHuicheng JiangJialv ZouJianping LinJing JinJuncheng SheKuofeng GaoWenlong HeYifei CaoYimeng JiaZeyang Hu