This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 converts auto-regressive models into discrete diffusion large language models with a novel training scheme, achieving superior performance and efficiency at scale.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 31
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2512.15745ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
31Ji-Rong WenJunbo ZhaoZhenzhong LanJun ZhouJianguo LiChengxi LiChongxuan LiXiaolu ZhangLun DuDa ZhengLanning WeiLing LiuMaosong CaoTiwei BieKun ChenMingliang GongYanmei GuZenan Huangzehuan liHuabin LiuGuoshan LuYuxin MaJianfeng TanYipeng XingJunlin ZhouLiwang ZhuYihong ZhuangZhuochen GongJiaqi HuXiaocheng LuZhanchao Zhou