We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
Qwen3-VL Technical Report
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 64
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2511.21631ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
64Junyang LinTianbao XieYiheng XuYang LiuJingren ZhouZhifang GuoJin XuYuxuan WangBinyuan HuiShuai BaiAn YangBowen YuDayiheng LiuHang ZhangPeng WangYuxuan CaiKeqin ChenXuejing LiuMingkun YangXingzhang RenBo ZhengRui MenFan ZhouJianxin YangWenbin GeXuancheng RenSibo SongJun TangHumen ZhongYuanzhi ZhuZhaohai LiJianqiang WanPengfei WangWei DingXi ZhangZesen ChengZhibo YangHaiyang XuJiawei LiuFei HuangRuizhe ChenXionghui ChenLianghao DengChang GaoChunjiang GeQidong HuangJie HuangShutong JiangMingsheng LiMei LiKaixin LiZicheng LinChenglong LiuShixuan LiuDunjie LuRuilin LuoChenxu LvLingchen MengYuchong SunJianhong TuQiuyue WangFei ZhangJing ZhouKe Zhu