We present GLM-4.1V-Thinking, a vision-language model (VLM) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document understanding. We open-source GLM-4.1V-9B-Thinking, which achieves state-of-the-art performance among models of comparable size. In a comprehensive evaluation across 28 public benchmarks, our model outperforms Qwen2.5-VL-7B on nearly all tasks and achieves comparable or even superior performance on 18 benchmarks relative to the significantly larger Qwen2.5-VL-72B. Notably, GLM-4.1V-9B-Thinking also demonstrates competitive or superior performance compared to closed-source models such as GPT-4o on challenging tasks including long document understanding and STEM reasoning, further underscoring its strong capabilities. Code, models and more information are released at https://github.com/THUDM/GLM-4.1V-Thinking.
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
A vision-language model (VLM) named GLM-4.1V-Thinking, developed with a reasoning-centric training framework, achieves state-of-the-art performance across various tasks, including STEM problem solving, video understanding, and long document understanding, outperforming larger models on many benchmarks.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 78
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2507.01006v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
78Jie TangAohan ZengQinkai ZhengZhenyu HouDa YinXuancheng HuangYean ChengYifan AnYilin NiuZhengxiao DuZihan WangChenhui ZhangGuo WangWeihan WangXiaotao GuYuxuan ZhangZhen YangBin XuMinlie HuangJuanzi LiYuxiao DongFan YangPeng ZhangYanfeng WangYutao ZhangYan WangYifan DuJunjie ChenWei JiaWenkai LiJiale ChengLeqi LeiTianxiang HaoJing ChenShiyu HuangWenyi HongZiyang PanYuanchang YueYanling WangWenmeng YuShuaiqi DuanSheng YangLihang PanJunhui JiJinjiang WangJiazheng XuJi QiGuobing GanZhao XueZehai HeYuchen LiYuan WangYadong XueTianshu ZhangJiali ChenBoyan ShiBaoxu WangDebing LiuJinhao ChenGLM-V TeamHaomiao TangZhe SuChangyu PangGuoqing ChenJinghao LinLetian GongLeyi PanMingzhi ZhangShi ZhongShuyuan ZhaoSiyan XueShangqin TuShengbiao MengTianwei LuoXin LyuYiming ShiYiheng HuangZhanxiao Du