We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
Kimi-VL Technical Report
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B)
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 93
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2504.07491v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
93Hao ZhangLonghui YuTao YuXinyuan WangBowen WangYibo MiaoYang LiHao YangWeiran HeXinyu ZhouHaoNing WuJingyuan LiuJianlin SuXingcheng YaoZhejun JiangGuokun LaiYulun DuYidao QinWeixin XuEnzhe LuJunjie YanYanru ChenHuabin ZhengYibo LiuShaowei LiuBohong YinHan ZhuYuzhi WangJianzhou WangMengnan DongZheng ZhangYongsheng KangYuxin WuZhilin YangLiang ChenWei SongChu WeiY. CharlesZaida ZhouHaoyu LuZongyu LinHeng WangEnming YuanZhiqi HuangJiezhong QiuFang LiKimi TeamAngang DuBowei XingBowen QuCheng ChenChenlin ZhangChenzhuang DuCongcong WangDehao ZhangDikang DuDongliang WangFlood SungGuangda WeiHao DingHao HuHaotian YaoHongcheng GaoJiaming LiJiaqi DengJin XieJinhong WangKun OuyangLin SuiMengfan DongNuo XuPengyu ChengQizheng GuRunjie ZhouSihan CaoTianhui SongTongtong BaiWeixiao HuangXiaokun YuanXingzhe WuXinxing ZuYan ZhongYangyang HuYejie WangYimin ChenYiping BaoYiqin WangYuanxin LiuYuzi YanZhaowei LiZihao HuangZijia ZhaoZiwei Chen