0

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-V2, a Mixture-of-Experts language model with 236B parameters, uses Multi-head Latent Attention and DeepSeekMoE to achieve high performance with reduced costs and efficient inference.

Year
2024
Venue
arXiv 2024
Authors
157
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2405.04434v5ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

Authors

157
Bo LiuXin LiuBin WangWentao ZhangZihan WangHui LiJin ChenXinyu YangDeepSeek-AIAixin LiuBei FengBingxuan WangChenggang ZhaoChong RuanDamai DaiDaya GuoDejian YangDeli ChenDongjie JiErhang LiFangyun LinFuli LuoGuangbo HaoGuanting ChenGuowei LiH. ZhangHanwei XuHaowei ZhangHonghui DingHuajian XinHuazuo GaoHui QuJ. L. CaiJian LiangJianZhong GuoJiaqi NiJiashi LiJingyang YuanJunjie QiuJunxiao SongKai DongKaige GaoKang GuanLean WangLecong ZhangLei XuLeyi XiaLiang ZhaoLiyue ZhangMeng LiMiaojun WangMingchuan ZhangMinghua ZhangMinghui TangMingming LiNing TianPanpan HuangPeiyi WangPeng ZhangQihao ZhuQinyu ChenQiushi DuR. J. ChenR. L. JinRuiqi GeRuizhe PanRunxin XuRuyi ChenS. S. LiShanghao LuShangyan ZhouShanhuang ChenShaoqing WuShengfeng YeShirong MaShiyu WangShuang ZhouShuiping YuShunfeng ZhouT. WangTian PeiTianyu SunW. L. XiaoWangding ZengWei AnWen LiuWenfeng LiangWenjun GaoX. Q. LiXiangyue JinXianzu WangXiao BiXiaodong LiuXiaohan WangXiaojin ShenXiaokang ChenXiaosha ChenXiaotao NieXiaowen SunXiaoxiang WangXin XieXingkai YuXinnan SongXinyi ZhouXuecheng SuY. K. LiY. X. WeiY. X. ZhuYanhong XuYanping HuangYao LiYao ZhaoYaofeng SunYaohui LiYaohui WangYi ZhengYichao ZhangYiliang XiongYing HeYing TangYishi PiaoYixuan TanYiyuan LiuYongqiang GuoYuchen ZhuYuduan WangYuheng ZouYukun ZhaYunxian MaYuting YanYuxiang YouYuxuan LiuZ. Z. RenZehui RenZhangli ShaZhe FuZhen HuangZhen ZhangZhenda XieZhewen HaoZhihong ShaoZhipeng XuZhongyu ZhangZhuoshu LiZihui GuZilin LiZiwei XieHao YangY. WuYixin DongYilong ZhaoXuan LuSize ZhengYongji WangChengqi DengrTian YuanZhiniu Wen