Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Step-Audio is a production-ready, open-source solution featuring a multi-modal model and data engine that enhances real-time speech interaction with dynamic control and improved performance.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 145
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2502.11946v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
145Daxin JiangLi ZhouBin WangJing LiJie WuPeng LiuMingrui ChenTianyu WangNa WangXin HuangLiang ZhaoYang ZhangYuting YanZheng GeRanchen MingLei XiaYibo ZhuBinxing JiaoXiangyu ZhangBo LiJie YangYangguang LiYang CaoMing LiXu ZhaoZhewei HuangShuchang ZhouHongYu ZhouHeung-Yeung ShumHeng WangJingyang ZhangChen HuYanming XuJianjian SunZheng SunZhe XieYu LuoYuxiang ZhangAilin HuangHongyuan WangYu ZhouZheng GongXi ChenWei JiJie FengChen XuZili ZhangZixin ZhangWen SunJiansheng ChenXinhao ZhangXin HanBrian LiBuyun MaChangxin MiaoChangyi WanChengting FengDapeng ShiDeshan SunEnle LiuGuanzhe HuangGulin YanHaonan JiaJiaoren WuJunzhe LinKaixiang LiKang AnMingliang LiRuihang MiaoShaoliang PangShiliang YangSiQi LiuSong YuanTiancheng CaoWang YouWenjin DengWuxun XieXiangwen KongXiaojia LiuXin WuYanbo YuYaoyu WangYuanhao DingYuchu LuoYuxiang YangZhichao ChangZidong YangYuanwei LiangFei TianHaoyang ZhangYechang HuangXuerui YangMingxiao LiShilei JiangHanpeng HuMei ChenXuelin ZhangBoyong WuChao YanChengli FengFeiyu ShenJingbei LiJiangjie ZhenBingxin LiShuli GaoWen LiXuan WenYuankai MaDingyuan HuLonglong GuQinyuan TanWenqing HeYilei WangYuanwei LuYuhe YinBruce WangJianchang WuYun MouBahtiyar AhmidiChenrun WangDula SaiJiahao GongJunjing GuoJiashuai LiuJiahong LiuJinguo WangMenglin WuMingyao LiangNie HaoQiling WuRan SunShuai ShuaiShanshan YuanShihong DengSitong LiuWeipeng MingXiaomin DengYanan WeiYangzhen MaYaqiang ShiYizhuang ZhouYinmin ZhongYaoben WeiYaqi DaiZhisheng Guan