Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
MOVA: Towards Scalable and Synchronized Video-Audio Generation
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components.
- Year
- 2026
- Venue
- arXiv 2026
- Stars
- 1.0k
- Authors
- 40
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.08794ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
40Xie ChenCheng ChangMingshu ChenRuixiao LiYiyang ZhangYang GaoHanfu ChenKe ChenSonglin WangZhaoye FeiQinyuan ChengShiMin LiXipeng QiuXiangyu PengQi LuoZhiyuan NingQi ChenJingqi TongQianyi WuChenchen YangZhe XuWei JiangYuerong SongTianyi LiangZiwei HeSII-OpenMOSS TeamDonghua YuWenbo ZhangWenming TuYanru HuoYing ZhuYinze LuoZhiYu ZhangChushu ZhouHongnan MaJiaxi LiJunxi LiuChunguo LiChenhui LiZengfeng Huang