Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.
Training Video Foundation Models with NVIDIA NeMo
A scalable open-source pipeline using NVIDIA NeMo for training and inference of Video Foundation Models addresses challenges in generating high-quality videos.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 29
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2503.12964ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
29Hao WangJoseph JenningsSahil JainAshwath AithalNima TajbakhshShanmugam RamasamyXiaowei RenZeeshan PatelEthan HeParth MannanRyan WolfNiket AgarwalJacob HuffmanZhuoyao WangCarl WangJack ChangYan BaiTommy HuangLinnan WangEkaterina SirazitdinovaOleg SudakovMingyuan MaBobby ChenForrest LinVasanth Rao Naik SabavatSriharsha NivertyRong OuPallab BhattacharyaDavid Page