0

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Vchitect-2.0 is a parallel transformer architecture that scales video diffusion models for text-to-video generation, ensuring alignment between text and video using a Multimodal Diffusion Block, overcoming memory and computational bottlenecks with Memory-efficient Training, and utilizing a high-quality million-scale dataset.

Year
2025
Venue
arXiv 2025
Authors
19
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2501.08453ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.

Authors

19