Yoav HaCohen, Benny Brazowski, Nisan Chiprut et al. · 6 Jan 2026
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides.
Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.
Yoav HaCohen, Benny Brazowski, Nisan Chiprut et al. · 6 Jan 2026
Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides.
Hujun Bao, Yipeng chen, Xinyu Chen et al. · 8 Apr 2026
Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision.
Bo Zheng, Kaipeng Zhang, Zhen Li et al. · 24 Mar 2026
Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state.
1 Jun 2026
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture.
Hao Yang, Pheng-Ann Heng, Chi-Wing Fu et al. · 13 Apr 2026
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose.
Yifan Yang, Yuqing Yang, Zhiyuan He et al. · 27 Apr 2026
Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies.
Pieter Abbeel, Marc Pollefeys, Jitendra Malik et al. · 30 Apr 2026
World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning.
Zongjian Li, Li Yuan, Shenghai Yuan et al. · 4 Mar 2026
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline.
14 Jun 2026
Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene.
24 Jun 2026
Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in…