Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: Video generationClear

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut et al. · 6 Jan 2026

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides.

Audio Generation Video generation

7.3k

INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

Hujun Bao, Yipeng chen, Xinyu Chen et al. · 8 Apr 2026

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision.

Video generation World Models

9250.2/h

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Bo Zheng, Kaipeng Zhang, Zhen Li et al. · 24 Mar 2026

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state.

Reinforcement Learning Video generation World Models

3930.2/h

Cosmos 3: Omnimodal World Models for Physical AI

1 Jun 2026

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture.

Image Understanding Language Modeling Omni models Video generation

11k1.4/h

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Hao Yang, Pheng-Ann Heng, Chi-Wing Fu et al. · 13 Apr 2026

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose.

Video generation

4110.1/h

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

Yifan Yang, Yuqing Yang, Zhiyuan He et al. · 27 Apr 2026

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies.

Image Understanding Language Modeling Reinforcement Learning Video generation

3980.1/h

World Model for Robot Learning: A Comprehensive Survey

Pieter Abbeel, Marc Pollefeys, Jitendra Malik et al. · 30 Apr 2026

World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning.

Reinforcement Learning Video generation World Models

6460.1/h

Helios: Real Real-Time Long Video Generation Model

Zongjian Li, Li Yuan, Shenghai Yuan et al. · 4 Mar 2026

We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline.

Video generation

1.9k

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

14 Jun 2026

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene.

Robotics Video generation World Models

410.6/h

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

24 Jun 2026

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in…

Image Understanding Language Modeling Video generation