Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: RoboticsClear

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Bo Liu, Yubo Wang, Yan Li et al. · 12 May 2026

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces.

Image Understanding Language Modeling Robotics World Models

3.4k0.7/h

A Pragmatic VLA Foundation Model

Xing Zhu, Yujun Shen, Wei Wu et al. · 26 Jan 2026

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation).

Image Understanding Robotics

1.5k

Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

Yuan Zhang, Yijun Yang, Hang Xu et al. · 5 May 2026

JoyAI-Image integrates a spatially enhanced MLLM with MMDiT to achieve unified visual understanding, text-to-image generation, and instruction-guided image editing with enhanced spatial intelligence.

Image editing Image generation Image Understanding Language Modeling

2.2k0.2/h

Utonia: Toward One Encoder for All Point Clouds

Yujia Zhang, Xiaoyang Wu, Naiyan Wang et al. · 3 Mar 2026

We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all.

Image Understanding Language Modeling Remote Sensing Robotics

692

Causal World Modeling for Robot Control

Nan Xue, Xing Zhu, Yujun Shen et al. · 29 Jan 2026

Video world modeling enables robot learning through a unified framework that predicts frames and executes policies simultaneously using a shared latent space and closed-loop feedback mechanisms.

Robotics World Models

1.3k

World Action Models: The Next Frontier in Embodied AI

Zhaoye Fei, Xipeng Qiu, Yu-Gang Jiang et al. · 12 May 2026

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention.

Robotics World Models

9600.4/h

PhysBrain 1.0 Technical Report

Kai Chen, Bin Yu, Shijie Lian et al. · 14 May 2026

PhysBrain 1.0 leverages human egocentric video to generate physical commonsense supervision for vision-language-action models, achieving state-of-the-art performance in embodied control tasks through capability-preserving adaptation.

Image Understanding Question Answering Robotics

350.0/h

InSight: Self-Guided Skill Acquisition via Steerable VLAs

23 Jun 2026

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the…

Image Understanding Robotics

150.0/h

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

14 Jun 2026

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene.

Robotics Video generation World Models

410.6/h

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

4 Jun 2026

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface}…

Robotics

980.1/h

Geometric Action Model for Robot Policy Learning

15 Jun 2026

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors…

Robotics

960.0/h