Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: Image UnderstandingClear

Cosmos 3: Omnimodal World Models for Physical AI

1 Jun 2026

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture.

Image Understanding Language Modeling Omni models Video generation

11k1.4/h

InSight: Self-Guided Skill Acquisition via Steerable VLAs

23 Jun 2026

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the…

Image Understanding Robotics

150.0/h

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

6 Jun 2026

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions.

Image Understanding Language Modeling Reinforcement Learning

2730.4/h

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

11 Jun 2026

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs).

Image Understanding Language Modeling

2910.1/h

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

24 Jun 2026

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in…

Image Understanding Language Modeling Video generation

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

23 Jun 2026

Multimodal misinformation detection is increasingly important because viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text--image framing errors.

Image Understanding