Shuhuai Ren

MiMo-VL Technical Report

arXiv 2025

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

arXiv 2025

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

arXiv 2025

Next Block Prediction: Video Generation via Semi-Auto-Regressive Modeling

arXiv 2025

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

ICCV 2025

TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment

arXiv 2025

MiMo-Embodied: X-Embodied Foundation Model Technical Report

arXiv 2025

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

arXiv 2025

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

arXiv 2025

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

arXiv 2024

Parallelized Autoregressive Visual Generation

CVPR 2025 1

TempCompass: Do Video LLMs Really Understand Videos?

arXiv 2024

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

arXiv 2024

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

arXiv 2024

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

CVPR 2024 1

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

prompt-pre-training-with-twenty-thousand

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

arXiv 2023

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

arXiv 2023