Shuhuai Ren
- Papers
- 21
Cite
Notes
Only stored in your browser.
Authored papers
21MiMo-V2-Flash Technical Report
arXiv 2026
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
arXiv 2025
MiMo-VL Technical Report
arXiv 2025
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos
arXiv 2025
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
arXiv 2025
Next Block Prediction: Video Generation via Semi-Auto-Regressive Modeling
arXiv 2025
MiMo-Embodied: X-Embodied Foundation Model Technical Report
arXiv 2025
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
arXiv 2025
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
arXiv 2025
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
ICCV 2025
TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment
arXiv 2025
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
arXiv 2024
Parallelized Autoregressive Visual Generation
CVPR 2025 1
TempCompass: Do Video LLMs Really Understand Videos?
arXiv 2024
PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
arXiv 2024
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
arXiv 2024
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
CVPR 2024 1
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
prompt-pre-training-with-twenty-thousand
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
arXiv 2023
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
arXiv 2023
Delving into the Openness of CLIP
arXiv 2022
Affiliations
Frequent co-authors
10from 21 papers