Xu sun

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

arXiv 2025

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

arXiv 2025

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

arXiv 2025

Next Block Prediction: Video Generation via Semi-Auto-Regressive Modeling

arXiv 2025

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

arXiv 2025

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

arXiv 2025

TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment

arXiv 2025

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

arXiv 2024

VidTwin: Video VAE with Decoupled Structure and Dynamics

CVPR 2025 1

TempCompass: Do Video LLMs Really Understand Videos?

arXiv 2024

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

arXiv 2024

Temporal Reasoning Transfer from Text to Video

arXiv 2024

PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension

arXiv 2024

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

arXiv 2024

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

CVPR 2024 1

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

prompt-pre-training-with-twenty-thousand

Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning

arXiv 2023

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

arXiv 2023

Towards Codable Watermarking for Injecting Multi-bits Information to LLMs

arXiv 2023

MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning

arXiv 2023

Can Language Models Understand Physical Concepts?

arXiv 2023

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models

arXiv 2023