Meng Cao

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

arXiv 2025

"See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models

arXiv 2025

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

arXiv 2025

Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps

arXiv 2024

How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization?

arXiv 2024

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

arXiv 2024

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

physgame-uncovering-physical-commonsense

Efficient ConvBN Blocks for Transfer Learning and Beyond

arXiv 2023

Systematic Rectification of Language Models via Dead-end Analysis

arXiv 2023

VeCLIP: Improving CLIP Training via Visual-enriched Captions

arXiv 2023

Improving Retrieval-Augmented Large Language Models via Data Importance Learning

arXiv 2023