Qi Qian

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

arXiv 2026

Small Vision-Language Models are Smart Compressors for Long Video Understanding

arXiv 2026

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

arXiv 2026

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

arXiv 2025

2025

Searching for Best Practices in Retrieval-Augmented Generation

arXiv 2024

Efficient Personalized Text-to-image Generation by Leveraging Textual Subspace

arXiv 2024

SeA: Semantic Adversarial Augmentation for Last Layer Features from Unsupervised Representation Learning

arXiv 2024

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

arXiv 2024

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

arXiv 2023

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

arXiv 2023

Improved Visual Fine-tuning with Natural Language Supervision

ICCV 2023 1

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

CVPR 2024 1

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

arXiv 2023