Qin Jin

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

arXiv 2025

TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

arXiv 2025

A Survey of Deep Learning for Geometry Problem Solving

arXiv 2025

Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

arXiv 2025

TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM

arXiv 2025

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

arXiv 2024

ESCoT: Towards Interpretable Emotional Support Dialogue Systems

arXiv 2024

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

arXiv 2024

SPAFormer: Sequential 3D Part Assembly with Transformers

arXiv 2024

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

arXiv 2024

Movie101: A New Movie Understanding Benchmark

arXiv 2023

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

arXiv 2023

Rethinking Benchmarks for Cross-modal Image-text Retrieval

arXiv 2023

MPMQA: Multimodal Question Answering on Product Manuals

arXiv 2023