Yu Qiao
- Papers
- 195
Cite
Notes
Only stored in your browser.
Authored papers
195ASI-Evolve: AI Accelerates AI
arXiv 2026
InternVLA-A1: Unifying Understanding, Generation and Action for Robotic Manipulation
arXiv 2026
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
arXiv 2026
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
arXiv 2026
daVinci-LLM:Towards the Science of Pretraining
arXiv 2026
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
arXiv 2026
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
arXiv 2026
MARBLE: Multi-Aspect Reward Balance for Diffusion RL
arXiv 2026
CauScale: Neural Causal Discovery at Scale
arXiv 2026
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
arXiv 2026
P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
arXiv 2026
Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
arXiv 2026
SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
arXiv 2026
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
arXiv 2025
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
arXiv 2025
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
arXiv 2025
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
arXiv 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv 2025
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
arXiv 2025
Sekai: A Video Dataset towards World Exploration
arXiv 2025
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
arXiv 2025
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
ICCV 2025
Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback
arXiv 2025
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
arXiv 2025
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
arXiv 2025
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
arXiv 2025
Yume-1.5: A Text-Controlled Interactive World Generation Model
arXiv 2025
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
arXiv 2025
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv 2025
ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
arXiv 2025
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
arXiv 2025
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
arXiv 2025
PICABench: How Far Are We from Physically Realistic Image Editing?
arXiv 2025
Cut2Next: Generating Next Shot via In-Context Tuning
arXiv 2025
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
arXiv 2025
GenExam: A Multidisciplinary Text-to-Image Exam
arXiv 2025
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
ICCV 2025
Sequential Diffusion Language Models
arXiv 2025
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
arXiv 2025
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
arXiv 2025
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
arXiv 2025
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
arXiv 2025
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
arXiv 2025
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
arXiv 2025
VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
arXiv 2025
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
arXiv 2025
P1: Mastering Physics Olympiads with Reinforcement Learning
arXiv 2025
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
arXiv 2025
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
arXiv 2025
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
arXiv 2025
Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny
arXiv 2025
O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
arXiv 2025
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
arXiv 2025
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
arXiv 2025
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
arXiv 2025
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
ICCV 2025
OASIS: Open Agent Social Interaction Simulations with One Million Agents
arXiv 2024
GRUtopia: Dream General Robots in a City at Scale
arXiv 2024
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
arXiv 2024
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
arXiv 2024
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
arXiv 2024
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
arXiv 2024
VideoMamba: State Space Model for Efficient Video Understanding
arXiv 2024
GenAD: Generalized Predictive Model for Autonomous Driving
CVPR 2024 1
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
arXiv 2024
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
arXiv 2024
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
arXiv 2024
Needle In A Multimodal Haystack
arXiv 2024
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024 1
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
DynamicCity: Large-Scale 4D Occupancy Generation from Dynamic Scenes
arXiv 2024
Learning Manipulation by Predicting Interaction
arXiv 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
arXiv 2024
Real-time Holistic Robot Pose Estimation with Unknown States
arXiv 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
arXiv 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
arXiv 2024
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
arXiv 2024
MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation
arXiv 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
arXiv 2024
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
CVPR 2025 1
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
CVPR 2024 1
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2024
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
arXiv 2024
MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control
arXiv 2024
Cinemo: Consistent and Controllable Image Animation with Motion Diffusion Models
arXiv 2024
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning
arXiv 2024
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
arXiv 2024
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
arXiv 2024
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
arXiv 2024
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
arXiv 2024
Linear Attention Sequence Parallelism
arXiv 2024
Causal Evaluation of Language Models
arXiv 2024
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
arXiv 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
arXiv 2024
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
arXiv 2024
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
CVPR 2025 1
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge
visionunite-a-vision-language-foundation
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
arXiv 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
arXiv 2024
MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
arXiv 2024
ToMiE: Towards Modular Growth in Enhanced SMPL Skeleton for 3D Human with Animatable Garments
ICCV 2025
FLoRA: Low-Rank Core Space for N-dimension
arXiv 2024
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
arXiv 2024
Latte: Latent Diffusion Transformer for Video Generation
arXiv 2024
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
arXiv 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024 1
CO2: Efficient Distributed Training with Full Communication-Computation Overlap
arXiv 2024
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
arXiv 2024
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
arXiv 2024
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
arXiv 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
arXiv 2024
FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality
arXiv 2024
Embodied Understanding of Driving Scenarios
arXiv 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
arXiv 2024
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
arXiv 2024
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI
arXiv 2024
REEF: Representation Encoding Fingerprints for Large Language Models
arXiv 2024
LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
arXiv 2024
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
arXiv 2024
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
arXiv 2024
Vlogger: Make Your Dream A Vlog
CVPR 2024 1
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
arXiv 2024
Diffusion Transformer Policy
arXiv 2024
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
arXiv 2023
DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds
ICCV 2023 1
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
CVPR 2023 1
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023
Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision
CVPR 2024 1
ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation
arXiv 2023
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
arXiv 2023
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023 11
MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
arXiv 2023
SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks
arXiv 2023
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
arXiv 2023
UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase
ICCV 2023 1
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024 1
HAT: Hybrid Attention Transformer for Image Restoration
arXiv 2023
Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
leveraging-vision-centric-multi-modal
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
ICCV 2023 1
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
arXiv 2023
ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models
arXiv 2023
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
arXiv 2023
ControlLLM: Augment Language Models with Tools by Searching on Graphs
arXiv 2023
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
arXiv 2023
A Comparative Study of Image Restoration Networks for General Backbone Network Design
arXiv 2023
Fine-grained Audible Video Description
CVPR 2023 1
Long-Term Rhythmic Video Soundtracker
arXiv 2023
SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution
arXiv 2023
MLLMs-Augmented Visual-Language Representation Learning
arXiv 2023
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning
ICCV 2023 1
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023 1
Fake Alignment: Are LLMs Really Aligned Well?
arXiv 2023
LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global Cross-Modal Fusion
CVPR 2023 1
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
CVPR 2024 1
DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
arXiv 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
arXiv 2023
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
arXiv 2023
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models
arXiv 2023
SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images
arXiv 2023
Drive Like a Human: Rethinking Autonomous Driving with Large Language Models
arXiv 2023
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models
arXiv 2023
Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation
arXiv 2023
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
arXiv 2023
MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification
arXiv 2023
Scaling Data Generation in Vision-and-Language Navigation
ICCV 2023 1
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
CVPR 2023 1
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
arXiv 2023
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
ICCV 2023 1
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model
arXiv 2023
MGMAE: Motion Guided Masking for Video Masked Autoencoding
ICCV 2023 1
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024 1
Planning-oriented Autonomous Driving
CVPR 2023 1
Activating More Pixels in Image Super-Resolution Transformer
CVPR 2023 1
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
CVPR 2023 1
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction
arXiv 2022
Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy
arXiv 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
arXiv 2022
PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark
persformer-3d-lane-detection-via-perspective
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023 1
Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
CVPR 2023 1
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
CVPR 2023 1
Diff-Font: Diffusion Model for Robust One-Shot Font Generation
arXiv 2022
Demystify Transformers & Convolutions in Modern Image Deep Networks
arXiv 2022
ResFormer: Scaling ViTs with Multi-Resolution Training
CVPR 2023 1
UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
arXiv 2022
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023 1
Lego-MT: Learning Detachable Models for Massively Multilingual Machine Translation
arXiv 2022
Self-slimmed Vision Transformer
arXiv 2021
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
arXiv 2021
Efficient Image Super-Resolution Using Pixel Attention
arXiv 2020
Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition
attention-driven-dynamic-graph-convolutional
ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks
arXiv 2018
Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs
arXiv 2016
Affiliations
Frequent co-authors
10from 195 papers