Hongsheng Li
- Papers
- 117
Cite
Notes
Only stored in your browser.
Authored papers
117DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
arXiv 2026
PromptRL: Prompt Matters in RL for Flow-Based Image Generation
arXiv 2026
MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning
arXiv 2026
AURA: Always-On Understanding and Real-Time Assistance via Video Streams
arXiv 2026
From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors
arXiv 2026
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving
arXiv 2026
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
arXiv 2026
FullStack-Agent: Enhancing Agentic Full-Stack Web Coding via Development-Oriented Testing and Repository Back-Translation
arXiv 2026
SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics
arXiv 2026
Rethinking VLM Representation for VLA Initialization
arXiv 2026
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
arXiv 2026
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
arXiv 2025
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects
arXiv 2025
MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
arXiv 2025
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
ICCV 2025
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
arXiv 2025
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
arXiv 2025
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
arXiv 2025
Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models
arXiv 2025
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
arXiv 2025
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
arXiv 2025
Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield
arXiv 2025
EditThinker: Unlocking Iterative Reasoning for Any Image Editor
arXiv 2025
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
arXiv 2025
HPSv3: Towards Wide-Spectrum Human Preference Score
ICCV 2025
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints
arXiv 2025
CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
arXiv 2025
MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning
arXiv 2025
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
arXiv 2025
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing
arXiv 2025
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
CVPR 2025 1
WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning
arXiv 2025
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
arXiv 2025
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
arXiv 2025
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
arXiv 2025
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation
arXiv 2025
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
arXiv 2025
SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
arXiv 2025
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
arXiv 2025
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
arXiv 2025
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
arXiv 2025
PICABench: How Far Are We from Physically Realistic Image Editing?
arXiv 2025
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
arXiv 2025
Factuality Matters: When Image Generation and Editing Meet Structured Visuals
arXiv 2025
One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
arXiv 2025
M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving
arXiv 2025
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
arXiv 2024
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
arXiv 2024
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
arXiv 2024
Phased Consistency Models
arXiv 2024
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
arXiv 2024
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024 1
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
arXiv 2024
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
arXiv 2024
AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data
arXiv 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
arXiv 2024
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
arXiv 2024
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
arXiv 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
arXiv 2024
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
arXiv 2024
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
ICCV 2025
Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior
arXiv 2024
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
arXiv 2024
MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
arXiv 2024
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
arXiv 2024
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation
arXiv 2024
Empowering Character-level Text Infilling by Eliminating Sub-Tokens
arXiv 2024
Stable Consistency Tuning: Understanding and Improving Consistency Models
arXiv 2024
GiT: Towards Generalist Vision Transformer through Universal Language Interface
arXiv 2024
Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow
arXiv 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
arXiv 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
arXiv 2024
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
arXiv 2024
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
arXiv 2024
A3VLM: Actionable Articulation-Aware Vision Language Model
arXiv 2024
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
arXiv 2024
TerDiT: Ternary Diffusion Models with Transformers
arXiv 2024
I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow
arXiv 2024
MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More
arXiv 2024
DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition
arXiv 2024
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
arXiv 2024
Flowmind2Digital: The First Comprehensive Flowmind Recognition and Conversion Approach
arXiv 2024
Enhancing Vision-Language Model with Unmasked Token Alignment
arXiv 2024
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
arXiv 2024
DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds
ICCV 2023 1
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
CVPR 2024 1
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
arXiv 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
arXiv 2023
Personalize Segment Anything Model with One Shot
arXiv 2023
VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation
ICCV 2023 1
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023
GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
ICCV 2023 1
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space
ICCV 2023 1
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
arXiv 2023
SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification
arXiv 2023
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
arXiv 2023
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
arXiv 2023
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
arXiv 2023
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising
arXiv 2023
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference
ICCV 2023 1
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction
ICCV 2023 1
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023 1
Unmasking Bias in Diffusion Model Training
arXiv 2023
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
CVPR 2023 1
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023 1
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
CVPR 2023 1
Simulating Fluids in Real-World Still Images
ICCV 2023 1
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
arXiv 2022
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023 1
RBGNet: Ray-based Grouping for 3D Object Detection
CVPR 2022 1
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
arXiv 2021
Efficient Attention: Attention with Linear Complexities
arXiv 2018
StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks
arXiv 2017
Affiliations
Frequent co-authors
10from 117 papers