Lei Zhang
- Papers
- 130
Cite
Notes
Only stored in your browser.
Authored papers
130Qwen3-Coder-Next Technical Report
arXiv 2026
Multimodal OCR: Parse Anything from Documents
arXiv 2026
AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context
arXiv 2026
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
arXiv 2026
Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
arXiv 2026
ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas
arXiv 2026
Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis
arXiv 2026
D^3R-DETR: DETR with Dual-Domain Density Refinement for Tiny Object Detection in Aerial Images
arXiv 2026
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
arXiv 2026
AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
arXiv 2026
Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training
arXiv 2026
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
arXiv 2025
Detect Anything via Next Point Prediction
arXiv 2025
MiniCPM4: Ultra-Efficient LLMs on End Devices
arXiv 2025
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
arXiv 2025
Personalized Image Generation with Deep Generative Models: A Decade Survey
arXiv 2025
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
arXiv 2025
Referring to Any Person
ICCV 2025
No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves
arXiv 2025
Rex-Thinker: Grounded Object Referring via Chain-of-Thought Reasoning
arXiv 2025
RORem: Training a Robust Object Remover with Human-in-the-Loop
CVPR 2025 1
Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
ICCV 2025
Toward Generalized Image Quality Assessment: Relaxing the Perfect Reference Quality Assumption
CVPR 2025 1
TIIF-Bench: How Does Your T2I Model Follow Your Instructions?
arXiv 2025
IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property
arXiv 2025
MediAug: Exploring Visual Augmentation in Medical Imaging
arXiv 2025
D$^2$iT: Dynamic Diffusion Transformer for Accurate Image Generation
arXiv 2025
HCMA: Hierarchical Cross-model Alignment for Grounded Text-to-Image Generation
arXiv 2025
SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
arXiv 2025
PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models
arXiv 2025
SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
arXiv 2025
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation
arXiv 2025
Baichuan-Omni-1.5 Technical Report
arXiv 2025
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models
ICCV 2025
HumanMM: Global Human Motion Recovery from Multi-shot Videos
CVPR 2025 1
Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data
CVPR 2025 1
Prompt-Free Conditional Diffusion for Multi-object Image Augmentation
arXiv 2025
ProjectedEx: Enhancing Generation in Explainable AI for Prostate Cancer
arXiv 2025
MedConv: Convolutions Beat Transformers on Long-Tailed Bone Density Prediction
arXiv 2025
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
arXiv 2025
LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls
arXiv 2025
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
arXiv 2025
InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
arXiv 2025
Evaluating and Aligning CodeLLMs on Human Preference
arXiv 2024
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
arXiv 2024
DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
arXiv 2024
SkillMimic: Learning Basketball Interaction Skills from Demonstrations
CVPR 2025 1
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
arXiv 2024
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
arXiv 2024
Adversarial Diffusion Compression for Real-World Image Super-Resolution
CVPR 2025 1
T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy
arXiv 2024
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data
arXiv 2024
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
arXiv 2024
TokenPacker: Efficient Visual Projector for Multimodal LLM
arXiv 2024
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
arXiv 2024
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
arXiv 2024
YuLan: An Open-source Large Language Model
arXiv 2024
PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation
arXiv 2024
CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility
arXiv 2024
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
arXiv 2024
Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion
arXiv 2024
A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
arXiv 2024
DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion
arXiv 2024
MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation
arXiv 2024
Autoregressive Pretraining with Mamba in Vision
arXiv 2024
ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention
arXiv 2024
LAPT: Label-driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
arXiv 2024
An Open-World, Diverse, Cross-Spatial-Temporal Benchmark for Dynamic Wild Person Re-Identification
arXiv 2024
Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data
arXiv 2024
Symbol as Points: Panoptic Symbol Spotting via Point-based Representation
arXiv 2024
ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation
arXiv 2024
Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models
arXiv 2024
DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception
arXiv 2024
FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling
arXiv 2024
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
arXiv 2024
Multi-fingered Robotic Hand Grasping in Cluttered Environments through Hand-object Contact Semantic Mapping
arXiv 2024
SymPoint Revolutionized: Boosting Panoptic Symbol Spotting with Layer Feature Enhancement
arXiv 2024
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
arXiv 2024
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
arXiv 2023
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
CVPR 2024 1
Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
NeurIPS 2023 11
detrex: Benchmarking Detection Transformers
arXiv 2023
Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution
arXiv 2023
Improving the Stability and Efficiency of Diffusion Models for Content Consistent Super-Resolution
arXiv 2023
SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation
smpler-x-scaling-up-expressive-human-pose-and
A Benchmark for Chinese-English Scene Text Image Super-resolution
ICCV 2023 1
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2023
X-Pose: Detecting Any Keypoints
arXiv 2023
A Simple Framework for Open-Vocabulary Segmentation and Detection
ICCV 2023 1
Detection Transformer with Stable Matching
ICCV 2023 1
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
ICCV 2023 1
Open-Set Image Tagging with Multi-Grained Text Supervision
arXiv 2023
Recognize Anything: A Strong Image Tagging Model
arXiv 2023
Semantic-SAM: Segment and Recognize Anything at Any Granularity
arXiv 2023
Pixel-Aware Stable Diffusion for Realistic Image Super-resolution and Personalized Stylization
arXiv 2023
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR 2024 1
One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer
CVPR 2023 1
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
arXiv 2023
HumanTOMATO: Text-aligned Whole-body Motion Generation
arXiv 2023
Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes
CVPR 2023 1
MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks
arXiv 2023
One-Shot Learning as Instruction Data Prospector for Large Language Models
arXiv 2023
Neural Interactive Keypoint Detection
ICCV 2023 1
MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
CVPR 2023 1
Marathon: A Race Through the Realm of Long Context with Large Language Models
arXiv 2023
HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation
ICCV 2023 1
Visual In-Context Prompting
CVPR 2024 1
CORE: Cooperative Reconstruction for Multi-Agent Perception
ICCV 2023 1
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation
ICCV 2023 1
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
dino-detr-with-improved-denoising-anchor
Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation
mask-dino-towards-a-unified-transformer-based
Generative Action Description Prompts for Skeleton-based Action Recognition
ICCV 2023 1
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR
dab-detr-dynamic-anchor-boxes-are-better
Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization
CVPR 2022 1
Dense Learning based Semi-Supervised Object Detection
CVPR 2022 1
Mind the Gap: Polishing Pseudo labels for Accurate Semi-supervised Object Detection
arXiv 2022
CvT: Introducing Convolutions to Vision Transformers
ICCV 2021 10
GAN Prior Embedded Network for Blind Face Restoration in the Wild
CVPR 2021 1
Lite-HRNet: A Lightweight High-Resolution Network
CVPR 2021 1
Dynamic Head: Unifying Object Detection Heads with Attentions
CVPR 2021 1
Image Scene Graph Generation (SGG) Benchmark
arXiv 2021
VinVL: Revisiting Visual Representations in Vision-Language Models
CVPR 2021 1
Variational Attention: Propagating Domain-Specific Knowledge for Multi-Domain Learning in Crowd Counting
ICCV 2021 10
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
ECCV 2020 8
Blind Face Restoration via Deep Multi-scale Component Dictionaries
ECCV 2020 8
Unified Vision-Language Pre-Training for Image Captioning and VQA
arXiv 2019
Toward Convolutional Blind Denoising of Real Photographs
toward-convolutional-blind-denoising-of-real-1
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
bottom-up-and-top-down-attention-for-image-1
Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising
arXiv 2016
MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
arXiv 2016
Affiliations
Frequent co-authors
10from 130 papers