Peng Gao
- Papers
- 55
Cite
Notes
Only stored in your browser.
Authored papers
55SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
arXiv 2026
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
arXiv 2026
PRBench: End-to-end Paper Reproduction in Physics Research
arXiv 2026
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
ICCV 2025
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
ICCV 2025
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
arXiv 2025
OmniCaptioner: One Captioner to Rule Them All
arXiv 2025
Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield
arXiv 2025
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
arXiv 2025
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
arXiv 2025
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
arXiv 2025
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
CVPR 2025 1
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
arXiv 2025
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
arXiv 2025
TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
arXiv 2025
IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
arXiv 2025
Distribution Matching Distillation Meets Reinforcement Learning
arXiv 2025
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
arXiv 2024
Phased Consistency Models
arXiv 2024
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
arXiv 2024
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
arXiv 2024
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
arXiv 2024
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
arXiv 2024
SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners
arXiv 2024
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
arXiv 2024
ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models
arXiv 2024
TerDiT: Ternary Diffusion Models with Transformers
arXiv 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
arXiv 2024
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
arXiv 2024
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
arXiv 2024
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
arXiv 2024
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
arXiv 2024
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding
arXiv 2024
A3VLM: Actionable Articulation-Aware Vision Language Model
arXiv 2024
I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow
arXiv 2024
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
arXiv 2024
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
arXiv 2024
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023
Personalize Segment Anything Model with One Shot
arXiv 2023
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
arXiv 2023
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
arXiv 2023
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
ICCV 2023 1
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
CVPR 2023 1
SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification
arXiv 2023
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking
arXiv 2023
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
arXiv 2023
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024 1
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction
arXiv 2022
UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
arXiv 2022
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
ICCV 2023 1
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023 1
Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders
CVPR 2023 1
CLIP-Adapter: Better Vision-Language Models with Feature Adapters
arXiv 2021
Affiliations
Frequent co-authors
10from 55 papers