0

Peng Gao

Papers
55

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
55papers

Authored papers

55

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

arXiv 2026

2026

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

arXiv 2026

2026

PRBench: End-to-end Paper Reproduction in Physics Research

arXiv 2026

2026

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

ICCV 2025

2025

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

ICCV 2025

2025

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

ICCV 2025

2025

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

arXiv 2025

2025

OmniCaptioner: One Captioner to Rule Them All

arXiv 2025

2025

Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

arXiv 2025

2025

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

arXiv 2025

2025

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis

arXiv 2025

2025

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

arXiv 2025

2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

CVPR 2025 1

2025

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

arXiv 2025

2025

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

arXiv 2025

2025

TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving

arXiv 2025

2025

IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models

arXiv 2025

2025

Distribution Matching Distillation Meets Reinforcement Learning

arXiv 2025

2025

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

arXiv 2024

2024

Phased Consistency Models

arXiv 2024

2024

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

arXiv 2024

2024

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

arXiv 2024

2024

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

arXiv 2024

2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

arXiv 2024

2024

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

arXiv 2024

2024

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

arXiv 2024

2024

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

arXiv 2024

2024

TerDiT: Ternary Diffusion Models with Transformers

arXiv 2024

2024

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

arXiv 2024

2024

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

arXiv 2024

2024

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

arXiv 2024

2024

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

arXiv 2024

2024

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

arXiv 2024

2024

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

arXiv 2024

2024

A3VLM: Actionable Articulation-Aware Vision Language Model

arXiv 2024

2024

I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

arXiv 2024

2024

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

arXiv 2024

2024

Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models

arXiv 2024

2024

ImageBind-LLM: Multi-modality Instruction Tuning

arXiv 2023

2023

Personalize Segment Anything Model with One Shot

arXiv 2023

2023

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

arXiv 2023

2023

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

arXiv 2023

2023

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

arXiv 2023

2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

ICCV 2023 1

2023

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

CVPR 2023 1

2023

SUG: Single-dataset Unified Generalization for 3D Point Cloud Classification

arXiv 2023

2023

Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

arXiv 2023

2023

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

arXiv 2023

2023

OneLLM: One Framework to Align All Modalities with Language

CVPR 2024 1

2023

You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction

arXiv 2022

2022

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

arXiv 2022

2022

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

ICCV 2023 1

2022

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

ICCV 2023 1

2022

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

CVPR 2023 1

2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

arXiv 2021

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 55 papers