Ping Luo
- Papers
- 97
Cite
Notes
Only stored in your browser.
Authored papers
97Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
arXiv 2026
DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search
arXiv 2026
χ_{0}: Resource-Aware Robust Manipulation via Taming Distributional Inconsistencies
arXiv 2026
Is Diversity All You Need for Scalable Robotic Manipulation?
arXiv 2025
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
arXiv 2025
DanceGRPO: Unleashing GRPO on Visual Generation
arXiv 2025
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
arXiv 2025
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
arXiv 2025
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
arXiv 2025
Aligning Latent Spaces with Flow Priors
arXiv 2025
PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
arXiv 2025
MM-ACT: Learn from Multimodal Parallel Generation to Act
arXiv 2025
Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback
arXiv 2025
Fast-dLLM v2: Efficient Block-Diffusion LLM
arXiv 2025
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
arXiv 2025
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
arXiv 2025
INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats
arXiv 2025
Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition
arXiv 2025
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
arXiv 2025
PixelFlow: Pixel-Space Generative Models with Flow
arXiv 2025
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
arXiv 2025
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
arXiv 2025
Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training
arXiv 2025
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models
arXiv 2024
Autoregressive Models in Vision: A Survey
arXiv 2024
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
arXiv 2024
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
arXiv 2024
End-to-End Autonomous Driving through V2X Cooperation
arXiv 2024
GenAD: Generalized Predictive Model for Autonomous Driving
CVPR 2024 1
Needle In A Multimodal Haystack
arXiv 2024
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
arXiv 2024
Learning Manipulation by Predicting Interaction
arXiv 2024
HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model
arXiv 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
CVPR 2024 1
LLaMA Pro: Progressive LLaMA with Block Expansion
arXiv 2024
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
arXiv 2024
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
arXiv 2024
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
arXiv 2024
PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization
arXiv 2024
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
arXiv 2024
Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation
arXiv 2024
Adapting LLaMA Decoder to Vision Transformer
arXiv 2024
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
arXiv 2024
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
arXiv 2024
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
arXiv 2024
Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking
arXiv 2024
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
arXiv 2024
AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks
arXiv 2024
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
arXiv 2024
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024 1
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
arXiv 2024
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
arXiv 2024
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality
arXiv 2024
HRVMamba: High-Resolution Visual State Space Model for Dense Prediction
arXiv 2024
Scene as Occupancy
ICCV 2023 1
OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping
openlane-v2-a-topology-reasoning-benchmark
Graph-based Topology Reasoning for Driving Scenes
arXiv 2023
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023 11
A Survey of Reasoning with Foundation Models
arXiv 2023
VDT: General-purpose Video Diffusion Transformers via Mask Modeling
arXiv 2023
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024 1
Going Denser with Open-Vocabulary Part Segmentation
ICCV 2023 1
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2023
Video Understanding with Large Language Models: A Survey
arXiv 2023
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023
MotionCtrl: A Unified and Flexible Motion Controller for Video Generation
arXiv 2023
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
arXiv 2023
MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation
arXiv 2023
RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs
arXiv 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
arXiv 2023
DDP: Diffusion Model for Dense Visual Prediction
ICCV 2023 1
V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting
CVPR 2023 1
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
ICCV 2023 1
MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision
arXiv 2023
AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
arXiv 2023
MLLMs-Augmented Visual-Language Representation Learning
arXiv 2023
EGC: Image Generation and Classification via a Diffusion Energy-Based Model
ICCV 2023 1
Beyond One-to-One: Rethinking the Referring Image Segmentation
ICCV 2023 1
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
CVPR 2023 1
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
arXiv 2023
Context Autoencoder for Self-Supervised Representation Learning
arXiv 2022
DiffusionDet: Diffusion Model for Object Detection
ICCV 2023 1
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition
arXiv 2022
DaViT: Dual Attention Vision Transformers
arXiv 2022
Language as Queries for Referring Video Object Segmentation
CVPR 2022 1
Large-batch Optimization for Dense Visual Predictions
arXiv 2022
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
CVPR 2023 1
ByteTrack: Multi-Object Tracking by Associating Every Detection Box
bytetrack-multi-object-tracking-by
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
NeurIPS 2021 12
DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion
CVPR 2022 1
PVT v2: Improved Baselines with Pyramid Vision Transformer
arXiv 2021
Panoptic SegFormer: Delving Deeper into Panoptic Segmentation with Transformers
CVPR 2022 1
FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation
arXiv 2021
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals
CVPR 2021 1
MaskGAN: Towards Diverse and Interactive Facial Image Manipulation
maskgan-towards-diverse-and-interactive-1
Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net
two-at-once-enhancing-learning-and-1
Spatial As Deep: Spatial CNN for Traffic Scene Understanding
arXiv 2017
Affiliations
Frequent co-authors
10from 97 papers