Hengshuang Zhao
- Papers
- 40
Cite
Notes
Only stored in your browser.
Authored papers
40Utonia: Toward One Encoder for All Point Clouds
arXiv 2026
Orient Anything V2: Unifying Orientation and Rotation Understanding
arXiv 2026
FASTER: Rethinking Real-Time Flow VLAs
arXiv 2026
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
arXiv 2026
AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward
arXiv 2026
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
arXiv 2025
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
arXiv 2025
Depth Anything with Any Prior
arXiv 2025
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
ICCV 2025
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
arXiv 2025
In Pursuit of Pixel Supervision for Visual Pre-training
arXiv 2025
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
arXiv 2025
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning
arXiv 2025
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
arXiv 2025
Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
arXiv 2025
Visual Spatial Tuning
arXiv 2025
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
arXiv 2025
ROSE: Remove Objects with Side Effects in Videos
arXiv 2025
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
arXiv 2025
TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
arXiv 2025
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
arXiv 2025
Depth Anything V2
arXiv 2024
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
CVPR 2024 1
LION: Linear Group RNN for 3D Object Detection in Point Clouds
arXiv 2024
Liquid: Language Models are Scalable Multi-modal Generators
arXiv 2024
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
CVPR 2025 1
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
arXiv 2024
SAM3D: Segment Anything in 3D Scenes
arXiv 2023
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
arXiv 2023
Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding
CVPR 2024 1
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning
ICCV 2023 1
Influencer Backdoor Attack on Semantic Segmentation
arXiv 2023
Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases
arXiv 2023
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
arXiv 2023
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving
CVPR 2024 1
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
arXiv 2023
DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
CVPR 2024 1
$BT^2$: Backward-compatible Training with Basis Transformation
arXiv 2022
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
CVPR 2021 1
GridMask Data Augmentation
arXiv 2020
Affiliations
Frequent co-authors
10from 40 papers