0

Hengshuang Zhao

Papers
40

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
40papers

Authored papers

40

Utonia: Toward One Encoder for All Point Clouds

arXiv 2026

2026

Orient Anything V2: Unifying Orientation and Rotation Understanding

arXiv 2026

2026

FASTER: Rethinking Real-Time Flow VLAs

arXiv 2026

2026

HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation

arXiv 2026

2026

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

arXiv 2026

2026

Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

arXiv 2025

2025

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

arXiv 2025

2025

Depth Anything with Any Prior

arXiv 2025

2025

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

ICCV 2025

2025

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

arXiv 2025

2025

In Pursuit of Pixel Supervision for Visual Pre-training

arXiv 2025

2025

Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

arXiv 2025

2025

PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning

arXiv 2025

2025

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

arXiv 2025

2025

Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection

arXiv 2025

2025

Visual Spatial Tuning

arXiv 2025

2025

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

arXiv 2025

2025

ROSE: Remove Objects with Side Effects in Videos

arXiv 2025

2025

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

arXiv 2025

2025

TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization

arXiv 2025

2025

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

arXiv 2025

2025

Depth Anything V2

arXiv 2024

2024

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

CVPR 2024 1

2024

LION: Linear Group RNN for 3D Object Detection in Point Clouds

arXiv 2024

2024

Liquid: Language Models are Scalable Multi-modal Generators

arXiv 2024

2024

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

CVPR 2025 1

2024

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

arXiv 2024

2024

SAM3D: Segment Anything in 3D Scenes

arXiv 2023

2023

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

arXiv 2023

2023

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

CVPR 2024 1

2023

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

ICCV 2023 1

2023

Influencer Backdoor Attack on Semantic Segmentation

arXiv 2023

2023

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

arXiv 2023

2023

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

arXiv 2023

2023

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

CVPR 2024 1

2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

arXiv 2023

2023

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

CVPR 2024 1

2023

$BT^2$: Backward-compatible Training with Basis Transformation

arXiv 2022

2022

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

CVPR 2021 1

2020

GridMask Data Augmentation

arXiv 2020

2020

Affiliations

No known affiliations.

Frequent co-authors

10

from 40 papers