Hao Dong

To Trust Or Not To Trust Your Vision-Language Model's Prediction

arXiv 2025

Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation

arXiv 2025

Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface

arXiv 2025

TwinAligner: Visual-Dynamic Alignment Empowers Physics-aware Real2Sim2Real for Robotic Manipulation

arXiv 2025

Adapting Vision-Language Models Without Labels: A Comprehensive Survey

arXiv 2025

SpatialBot: Precise Spatial Understanding with Vision Language Models

arXiv 2024

Learning Manipulation by Predicting Interaction

arXiv 2024

A3VLM: Actionable Articulation-Aware Vision Language Model

arXiv 2024

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

arXiv 2024

MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities

arXiv 2024

A Survey of Reasoning with Foundation Models

arXiv 2023

Personalize Segment Anything Model with One Shot

arXiv 2023

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

arXiv 2023

SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization

simmmdg-a-simple-and-effective-framework-for

Leveraging SE(3) Equivariance for Learning 3D Geometric Shape Assembly

ICCV 2023 1