Hang Xu

DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning

arXiv 2025

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

arXiv 2025

FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors

ICCV 2025

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

arXiv 2025

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

arXiv 2025

ACE: Anti-Editing Concept Erasure in Text-to-Image Models

CVPR 2025 1

Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

arXiv 2025

Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

arXiv 2025

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

arXiv 2024

Collaborative Novel Object Discovery and Box-Guided Cross-Modal Alignment for Open-Vocabulary 3D Object Detection

arXiv 2024

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

CVPR 2025 1

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

arXiv 2024

HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

arXiv 2024

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

openlane-v2-a-topology-reasoning-benchmark

Graph-based Topology Reasoning for Driving Scenes

arXiv 2023

A Survey on Video Diffusion Models

arXiv 2023

Baichuan 2: Open Large-scale Language Models

arXiv 2023

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

arXiv 2023

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

arXiv 2023

PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection

ICCV 2023 1

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

arXiv 2023