Tai Wang

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

arXiv 2025

LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

arXiv 2025

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

arXiv 2025

InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

arXiv 2025

G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

arXiv 2025

VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

arXiv 2025

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

arXiv 2025

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

arXiv 2025

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

arXiv 2025

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

arXiv 2025

Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection

arXiv 2025

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

arXiv 2025

GRUtopia: Dream General Robots in a City at Scale

arXiv 2024

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

arXiv 2024

Grounded 3D-LLM with Referent Tokens

arXiv 2024

GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction

CVPR 2024 1

Omni6D: Large-Vocabulary 3D Object Dataset for Category-Level 6D Object Pose Estimation

arXiv 2024

Scene as Occupancy

ICCV 2023 1

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

arXiv 2023

PointLLM: Empowering Large Language Models to Understand Point Clouds

arXiv 2023

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

ICCV 2023 1

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

CVPR 2023 1

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

arXiv 2023