Yilun Chen

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

arXiv 2025

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

arXiv 2025

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

arXiv 2025

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

arXiv 2025

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

CVPR 2025 1

Evolving Symbolic 3D Visual Grounder with Weakly Supervised Reflection

arXiv 2025

GRUtopia: Dream General Robots in a City at Scale

arXiv 2024

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

arXiv 2024

Grounded 3D-LLM with Referent Tokens

arXiv 2024

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

arXiv 2024

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

arXiv 2024

PointLLM: Empowering Large Language Models to Understand Point Clouds

arXiv 2023

FocalFormer3D : Focusing on Hard Instance for 3D Object Detection

arXiv 2023

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

arXiv 2023

VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

arXiv 2023