Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation.
To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 km^2). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
Embodied navigation has long been fragmented by task-specific architectures.
- Year
- 2026
- Venue
- arXiv 2026
- Authors
- 44
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.11598ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
44Xin LiuYe HeJiawei HanLiu LiuFei LiuDi YangKai YangXu ChenZhengbo WangFan JiangMu XuZedong ChuWei GuoTianlin ZhangGuoqing LiuMenglin YangShichao XieXiaolong WuYanfen ShenMinghua LuoXiaoxu LengJunjun HuMingyang YinJia LuYingnan GuoYanqing ZhuYuxiang ZhaoYirong YangJiahang WangYang CaiLi GaoMingchao SunChiyu WangZhicheng LiuHongyu PanHonglin HanZhining GuKuan YangJianfang ZhangDi JingZihao GuanXiangpo YangHongguang XingWeiguo Li