We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments.
- Year
- 2026
- Venue
- arXiv 2026
- Authors
- 28
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.00919ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
28I. ApanasevichM. ArtemyevR. BabakyanP. FedotovaD. GrankinE. KupryashinA. MisailidiD. NerusA. NutalapatiG. SidorovI. EfremovM. GerasyovD. PikurovY. SenchenkoS. DavidenkoD. KulikovM. SultankinK. AskarbekO. ShamaninD. StatovoyE. ZalyaevI. ZorinA. LetkinE. RusakovA. SilchenkoV. VorobyovS. SobolnikovA. Postnikov