RoboBERT: An End-to-end Multimodal Robotic Manipulation Model

Embodied intelligence seamlessly integrates vision, language, and action. However, most multimodal robotic models rely on massive fine-tuning, incurring high time and hardware costs. To address this, we introduce RoboBERT, an end-to-end multimodal manipulation model built around a novel two-stage training paradigm. In the first stage, we freeze most of the vision encoder and train with a single "standard" instruction phrasing, allowing the model to focus on stable policy learning via a CNN-based diffusion policy. In the second stage, we unfreeze all modules and inject diverse natural language variants, rapidly aligning varied instructions to the already-learned policy without destabilizing performance. We further employ systematic data augmentations to enhance robustness against visual perturbations. Without relying on auxiliary datasets, RoboBERT achieves new state-of-the-art (SOTA) mean episode lengths of 4.52 on the CALVIN ABCD-D benchmark and 3.79 on the ABC-D benchmark using only language-labeled expert demonstrations and a comparatively lightweight architecture.Real-robot trials on a 6-DOF manipulator confirm higher success rates than comparable methods trained on identical data.These results demonstrate that our data-augmentation-enhanced two-stage training paradigm delivers efficient, scalable, and broadly applicable performance for multimodal robotic systems.

RoboBERT: An End-to-end Multimodal Robotic Manipulation Model

Abstract

Authors