Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

Latent action models (LAMs) offer a promising path to pre-training embodied agents on large amounts of action-free video. They infer latent actions between consecutive observations that can later be decoded to ground-truth actions using a small number of labels. However, recent work has shown that this recipe fails in the presence of action-correlated visual distractors common in real-world video, such as dynamic backgrounds, camera shake, or other moving objects. In these scenarios, the standard reconstruction objective drives latent actions to encode exogenous motion instead of agent-controlled dynamics, resulting in policies that underperform when fine-tuned. We observe, however, that endogenous and exogenous factors are typically spatially separated in pixel space: control-relevant change is concentrated on the agent, while distractor motion occurs elsewhere. We exploit this observation by restricting the reconstruction objective to agent pixels, forcing latent actions to explain agent-controlled dynamics rather than exogenous ones. We call this method MaskLAM; it obtains the agent mask zero-shot from off-the-shelf segmentation foundation models (e.g., SAM) and requires no architectural changes, auxiliary losses, or action labels during pre-training. Across two continuous-control benchmarks (Distracting Control Suite, Distracting Meta-World), MaskLAM reduces normalized linear-probe MSE by up to $3.51\times$ and improves normalized return by up to $4.97\times$ over LAPO, while narrowing the gap to LAOM-Labels, which relies on ground-truth action supervision.