We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding when to interrupt, and how to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: (1) we release EgoProactive, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; (2) we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; (3) we propose a decoupled planner--interaction architecture specialized for procedural state, visual cues, and recovery injection; (4) we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama 4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus 4.6, Gemini 3.1 Pro, GPT 5.2) and open-weight baselines (Qwen3 VL 235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach.
- Preview

- Year
- 2026
- Hosting
- Excerpt onlyCC-BY-NC-SA-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.04970CC-BY-NC-SA-4.0
- TL;DR
- Semantic Scholar