OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
25 Jun 2026
Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed.