0

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies,…

Preview
Year
2025
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2512.23075ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Policy gradient methods for Large Language Models optimize a policy π_θ via a surrogate objective computed from samples of a rollout policy π_{roll}. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch (π_{roll} \neq π_θ) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as O(T^2) with sequence length T, rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound (O(T^{3/2})), a Mixed bound (O(T)), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position importance-ratio decomposition. Taking the minimum over all bounds yields the tightest known guarantee across all divergence regimes. Crucially, all bounds depend on the maximum token-level divergence D_{KL}^{tok,max} (or D_{TV}^{tok,max}), a sequence-level quantity that cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences violating the trust region, enabling the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.