OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance. Experiments on mathematical reasoning benchmarks show that OGLS-SD stabilizes self-distillation and improves performance over standard OPSD and other variants.