CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Reasoning models often generate far more tokens than a task requires, which raises inference cost and can compound errors. We introduce CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), an on-policy self-distillation method that teaches a model to reason more concisely by distilling its own concise behavior back into itself. The method uses a single idea: condition the same model on a "be concise" instruction to obtain teacher logits, then minimize the per-token reverse KL divergence between the student and this teacher on the student's own rollouts. It requires no ground-truth answers, no token budgets, and no difficulty estimators. The reverse-KL objective is naturally difficulty-adaptive: it compresses easy problems aggressively while preserving the reasoning steps that hard problems require. On Qwen3-14B, CRISP cuts reasoning length by up to 56% on MATH-500 and 38% on the harder AIME 2024, while improving MATH-500 accuracy by up to 3.3 points over the base model and holding AIME 2024 accuracy within about one point. This behavior generalizes across model sizes and families: Qwen3-8B shows the same compression with accuracy preserved, and DeepSeek-R1-Distill-Llama-8B improves accuracy on all five benchmarks while shortening its responses. General capabilities are preserved across all three models. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Abstract

Authors