Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists: traditional token-level analysis fails to capture the macroscopic dynamics of reasoning-level scaling. To address this, we introduce CoT-Space, a novel theoretical framework that recasts the reasoning process from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By modeling the reasoning trajectory from both noise and risk perspectives and revitalizing foundational principles from classical learning theory, we demonstrate that the observed convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. We further utilize RL as a tool to elicit and verify these results in our experiments. Our findings provide a mechanistic explanation for the internal test-time scaling via RL, offering a principled theoretical foundation to optimize reasoning trajectories in modern LLMs.
CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning
Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs).
- Preview

- Year
- 2025
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2509.04027CC-BY-4.0
- TL;DR
- Semantic Scholar