Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning

Large reasoning models (LRMs) often exhibit overthinking, producing verbose Chain-of-Thought (CoT) traces that increase inference cost and obscure the underlying reasoning process. Existing CoT compression methods mainly rely on global length rewards, which conflate necessary intermediate reasoning with redundant text and may therefore compromise reasoning fidelity. This paper revisits overthinking from a semantic-efficiency perspective and decomposes CoT redundancy into two distinct forms: internal redundancy, defined as informational stagnation before the first correct answer, and external redundancy, defined as superfluous continuation after the first correct answer. Based on this decomposition, we propose a dual-penalty reinforcement learning framework that separately optimizes reasoning progress and termination behavior. Specifically, a sliding-window semantic similarity metric penalizes low-progress reasoning segments, while a normalized external-redundancy metric discourages post-answer continuation. Experiments on GSM8K, MATH500, and AIME24 across different model scales show that our method reduces average reasoning length by 41.3% on the 1.5B model and 40.1% on the 7B model, while preserving competitive accuracy and achieving the best overall accuracy-efficiency score among evaluated baselines. The learned compression behavior further transfers to out-of-domain reasoning tasks, including GPQA and LiveCodeBench. More importantly, our analysis reveals a clear asymmetry between the two redundancy types: external redundancy can be largely removed with little performance loss, whereas internal redundancy compression follows a sensitive accuracy-efficiency trade-off. These results suggest that effective CoT compression should optimize semantic efficiency rather than sequence length alone, offering a principled route toward more concise, efficient, and interpretable LRMs.