0

Capacity, Not Format: Rethinking Structured Reasoning Failures

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects…

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.09410ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: 88.7\pm4.0% JSON vs. 89.3\pm1.7% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp (p < 0.0001) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp (p < 0.001), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar p < 0.0001) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON (-5.3pp; the displayed percentages are independently rounded, exact difference is 7/133 = 5.26pp \approx 5.3pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.