Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients.
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2603.29025CC-BY-4.0
- TL;DR
- Semantic Scholar