Large language models (LLMs) deployed for logical reasoning in knowledge-intensive domains exhibit a subtle but critical failure: coherence can be vacuously achieved through systematic abstention. A model that withholds commitment to either entailment or refutation satisfies negation consistency while providing no utility. We introduce Coherence Under Commitment (CUC), a dual-query evaluation paradigm that jointly measures consistency and decisiveness. CUC contributes three innovations: (1) a commitment score c(φ) = p(φ) + p(\lnotφ) quantifying probability mass allocated to decisive outcomes; (2) a deterministic elicitation protocol via normalized YES/NO log probabilities, eliminating sampling variance; and (3) a 3-way decision framework (True/False/Uncertain) operationalizing the coherence-commitment trade-off into metrics. Experiments on four open-weight LLMs (1B-3B) across 204 FOLIO examples expose a sharp frontier. Qwen2.5-3B achieves near-zero contradiction (\mathbb{E}[v_{neg}]{=}0.025) but only 7.4% coverage, while TinyLlama-1.1B reaches 79.4% coverage with violations on every example. Coherence-only evaluation would rank the abstaining model first; CUC exposes this as vacuous, and the frontier generalizes to LogiQA v2 (ρ{=}0.97). We argue that evaluation must report both coherence and non-vacuous commitment and release a toolkit for standardized assessment.
Coherence Under Commitment: Probing Generalization and Vacuous Memorization in LLM Logical Reasoning
Large language models (LLMs) deployed for logical reasoning in knowledge-intensive domains exhibit a subtle but critical failure: coherence can be vacuously achieved through systematic abstention.
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.21083CC-BY-4.0
- TL;DR
- Semantic Scholar