The Consistency Dilemma in LLMs: Generator-Evaluator Agreement and Vulnerability to Mistakes

Large language models are increasingly deployed in agentic pipelines that depend on the model evaluating its own outputs without external verification. The reliability of these pipelines depends on an implicit assumption: that the model applies relevant concepts the same way when it generates an output and later evaluates that output. We propose a new measure, generator-evaluator self-consistency, to test this assumption directly and apply it to 10 frontier models across 491 concepts. We find, first, that there is substantial variation in self-consistency. Second, we find that in a clinical setting with physician-validated mistakes (Proniakin et al., 2025), across models, those with higher self-consistency are linked to greater vulnerability to mistakes. Thus, even when models consistently apply concepts they may not be safe to deploy. This is evidence of a consistency dilemma in LLMs: self-consistency is operationally useful, but models that are more consistent are also more prone to mistakes.