0

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three $7$B--$8$B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in $4$-bit NF4…

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.02628CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three 7B--8B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in 4-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves 0.904--1.000 AUROC on held-out splits, while sampling-based detectors do not exceed 0.541 AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than 0.01 AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks 13--18 of 32 for Llama and Mistral, and blocks 19--25 of 28 for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings (0.866--0.941 AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single 8,GB GPU.