Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90%. LLMpedia shows this picture is incomplete. We materialize {\sim}1.3M encyclopedia articles entirely from parametric memory across three model families, then audit every claim against Wikipedia and curated web evidence. For gpt-5-mini, the verifiable true rate is 68.4% on Wikipedia-covered subjects - more than 21,pp below MMLU - and the gap is driven by unverifiability (30.5%), not refutation (1.2%). Beyond Wikipedia, frontier articles audited against curated web evidence reach 57.6%; Wikipedia covers only 56.7% of model-surfaced subjects, and three model families overlap in just 7.3% of subject choices. In a retrieval-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia is more factual at roughly half the textual similarity to Wikipedia. Every prompt, article, and verdict is released. Data, code, interface: https://llmpedia.net.
LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale
Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90\%. \emph{LLMpedia} shows this picture is incomplete. We materialize ${\sim}$1.3M encyclopedia articles entirely from parametric memory across three model families, then audit every…
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2603.24080CC-BY-4.0
- TL;DR
- Semantic Scholar