factual recall

Category: language
Slug: factual-recall
Evals: 19
Tools: 48
Models: 524
Papers: 11

Evals testing this capability

AA-Omniscience

Artificial Analysis

Artificial Analysis's broad-knowledge benchmark - thousands of curated factual questions spanning specialized domains - designed to test hallucination calibration.

ActiveFactual RecallHallucination

APEX

Mercor

Mercor's expert-graded eval - domain experts (doctors, lawyers, engineers) grade model responses on long-form professional tasks they would actually be paid to do.

ActiveInstruction FollowingFactual RecallLLM Judging

ARC (AI2 Reasoning Challenge)

Allen Institute for AI (Ai2)

Grade-school science multiple-choice questions (Easy and Challenge sets) drawn from US standardized tests - an early language-understanding benchmark.

SaturatedScientific ReasoningFactual RecallScience

BIG-Bench

Google DeepMind

204 diverse tasks contributed by 450 researchers at 132 institutions - the original "test everything" LLM benchmark.

SaturatedFactual RecallPlanningMath

GDPval

OpenAI

OpenAI's economic-impact eval - 220 expert-curated tasks weighted by US-GDP contribution across 44 occupations, evaluating whether models can do real white-collar work.

ActiveFactual RecallInstruction FollowingPlanning

GPQA Diamond

New York University

Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.

ActiveScientific ReasoningFactual RecallScience

GPQA (Full Set)

New York University

448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.

ActiveScientific ReasoningFactual RecallScience

HELM (Holistic Evaluation of Language Models)

Stanford Center for Research on Foundation Models (CRFM)

Stanford CRFM's wide-coverage evaluation framework - dozens of scenarios scored on accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

ActiveFactual RecallSafetyInstruction Following

Humanity's Last Exam (HLE)

Center for AI Safety (CAIS)

2,500 expert-authored questions across math, sciences, and humanities designed to be the hardest closed-ended benchmark for frontier models.

ActiveScientific ReasoningMathFactual Recall

LegalBench

Stanford University

162 collaboratively curated legal-reasoning tasks across rule-recall, issue-spotting, application, and interpretation - the standard legal LLM benchmark.

ActiveLegal ReasoningFactual RecallLegal

LiveBench

Meta FAIR (Fundamental AI Research)

Rolling contamination-free benchmark that updates questions monthly across math, coding, reasoning, language, instruction-following, and data analysis.

ActiveMathCode GenerationInstruction Following

MMLU-Pro

TIGER-Lab

Harder, reasoning-focused successor to MMLU with 10 answer choices and curated questions resistant to lucky guessing.

ActiveFactual RecallScientific Reasoning

Massive Multitask Language Understanding (MMLU)

University of California, Berkeley

57-subject multiple-choice exam testing broad world knowledge and reasoning across academic and professional domains.

SaturatedFactual RecallScientific Reasoning

Multilingual MMLU

OpenAI

MMLU translated into 14-26 languages (community variants exist); measures world knowledge and reasoning across non-English languages.

ActiveFactual RecallMultilingual

MuSR

University of California, Berkeley

756 multi-step soft-reasoning problems - murder mysteries, object placement, team allocation - generated to require chained commonsense inference.

ActivePlanningFactual Recall

Needle in a Haystack (NIAH)

Greg Kamradt

Long-context retrieval pressure test - insert a random fact ("needle") at a random depth inside a long document ("haystack") and ask the model to retrieve it verbatim.

SaturatedFactual RecallLong Context

SimpleQA

OpenAI

4,326 short factual questions with a single unambiguous answer, designed to measure short-form factuality and calibrate hallucination rate.

ActiveFactual RecallHallucination

TruthfulQA

Future of Humanity Institute (Oxford)

817 questions targeting common human misconceptions, measuring whether a model gives factually true answers or repeats popular falsehoods.

SaturatedHallucinationFactual Recall

Vals.ai Legal Evals

Vals AI

Vals.ai's proprietary suite of legal-domain benchmarks (contract review, hallucination tests, LegalBench Pro) used by law firms to procure LLMs.

ActiveLegal ReasoningHallucinationFactual RecallLegal

Tools lifting evals here

View all

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

RL Env

lifts 4 evals here

GPQA Diamond RL Env (Community)

GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark

RL EnvGpqaScienceExpert LevelMedical

lifts 2 evals here

GPQA RL Env (Community)

GPQA evaluation environment

RL Env

lifts 2 evals here

GPQA RL Env (Prime Intellect)

Prime Intellect

GPQA evaluation environment

RL Env

lifts 2 evals here

OpenOrca

OpenOrca Team

An open reproduction of Microsoft's Orca recipe - FLAN prompts with GPT-4 chain-of-thought completions that taught reasoning by imitation.

SFT DatasetInstruction FollowingMathScientific Reasoning

lifts 2 evals here

SlimOrca

OpenOrca Team

A heavily-deduplicated, GPT-4-only slice of OpenOrca that delivers similar downstream quality at one-third the size.

SFT DatasetInstruction FollowingMathScientific Reasoning

lifts 2 evals here

Abercrombie RL Env (Community)

Teach a small model to classify trademarks on the Abercrombie distinctiveness spectrum

Evals testing this capability

AA-Omniscience

APEX

ARC (AI2 Reasoning Challenge)

BIG-Bench

GDPval

GPQA Diamond

GPQA (Full Set)

HELM (Holistic Evaluation of Language Models)

Humanity's Last Exam (HLE)

LegalBench

LiveBench

MMLU-Pro

Massive Multitask Language Understanding (MMLU)

Multilingual MMLU

MuSR

Needle in a Haystack (NIAH)

SimpleQA

TruthfulQA

Vals.ai Legal Evals

Tools lifting evals here

VF Openbench RL Env (Community)

GPQA Diamond RL Env (Community)

GPQA RL Env (Community)

GPQA RL Env (Prime Intellect)

OpenOrca

SlimOrca

Abercrombie RL Env (Community)

AI 2 ARC RL Env (Community)

Anthropic HH-RLHF

APEX Agents RL Env (Community)

Aya Dataset

Bigbench BBH RL Env (Prime Community)

Capybara

Context Needle RL Env (Community)

Deepconf RL Env (Community)

DeepDive (Serper-powered web QA)

Defend Concede RL Env (Community)

Gdpval RL Env (Community)

Haystack RLM RL Env (Prime Intellect)

HLE RL Env (Prime Intellect)

Legalbench RL Env (Community)

Legalbench RL Env (Community)

Legalbench RL Env (Community)

Legalbench RL Env (Prime Community)

Legalbench RL Env (Prime Intellect)

M ARC RL Env (Community)

M ARC RL Env (Medarc)

MMLU PRO RL Env (Community)

MMLU PRO RL Env (Prime Intellect)

MMLU RL Env (Community)

MMLU RL Env (Prime Community)

Mmmlu RL Env (Community)

OpenHermes 2.5

Openmed Medknowledge RL Env (Community)

OpenThoughts

PRO Health RL Env (Medarc)

s1K

ShareGPT

Simpleqa RL Env (Community)

Simpleqa RL Env (Prime Intellect)

Simpleqa Verified RL Env (Community)

Simpleqa Verified RL Env (Prime Intellect)

Tomagpt RL Env (Community)

Tülu 3 SFT Mixture

WEB PY RL Env (Community)

WEB PY RL Env (Prime Community)

WEB PY RL Env (Prime Intellect)

YOU SURE RL Env (Community)

Top models on this capability

Papers in this area

Related in language