What capabilities does SimpleQA test?

SimpleQA evaluates factual recall, hallucination.

What is the current top score on SimpleQA?

The top reported score is 33.3% by GPT-4.1 Mini, across 1 model reporting (1 from frontier labs).

How can a model improve its SimpleQA score?

Tools linked to SimpleQA on Sophon include Simpleqa RL Env (Prime Intellect), Simpleqa RL Env (Community), DeepDive (Serper-powered web QA), VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

What license is SimpleQA under?

SimpleQA is available under MIT.

SimpleQA

Active

4,326 short factual questions with a single unambiguous answer, designed to measure short-form factuality and calibrate hallucination rate.

Open

Publisher: OpenAI
Capabilities: Factual Recall Hallucination
Format: Custom
Size: 4326 tasks
License: MIT
Published: Apr 2024
Notable for: Benchmark for evaluating factual recall and hallucination.
Canonical: openai.com/index/introducing-simpleqa
Also on: github.com/openai/simple-evals

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: prime-hub

Attribution policy →

Top score 33.3% by GPT-4.1 Mini - 1 model reporting (1 frontier)

Top models

SimpleQABar chart with 1 bar. Highest value: GPT-4.1 Mini at 33.3.

1 model

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Simpleqa RL Env (Prime Intellect)

Prime Intellect

SimpleQA evaluation environment

ImplementationRL EnvSimpleqaKnowledge

Simpleqa RL Env (Community)

SimpleQA eval

ImplementationRL EnvSimpleqaKnowledge

DeepDive (Serper-powered web QA)

Prime Intellect

Multi-turn open-web research QA environment where the model uses a Serper search tool to answer hard, hop-heavy questions of the BrowseComp / SimpleQA style.

Trains towardRL EnvRetrievalTool CallingPlanning

VF Openbench RL Env (Community)

Environment for single-turn tasks in OpenBench

Trains towardRL Env

Simpleqa Verified RL Env (Prime Intellect)

Prime Intellect

SimpleQA-Verified evaluation environment with accuracy scoring.

Trains towardRL EnvSimpleqaSimpleqa VerifiedKnowledge

Simpleqa Verified RL Env (Community)

SimpleQA-Verified evaluation environment with accuracy scoring.

Trains towardRL EnvSimpleqaSimpleqa VerifiedKnowledge

FAQ

What is SimpleQA?: 4,326 short factual questions with a single unambiguous answer, designed to measure short-form factuality and calibrate hallucination rate.
What capabilities does SimpleQA test?: SimpleQA evaluates factual recall, hallucination.
What is the current top score on SimpleQA?: The top reported score is 33.3% by GPT-4.1 Mini, across 1 model reporting (1 from frontier labs).
How can a model improve its SimpleQA score?: Tools linked to SimpleQA on Sophon include Simpleqa RL Env (Prime Intellect), Simpleqa RL Env (Community), DeepDive (Serper-powered web QA), VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is SimpleQA under?: SimpleQA is available under MIT.