0

GPQA (Full Set)

Frontier

448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.

Domain
science
Format
HF Dataset
Size
448 tasks
License
CC-BY-4.0
Published
Nov 2023
Notable for
Benchmark for evaluating scientific reasoning and factual recall in the science domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 93.2% by Gemini 3.1 Pro Preview - 10 models reporting (6 frontier)

Score history

10
55%66%78%89%100%Apr 25Jul 25Oct 25Jan 26GPT-4.1Claude 4 SonnetGPT-5Claude Sonnet 4.5Claude Opus 4.5Gemini 3.1 Pro Preview

Top models

10
GPQA (Full Set)Bar chart with 10 bars. Highest value: Gemini 3.1 Pro Preview at 93.2.
10 models

Related tools

3
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

FAQ

What is GPQA (Full Set)?
448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.
What capabilities does GPQA (Full Set) test?
GPQA (Full Set) evaluates scientific reasoning, factual recall.
What is the current top score on GPQA (Full Set)?
The top reported score is 93.2% by Gemini 3.1 Pro Preview, across 10 models reporting (6 from frontier labs).
How can a model improve its GPQA (Full Set) score?
Tools linked to GPQA (Full Set) on Sophon include GPQA RL Env (Community), GPQA RL Env (Prime Intellect), GPQA Diamond RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is GPQA (Full Set) under?
GPQA (Full Set) is available under CC-BY-4.0.