What capabilities does GPQA (Full Set) test?

GPQA (Full Set) evaluates scientific reasoning, factual recall.

What is the current top score on GPQA (Full Set)?

The top reported score is 93.2% by Gemini 3.1 Pro Preview, across 10 models reporting (6 from frontier labs).

How can a model improve its GPQA (Full Set) score?

Tools linked to GPQA (Full Set) on Sophon include GPQA RL Env (Community), GPQA RL Env (Prime Intellect), GPQA Diamond RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

What license is GPQA (Full Set) under?

GPQA (Full Set) is available under CC-BY-4.0.

GPQA (Full Set)

Frontier

448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.

Open

Publisher: New York University
Capabilities: Scientific Reasoning Factual Recall
Domain: science
Format: HF Dataset
Size: 448 tasks
License: CC-BY-4.0
Published: Nov 2023
Notable for: Benchmark for evaluating scientific reasoning and factual recall in the science domain.
Canonical: github.com/idavidrein/gpqa
Also on: huggingface.co/datasets/Idavidrein/gpqa

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: prime-hub

Attribution policy →

Top score 93.2% by Gemini 3.1 Pro Preview - 10 models reporting (6 frontier)

Score history

Top models

GPQA (Full Set)Bar chart with 10 bars. Highest value: Gemini 3.1 Pro Preview at 93.2.

10 models

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

GPQA RL Env (Community)

GPQA evaluation environment

ImplementationRL Env

GPQA RL Env (Prime Intellect)

Prime Intellect

GPQA evaluation environment

ImplementationRL Env

GPQA Diamond RL Env (Community)

GPQA Diamond: A Graduate-Level Google-Proof Q&A Benchmark

Trains towardRL EnvGpqaScienceExpert Level

Papers

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

COLM · 2023

Introduces GPQA, 448 PhD-written multiple-choice questions in biology, physics, and chemistry that domain non-experts cannot solve even with web access.

introduces

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

COLM · 2023

Introduces GPQA, 448 PhD-written multiple-choice questions in biology, physics, and chemistry that domain non-experts cannot solve even with web access.

FAQ

What is GPQA (Full Set)?: 448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.
What capabilities does GPQA (Full Set) test?: GPQA (Full Set) evaluates scientific reasoning, factual recall.
What is the current top score on GPQA (Full Set)?: The top reported score is 93.2% by Gemini 3.1 Pro Preview, across 10 models reporting (6 from frontier labs).
How can a model improve its GPQA (Full Set) score?: Tools linked to GPQA (Full Set) on Sophon include GPQA RL Env (Community), GPQA RL Env (Prime Intellect), GPQA Diamond RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is GPQA (Full Set) under?: GPQA (Full Set) is available under CC-BY-4.0.