0

GPQA Diamond

Frontier

Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.

Open
Domain
science
Format
HF Dataset
Size
198 tasks
License
CC-BY-4.0
Published
Nov 2023
Notable for
Benchmark for evaluating scientific reasoning and factual recall in the science domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
VaultAAOpenLLM
Attribution policy →

Top score 94.1% by Gemini 3.1 Pro Preview - 449 models reporting (83 frontier)

Score history

413
0%25%50%75%100%Nov 22Aug 23May 24Feb 25Nov 25GPT-3.5 TurboClaude InstantClaude 2.0Mistral MediumGPT-4o (2024-05-13)o1 Minio3 MiniGemini 3 ProGemini 3.1 Pro Preview

Top models

449
GPQA DiamondBar chart with 21 bars. Highest value: Gemini 3.1 Pro Preview at 94.1.
21 models

Where it's ranked

2

Related tools

9
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

3

FAQ

What is GPQA Diamond?
Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.
What capabilities does GPQA Diamond test?
GPQA Diamond evaluates scientific reasoning, factual recall.
What is the current top score on GPQA Diamond?
The top reported score is 94.1% by Gemini 3.1 Pro Preview, across 449 models reporting (83 from frontier labs).
How can a model improve its GPQA Diamond score?
Tools linked to GPQA Diamond on Sophon include GPQA Diamond RL Env (Community), VF Openbench RL Env (Community), GPQA RL Env (Community), GPQA RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is GPQA Diamond under?
GPQA Diamond is available under CC-BY-4.0.