GPQA Diamond
Frontier
Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.
- Publisher
- New York University
- Capabilities
- Scientific ReasoningFactual Recall
- Domain
- science
- Format
- HF Dataset
- Size
- 198 tasks
- License
- CC-BY-4.0
- Published
- Nov 2023
- Notable for
- Benchmark for evaluating scientific reasoning and factual recall in the science domain.
- Canonical
- github.com/idavidrein/gpqa
Cite
Notes
Only stored in your browser.
Top score 94.1% by Gemini 3.1 Pro Preview - 449 models reporting (83 frontier)
Score history
413Top models
449Where it's ranked
2Related tools
9Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
3FAQ
- What is GPQA Diamond?
- Graduate-level physics, chemistry, and biology multiple-choice questions written by PhDs and verified to be Google-proof.
- What capabilities does GPQA Diamond test?
- GPQA Diamond evaluates scientific reasoning, factual recall.
- What is the current top score on GPQA Diamond?
- The top reported score is 94.1% by Gemini 3.1 Pro Preview, across 449 models reporting (83 from frontier labs).
- How can a model improve its GPQA Diamond score?
- Tools linked to GPQA Diamond on Sophon include GPQA Diamond RL Env (Community), VF Openbench RL Env (Community), GPQA RL Env (Community), GPQA RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
- What license is GPQA Diamond under?
- GPQA Diamond is available under CC-BY-4.0.
