GPQA (Full Set)
Frontier
448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.
- Publisher
- New York University
- Capabilities
- Scientific ReasoningFactual Recall
- Domain
- science
- Format
- HF Dataset
- Size
- 448 tasks
- License
- CC-BY-4.0
- Published
- Nov 2023
- Notable for
- Benchmark for evaluating scientific reasoning and factual recall in the science domain.
- Canonical
- github.com/idavidrein/gpqa
Cite
Notes
Only stored in your browser.
Top score 93.2% by Gemini 3.1 Pro Preview - 10 models reporting (6 frontier)
Score history
10Top models
10Related tools
3Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2FAQ
- What is GPQA (Full Set)?
- 448 graduate-level physics/chemistry/biology MCQs written by PhDs - the full set; the harder "Diamond" subset is reported more often.
- What capabilities does GPQA (Full Set) test?
- GPQA (Full Set) evaluates scientific reasoning, factual recall.
- What is the current top score on GPQA (Full Set)?
- The top reported score is 93.2% by Gemini 3.1 Pro Preview, across 10 models reporting (6 from frontier labs).
- How can a model improve its GPQA (Full Set) score?
- Tools linked to GPQA (Full Set) on Sophon include GPQA RL Env (Community), GPQA RL Env (Prime Intellect), GPQA Diamond RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
- What license is GPQA (Full Set) under?
- GPQA (Full Set) is available under CC-BY-4.0.