GPQA

Description

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is an environment for evaluating expert-level question answering. It contains challenging multiple-choice questions in Biology, Physics, and Chemistry that are designed to be difficult even for domain experts with unrestricted internet access. Questions are crafted to require genuine expertise rather than simple information retrieval.

Capabilities

Graduate-level scientific reasoning
Expert knowledge in Biology, Physics, and Chemistry
Multiple-choice question answering
Distinguishing between plausible-sounding incorrect answers

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

main: 448 tasks
diamond: 198 tasks (highest quality subset)
extended: 546 tasks

Questions span Biology, Physics, and Chemistry subdomains. Each task presents a question with four answer choices (A, B, C, D).

Reward Structure

This is a single-turn environment. The agent submits an answer letter (A, B, C, or D) via the submit_answer tool. Validation is deterministic exact match. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of CSV files for each split sourced from HuggingFace Idavidrein/gpqa. Each row contains a question, correct answer, three incorrect answers, and subdomain. Data is stored on the OpenReward platform.

Tools

Tool	Description
`submit_answer`	Submit your answer choice (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the question and options, then submits one answer.

Environment Difficulty

GPQA evaluates expert-level scientific reasoning:

Model	Diamond Accuracy
Claude Opus 4.5	77.4%
GPT-5	72.7%
Gemini 2 Flash	67.7%
Claude Sonnet 4	65.5%
Human Experts	69.7%
Human Non-Experts	34.1%

Other Environment Requirements

There are no further environment requirements; GPQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GPQA answer graduate-level science questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{rein2023gpqa,
  title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
  author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R.},
  journal={arXiv preprint arXiv:2311.12022},
  year={2023}
}