0

GPQA

Fresh

An implementation of GPQA

Type
RL Env
Runtime
ORS
License
unknown
Size
1192 tasks
Published
Jan 2026

Cite

Notes

Only stored in your browser.

GPQA

OpenReward Environment Hugging Face Dataset

Description

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is an environment for evaluating expert-level question answering. It contains challenging multiple-choice questions in Biology, Physics, and Chemistry that are designed to be difficult even for domain experts with unrestricted internet access. Questions are crafted to require genuine expertise rather than simple information retrieval.

Capabilities

  • Graduate-level scientific reasoning
  • Expert knowledge in Biology, Physics, and Chemistry
  • Multiple-choice question answering
  • Distinguishing between plausible-sounding incorrect answers

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

  • main: 448 tasks
  • diamond: 198 tasks (highest quality subset)
  • extended: 546 tasks

Questions span Biology, Physics, and Chemistry subdomains. Each task presents a question with four answer choices (A, B, C, D).

Reward Structure

This is a single-turn environment. The agent submits an answer letter (A, B, C, or D) via the submit_answer tool. Validation is deterministic exact match. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Data

Data consists of CSV files for each split sourced from HuggingFace Idavidrein/gpqa. Each row contains a question, correct answer, three incorrect answers, and subdomain. Data is stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit your answer choice (A, B, C, or D). Ends the episode.

Time Horizon

Single-turn. The agent reads the question and options, then submits one answer.

Environment Difficulty

GPQA evaluates expert-level scientific reasoning:

ModelDiamond Accuracy
Claude Opus 4.577.4%
GPT-572.7%
Gemini 2 Flash67.7%
Claude Sonnet 465.5%
Human Experts69.7%
Human Non-Experts34.1%

Other Environment Requirements

There are no further environment requirements; GPQA works out of the box with the OpenReward endpoint without any external API keys.

Safety

Agents in GPQA answer graduate-level science questions in a standard environment. The environment does not present direct safety risks.

Citation

@article{rein2023gpqa,
  title={GPQA: A Graduate-Level Google-Proof Q&A Benchmark},
  author={Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R.},
  journal={arXiv preprint arXiv:2311.12022},
  year={2023}
}