0

FrontierScience

Fresh

An implementation of the FrontierScience evaluation.

Type
RL Env
Runtime
ORS
License
unknown
Size
320 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

FrontierScience

OpenReward Environment Hugging Face Dataset

Description

FrontierScience is an environment for evaluating expert-level scientific reasoning capabilities. It contains 160 expert-level problems designed to assess frontier model capabilities in physics, chemistry, and biology through two distinct evaluation tracks: Olympiad (short-answer format) and Research (open-ended PhD-level problems).

Capabilities

  • Expert-level scientific reasoning across physics, chemistry, and biology
  • Short-answer problem solving (Olympiad track)
  • Open-ended research subtask completion (Research track)
  • Multi-criterion rubric-based evaluation

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

Apache 2.0.

Tasks

There are three splits in this environment:

  • test: 160 tasks (all problems)
  • olympic: ~92 tasks (short-answer Olympiad-style problems)
  • research: ~68 tasks (open-ended PhD-level research problems)

Problems span Physics (70), Chemistry (60), and Biology (30).

Reward Structure

This is a single-turn environment with two grading methodologies:

Olympiad Track: An LLM grader (gpt-5.2 with high reasoning effort) checks equivalence with the reference answer, considering algebraic equivalence, numeric tolerance, chemical equivalents, and unit conversions. Reward is binary: 1.0 if correct, 0.0 if incorrect.

Research Track: An LLM grader (gpt-5.2 with high reasoning effort) evaluates against a multi-criterion rubric parsed from the answer field. Each criterion is graded independently, scores are aggregated, and reward is normalized (total earned / total possible). Success threshold is 7+ points out of 10 (0.7 reward).

Data

Data consists of a Parquet file (frontierscience.parquet) sourced from HuggingFace openai/frontierscience. Each row contains a problem, answer (short answer or rubric), subject, and task group ID. Data is stored on the OpenReward platform.

Tools

ToolDescription
submit_answerSubmit your final answer for grading. Ends the episode.

Time Horizon

Single-turn. The agent reads the scientific problem and submits one answer.

Environment Difficulty

FrontierScience evaluates expert-level scientific reasoning designed to challenge frontier AI systems.

TrackModelPass Rate
OlympiadGPT-5.277%
OlympiadGemini 3 Pro76%
ResearchGPT-5.225%
ResearchGPT-525%

Other Environment Requirements

OpenAI API key required for LLM-based grading. Pass via secrets={"openai_api_key": "..."}.

Safety

Agents in FrontierScience solve expert-level scientific problems in a standard environment. The environment does not present direct safety risks.

Citation

@article{frontierscience2025,
  title={FrontierScience: Measuring Expert-Level Scientific Reasoning in AI},
  author={OpenAI},
  journal={arXiv preprint arXiv:2601.21165},
  year={2025}
}