0

PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress)

Active

Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must

Publisher
OpenAI
Domain
Coding
License
mit
Published
Dec 2025
Notable for
Benchmark for evaluating Coding.

Cite

Notes

Only stored in your browser.

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress)?
Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must
How can a model improve its PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) score?
Tools linked to PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) on Sophon include Paperbench ENV RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) under?
PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) is available under mit.