PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress)
Active
Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must
- Publisher
- OpenAI
- Domain
- Coding
- License
- mit
- Published
- Dec 2025
- Notable for
- Benchmark for evaluating Coding.
Cite
Notes
Only stored in your browser.
Related tools
1Implementations, trainers, datasets and scaffolds linked to this eval.
FAQ
- What is PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress)?
- Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch. Given a research paper PDF, an addendum with clarifications, and a rubric defining evaluation criteria, the agent must
- How can a model improve its PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) score?
- Tools linked to PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) on Sophon include Paperbench ENV RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
- What license is PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) under?
- PaperBench: Evaluating AI''s Ability to Replicate AI Research (Work In Progress) is available under mit.