E2E-Bench

Description

E2E-Bench evaluates agents on end-to-end scientific discovery tasks. Agents must perform full research pipelines — analyzing data, running experiments, writing code — and produce a scientific paper with supporting code and artifacts. Results are evaluated against task-specific rubric criteria by a Claude Sonnet LLM judge.

Two variants are served from this directory:

E2E-Bench: May 2025 dataset (50 tasks: 10 train, 40 test)
E2E-Bench-Hard: June 2025 HARPA dataset (50 tasks: 10 train, 40 test)

Capabilities

Scientific experiment design and execution
Research paper writing
Code development for data analysis and ML
Artifact generation (results files, logs, visualizations)

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 2 CPU and 4 GB RAM. Network access enabled. No GPU.

Tasks

E2E-Bench train: 10 tasks, test: 40 tasks
E2E-Bench-Hard train: 10 tasks, test: 40 tasks
Each task is a full research pipeline with specific rubric criteria

Reward Structure

Continuous reward based on rubric evaluation:

Each rubric criterion is evaluated independently by claude-sonnet-4-6 (matching original AstaBench)
Two-pass evaluation: (1) evaluate paper/code/artifacts separately, (2) reflect and give overall verdict
Score = (required criteria passed) / (total required criteria)
Range: 0.0 to 1.0

Data

Source: allenai/asta-bench on HuggingFace (gated dataset)
Attribution: Data provided by The Allen Institute for Artificial Intelligence. Test portions must not be used for training.

Tools

bash: Execute shell commands in the sandbox
submit: Submit research results (paper + code + artifacts) for rubric evaluation (terminal action, one attempt)

Time Horizon

Multi-turn. Agents perform extensive research pipelines. Expected: 20–100+ tool calls.

Environment Difficulty

Very hard. Requires end-to-end scientific research: experiment design, code implementation, analysis, and paper writing.

Safety

Code is executed in an isolated sandbox. Rubric evaluation uses Claude Sonnet API calls (requires Anthropic API key).

Citations

@article{bragg2025astabench,
  title={AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite},
  author={Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Dalvi, Bhavana and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefan and Dunkelberger, Jason and Emery, Dan and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Cecile and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S.},
  journal={arXiv preprint arXiv:2510.21652},
  year={2025}
}