0

E2EBench

Fresh

End-to-end evaluation of AI systems in complete workflows from input to final output

Type
RL Env
Runtime
ORS
License
unknown
Size
100 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

E2E-Bench

OpenReward Environment

Description

E2E-Bench evaluates agents on end-to-end scientific discovery tasks. Agents must perform full research pipelines — analyzing data, running experiments, writing code — and produce a scientific paper with supporting code and artifacts. Results are evaluated against task-specific rubric criteria by a Claude Sonnet LLM judge.

Two variants are served from this directory:

  • E2E-Bench: May 2025 dataset (50 tasks: 10 train, 40 test)
  • E2E-Bench-Hard: June 2025 HARPA dataset (50 tasks: 10 train, 40 test)

Capabilities

  • Scientific experiment design and execution
  • Research paper writing
  • Code development for data analysis and ML
  • Artifact generation (results files, logs, visualizations)

Compute Requirements

Agents are given a sandboxed Docker environment. Default sandbox size is 2 CPU and 4 GB RAM. Network access enabled. No GPU.

Tasks

  • E2E-Bench train: 10 tasks, test: 40 tasks
  • E2E-Bench-Hard train: 10 tasks, test: 40 tasks
  • Each task is a full research pipeline with specific rubric criteria

Reward Structure

Continuous reward based on rubric evaluation:

  • Each rubric criterion is evaluated independently by claude-sonnet-4-6 (matching original AstaBench)
  • Two-pass evaluation: (1) evaluate paper/code/artifacts separately, (2) reflect and give overall verdict
  • Score = (required criteria passed) / (total required criteria)
  • Range: 0.0 to 1.0

Data

  • Source: allenai/asta-bench on HuggingFace (gated dataset)
  • Attribution: Data provided by The Allen Institute for Artificial Intelligence. Test portions must not be used for training.

Tools

  • bash: Execute shell commands in the sandbox
  • submit: Submit research results (paper + code + artifacts) for rubric evaluation (terminal action, one attempt)

Time Horizon

Multi-turn. Agents perform extensive research pipelines. Expected: 20–100+ tool calls.

Environment Difficulty

Very hard. Requires end-to-end scientific research: experiment design, code implementation, analysis, and paper writing.

Safety

Code is executed in an isolated sandbox. Rubric evaluation uses Claude Sonnet API calls (requires Anthropic API key).

Citations

@article{bragg2025astabench,
  title={AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite},
  author={Bragg, Jonathan and D'Arcy, Mike and Balepur, Nishant and Bareket, Dan and Dalvi, Bhavana and Feldman, Sergey and Haddad, Dany and Hwang, Jena D. and Jansen, Peter and Kishore, Varsha and Majumder, Bodhisattwa Prasad and Naik, Aakanksha and Rahamimov, Sigal and Richardson, Kyle and Singh, Amanpreet and Surana, Harshit and Tiktinsky, Aryeh and Vasu, Rosni and Wiener, Guy and Anastasiades, Chloe and Candra, Stefan and Dunkelberger, Jason and Emery, Dan and Evans, Rob and Hamada, Malachi and Huff, Regan and Kinney, Rodney and Latzke, Matt and Lochner, Jaron and Lozano-Aguilera, Ruben and Nguyen, Cecile and Rao, Smita and Tanaka, Amber and Vlahos, Brooke and Clark, Peter and Downey, Doug and Goldberg, Yoav and Sabharwal, Ashish and Weld, Daniel S.},
  journal={arXiv preprint arXiv:2510.21652},
  year={2025}
}