BountyBench

Description

BountyBench is a cybersecurity benchmark environment for evaluating AI agents on real-world bug bounties. Agents operate in containers with full codebase access and running target services, tasked with detecting vulnerabilities, writing exploits, or patching security flaws across 30+ real-world software systems.

Based on the BountyBench benchmark with 46 bug bounties across 31 systems and 3 task phases.

Capabilities

Vulnerability detection in real-world codebases
Exploit development for known CVEs
Security patch authoring that preserves existing functionality
Interaction with live services (web apps, APIs, databases)

Compute Requirements

Each task runs in a dedicated sandbox container with:

1 vCPU, 2 GB RAM
Network access (for service-based systems)
Per-system Docker images with pre-configured environments

Tasks

138 tasks total: 46 bounties x 3 phases (detect, exploit, patch).

Each task targets a specific (system, bounty, phase) triple:

detect: Find an unknown vulnerability and write exploit.sh
exploit: Exploit a known vulnerability (CVE/CWE provided) and write exploit.sh
patch: Fix a known vulnerability without breaking existing functionality

Systems include: lunary, yaml, node, django, fastapi, gradio, mlflow, curl, gunicorn, and 20+ more.

Reward Structure

Binary rewards (0.0 or 1.0), verified by shell scripts:

Detect/Exploit: Agent's exploit.sh is run, then a hidden verify.sh checks if the vulnerability was demonstrated. Reward = 1.0 if verify.sh exits 0.
Patch: A reference exploit is run against the patched codebase, then verify.sh checks if the vulnerability is still present. Reward = 1.0 if the exploit is blocked AND invariant tests pass.

Data

Source: bountybench/bountytasks
Format: task_index.json generated by prepare_data.py
Per-system Docker images built by build_images.sh

Tools

Tool	Description
`bash`	Execute arbitrary bash commands in the sandbox
`list_files`	List directory contents
`read_file`	Read file contents
`write_file`	Write content to a file
`submit`	Submit work for automated grading

Time Horizon

Multi-turn. Agents typically need 10-50+ tool calls depending on the phase:

Detect: extensive code analysis + exploit development
Exploit: targeted exploit development
Patch: code analysis + targeted fix

Environment Difficulty

Varies by system and vulnerability type:

Severity ranges from 3.0 to 10.0 (CVSS)
CWE types include injection, auth bypass, SSRF, path traversal, and more

Safety

This environment involves real vulnerability exploitation techniques. All activities are sandboxed and isolated. The environment is designed for security research and AI capability evaluation only.

Citations

@article{zhang2025bountybench,
      title={BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems},
      author={Andy K. Zhang and Joey Ji and Celeste Menders and Riya Dulepet and Thomas Qin and Ron Y. Wang and Junrong Wu and Kyleen Liao and Jiliang Li and Jinghan Hu and Sara Hong and Nardos Demilew and Shivatmica Murgai and Jason Tran and Nishka Kacheria and Ethan Ho and Denis Liu and Lauren McLane and Olivia Bruvik and Dai-Rong Han and Seungwoo Kim and Akhil Vyas and Cuiyuanxiu Chen and Ryan Li and Weiran Xu and Jonathan Z. Ye and Prerit Choudhary and Siddharth M. Bhatia and Vikram Sivashankar and Yuxuan Bao and Dawn Song and Dan Boneh and Daniel E. Ho and Percy Liang},
      year={2025},
      eprint={2505.15216},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
}