0

BountyBench

Fresh

A framework to capture offensive & defensive cyber-capabilities in evolving real-world systems, BountyBench is a benchmark with 25 systems with complex, real-world codebases, and include 40 bug bounties that cover 9 of the OWASP Top 10 Risks.

BountyBench

OpenReward Environment

Description

BountyBench is a cybersecurity benchmark environment for evaluating AI agents on real-world bug bounties. Agents operate in containers with full codebase access and running target services, tasked with detecting vulnerabilities, writing exploits, or patching security flaws across 30+ real-world software systems.

Based on the BountyBench benchmark with 46 bug bounties across 31 systems and 3 task phases.

Capabilities

  • Vulnerability detection in real-world codebases
  • Exploit development for known CVEs
  • Security patch authoring that preserves existing functionality
  • Interaction with live services (web apps, APIs, databases)

Compute Requirements

Each task runs in a dedicated sandbox container with:

  • 1 vCPU, 2 GB RAM
  • Network access (for service-based systems)
  • Per-system Docker images with pre-configured environments

Tasks

138 tasks total: 46 bounties x 3 phases (detect, exploit, patch).

Each task targets a specific (system, bounty, phase) triple:

  • detect: Find an unknown vulnerability and write exploit.sh
  • exploit: Exploit a known vulnerability (CVE/CWE provided) and write exploit.sh
  • patch: Fix a known vulnerability without breaking existing functionality

Systems include: lunary, yaml, node, django, fastapi, gradio, mlflow, curl, gunicorn, and 20+ more.

Reward Structure

Binary rewards (0.0 or 1.0), verified by shell scripts:

  • Detect/Exploit: Agent's exploit.sh is run, then a hidden verify.sh checks if the vulnerability was demonstrated. Reward = 1.0 if verify.sh exits 0.
  • Patch: A reference exploit is run against the patched codebase, then verify.sh checks if the vulnerability is still present. Reward = 1.0 if the exploit is blocked AND invariant tests pass.

Data

  • Source: bountybench/bountytasks
  • Format: task_index.json generated by prepare_data.py
  • Per-system Docker images built by build_images.sh

Tools

ToolDescription
bashExecute arbitrary bash commands in the sandbox
list_filesList directory contents
read_fileRead file contents
write_fileWrite content to a file
submitSubmit work for automated grading

Time Horizon

Multi-turn. Agents typically need 10-50+ tool calls depending on the phase:

  • Detect: extensive code analysis + exploit development
  • Exploit: targeted exploit development
  • Patch: code analysis + targeted fix

Environment Difficulty

Varies by system and vulnerability type:

  • Severity ranges from 3.0 to 10.0 (CVSS)
  • CWE types include injection, auth bypass, SSRF, path traversal, and more

Safety

This environment involves real vulnerability exploitation techniques. All activities are sandboxed and isolated. The environment is designed for security research and AI capability evaluation only.

Citations

@article{zhang2025bountybench,
      title={BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems},
      author={Andy K. Zhang and Joey Ji and Celeste Menders and Riya Dulepet and Thomas Qin and Ron Y. Wang and Junrong Wu and Kyleen Liao and Jiliang Li and Jinghan Hu and Sara Hong and Nardos Demilew and Shivatmica Murgai and Jason Tran and Nishka Kacheria and Ethan Ho and Denis Liu and Lauren McLane and Olivia Bruvik and Dai-Rong Han and Seungwoo Kim and Akhil Vyas and Cuiyuanxiu Chen and Ryan Li and Weiran Xu and Jonathan Z. Ye and Prerit Choudhary and Siddharth M. Bhatia and Vikram Sivashankar and Yuxuan Bao and Dawn Song and Dan Boneh and Daniel E. Ho and Percy Liang},
      year={2025},
      eprint={2505.15216},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
}