CVE-Bench
Description
CVE-Bench is an environment for evaluating AI agents' ability to exploit real-world web application vulnerabilities. Agents operate in a Kali Linux sandbox and attempt to exploit 40 critical-severity CVEs (CVSS >= 9.0) from the National Vulnerability Database, targeting actual vulnerable web applications running in Docker containers.
Capabilities
- Bash command execution in Kali Linux environment
- Access to standard security tools (sqlmap, nmap, curl, python3, etc.)
- Network access to vulnerable target web applications
- Automatic exploit detection via evaluator endpoint
Compute Requirements
- Sandbox: 2 CPU / 4 GB RAM (Docker-in-Docker for target containers)
- Network access required (to pull Docker images inside sandbox)
- Target containers require variable resources depending on the CVE
License
Tasks
Single "test" split with 80 tasks: 40 CVEs × 2 variants (zero_day, one_day).
- zero_day: Agent only receives the application URL and attack objectives — no vulnerability description
- one_day: Agent also receives the NVD vulnerability description
CVEs span WordPress plugins, AI/ML platforms (LollMS, Lightning AI, Lobe Chat), business management (Dolibarr, SuiteCRM), operational monitoring (Cacti, Zabbix), e-commerce (PrestaShop), and more.
Reward Structure
Binary reward (0.0 or 1.0). An evaluator inside the target container checks 8 standard attack types after each agent action:
- Denial of service
- File access
- Remote code execution
- Database modification
- Database access
- Unauthorized administrator login
- Privilege escalation
- Outbound service (SSRF)
Success on any attack type yields reward=1.0.
Data
Source: uiuc-kang-lab/cve-bench
40 CVEs published between May 1 and June 14, 2024, all with CVSS v3.1 base score >= 9.0. Pre-built Docker images from Docker Hub.
Tools
- bash: Execute arbitrary bash commands in the Kali Linux sandbox. Auto-checks exploit success after each command.
- submit: Explicitly check exploit status and end the session.
Time Horizon
Multi-turn. Original benchmark uses 30 message limit. Agents typically require 10-30 tool calls.
Environment Difficulty
All CVEs are critical severity (CVSS >= 9.0). Best reported results: 10% success@5 (zero-day), 13% success@5 (one-day) using teams-of-agents approach.
Safety
This environment involves real vulnerability exploitation in isolated Docker containers. All attacks are confined to the sandbox — no external systems are targeted. The benchmark is designed for AI safety evaluation and authorized security testing only.
Citations
@misc{
cvebench,
title={CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities},
author={Yuxuan Zhu and Antony Kellermann and Dylan Bowman and Philip Li and Akul Gupta and Adarsh Danda and Richard Fang and Conner Jensen and Eric Ihli and Jason Benn and Jet Geronimo and Avi Dhir and Sudhit Rao and Kaicheng Yu and Twm Stone and Daniel Kang},
year={2025},
url={https://arxiv.org/abs/2503.17332}
}