debugging

Category: coding
Slug: debugging
Evals: 7
Tools: 16
Models: 291
Papers: 6

Evals testing this capability

HCAST

METR (Model Evaluation and Threat Research)

METR's Human-Calibrated Autonomy Software Tasks - 189 multi-step software tasks calibrated against human time-to-complete, used to measure agent task length capability.

ActiveCode GenerationPlanningTool CallingAgentic

LiveCodeBench

University of California, Berkeley

Rolling competitive-programming benchmark that scrapes LeetCode / AtCoder / Codeforces problems after a known cutoff to fight contamination.

ActiveCode GenerationDebuggingCode

SWE-bench Lite

Princeton University

300-issue subset of SWE-bench focused on functional bug fixes that are easier to evaluate - used for fast iteration before full SWE-bench runs.

ActiveCode EditingDebuggingTool CallingCode

SWE-bench Verified

OpenAI

500 human-validated SWE-bench tasks confirmed solvable from the issue alone, with non-flaky test suites - the most-reported agentic coding benchmark.

ActiveCode EditingDebuggingTool CallingCode

SWE-bench

Princeton NLP Group

2,294 real GitHub issues from 12 popular Python repos that require an agent to produce a patch passing the project's test suite.

ActiveCode EditingDebuggingTool CallingCode

SWE-Gym

University of California, Berkeley

First publicly available training environment for software-engineering agents - 2,438 real Python GitHub issues with executable Docker test environments and golden patches.

ActiveCode EditingDebuggingTool CallingCode

Terminal-Bench

Laude Institute

Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.

ActiveTool CallingPlanningCode EditingAgentic

Tools lifting evals here

View all

Agent Bench RL Env (Prime Community)

Prime Community

Benchmarking model performance on SWE Bench in the Mini SWE Agent harness.

RL EnvTool UseAgent

lifts 3 evals here

SWE-Gym

University of California, Berkeley

First open training environment for real-world software-engineering agents - 2,438 Python tasks from 11 repos, each with an executable runtime and a hidden test suite.

RL EnvCode EditingDebuggingTool CallingCode

lifts 3 evals here

Agent PLUS RL Env (Prime Intellect)

Prime Intellect

Mini SWE Agent Plus environment for solving SWE issues inside Prime Sandboxes.

RL EnvSWECode

lifts 2 evals here

Deepswe RL Env (Prime Intellect)

Prime Intellect

DeepSWE environment for solving SWE issues inside Prime Sandboxes.

RL EnvSWECode

lifts 2 evals here

mini-swe-agent-plus

Prime Intellect

Verifiers env that runs the mini-swe-agent harness inside Prime Sandboxes against real GitHub issues; reward is test-suite pass.

RL EnvCode EditingDebuggingTool CallingCode

lifts 2 evals here

Bench 2 RL Env (Prime Intellect)

Prime Intellect

Terminal-Bench 2.1 Harbor taskset with Terminus2 as the default harness

RL EnvCli AgentTerminal Bench

lifts 1 eval here

Harbor RL Env (Prime Intellect)

Prime Intellect

Harbor (terminal-bench-style) tasks via ComposableEnv.

RL EnvTerminal Bench

lifts 1 eval here

Livecodebench RL Env (Prime Intellect)

Prime Intellect

LiveCodeBench evaluation environment

RL EnvCode

lifts 1 eval here

Opencode SWE RL Env (Prime Intellect)

Prime Intellect

OpenCode SWE environment for solving SWE issues inside Prime Sandboxes.

RL EnvSWECode

lifts 1 eval here

OpenEnv Terminus (HF Sandbox terminal)

Hugging Face

Single-tool (tmux session) terminal coding environment in the OpenEnv standard, backed by Hugging Face Sandbox containers - the OpenEnv port of the Terminus-2 design.

RL EnvTool CallingCode EditingPlanningCode

lifts 1 eval here

OpenThoughts

Open Thoughts

A fully-open distillation of long DeepSeek-R1 reasoning traces - the community's flagship "open R1" SFT corpus for reasoning models.

SFT DatasetMathCode GenerationScientific Reasoning

lifts 1 eval here

SWE RL Env (Prime Intellect)

Prime Intellect

SWE tasks (R2E-Gym, SWE-bench, ...).

RL EnvSWECode

lifts 1 eval here

Swebench PRO RL Env (Prime Intellect)

Prime Intellect

SWE-bench Pro environment backed by Harbor tasks.

RL EnvV1SWESWE BenchCode

lifts 1 eval here

Terminal Bench RL Env (Community)

Terminal-Bench wrapper environment for verifiers.

RL Env

lifts 1 eval here

Terminal-Bench (Verifiers wrapper)

Harbor

Terminal-Bench tasks wrapped as a Verifiers RL environment - model drives a tmux shell to complete realistic end-to-end terminal jobs (compiling, sysadmin, data science).

RL EnvTool CallingPlanningCode EditingCode

lifts 1 eval here