0

Terminal-Bench

Active

Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.

Open
Domain
agentic
Format
Custom
Size
80 tasks
License
Apache-2.0
Published
Jan 2025
Updates
Monthly
Notable for
The reference leaderboard for terminal / DevOps / sysadmin agent capability, complementary to SWE-bench's IDE-style coding focus.
Canonical
tbench.ai
Official leaderboard
tbench.ai/leaderboard

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 26.7% by DeepSeek V3.1 - 1 model reporting (1 frontier)

Top models

1
Terminal-BenchBar chart with 1 bar. Highest value: DeepSeek V3.1 at 26.7.
1 model

Where it's ranked

1

Related tools

6
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

3

Contributors

2

FAQ

What is Terminal-Bench?
Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.
What capabilities does Terminal-Bench test?
Terminal-Bench evaluates tool calling, planning, code editing, debugging.
What is the current top score on Terminal-Bench?
The top reported score is 26.7% by DeepSeek V3.1, across 1 model reporting (1 from frontier labs).
How can a model improve its Terminal-Bench score?
Tools linked to Terminal-Bench on Sophon include Terminal Bench RL Env (Community), Terminalbench RL Env (Community), OpenEnv Terminus (HF Sandbox terminal), Terminal-Bench (Verifiers wrapper) - RL environments, datasets, and scaffolds that target this eval.
What license is Terminal-Bench under?
Terminal-Bench is available under Apache-2.0.