Terminal-Bench
Active
Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.
- Publisher
- Laude Institute
- Capabilities
- Tool CallingPlanningCode EditingDebugging
- Domain
- agentic
- Format
- Custom
- Size
- 80 tasks
- License
- Apache-2.0
- Published
- Jan 2025
- Updates
- Monthly
- Notable for
- The reference leaderboard for terminal / DevOps / sysadmin agent capability, complementary to SWE-bench's IDE-style coding focus.
- Canonical
- tbench.ai
- Official leaderboard
- tbench.ai/leaderboard
Cite
Notes
Only stored in your browser.
Top score 26.7% by DeepSeek V3.1 - 1 model reporting (1 frontier)
Top models
1Where it's ranked
1Related tools
6Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
3Contributors
2FAQ
- What is Terminal-Bench?
- Suite of real-world command-line tasks executed inside sandboxed Docker containers to measure agentic shell competence.
- What capabilities does Terminal-Bench test?
- Terminal-Bench evaluates tool calling, planning, code editing, debugging.
- What is the current top score on Terminal-Bench?
- The top reported score is 26.7% by DeepSeek V3.1, across 1 model reporting (1 from frontier labs).
- How can a model improve its Terminal-Bench score?
- Tools linked to Terminal-Bench on Sophon include Terminal Bench RL Env (Community), Terminalbench RL Env (Community), OpenEnv Terminus (HF Sandbox terminal), Terminal-Bench (Verifiers wrapper) - RL environments, datasets, and scaffolds that target this eval.
- What license is Terminal-Bench under?
- Terminal-Bench is available under Apache-2.0.