0

Terminal-Bench (Hard)

Frontier

Hard subset of Terminal-Bench - long-horizon shell + filesystem tasks.

Published
May 2026
Canonical
tbench.ai

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
AA
Attribution policy →

Top score 58.3% by Claude Opus 4.8 - 311 models reporting (60 frontier)

Score history

311
0%25%50%75%100%Feb 24Aug 24Feb 25Aug 25Feb 26Phi-4 Mini InstructLlama 3.1 70B Instructo1Grok 3 miniGrok 4Gemini 3 ProGPT-5.3-CodexClaude Opus 4.7Claude Opus 4.8

Top models

311
Terminal-Bench (Hard)Bar chart with 21 bars. Highest value: Claude Opus 4.8 at 58.3.
21 models

FAQ

What is Terminal-Bench (Hard)?
Hard subset of Terminal-Bench - long-horizon shell + filesystem tasks.
What is the current top score on Terminal-Bench (Hard)?
The top reported score is 58.3% by Claude Opus 4.8, across 311 models reporting (60 from frontier labs).