0

AutomationBench

Frontier

Evaluates AI agents on realistic, multi-step business workflows across 47 simulated SaaS tools.

Domain
rl-env
License
unknown
Published
Apr 2026

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 56.1% by Claude Opus 4.6 - 12 models reporting (5 frontier)

Score history

11
0%25%50%75%100%Dec 25Jan 26Feb 26Mar 26Apr 26Gemini 3 Flash PreviewClaude Opus 4.6

Top models

12
AutomationBenchBar chart with 12 bars. Highest value: Claude Opus 4.6 at 56.1.
12 models

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is AutomationBench?
Evaluates AI agents on realistic, multi-step business workflows across 47 simulated SaaS tools.
What is the current top score on AutomationBench?
The top reported score is 56.1% by Claude Opus 4.6, across 12 models reporting (5 from frontier labs).
How can a model improve its AutomationBench score?
Tools linked to AutomationBench on Sophon include Automationbench RL Env (Zapier) - RL environments, datasets, and scaffolds that target this eval.
What license is AutomationBench under?
AutomationBench is available under unknown.