0

Agent Diff Bench

Benchmark for evaluating agents on Slack, Linear, Box, Calendar via Bash & Python

Domain
rl-env
License
unknown
Published
Feb 2026

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 74.7% by DeepSeek V3.2 - 6 models reporting (2 frontier)

Score history

3
25%44%63%81%100%Nov 25Dec 25Grok 4.1 FastDeepSeek V3.2

Top models

6
Agent Diff BenchBar chart with 6 bars. Highest value: DeepSeek V3.2 at 74.7.
6 models

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is Agent Diff Bench?
Benchmark for evaluating agents on Slack, Linear, Box, Calendar via Bash & Python
What is the current top score on Agent Diff Bench?
The top reported score is 74.7% by DeepSeek V3.2, across 6 models reporting (2 from frontier labs).
How can a model improve its Agent Diff Bench score?
Tools linked to Agent Diff Bench on Sophon include DIFF Bench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is Agent Diff Bench under?
Agent Diff Bench is available under unknown.