Question 1

What is Agent Diff Bench?

Accepted Answer

Benchmark for evaluating agents on Slack, Linear, Box, Calendar via Bash & Python

Question 2

What is the current top score on Agent Diff Bench?

Accepted Answer

The top reported score is 74.7% by DeepSeek V3.2, across 6 models reporting (2 from frontier labs).

Question 3

How can a model improve its Agent Diff Bench score?

Accepted Answer

Tools linked to Agent Diff Bench on Sophon include DIFF Bench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

Question 4

What license is Agent Diff Bench under?

Accepted Answer

Agent Diff Bench is available under unknown.

Agent Diff Bench

Score history