0

τ-bench (tau-bench)

Frontier

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

Publisher
Sierra
Domain
agentic
Format
Custom
Size
165 tasks
License
MIT
Published
Jun 2024
Updates
Live
Notable for
The reference benchmark for production-grade conversational customer-service agents; published by Sierra (Bret Taylor's company), covering retail, airline, telecom, plus newer τ-Knowledge and τ-Voice extensions.
Official leaderboard
taubench.com

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
tau-bench
Attribution policy →

Top score 32.5% by GPT-4o - 3 models reporting (3 frontier)

Score history

2
15%36%57%79%100%Jul 24Aug 24Sep 24Oct 24Nov 24GPT-4o-miniGPT-4o

Top models

3
τ-bench (tau-bench)Bar chart with 3 bars. Highest value: Claude Sonnet 3.5 at 62.6.
3 models

Where it's ranked

1

Related tools

2
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

3

FAQ

What is τ-bench (tau-bench)?
Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.
What capabilities does τ-bench (tau-bench) test?
τ-bench (tau-bench) evaluates tool calling, multi turn dialog, instruction following, planning.
What is the current top score on τ-bench (tau-bench)?
The top reported score is 32.5% by GPT-4o, across 3 models reporting (3 from frontier labs).
How can a model improve its τ-bench (tau-bench) score?
Tools linked to τ-bench (tau-bench) on Sophon include Bench ENV RL Env (Prime Community), TAU 3 Bench RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is τ-bench (tau-bench) under?
τ-bench (tau-bench) is available under MIT.