τ-bench (tau-bench)
Frontier
Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.
- Publisher
- Sierra
- Domain
- agentic
- Format
- Custom
- Size
- 165 tasks
- License
- MIT
- Published
- Jun 2024
- Updates
- Live
- Notable for
- The reference benchmark for production-grade conversational customer-service agents; published by Sierra (Bret Taylor's company), covering retail, airline, telecom, plus newer τ-Knowledge and τ-Voice extensions.
- Canonical
- github.com/sierra-research/tau-bench
- Official leaderboard
- taubench.com
Cite
Notes
Only stored in your browser.
Top score 32.5% by GPT-4o - 3 models reporting (3 frontier)
Score history
2Top models
3Where it's ranked
1Related tools
2Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
3FAQ
- What is τ-bench (tau-bench)?
- Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.
- What capabilities does τ-bench (tau-bench) test?
- τ-bench (tau-bench) evaluates tool calling, multi turn dialog, instruction following, planning.
- What is the current top score on τ-bench (tau-bench)?
- The top reported score is 32.5% by GPT-4o, across 3 models reporting (3 from frontier labs).
- How can a model improve its τ-bench (tau-bench) score?
- Tools linked to τ-bench (tau-bench) on Sophon include Bench ENV RL Env (Prime Community), TAU 3 Bench RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
- What license is τ-bench (tau-bench) under?
- τ-bench (tau-bench) is available under MIT.