What capabilities does τ-bench (tau-bench) test?

τ-bench (tau-bench) evaluates tool calling, multi turn dialog, instruction following, planning.

What is the current top score on τ-bench (tau-bench)?

The top reported score is 62.6% by Claude 3.5 Sonnet, across 3 models reporting (3 from frontier labs).

How can a model improve its τ-bench (tau-bench) score?

Tools linked to τ-bench (tau-bench) on Sophon include Bench ENV RL Env (Prime Community), TAU 3 Bench RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.

What license is τ-bench (tau-bench) under?

τ-bench (tau-bench) is available under MIT.

τ-bench (tau-bench)

Frontier

Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.

Open

Publisher: Sierra
Capabilities: Tool Calling Multi Turn Dialog Instruction Following Planning
Domain: agentic
Format: Custom
Size: 165 tasks
License: MIT
Published: Jun 2024
Updates: Live
Notable for: The reference benchmark for production-grade conversational customer-service agents; published by Sierra (Bret Taylor's company), covering retail, airline, telecom, plus newer τ-Knowledge and τ-Voice extensions.
Canonical: github.com/sierra-research/tau-bench
Official leaderboard: taubench.com

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: tau-bench

Attribution policy →

Top score 62.6% by Claude 3.5 Sonnet - 3 models reporting (3 frontier)

Score history

Top models

τ-bench (tau-bench)Bar chart with 3 bars. Highest value: Claude 3.5 Sonnet at 62.6.

3 models

Where it's ranked

Official leaderboard

taubench.com

Single benchmark

live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Bench ENV RL Env (Prime Community)

Prime Community

τ-bench: Tool-Agent-User benchmark for conversational agents in customer service domains with user simulation

Trains towardRL EnvTool UseAirlineConversation

TAU 3 Bench RL Env (Prime Intellect)

Prime Intellect

τ²-bench evaluation environment. Focus on tau-knowledge.

Trains towardRL EnvTool UseTool Agent UserUser Sim

Papers

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

preprint · 2024

Sierra benchmark that puts an agent between a simulated user and a customer-service API (retail / airline), measuring multi-turn tool-using policy adherence.

introduces

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

preprint · 2024

Sierra benchmark that puts an agent between a simulated user and a customer-service API (retail / airline), measuring multi-turn tool-using policy adherence.

Contributors

SShunyu Yao NNoah Shinn KKarthik Narasimhan

FAQ

What is τ-bench (tau-bench)?: Multi-turn customer-service simulation testing whether agents follow domain policies while interacting with a tool-using user simulator.
What capabilities does τ-bench (tau-bench) test?: τ-bench (tau-bench) evaluates tool calling, multi turn dialog, instruction following, planning.
What is the current top score on τ-bench (tau-bench)?: The top reported score is 62.6% by Claude 3.5 Sonnet, across 3 models reporting (3 from frontier labs).
How can a model improve its τ-bench (tau-bench) score?: Tools linked to τ-bench (tau-bench) on Sophon include Bench ENV RL Env (Prime Community), TAU 3 Bench RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is τ-bench (tau-bench) under?: τ-bench (tau-bench) is available under MIT.