What capabilities does τ²-bench (Tau²-bench) test?

τ²-bench (Tau²-bench) evaluates tool calling, multi turn dialog, planning.

What is the current top score on τ²-bench (Tau²-bench)?

The top reported score is 99.1% by JT-35B-Flash, across 343 models reporting (71 from frontier labs).

How can a model improve its τ²-bench (Tau²-bench) score?

Tools linked to τ²-bench (Tau²-bench) on Sophon include TAU 2 Bench RL Env (Community), TAU 2 Bench RL Env (Prime Intellect), TAU 2 Bench RL Env (Community), TAU 2 Synth RL Env (Prime) - RL environments, datasets, and scaffolds that target this eval.

What license is τ²-bench (Tau²-bench) under?

τ²-bench (Tau²-bench) is available under MIT.

τ²-bench (Tau²-bench)

Saturated

Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

Open

Publisher: Sierra
Capabilities: Tool Calling Multi Turn Dialog Planning
Domain: agentic
Format: Custom
Size: 220 tasks
License: MIT
Published: Jun 2025
Notable for: Benchmark for evaluating tool calling, multi turn dialog and planning in the agentic domain.
Canonical: github.com/sierra-research/tau2-bench
Also on: sierra.ai/blog/tau-bench-evaluating-conversational-agents

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: AA prime-hub

Attribution policy →

Top score 99.1% by JT-35B-Flash - 343 models reporting (71 frontier)

Score history

342

Top models

343

τ²-bench (Tau²-bench)Bar chart with 21 bars. Highest value: JT-35B-Flash at 99.1.

21 models

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

TAU 2 Bench RL Env (Community)

Verifiers implementation of tau2-bench

ImplementationRL EnvTool Agent UserTool UseUser Sim

TAU 2 Bench RL Env (Prime Intellect)

Prime Intellect

τ²-bench evaluation environment

ImplementationRL EnvTool Agent UserTool UseUser Sim

TAU 2 Bench RL Env (Community)

Verifiers implementation of tau2-bench

ImplementationRL EnvTool Agent UserTool UseUser Sim

TAU 2 Synth RL Env (Prime)

Prime

τ²-bench with custom synthetic domains (library, fitness_gym, tech_support, telecom, cloud_incident_response, daily_planner, ev_charging_support)

Trains towardRL EnvTool Agent UserTool UseUser Sim

TAU 3 Bench RL Env (Prime Intellect)

Prime Intellect

τ²-bench evaluation environment. Focus on tau-knowledge.

Trains towardRL EnvTool Agent UserTool UseUser Sim

FAQ

What is τ²-bench (Tau²-bench)?: Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.
What capabilities does τ²-bench (Tau²-bench) test?: τ²-bench (Tau²-bench) evaluates tool calling, multi turn dialog, planning.
What is the current top score on τ²-bench (Tau²-bench)?: The top reported score is 99.1% by JT-35B-Flash, across 343 models reporting (71 from frontier labs).
How can a model improve its τ²-bench (Tau²-bench) score?: Tools linked to τ²-bench (Tau²-bench) on Sophon include TAU 2 Bench RL Env (Community), TAU 2 Bench RL Env (Prime Intellect), TAU 2 Bench RL Env (Community), TAU 2 Synth RL Env (Prime) - RL environments, datasets, and scaffolds that target this eval.
What license is τ²-bench (Tau²-bench) under?: τ²-bench (Tau²-bench) is available under MIT.