0

τ²-bench (Tau²-bench)

Saturated

Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.

Publisher
Sierra
Domain
agentic
Format
Custom
Size
220 tasks
License
MIT
Published
Jun 2025
Notable for
Benchmark for evaluating tool calling, multi turn dialog and planning in the agentic domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
AAprime-hub
Attribution policy →

Top score 99.1% by JT-35B-Flash - 320 models reporting (60 frontier)

Score history

319
0%25%50%75%100%Sep 23Apr 24Nov 24Jun 25Jan 26Mistral 7B InstructSolar MiniMistral Large 2407Pixtral Large 2411Grok 3 miniKimi K2 ThinkingStep 3.5 FlashJT-35B-Flash

Top models

320
τ²-bench (Tau²-bench)Bar chart with 21 bars. Highest value: JT-35B-Flash at 99.1.
21 models

Related tools

5
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is τ²-bench (Tau²-bench)?
Sierra's dual-control extension of τ-bench - now the user is also an LLM and both agents share access to the same tool-driven environment.
What capabilities does τ²-bench (Tau²-bench) test?
τ²-bench (Tau²-bench) evaluates tool calling, multi turn dialog, planning.
What is the current top score on τ²-bench (Tau²-bench)?
The top reported score is 99.1% by JT-35B-Flash, across 320 models reporting (60 from frontier labs).
How can a model improve its τ²-bench (Tau²-bench) score?
Tools linked to τ²-bench (Tau²-bench) on Sophon include TAU 2 Bench RL Env (Community), TAU 2 Bench RL Env (Prime Intellect), TAU 2 Bench RL Env (Community), TAU 2 Synth RL Env (Prime) - RL environments, datasets, and scaffolds that target this eval.
What license is τ²-bench (Tau²-bench) under?
τ²-bench (Tau²-bench) is available under MIT.