Taubench
Fresh
tau-bench challenges agents to coordinate, guide, and assist users in achieving shared objectives across complex enterprise domains.
- Type
- RL Env
- Publisher
- General Reasoning
- Runtime
ORS- License
- unknown
- Size
- 165 tasks
- Published
- Jan 2026
Cite
Notes
Only stored in your browser.
Public scores on this env
1240 vf-eval reports across 12 models
1GPT-4oOpenAIdisputed61.22GPT-4 TurboOpenAIdisputed57.73GPT 4 32kOpenAIdisputed56.54Claude Opus 3Anthropicdisputed44.25Mistral LargeMistral AIdisputed30.76Claude Sonnet 3Anthropicdisputed26.37Gemini 1.5 Pro (Sep '24)Google (Alphabet Inc.)disputed21.78GPT-3.5 TurboOpenAIdisputed209Claude 3 HaikuAnthropicdisputed1910Mixtral 8X22BMistral AIdisputed17.711Gemini 1.5 Flash (Sep '24)Google (Alphabet Inc.)disputed17.412meta-llama-3-70BMeta Platformsdisputed14.8
Open the scoring view →