0

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Sierra benchmark that puts an agent between a simulated user and a customer-service API (retail / airline), measuring multi-turn tool-using policy adherence.

Publisher
Sierra
Year
2024
Venue
preprint
Authors
4
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2406.12045
TL;DR
Semantic Scholar
Attribution policy →

Introduces 1 artifact - 1 eval

Artifacts

1

Authors

4