AssistantBench
Active
214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.
- Publisher
- Allen Institute for AI (Ai2)
- Capabilities
- Browser UsePlanningRetrievalTool Calling
- Domain
- agentic
- Format
- Custom
- Size
- 214 tasks
- License
- MIT
- Published
- Jul 2024
- Notable for
- Benchmark for evaluating browser use, planning and retrieval in the agentic domain.
- Canonical
- assistantbench.github.io
Cite
Notes
Only stored in your browser.
Related tools
1Implementations, trainers, datasets and scaffolds linked to this eval.
FAQ
- What is AssistantBench?
- 214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.
- What capabilities does AssistantBench test?
- AssistantBench evaluates browser use, planning, retrieval, tool calling.
- How can a model improve its AssistantBench score?
- Tools linked to AssistantBench on Sophon include BrowserGym - RL environments, datasets, and scaffolds that target this eval.
- What license is AssistantBench under?
- AssistantBench is available under MIT.