0

AssistantBench

Active

214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.

Domain
agentic
Format
Custom
Size
214 tasks
License
MIT
Published
Jul 2024
Notable for
Benchmark for evaluating browser use, planning and retrieval in the agentic domain.

Cite

Notes

Only stored in your browser.

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is AssistantBench?
214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.
What capabilities does AssistantBench test?
AssistantBench evaluates browser use, planning, retrieval, tool calling.
How can a model improve its AssistantBench score?
Tools linked to AssistantBench on Sophon include BrowserGym - RL environments, datasets, and scaffolds that target this eval.
What license is AssistantBench under?
AssistantBench is available under MIT.