What capabilities does AssistantBench test?

AssistantBench evaluates browser use, planning, retrieval, tool calling.

How can a model improve its AssistantBench score?

Tools linked to AssistantBench on Sophon include BrowserGym - RL environments, datasets, and scaffolds that target this eval.

What license is AssistantBench under?

AssistantBench is available under MIT.

AssistantBench

Active

214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.

Open

Publisher: Allen Institute for AI (Ai2)
Capabilities: Browser Use Planning Retrieval Tool Calling
Domain: agentic
Format: Custom
Size: 214 tasks
License: MIT
Published: Jul 2024
Notable for: Benchmark for evaluating browser use, planning and retrieval in the agentic domain.
Canonical: assistantbench.github.io
Also on: github.com/oriyor/assistantbench

Cite

Notes

Only stored in your browser.

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

BrowserGym

ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

Trains towardRL EnvBrowser UseTool CallingPlanning

FAQ

What is AssistantBench?: 214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.
What capabilities does AssistantBench test?: AssistantBench evaluates browser use, planning, retrieval, tool calling.
How can a model improve its AssistantBench score?: Tools linked to AssistantBench on Sophon include BrowserGym - RL environments, datasets, and scaffolds that target this eval.
What license is AssistantBench under?: AssistantBench is available under MIT.