What capabilities does MiniWoB++ test?

MiniWoB++ evaluates browser use, tool calling.

What is the current top score on MiniWoB++?

The top reported score is 80.0% by gpt-oss-120b, across 1 model reporting (1 from frontier labs).

How can a model improve its MiniWoB++ score?

Tools linked to MiniWoB++ on Sophon include Browser Miniwob RL Env (Community), Browser Miniwob RL Env (Community), BrowserGym, Openenv Browsergym RL Env (Hugging Face) - RL environments, datasets, and scaffolds that target this eval.

What license is MiniWoB++ under?

MiniWoB++ is available under MIT.

MiniWoB++

100+ small synthetic web-page tasks (click button, fill form, drag slider) - the original web-agent benchmark, still used as a unit test.

Open

Publisher: University of California, Berkeley
Capabilities: Browser Use Tool Calling
Domain: agentic
Format: Custom
Size: 125 tasks
License: MIT
Published: Feb 2018
Notable for: Benchmark for evaluating browser use and tool calling in the agentic domain.
Canonical: miniwob.farama.org
Also on: github.com/Farama-Foundation/miniwob-plusplus

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: prime-hub

Attribution policy →

Top score 80.0% by gpt-oss-120b - 1 model reporting (1 frontier)

Top models

MiniWoB++Bar chart with 1 bar. Highest value: gpt-oss-120b at 80.

1 model

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Browser Miniwob RL Env (Community)

BrowserGym environment for MiniWoB dataset

ImplementationRL EnvBrowser AutomationWeb InteractionMiniwob

Browser Miniwob RL Env (Community)

BrowserGym environment for MiniWoB dataset

ImplementationRL EnvBrowser AutomationWeb InteractionMiniwob

BrowserGym

ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

Trains towardRL EnvBrowser UseTool CallingPlanning

Openenv Browsergym RL Env (Hugging Face)

Hugging Face

OpenEnv port of ServiceNow's BrowserGym - a Playwright-backed browser environment exposing WebArena, MiniWoB, WorkArena, etc. through the standard OpenEnv API.

Trains towardRL EnvBrowser UseTool CallingPlanning

Androidworld RL Env (Prime Community)

Prime Community

AndroidWorld benchmark for evaluating autonomous agents on real Android apps with 116 tasks across 20 apps

Trains towardRL EnvMobileAndroidTool Use

Androidworld RL Env (Prime Intellect)

Prime Intellect

AndroidWorld benchmark for evaluating autonomous agents on real Android apps with 116 tasks across 20 apps

Trains towardRL EnvMobileAndroidTool Use

FAQ

What is MiniWoB++?: 100+ small synthetic web-page tasks (click button, fill form, drag slider) - the original web-agent benchmark, still used as a unit test.
What capabilities does MiniWoB++ test?: MiniWoB++ evaluates browser use, tool calling.
What is the current top score on MiniWoB++?: The top reported score is 80.0% by gpt-oss-120b, across 1 model reporting (1 from frontier labs).
How can a model improve its MiniWoB++ score?: Tools linked to MiniWoB++ on Sophon include Browser Miniwob RL Env (Community), Browser Miniwob RL Env (Community), BrowserGym, Openenv Browsergym RL Env (Hugging Face) - RL environments, datasets, and scaffolds that target this eval.
What license is MiniWoB++ under?: MiniWoB++ is available under MIT.