browser use

Category: agents
Slug: browser-use
Evals: 7
Tools: 14
Models: 6
Papers: 5

Evals testing this capability

AssistantBench

Allen Institute for AI (Ai2)

214 realistic, time-consuming web-research tasks ("how long is the train from X to Y on Tuesday?") that require live browsing and multi-page synthesis.

ActiveBrowser UsePlanningRetrievalAgentic

BrowseComp

OpenAI

1,266 hard fact-finding questions on the open web requiring persistent browsing and reasoning over scattered, obscure sources.

ActiveBrowser UseRetrievalPlanningAgentic

GAIA (General AI Assistants)

Meta FAIR (Fundamental AI Research)

466 real-world questions requiring tool use, multi-step reasoning, and web browsing - easy for humans (~92%) but hard for AI assistants.

ActiveTool CallingBrowser UsePlanningAgentic

MiniWoB++

University of California, Berkeley

100+ small synthetic web-page tasks (click button, fill form, drag slider) - the original web-agent benchmark, still used as a unit test.

SaturatedBrowser UseTool CallingAgentic

VisualWebArena

Carnegie Mellon University

910 visually grounded web tasks across three self-hosted sites (Classifieds, Shopping, Reddit) requiring image understanding to complete.

ActiveBrowser UseImage UnderstandingPlanningAgentic

WebArena

Carnegie Mellon University

812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.

ActiveBrowser UsePlanningTool CallingAgentic

WorkArena

ServiceNow Research

Browser-based enterprise web tasks on a live ServiceNow instance - list filtering, form filling, knowledge search - covering daily knowledge-worker workflows.

ActiveBrowser UseTool CallingPlanningAgentic

Tools lifting evals here

View all

BrowserGym

ServiceNow Research

ServiceNow's unified Gym-style framework for web agents - wraps WebArena, MiniWoB, VisualWebArena, WorkArena, AssistantBench, WebLINX, and more under one Playwright-backed interface.

RL EnvBrowser UseTool CallingPlanningAgentic

lifts 5 evals here

Openenv Browsergym RL Env (Hugging Face)

Hugging Face

OpenEnv port of ServiceNow's BrowserGym - a Playwright-backed browser environment exposing WebArena, MiniWoB, WorkArena, etc. through the standard OpenEnv API.

RL EnvBrowser UseTool CallingPlanningAgentic

lifts 3 evals here

Androidworld RL Env (Prime Community)

Prime Community

AndroidWorld benchmark for evaluating autonomous agents on real Android apps with 116 tasks across 20 apps

RL EnvMobileAndroidTool UseMultimodal

lifts 1 eval here

Androidworld RL Env (Prime Intellect)

Prime Intellect

AndroidWorld benchmark for evaluating autonomous agents on real Android apps with 116 tasks across 20 apps

RL EnvMobileAndroidTool UseMultimodal

lifts 1 eval here

BB Browsecomp RL Env (Prime Intellect)

Prime Intellect

BrowserEnv[DOM] Environment on Browsecomp

RL Env

lifts 1 eval here

Browsecomp Openai RL Env (Community)

Tool-use environment for the model to browse the web and locate hard-to-find information; scored using an LLM-as-judge rubric

RL EnvWeb SearchTool UseWeb

lifts 1 eval here

Browsecomp Openai RL Env (Prime Intellect)

Prime Intellect

Tool-use environment for the model to browse the web and locate hard-to-find information; scored using an LLM-as-judge rubric

RL EnvWeb SearchTool UseWeb

lifts 1 eval here

Browsecomp RL Env (Prime Intellect)

Prime Intellect

BrowseComp evaluation environment

RL EnvWeb SearchTool UseWeb

lifts 1 eval here

Browser Miniwob RL Env (Community)

BrowserGym environment for MiniWoB dataset

RL EnvBrowser AutomationWeb InteractionMiniwob

lifts 1 eval here

Browser Miniwob RL Env (Community)

BrowserGym environment for MiniWoB dataset

RL EnvBrowser AutomationWeb InteractionMiniwob

lifts 1 eval here

DDBC RL Env (Prime Intellect)

Prime Intellect

browsecomp with deepdive tools

RL EnvSearchQA

lifts 1 eval here

DDBC RLM RL Env (Prime Intellect)

Prime Intellect

BrowseComp with DeepDive tools using RLM

RL EnvSearchQA

lifts 1 eval here

DeepDive (Serper-powered web QA)

Prime Intellect

Multi-turn open-web research QA environment where the model uses a Serper search tool to answer hard, hop-heavy questions of the BrowseComp / SimpleQA style.

RL EnvRetrievalTool CallingPlanningAgentic

lifts 1 eval here

GAIA RL Env (Browserbase)

Browserbase

GAIA web browser benchmark for multi-hop question answering with web navigation

RL EnvBrowserBrowserbaseGaiaWeb

lifts 1 eval here

Top models on this capability

by avg parsed score across evals here

browser useBar chart with 6 bars. Highest value: gpt-oss-120b at 80.

6 models

Papers in this area

introducesBrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents introducesGAIA: A Benchmark for General AI Assistants introducesVisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks introducesWebArena: A Realistic Web Environment for Building Autonomous Agents introducesWorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Related in agents

computer-use embodied multi-turn-dialog tool-calling