WebVoyager Browser Benchmark (No Anti-Bot)

A browser benchmark environment for evaluating LLM agents on WebVoyager web navigation tasks using Browserbase.

This version uses a filtered dataset that excludes websites with anti-bot protection for more reliable evaluation.

WebVoyager contains tasks across multiple real-world websites. Tasks are evaluated based on successful completion rather than explicit ground-truth answers. This environment judges against the recorded browser interaction transcript, not just the agent's final free-text claim.

Dataset

Total tasks: 600 tasks (93.3% of original 643 tasks)
Websites: Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking, Coursera, ESPN, GitHub, Google Flights, Google Map, Google Search, Hugging Face, Wolfram Alpha
Excluded sites: dictionary.cambridge.org (Cloudflare protection)
Removed tasks: 43 tasks from 1 site with anti-bot detection
Task format: Web navigation tasks
Evaluation: Task completion judging via LLM over the browser interaction transcript

Installation

First, install the browser extras for verifiers:

uv pip install -e ".[browser]"

Then install the webvoyager-no-anti-bot environment locally:

uv pip install -e ./environments/webvoyager_no_anti_bot

Or install from Prime hub:

prime env install browserbase/webvoyager-no-anti-bot

Usage

Quick Start

# Run WebVoyager benchmark with OpenAI (clean dataset)
prime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY

Configuration

Set your Browserbase credentials:

export BROWSERBASE_API_KEY="your-api-key"
# Optional: export BROWSERBASE_PROJECT_ID="your-project-id"

For DOM mode (default), you'll also need:

export OPENAI_API_KEY="your-openai-key"  # For agent model and judge
export MODEL_API_KEY="your-openai-key"   # For Stagehand browser operations

Website Filtering

WebVoyager includes tasks across many websites. You can filter by website:

# Run all tasks (clean dataset)
prime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY

# Run only Amazon tasks
prime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"web_filter": "Amazon"}'

# Run only Allrecipes tasks
prime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"web_filter": "Allrecipes"}'

Browser Modes

DOM Mode (default): Uses Stagehand SDK for natural language browser control.

prime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY

CUA Mode: Uses vision-based primitives via a CUA server.

prime eval run webvoyager-no-anti-bot -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"mode": "cua", "server_url": "http://localhost:3000"}'

Environment Arguments

Argument	Default	Description
`mode`	`"dom"`	Browser control mode (`"dom"` or `"cua"`)
`max_turns`	`15`	Maximum conversation turns (recommended: 50 for complex tasks)
`judge_model`	`"gpt-4o-mini"`	Model for task completion judging
`num_examples`	`-1`	Number of examples (-1 for all)
`web_filter`	`None`	Filter by website name

Requirements

Python >= 3.10
Browserbase account with API credentials
OpenAI API key (for agent model, judge, and Stagehand)