WebVoyager Browser Benchmark
A browser benchmark environment for evaluating LLM agents on WebVoyager web navigation tasks using Browserbase.
WebVoyager contains tasks across multiple real-world websites. Tasks are evaluated based on successful completion rather than explicit ground-truth answers.
Installation
First, install the browser extras for verifiers:
uv pip install -e ".[browser]"
Then install the webvoyager environment locally:
uv pip install -e ./environments/webvoyager
Usage
Quick Start
# Run WebVoyager benchmark with OpenAI
prime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
Configuration
Set your Browserbase credentials:
export BROWSERBASE_API_KEY="your-api-key"
export BROWSERBASE_PROJECT_ID="your-project-id"
For DOM mode (default), you'll also need:
export OPENAI_API_KEY="your-openai-key" # For agent model and judge
export MODEL_API_KEY="your-openai-key" # For Stagehand browser operations
Website Filtering
WebVoyager includes tasks across many websites. You can filter by website:
# Run all tasks
prime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
# Run only Amazon tasks
prime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"web_filter": "Amazon"}'
# Run only Allrecipes tasks
prime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"web_filter": "Allrecipes"}'
Browser Modes
DOM Mode (default): Uses Stagehand SDK for natural language browser control.
prime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY
CUA Mode: Uses vision-based primitives via a CUA server.
prime eval run webvoyager -m gpt-4.1-mini -b https://api.openai.com/v1 -k OPENAI_API_KEY -a '{"mode": "cua", "server_url": "http://localhost:3000"}'
Environment Arguments
| Argument | Default | Description |
|---|---|---|
mode | "dom" | Browser control mode ("dom" or "cua") |
max_turns | 15 | Maximum conversation turns (recommended: 50 for complex tasks) |
judge_model | "gpt-4o-mini" | Model for task completion judging |
num_examples | -1 | Number of examples (-1 for all) |
web_filter | None | Filter by website name |
Dataset
- Total tasks: 642 tasks across multiple websites
- Websites: Allrecipes, Amazon, Apple, ArXiv, BBC News, Booking, Cambridge Dictionary, Coursera, ESPN, GitHub, Google Flights, Google Map, Google Search, Hugging Face, Wolfram Alpha
- Task format: Web navigation tasks
- Evaluation: Task completion judging via LLM
Requirements
- Python >= 3.10
- Browserbase account with API credentials
- OpenAI API key (for agent model, judge, and Stagehand)