0

WebArena

Active

812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.

Domain
agentic
Format
Web Arena
Size
812 tasks
License
Apache-2.0
Published
Jul 2023
Updates
Monthly
Notable for
The first widely-adopted realistic web-agent benchmark — fully reproducible self-hosted environment of e-commerce, Reddit, GitLab, and CMS, with WebArena-Verified added in Feb 2026.
Canonical
webarena.dev
Official leaderboard
webarena.dev

Cite

Notes

Only stored in your browser.

Where it's ranked

1

Related tools

2
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

2

FAQ

What is WebArena?
812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.
What capabilities does WebArena test?
WebArena evaluates browser use, planning, tool calling.
How can a model improve its WebArena score?
Tools linked to WebArena on Sophon include BrowserGym, Openenv Browsergym RL Env (Hugging Face) - RL environments, datasets, and scaffolds that target this eval.
What license is WebArena under?
WebArena is available under Apache-2.0.