WebArena
Active
812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.
- Publisher
- Carnegie Mellon University
- Capabilities
- Browser UsePlanningTool Calling
- Domain
- agentic
- Format
- Web Arena
- Size
- 812 tasks
- License
- Apache-2.0
- Published
- Jul 2023
- Updates
- Monthly
- Notable for
- The first widely-adopted realistic web-agent benchmark — fully reproducible self-hosted environment of e-commerce, Reddit, GitLab, and CMS, with WebArena-Verified added in Feb 2026.
- Canonical
- webarena.dev
- Official leaderboard
- webarena.dev
Cite
Notes
Only stored in your browser.
Where it's ranked
1Related tools
2Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
2FAQ
- What is WebArena?
- 812 long-horizon web tasks across self-hosted clones of Reddit, GitLab, Shopify, Postmill, and a content-management system.
- What capabilities does WebArena test?
- WebArena evaluates browser use, planning, tool calling.
- How can a model improve its WebArena score?
- Tools linked to WebArena on Sophon include BrowserGym, Openenv Browsergym RL Env (Hugging Face) - RL environments, datasets, and scaffolds that target this eval.
- What license is WebArena under?
- WebArena is available under Apache-2.0.