VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
CMU benchmark extending WebArena with 910 visually grounded tasks across Classifieds, Shopping, and Reddit, evaluating multimodal browsing agents.
- Publisher
- CMU NLP
- Year
- 2024
- Venue
- ACL
- Authors
- 11
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 1 artifact - 1 eval
TL;DR
Semantic Scholar
An extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models are conducted, identifying several limitations of text-only LLM agents, and revealing gaps in the capabilities of state-of-the-art multimodal language agents.
Artifacts
1Evals