VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

CMU benchmark extending WebArena with 910 visually grounded tasks across Classifieds, Shopping, and Reddit, evaluating multimodal browsing agents.

Open

Preview
Publisher: CMU NLP
Year: 2024
Venue: ACL
ArXiv: arxiv.org/abs/2401.13649
Code: github.com/web-arena-x/visualwebarena
Authors: 11
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2401.13649
TL;DR: semanticscholar.org/paper/29940eb40e36ba67f0d2ceb5571ebbc299eee11b
Code: github.com/web-arena-x/visualwebarena

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

An extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models are conducted, identifying several limitations of text-only LLM agents, and revealing gaps in the capabilities of state-of-the-art multimodal language agents.

Artifacts

Evals

VisualWebArena

Authors

Daniel Fried Graham Neubig Jing Yu Koh Logan Jang Ming Chong Lim Po-Yu Huang Robert Lo Ruslan Salakhutdinov Shuyan Zhou Vikram Duvvur Lawrence Jang