0

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

CMU benchmark extending WebArena with 910 visually grounded tasks across Classifieds, Shopping, and Reddit, evaluating multimodal browsing agents.

Publisher
CMU NLP
Year
2024
Venue
ACL
Authors
11
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

An extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models are conducted, identifying several limitations of text-only LLM agents, and revealing gaps in the capabilities of state-of-the-art multimodal language agents.

Artifacts

1

Authors

11