0

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Introduces OSWorld, a benchmark of 369 real tasks across Ubuntu, Windows, and macOS apps (Chrome, VS Code, GIMP, LibreOffice, Thunderbird, etc.) with execution-based scoring.

Publisher
XLANG Lab
Year
2024
Venue
NeurIPS
Authors
17
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

OSWorld is introduced, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS.

Artifacts

1

Evals

Authors

17