OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Introduces OSWorld, a benchmark of 369 real tasks across Ubuntu, Windows, and macOS apps (Chrome, VS Code, GIMP, LibreOffice, Thunderbird, etc.) with execution-based scoring.
- Publisher
- XLANG Lab
- Year
- 2024
- Venue
- NeurIPS
- Authors
- 17
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 1 artifact - 1 eval
TL;DR
Semantic Scholar
OSWorld is introduced, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS.
Artifacts
1Evals