computer-use
XLANG Lab
Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.
369 computer-use tasks across Ubuntu, Windows, and macOS environments testing whether agents can operate a real desktop via screenshots and mouse/keyboard.
by avg parsed score across evals here