0

OSWorld-Verified

Frontier

Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.

Publisher
XLANG Lab
Domain
agentic
Format
Custom
Size
361 tasks
License
Apache-2.0
Published
Oct 2023
Notable for
Benchmark for evaluating computer use, planning and tool calling in the agentic domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
osworld-org
Attribution policy →

Top score 73.1% by Kimi K2.6 - 9 models reporting (5 frontier)

Score history

7
15%36%57%79%100%Feb 25May 25Aug 25Nov 25Feb 26Claude Sonnet 3.7Claude 4 SonnetClaude Sonnet 4.5Kimi K2.5Kimi K2.6

Top models

9
OSWorld-VerifiedBar chart with 9 bars. Highest value: Kimi K2.6 at 73.1.
9 models

Papers

2

Contributors

2

FAQ

What is OSWorld-Verified?
Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.
What capabilities does OSWorld-Verified test?
OSWorld-Verified evaluates computer use, planning, tool calling.
What is the current top score on OSWorld-Verified?
The top reported score is 73.1% by Kimi K2.6, across 9 models reporting (5 from frontier labs).
What license is OSWorld-Verified under?
OSWorld-Verified is available under Apache-2.0.