OSWorld-Verified
Frontier
Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.
- Publisher
- XLANG Lab
- Capabilities
- Computer UsePlanningTool Calling
- Domain
- agentic
- Format
- Custom
- Size
- 361 tasks
- License
- Apache-2.0
- Published
- Oct 2023
- Notable for
- Benchmark for evaluating computer use, planning and tool calling in the agentic domain.
- Canonical
- os-world.github.io
- Also on
Cite
Notes
Only stored in your browser.
Top score 73.1% by Kimi K2.6 - 9 models reporting (5 frontier)
Score history
7Top models
9Papers
2Contributors
2FAQ
- What is OSWorld-Verified?
- Cleaned, human-validated subset of OSWorld tasks designed for stable cross-lab comparison of computer-use agents.
- What capabilities does OSWorld-Verified test?
- OSWorld-Verified evaluates computer use, planning, tool calling.
- What is the current top score on OSWorld-Verified?
- The top reported score is 73.1% by Kimi K2.6, across 9 models reporting (5 from frontier labs).
- What license is OSWorld-Verified under?
- OSWorld-Verified is available under Apache-2.0.

