OSWorld-Verified Leaderboard
HUD AI's curated leaderboard for the OSWorld computer-use benchmark, reporting verified runs on the corrected and de-flaked task subset.
- Operator
- HUD
- Kind
- Aggregated
- Updates
- monthly·updated 3d ago
- Notable for
- The canonical reporting venue for computer-use agents (Claude Computer Use, OpenAI Operator, Gemini computer-use, Manus, etc.).
- Tracks
- 2 evals · aggregated
Cite
Notes
Only stored in your browser.
Per-eval breakdown
9models
| Model | |||
|---|---|---|---|
| Kimi K2.6 Moonshot AI | - | 73.1% | 73.1% |
| Claude Sonnet 4.6 Anthropic | - | 72.1% | 72.1% |
| Kimi K2.5 Kimi | - | 63.3% | 63.3% |
| Claude Sonnet 4.5 Anthropic | - | 62.9% | 62.9% |
| Claude 4 Sonnet Anthropic | - | 43.9% | 43.9% |
| Claude Sonnet 3.7 Anthropic | - | 35.8% | 35.8% |
| o3 OpenAI | - | 23.0% | 23.0% |
| Qwen2.5 VL 72B Instruct Alibaba | - | 5.0% | 5.0% |
| Qwen2.5Vl 32B Instruct Alibaba | - | 3.9% | 3.9% |
9 / 9 models