0

OSWorld-Verified Leaderboard

HUD AI's curated leaderboard for the OSWorld computer-use benchmark, reporting verified runs on the corrected and de-flaked task subset.

Operator
HUD
Kind
Aggregated
Updates
monthly·updated 3d ago
Notable for
The canonical reporting venue for computer-use agents (Claude Computer Use, OpenAI Operator, Gemini computer-use, Manus, etc.).
Tracks
2 evals · aggregated

Cite

Notes

Only stored in your browser.

Per-eval breakdown

9

models

Model
Kimi K2.6

Moonshot AI

-73.1%73.1%
Claude Sonnet 4.6

Anthropic

-72.1%72.1%
Kimi K2.5

Kimi

-63.3%63.3%
Claude Sonnet 4.5

Anthropic

-62.9%62.9%
Claude 4 Sonnet

Anthropic

-43.9%43.9%
Claude Sonnet 3.7

Anthropic

-35.8%35.8%
o3

OpenAI

-23.0%23.0%
Qwen2.5 VL 72B Instruct

Alibaba

-5.0%5.0%
Qwen2.5Vl 32B Instruct

Alibaba

-3.9%3.9%
9 / 9 models

Evals tracked

2