0

Arena-Hard

Frontier

500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.

Publisher
LMArena
Format
Custom
Size
500 tasks
License
Apache-2.0
Published
Jun 2024
Notable for
Benchmark for evaluating instruction following and llm judging.
Also on

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
arena-hard-auto
Attribution policy →

Top score 88.8% by o3 - 17 models reporting (9 frontier)

Score history

13
0%25%50%75%100%Sep 24Nov 24Jan 25Mar 25May 25Qwen2.5 72B Instructo1R1o3

Top models

17
Arena-HardBar chart with 17 bars. Highest value: o3 at 88.8.
17 models

Where it's ranked

1

Related tools

7
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

2

FAQ

What is Arena-Hard?
500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.
What capabilities does Arena-Hard test?
Arena-Hard evaluates instruction following, llm judging.
What is the current top score on Arena-Hard?
The top reported score is 88.8% by o3, across 17 models reporting (9 from frontier labs).
How can a model improve its Arena-Hard score?
Tools linked to Arena-Hard on Sophon include Argilla distilabel Capybara-DPO, HelpSteer2, Magpie, Nectar - RL environments, datasets, and scaffolds that target this eval.
What license is Arena-Hard under?
Arena-Hard is available under Apache-2.0.