Arena-Hard
Frontier
500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.
- Publisher
- LMArena
- Capabilities
- Instruction FollowingLLM Judging
- Format
- Custom
- Size
- 500 tasks
- License
- Apache-2.0
- Published
- Jun 2024
- Notable for
- Benchmark for evaluating instruction following and llm judging.
- Canonical
- github.com/lmarena/arena-hard-auto
- Also on
Cite
Notes
Only stored in your browser.
Top score 88.8% by o3 - 17 models reporting (9 frontier)
Score history
13Top models
17Where it's ranked
1Related tools
7Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
2Contributors
2FAQ
- What is Arena-Hard?
- 500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.
- What capabilities does Arena-Hard test?
- Arena-Hard evaluates instruction following, llm judging.
- What is the current top score on Arena-Hard?
- The top reported score is 88.8% by o3, across 17 models reporting (9 from frontier labs).
- How can a model improve its Arena-Hard score?
- Tools linked to Arena-Hard on Sophon include Argilla distilabel Capybara-DPO, HelpSteer2, Magpie, Nectar - RL environments, datasets, and scaffolds that target this eval.
- What license is Arena-Hard under?
- Arena-Hard is available under Apache-2.0.