What capabilities does Arena-Hard test?

Arena-Hard evaluates instruction following, llm judging.

What is the current top score on Arena-Hard?

The top reported score is 88.8% by o3, across 18 models reporting (9 from frontier labs).

How can a model improve its Arena-Hard score?

Tools linked to Arena-Hard on Sophon include Argilla distilabel Capybara-DPO, HelpSteer2, Magpie, Nectar - RL environments, datasets, and scaffolds that target this eval.

What license is Arena-Hard under?

Arena-Hard is available under Apache-2.0.

Arena-Hard

Frontier

500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.

Open

Publisher: LMArena
Capabilities: Instruction Following LLM Judging
Format: Custom
Size: 500 tasks
License: Apache-2.0
Published: Jun 2024
Notable for: Benchmark for evaluating instruction following and llm judging.
Canonical: github.com/lmarena/arena-hard-auto
Also on: lmarena.ai

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: arena-hard-auto

Attribution policy →

Top score 88.8% by o3 - 18 models reporting (9 frontier)

Score history

Top models

Arena-HardBar chart with 18 bars. Highest value: o3 at 88.8.

18 models

Where it's ranked

Arena Overall

LMArena

Human preference

preference voting · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Argilla distilabel Capybara-DPO

Argilla

A high-quality DPO derivative of LDJnr's Capybara, with chosen/rejected pairs synthesized and rated using Argilla's distilabel pipeline.

Training dataDPO DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

HelpSteer2

NVIDIA

NVIDIA's permissively-licensed human-annotated preference dataset with 5-axis Likert ratings - engineered to train high-quality reward models.

Training dataPreferenceInstruction FollowingSafetyHallucination

Magpie

Magpie Align

A self-synthesis method (and family of datasets) that elicits high-quality instructions directly from an aligned LLM using only its chat template - no seed prompts required.

Training dataSFT DatasetInstruction FollowingMulti Turn Dialog

Nectar

Berkeley NEST

Berkeley NEST's seven-way ranked preference dataset built from GPT-4 rankings over responses from a diverse model pool, used to train Starling.

Training dataPreferenceInstruction FollowingSafetyMulti Turn Dialog

UltraFeedback

OpenBMB

OpenBMB's 64k-prompt preference dataset built with GPT-4 critiques across instruction-following, truthfulness, honesty, and helpfulness - the de facto open DPO baseline.

Training dataPreferenceInstruction FollowingHallucinationSafety

WildChat

Allen Institute for AI (Ai2)

1M real user-chatbot conversations collected by Allen AI / UW via a free GPT proxy - a window into how real users actually prompt LLMs.

Training dataSFT DatasetMulti Turn DialogInstruction FollowingMultilingual

BenchBuilder

LMArena

LMSYS's automated pipeline for distilling high-quality LLM benchmarks from crowdsourced chat data (e.g. Chatbot Arena, WildChat), producing the Arena-Hard-Auto benchmark.

OtherFrameworkBenchmark Creation

Papers

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

preprint · 2024

Introduces Arena-Hard, a 500-prompt benchmark auto-curated from Chatbot Arena traffic that correlates ~0.9 with Arena Elo using LLM-as-a-Judge.

introduces

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

preprint · 2024

Introduces Arena-Hard, a 500-prompt benchmark auto-curated from Chatbot Arena traffic that correlates ~0.9 with Arena Elo using LLM-as-a-Judge.

Contributors

WWei-Lin Chiang AAnastasios N. Angelopoulos

FAQ

What is Arena-Hard?: 500 challenging real-user prompts mined from Chatbot Arena and graded by a strong LLM judge for pairwise win rate.
What capabilities does Arena-Hard test?: Arena-Hard evaluates instruction following, llm judging.
What is the current top score on Arena-Hard?: The top reported score is 88.8% by o3, across 18 models reporting (9 from frontier labs).
How can a model improve its Arena-Hard score?: Tools linked to Arena-Hard on Sophon include Argilla distilabel Capybara-DPO, HelpSteer2, Magpie, Nectar - RL environments, datasets, and scaffolds that target this eval.
What license is Arena-Hard under?: Arena-Hard is available under Apache-2.0.