What capabilities does AlpacaEval test?

AlpacaEval evaluates instruction following, llm judging.

What is the current top score on AlpacaEval?

The top reported score is 96.8% by Mistral Medium, across 21 models reporting (3 from frontier labs).

How can a model improve its AlpacaEval score?

Tools linked to AlpacaEval on Sophon include Magpie, Nectar, OpenHermes 2.5, Tülu 3 SFT Mixture - RL environments, datasets, and scaffolds that target this eval.

What license is AlpacaEval under?

AlpacaEval is available under Apache-2.0.

AlpacaEval

Saturated

Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.

Open

Publisher: University of California, Berkeley
Capabilities: Instruction Following LLM Judging
Format: Custom
Size: 805 tasks
License: Apache-2.0
Published: Apr 2024
Notable for: Benchmark for evaluating instruction following and llm judging.
Canonical: github.com/tatsu-lab/alpaca_eval
Also on: tatsu-lab.github.io/alpaca_eval

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: alpaca-eval

Attribution policy →

Top score 96.8% by Mistral Medium - 21 models reporting (3 frontier)

Score history

Top models

AlpacaEvalBar chart with 21 bars. Highest value: Mistral Medium at 96.8.

21 models

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Magpie

Magpie Align

A self-synthesis method (and family of datasets) that elicits high-quality instructions directly from an aligned LLM using only its chat template - no seed prompts required.

Training dataSFT DatasetInstruction FollowingMulti Turn Dialog

Nectar

Berkeley NEST

Berkeley NEST's seven-way ranked preference dataset built from GPT-4 rankings over responses from a diverse model pool, used to train Starling.

Training dataPreferenceInstruction FollowingSafetyMulti Turn Dialog

OpenHermes 2.5

Teknium

Teknium's million-row aggregation of high-quality GPT-4-style synthetic instructions that became the de facto open SFT baseline of 2023-2024.

Training dataSFT DatasetInstruction FollowingMulti Turn DialogCode Generation

Tülu 3 SFT Mixture

Allen Institute for AI (Ai2)

Allen AI's flagship open SFT mixture combining new persona-driven prompts with curated public data for post-training a frontier-quality instruct model.

Training dataSFT DatasetInstruction FollowingMathCode Generation

UltraChat

OpenBMB

Tsinghua / OpenBMB's large-scale multi-turn dialog dataset generated by two LLMs talking to each other across structured topic taxonomies.

Training dataSFT DatasetMulti Turn DialogInstruction Following

UltraFeedback

OpenBMB

OpenBMB's 64k-prompt preference dataset built with GPT-4 critiques across instruction-following, truthfulness, honesty, and helpfulness - the de facto open DPO baseline.

Training dataPreferenceInstruction FollowingHallucinationSafety

WizardLM Evol-Instruct

Microsoft

Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.

Training dataSFT DatasetInstruction FollowingMathCode Generation

Papers

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

COLM · 2024

Introduces AlpacaEval 2 with length control, a fast LLM-as-a-Judge benchmark whose ranking correlates 0.98 with Chatbot Arena after removing verbosity bias.

introduces

FAQ

What is AlpacaEval?: Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.
What capabilities does AlpacaEval test?: AlpacaEval evaluates instruction following, llm judging.
What is the current top score on AlpacaEval?: The top reported score is 96.8% by Mistral Medium, across 21 models reporting (3 from frontier labs).
How can a model improve its AlpacaEval score?: Tools linked to AlpacaEval on Sophon include Magpie, Nectar, OpenHermes 2.5, Tülu 3 SFT Mixture - RL environments, datasets, and scaffolds that target this eval.
What license is AlpacaEval under?: AlpacaEval is available under Apache-2.0.