0

AlpacaEval

Saturated

Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.

Format
Custom
Size
805 tasks
License
Apache-2.0
Published
Apr 2024
Notable for
Benchmark for evaluating instruction following and llm judging.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
alpaca-eval
Attribution policy →

Top score 96.8% by Mistral Medium - 21 models reporting (3 frontier)

Score history

11
65%74%83%91%100%Nov 22Feb 23May 23Aug 23Nov 23GPT-3.5 TurboGPT-4Mistral Medium

Top models

21
AlpacaEvalBar chart with 21 bars. Highest value: Mistral Medium at 96.8.
21 models

Related tools

7
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

1

FAQ

What is AlpacaEval?
Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.
What capabilities does AlpacaEval test?
AlpacaEval evaluates instruction following, llm judging.
What is the current top score on AlpacaEval?
The top reported score is 96.8% by Mistral Medium, across 21 models reporting (3 from frontier labs).
How can a model improve its AlpacaEval score?
Tools linked to AlpacaEval on Sophon include Magpie, Nectar, OpenHermes 2.5, Tülu 3 SFT Mixture - RL environments, datasets, and scaffolds that target this eval.
What license is AlpacaEval under?
AlpacaEval is available under Apache-2.0.