0

AlpacaEval 2.0 (Length-Controlled)

Frontier

AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.

Format
Custom
Size
805 tasks
License
Apache-2.0
Published
Apr 2024
Notable for
Benchmark for evaluating instruction following and llm judging.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
alpaca-eval
Attribution policy →

Top score 57.5% by GPT-4o (2024-05-13) - 49 models reporting (11 frontier)

Score history

27
0%25%50%75%100%Nov 22Apr 23Sep 23Feb 24Jul 24GPT-3.5 TurboGPT-4GPT-4 TurboGPT-4o (2024-05-13)

Top models

49
AlpacaEval 2.0 (Length-Controlled)Bar chart with 21 bars. Highest value: Gemma 2 9B It Simpo at 72.4.
21 models

Papers

2

FAQ

What is AlpacaEval 2.0 (Length-Controlled)?
AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.
What capabilities does AlpacaEval 2.0 (Length-Controlled) test?
AlpacaEval 2.0 (Length-Controlled) evaluates instruction following, llm judging.
What is the current top score on AlpacaEval 2.0 (Length-Controlled)?
The top reported score is 57.5% by GPT-4o (2024-05-13), across 49 models reporting (11 from frontier labs).
What license is AlpacaEval 2.0 (Length-Controlled) under?
AlpacaEval 2.0 (Length-Controlled) is available under Apache-2.0.