What capabilities does AlpacaEval 2.0 (Length-Controlled) test?

AlpacaEval 2.0 (Length-Controlled) evaluates instruction following, llm judging.

What is the current top score on AlpacaEval 2.0 (Length-Controlled)?

The top reported score is 57.5% by GPT-4o (2024-05-13), across 49 models reporting (11 from frontier labs).

What license is AlpacaEval 2.0 (Length-Controlled) under?

AlpacaEval 2.0 (Length-Controlled) is available under Apache-2.0.

AlpacaEval 2.0 (Length-Controlled)

Frontier

AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.

Open

Publisher: University of California, Berkeley
Capabilities: Instruction Following LLM Judging
Format: Custom
Size: 805 tasks
License: Apache-2.0
Published: Apr 2024
Notable for: Benchmark for evaluating instruction following and llm judging.
Canonical: github.com/tatsu-lab/alpaca_eval
Also on: tatsu-lab.github.io/alpaca_eval

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: alpaca-eval

Attribution policy →

Top score 57.5% by GPT-4o (2024-05-13) - 49 models reporting (11 frontier)

Score history

Top models

AlpacaEval 2.0 (Length-Controlled)Bar chart with 21 bars. Highest value: Gemma 2 9B It Simpo at 72.4.

21 models

Papers

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

COLM · 2024

Introduces AlpacaEval 2 with length control, a fast LLM-as-a-Judge benchmark whose ranking correlates 0.98 with Chatbot Arena after removing verbosity bias.

introduces

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

COLM · 2024

Introduces AlpacaEval 2 with length control, a fast LLM-as-a-Judge benchmark whose ranking correlates 0.98 with Chatbot Arena after removing verbosity bias.

FAQ

What is AlpacaEval 2.0 (Length-Controlled)?: AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.
What capabilities does AlpacaEval 2.0 (Length-Controlled) test?: AlpacaEval 2.0 (Length-Controlled) evaluates instruction following, llm judging.
What is the current top score on AlpacaEval 2.0 (Length-Controlled)?: The top reported score is 57.5% by GPT-4o (2024-05-13), across 49 models reporting (11 from frontier labs).
What license is AlpacaEval 2.0 (Length-Controlled) under?: AlpacaEval 2.0 (Length-Controlled) is available under Apache-2.0.