AlpacaEval 2.0 (Length-Controlled)
Frontier
AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.
- Publisher
- University of California, Berkeley
- Capabilities
- Instruction FollowingLLM Judging
- Format
- Custom
- Size
- 805 tasks
- License
- Apache-2.0
- Published
- Apr 2024
- Notable for
- Benchmark for evaluating instruction following and llm judging.
- Canonical
- github.com/tatsu-lab/alpaca_eval
Cite
Notes
Only stored in your browser.
Top score 57.5% by GPT-4o (2024-05-13) - 49 models reporting (11 frontier)
Score history
27Top models
49Papers
2FAQ
- What is AlpacaEval 2.0 (Length-Controlled)?
- AlpacaEval with a length-controlled winrate estimator that neutralizes the judge's preference for longer answers.
- What capabilities does AlpacaEval 2.0 (Length-Controlled) test?
- AlpacaEval 2.0 (Length-Controlled) evaluates instruction following, llm judging.
- What is the current top score on AlpacaEval 2.0 (Length-Controlled)?
- The top reported score is 57.5% by GPT-4o (2024-05-13), across 49 models reporting (11 from frontier labs).
- What license is AlpacaEval 2.0 (Length-Controlled) under?
- AlpacaEval 2.0 (Length-Controlled) is available under Apache-2.0.
