AlpacaEval
Saturated
Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.
- Publisher
- University of California, Berkeley
- Capabilities
- Instruction FollowingLLM Judging
- Format
- Custom
- Size
- 805 tasks
- License
- Apache-2.0
- Published
- Apr 2024
- Notable for
- Benchmark for evaluating instruction following and llm judging.
- Canonical
- github.com/tatsu-lab/alpaca_eval
Cite
Notes
Only stored in your browser.
Top score 96.8% by Mistral Medium - 21 models reporting (3 frontier)
Score history
11Top models
21Related tools
7Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
1FAQ
- What is AlpacaEval?
- Stanford's automatic instruction-following benchmark that compares a model's outputs to text-davinci-003 via a strong LLM judge and reports win rate.
- What capabilities does AlpacaEval test?
- AlpacaEval evaluates instruction following, llm judging.
- What is the current top score on AlpacaEval?
- The top reported score is 96.8% by Mistral Medium, across 21 models reporting (3 from frontier labs).
- How can a model improve its AlpacaEval score?
- Tools linked to AlpacaEval on Sophon include Magpie, Nectar, OpenHermes 2.5, Tülu 3 SFT Mixture - RL environments, datasets, and scaffolds that target this eval.
- What license is AlpacaEval under?
- AlpacaEval is available under Apache-2.0.

