Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Introduces AlpacaEval 2 with length control, a fast LLM-as-a-Judge benchmark whose ranking correlates 0.98 with Chatbot Arena after removing verbosity bias.
- Year
- 2024
- Venue
- COLM
- Authors
- 5
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 2 artifacts - 2 evals
TL;DR
Semantic Scholar
A length-controlling AlpacaEval is introduced, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality that aims to answer the counterfactual question:"What would the preference be if the model's and baseline's output had the same length?"