0

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Introduces AlpacaEval 2 with length control, a fast LLM-as-a-Judge benchmark whose ranking correlates 0.98 with Chatbot Arena after removing verbosity bias.

Year
2024
Venue
COLM
Authors
5
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 2 artifacts - 2 evals

TL;DR

Semantic Scholar

A length-controlling AlpacaEval is introduced, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality that aims to answer the counterfactual question:"What would the preference be if the model's and baseline's output had the same length?"

Artifacts

2

Authors

5