Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Introduces AlpacaEval 2 with length control, a fast LLM-as-a-Judge benchmark whose ranking correlates 0.98 with Chatbot Arena after removing verbosity bias.

Open

Preview
Publisher: Stanford Center for Research on Foundation Models (CRFM)
Year: 2024
Venue: COLM
ArXiv: arxiv.org/abs/2404.04475
Code: github.com/tatsu-lab/alpaca_eval
Authors: 5
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2404.04475
TL;DR: semanticscholar.org/paper/eb375712bd37250c350ecd3f559e1879e87eb3e5
Code: github.com/tatsu-lab/alpaca_eval

Attribution policy →

Introduces 2 artifacts - 2 evals

TL;DR

Semantic Scholar

A length-controlling AlpacaEval is introduced, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality that aims to answer the counterfactual question:"What would the preference be if the model's and baseline's output had the same length?"

Artifacts

Evals

AlpacaEval AlpacaEval 2.0 (Length-Controlled)

Authors

Balazs Galambosi Percy Liang Tatsunori Hashimoto Yann Dubois Tatsunori B. Hashimoto