Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Introduces MT-Bench and validates LLM-as-a-Judge by showing GPT-4 judgments agree with humans at ~85%, the level of human-to-human agreement.

Open

Publisher: LMSYS Org
Year: 2023
Venue: NeurIPS
ArXiv: arxiv.org/abs/2306.05685
Code: github.com/lm-sys/FastChat
Authors: 13
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2306.05685
TL;DR: semanticscholar.org/paper/a0a79dad89857a96f8f71b14238e5237cbfc4787
Code: github.com/lm-sys/FastChat

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

The results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans, and LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.

Artifacts

Evals

MT-Bench

Authors

Dacheng Li Eric Xing Hao Zhang Ion Stoica Joseph E. Gonzalez Lianmin Zheng Siyuan Zhuang Wei-Lin Chiang Ying Sheng Yonghao Zhuang Zhanghao Wu Zhuohan Li Zi Lin