0

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Introduces MT-Bench and validates LLM-as-a-Judge by showing GPT-4 judgments agree with humans at ~85%, the level of human-to-human agreement.

Publisher
LMSYS Org
Year
2023
Venue
NeurIPS
Authors
13
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

The results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans, and LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.

Artifacts

1

Authors

13