Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Introduces MT-Bench and validates LLM-as-a-Judge by showing GPT-4 judgments agree with humans at ~85%, the level of human-to-human agreement.
- Publisher
- LMSYS Org
- Year
- 2023
- Venue
- NeurIPS
- Authors
- 13
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 1 artifact - 1 eval
TL;DR
Semantic Scholar
The results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans, and LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.
Artifacts
1Evals