What capabilities does MT-Bench test?

MT-Bench evaluates multi turn dialog, instruction following, llm judging.

How can a model improve its MT-Bench score?

Tools linked to MT-Bench on Sophon include Argilla distilabel Capybara-DPO, Capybara, HelpSteer2, Magpie - RL environments, datasets, and scaffolds that target this eval.

What license is MT-Bench under?

MT-Bench is available under Apache-2.0.

MT-Bench

80 two-turn open-ended questions across 8 categories, graded by GPT-4 as judge to score multi-turn dialogue quality.

Open

Publisher: LMArena
Capabilities: Multi Turn Dialog Instruction Following LLM Judging
Format: Custom
Size: 80 tasks
License: Apache-2.0
Published: Jun 2023
Notable for: Benchmark for evaluating multi turn dialog, instruction following and llm judging.
Canonical: github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge
Also on: huggingface.co/spaces/lmsys/mt-bench

Cite

Notes

Only stored in your browser.

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Argilla distilabel Capybara-DPO

Argilla

A high-quality DPO derivative of LDJnr's Capybara, with chosen/rejected pairs synthesized and rated using Argilla's distilabel pipeline.

Training dataDPO DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

Capybara

LDJnr

LDJnr's multi-turn reasoning dataset built with the Amplify-Instruct synthesis method - short but deep conversations on a single topic.

Training dataSFT DatasetMulti Turn DialogInstruction FollowingScientific Reasoning

HelpSteer2

NVIDIA

NVIDIA's permissively-licensed human-annotated preference dataset with 5-axis Likert ratings - engineered to train high-quality reward models.

Training dataPreferenceInstruction FollowingSafetyHallucination

Magpie

Magpie Align

A self-synthesis method (and family of datasets) that elicits high-quality instructions directly from an aligned LLM using only its chat template - no seed prompts required.

Training dataSFT DatasetInstruction FollowingMulti Turn Dialog

Nectar

Berkeley NEST

Berkeley NEST's seven-way ranked preference dataset built from GPT-4 rankings over responses from a diverse model pool, used to train Starling.

Training dataPreferenceInstruction FollowingSafetyMulti Turn Dialog

OpenHermes 2.5

Teknium

Teknium's million-row aggregation of high-quality GPT-4-style synthetic instructions that became the de facto open SFT baseline of 2023-2024.

Training dataSFT DatasetInstruction FollowingMulti Turn DialogCode Generation

ShareGPT

Anonymous Community

The original community-scraped corpus of ChatGPT conversations that bootstrapped Vicuna and the entire open-instruction-tuning era.

Training dataSFT DatasetMulti Turn DialogInstruction Following

UltraChat

OpenBMB

Tsinghua / OpenBMB's large-scale multi-turn dialog dataset generated by two LLMs talking to each other across structured topic taxonomies.

Training dataSFT DatasetMulti Turn DialogInstruction Following

UltraFeedback

OpenBMB

OpenBMB's 64k-prompt preference dataset built with GPT-4 critiques across instruction-following, truthfulness, honesty, and helpfulness - the de facto open DPO baseline.

Training dataPreferenceInstruction FollowingHallucinationSafety

WildChat

Allen Institute for AI (Ai2)

1M real user-chatbot conversations collected by Allen AI / UW via a free GPT proxy - a window into how real users actually prompt LLMs.

Training dataSFT DatasetMulti Turn DialogInstruction FollowingMultilingual

WizardLM Evol-Instruct

Microsoft

Microsoft's "Evol-Instruct" recipe - automatically rewriting simple instructions into harder, more diverse ones using an LLM evolver.

Training dataSFT DatasetInstruction FollowingMathCode Generation

Papers

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

NeurIPS · 2023

Introduces MT-Bench and validates LLM-as-a-Judge by showing GPT-4 judgments agree with humans at ~85%, the level of human-to-human agreement.

introduces

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

NeurIPS · 2023

Introduces MT-Bench and validates LLM-as-a-Judge by showing GPT-4 judgments agree with humans at ~85%, the level of human-to-human agreement.

Contributors

WWei-Lin Chiang IIon Stoica

FAQ

What is MT-Bench?: 80 two-turn open-ended questions across 8 categories, graded by GPT-4 as judge to score multi-turn dialogue quality.
What capabilities does MT-Bench test?: MT-Bench evaluates multi turn dialog, instruction following, llm judging.
How can a model improve its MT-Bench score?: Tools linked to MT-Bench on Sophon include Argilla distilabel Capybara-DPO, Capybara, HelpSteer2, Magpie - RL environments, datasets, and scaffolds that target this eval.
What license is MT-Bench under?: MT-Bench is available under Apache-2.0.