What capabilities does WildBench test?

WildBench evaluates instruction following, llm judging, multi turn dialog.

How can a model improve its WildBench score?

Tools linked to WildBench on Sophon include WildChat - RL environments, datasets, and scaffolds that target this eval.

WildBench is available under ODC-BY-1.0.

Active

1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.

Publisher: Allen Institute for AI (Ai2)
Capabilities: Instruction Following LLM Judging Multi Turn Dialog
Format: HF Dataset
Size: 1024 tasks
License: ODC-BY-1.0
Published: Mar 2024
Notable for: Benchmark for evaluating instruction following, llm judging and multi turn dialog.
Canonical: huggingface.co/spaces/allenai/WildBench
Also on: github.com/allenai/WildBench

Cite

Notes

Only stored in your browser.

Implementations, trainers, datasets and scaffolds linked to this eval.

Allen Institute for AI (Ai2)

1M real user-chatbot conversations collected by Allen AI / UW via a free GPT proxy - a window into how real users actually prompt LLMs.

What is WildBench?: 1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.
What capabilities does WildBench test?: WildBench evaluates instruction following, llm judging, multi turn dialog.
How can a model improve its WildBench score?: Tools linked to WildBench on Sophon include WildChat - RL environments, datasets, and scaffolds that target this eval.
What license is WildBench under?: WildBench is available under ODC-BY-1.0.