0

WildBench

Active

1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.

Format
HF Dataset
Size
1024 tasks
License
ODC-BY-1.0
Published
Mar 2024
Notable for
Benchmark for evaluating instruction following, llm judging and multi turn dialog.

Cite

Notes

Only stored in your browser.

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is WildBench?
1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.
What capabilities does WildBench test?
WildBench evaluates instruction following, llm judging, multi turn dialog.
How can a model improve its WildBench score?
Tools linked to WildBench on Sophon include WildChat - RL environments, datasets, and scaffolds that target this eval.
What license is WildBench under?
WildBench is available under ODC-BY-1.0.