WildBench
Active
1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.
- Publisher
- Allen Institute for AI (Ai2)
- Format
- HF Dataset
- Size
- 1024 tasks
- License
- ODC-BY-1.0
- Published
- Mar 2024
- Notable for
- Benchmark for evaluating instruction following, llm judging and multi turn dialog.
- Also on
Cite
Notes
Only stored in your browser.
Related tools
1Implementations, trainers, datasets and scaffolds linked to this eval.
FAQ
- What is WildBench?
- 1,024 real, long, challenging user prompts mined from WildChat conversations and judged by GPT-4 with rubric-based pairwise scoring.
- What capabilities does WildBench test?
- WildBench evaluates instruction following, llm judging, multi turn dialog.
- How can a model improve its WildBench score?
- Tools linked to WildBench on Sophon include WildChat - RL environments, datasets, and scaffolds that target this eval.
- What license is WildBench under?
- WildBench is available under ODC-BY-1.0.