From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Introduces Arena-Hard, a 500-prompt benchmark auto-curated from Chatbot Arena traffic that correlates ~0.9 with Arena Elo using LLM-as-a-Judge.
- Publisher
- LMArena
- Year
- 2024
- Venue
- preprint
- Authors
- 9
- Hosting
- External sourcelicense unknown
Cite
Notes
Only stored in your browser.
Introduces 2 artifacts - 1 eval, 1 tool
TL;DR
Semantic Scholar
This work introduces BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop, and sets a new framework for the scalable curation of automated benchmarks from extensive data.
Artifacts
2Evals
Tools