From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Introduces Arena-Hard, a 500-prompt benchmark auto-curated from Chatbot Arena traffic that correlates ~0.9 with Arena Elo using LLM-as-a-Judge.

Open

Preview
Publisher: LMArena
Year: 2024
Venue: preprint
ArXiv: arxiv.org/abs/2406.11939
Code: github.com/lm-sys/arena-hard-auto
Authors: 9
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2406.11939
TL;DR: semanticscholar.org/paper/05f02b4ed43d01f3efbbdcb454cc17b333f74817
Code: github.com/lm-sys/arena-hard-auto

Attribution policy →

Introduces 2 artifacts - 1 eval, 1 tool

TL;DR

Semantic Scholar

This work introduces BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop, and sets a new framework for the scalable curation of automated benchmarks from extensive data.

Artifacts

Evals

Arena-Hard

Tools

BenchBuilder

Authors

Banghua Zhu Evan Frick Ion Stoica Joseph E. Gonzalez Lisa Dunlap Tianhao Wu Tianle Li Wei-Lin Chiang Joseph E. Gonzalez