0

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Introduces Arena-Hard, a 500-prompt benchmark auto-curated from Chatbot Arena traffic that correlates ~0.9 with Arena Elo using LLM-as-a-Judge.

Publisher
LMArena
Year
2024
Venue
preprint
Authors
9
Hosting
External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Introduces 2 artifacts - 1 eval, 1 tool

TL;DR

Semantic Scholar

This work introduces BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop, and sets a new framework for the scalable curation of automated benchmarks from extensive data.

Artifacts

2

Authors

9