Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Evalica is an open-source toolkit designed to create reliable and reproducible leaderboards for instruction-tuned large language models with support for human and machine feedback.

Open

Preview
Year: 2024
Venue: arXiv 2024
ArXiv: arxiv.org/abs/2412.11314
Authors: 1
Hosting: Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2412.11314ARXIV-DEFAULT
TL;DR: Semantic Scholar

Attribution policy →

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

Authors

Dmitry Ustalov