Holistic Evaluation of Language Models

Introduces HELM, a framework that evaluates LLMs across 16 scenarios and 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) instead of a single number.

Open

Publisher: Stanford Center for Research on Foundation Models (CRFM)
Year: 2022
Venue: TMLR
ArXiv: arxiv.org/abs/2211.09110
Code: github.com/stanford-crfm/helm
Authors: 5
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2211.09110
TL;DR: semanticscholar.org/paper/29abcf865613287c661385c39401424f709a3fda
Code: github.com/stanford-crfm/helm

Attribution policy →

Introduces 2 artifacts - 1 eval, 1 tool

TL;DR

Semantic Scholar

Holistic Evaluation of Language Models (HELM) is presented to improve the transparency of language models and intends for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

Artifacts

Evals

HELM (Holistic Evaluation of Language Models)

Tools

HELM (Holistic Evaluation of Language Models)

Authors

Dilara Soylu Dimitris Tsipras Percy Liang Rishi Bommasani Tony Lee