0

Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

The use of multiple references in N-gram matching-based evaluation metrics improves their correlation with human evaluations, surpassing single-reference metrics and even neural-based ones, and mitigates data leakage in large language models.

Year
2023
Venue
arXiv 2023
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2308.03131v4ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. However, recent studies have revealed a weak correlation between these matching-based metrics and human evaluations, especially when compared with neural-based metrics like BLEURT. In this paper, we conjecture that the performance bottleneck in matching-based metrics may be caused by the limited diversity of references. To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations. Within the WMT Metrics benchmarks, we observe that the multi-references F200spBLEU surpasses the conventional single-reference one by an accuracy improvement of 7.2%. Remarkably, it also exceeds the neural-based BERTscore by an accuracy enhancement of 3.9%. Moreover, we observe that the data leakage issue in large language models (LLMs) can be mitigated to a large extent by our multi-reference metric. We release the code and data at \url{https://github.com/SefaZeng/LLM-Ref}

Authors

4