0

Unsupervised Evaluation of Interactive Dialog with DialoGPT

FED, an automatic evaluation metric for open-domain dialogue, uses DialoGPT to assess dialog quality without ground truth or training data, correlating moderately to strongly with human judgments.

Year
2020
Venue
SIGDIAL (ACL) 2020 7
Authors
2
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2006.12719ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.

Authors

2