The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
USR, an unsupervised and reference-free metric, effectively evaluates dialog models by measuring desirable qualities and correlates well with human judgment.
- Year
- 2020
- Venue
- usr-an-unsupervised-and-reference-free-1
- Authors
- 2
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2005.00456ARXIV-DEFAULT
- TL;DR
- Semantic Scholar