0

Impact of Corpora Quality on Neural Machine Translation

Data corruption in large parallel corpora can degrade the quality of neural machine translation systems; the paper outlines methods for identifying and removing problematic sentences.

Year
2018
Venue
arXiv 2018
Authors
1
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/1810.08392ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Large parallel corpora that are automatically obtained from the web, documents or elsewhere often exhibit many corrupted parts that are bound to negatively affect the quality of the systems and models that learn from these corpora. This paper describes frequent problems found in data and such data affects neural machine translation systems, as well as how to identify and deal with them. The solutions are summarised in a set of scripts that remove problematic sentences from input corpora.

Authors

1