0

Document-aligned Japanese-English Conversation Parallel Corpus

A document-aligned Japanese-English conversation corpus is developed to improve and evaluate document-level machine translation, focusing on context-aware translations and overcoming data scarcity.

Year
2020
Venue
WMT (EMNLP) 2020 11
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2012.06143ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Sentence-level (SL) machine translation (MT) has reached acceptable quality for many high-resourced languages, but not document-level (DL) MT, which is difficult to 1) train with little amount of DL data; and 2) evaluate, as the main methods and data sets focus on SL evaluation. To address the first issue, we present a document-aligned Japanese-English conversation corpus, including balanced, high-quality business conversation data for tuning and testing. As for the second issue, we manually identify the main areas where SL MT fails to produce adequate translations in lack of context. We then create an evaluation set where these phenomena are annotated to alleviate automatic evaluation of DL systems. We train MT models using our corpus to demonstrate how using context leads to improvements.

Authors

4