We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).
TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation
TURJUMAN, a neural translation toolkit, uses the AraT5 Transformer model to translate 20 languages into Arabic, offering diverse decoding methods and paraphrasing capabilities.
- Year
- 2022
- Venue
- OSACT (LREC) 2022 6
- Authors
- 3
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2206.03933ARXIV-DEFAULT
- TL;DR
- Semantic Scholar