This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task. We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings -- words from one language that are introduced into another without orthographic adaptation -- and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with subword embeddings along with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings outperforms results obtained by a multilingual BERT-based model.
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling
A new annotated Spanish corpus is used to evaluate the performance of various sequence labeling models, including CRF, BiLSTM-CRF, and Transformer-based models, for identifying lexical borrowings.
- Year
- 2022
- Venue
- ACL 2022 5
- Authors
- 2
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2203.16169ARXIV-DEFAULT
- TL;DR
- Semantic Scholar