Training large language models requires vast amounts of data, posing a challenge for less widely spoken languages like Norwegian and even more so for truly low-resource languages like Northern S'ami. To address this issue, we present a novel three-stage continual training approach that substantially improves the downstream performance together with the inference efficiency for the target languages. Based on our findings, we train, evaluate, and openly release a new generative language model for Norwegian Bokm\r{a}l, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.
Small Languages, Big Models: A Study of Continual Training on Languages of Norway
A new three-stage continual training approach enhances the performance and inference efficiency of a 11.4 billion parameter generative language model for Norwegian and Northern S'ami languages.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 7
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2412.06484v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar