Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data.
- Year
- 2025
- Venue
- arXiv 2025
- Authors
- 35
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2503.10267ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
35Laurie BurchellMikko AulamoVladislav MikhailovDavid SamuelJan HajičBarry HaddowNikita MogheStephan OepenMarta BañónJaume Zaragoza-BernabeuGema Ramírez-SánchezVille KomulainenOna de GibertLiane GuillouMariia FedorovaNikolay Arefyevand Pinzhen Chenand Jindřich HelclErik HenrikssonMateusz Klimaszewskiand Andrey KutuzovJoona KytöniemiVeronika LaippalaPetter Mæhlumand Bhavitvya MalikFarrokh Mehryaryand Amanda MynttiDayyán O'BrienProyag PalJousia Pihaand Sampo PyysaloPavel Stepachevand Jörg TiedemannDušan VarišTereza Vojtěchová