0

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation.

Year
2025
Venue
arXiv 2025
Authors
101
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2509.14233ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.

Authors

101
Nathan RanchinFlorian TramerKumar ShridharAuguste PoirouxAntoine BosselutTorsten HoeflerAlejandro Hernández CanoAngelika RomanouKyle MatobaMatteo PagliardiniSimin FanSyrielle MontariolMartin JaggiAndrei PanferovYiXuan XuAlexander HoyleRaghav SinghalKaustubh PonksheIdo HakimiAndreas KrauseFrederike LübeckBarna PásztorDongyang FanXiaozhe YaoAna KlimovicVinko SabolčecBettina MessmerNegar ForoutanAlexander HägeleAllen Hao HuangAntoni-Joan SolergibertDhia GarbayaEduard Frank ĎurechJuan García GiraldoMete IsmayilzadaSkander MoallaTiancheng ChenMichael AerniBadr AlKhamissiInes Altemir MarinasMohammad Hossein AmaniMatin AnsaripourIlia BadaninHarold BenoitEmanuela BorosNicholas BrowningFabian BöschMaximilian BötherNiklas CanovaCamille ChallierClement CharmillotJonathan ColesJan DeriuArnout DevosLukas DrescherDaniil DzenhaliouMaud EhrmannSilin GaoMiguel GilaMaría GranduryDiba HashemiJiaming JiangMark KleinAndrei KucharavyAnastasiia KucherenkoRoman MachacekTheofilos ManitarasAndreas MarfurtSimon MatrenokHenrique MendoncçaFawzi Roberto MohamedLuca MouchelSven Najem-MeyerJingwei NiGennaro OlivaElia PalmeLéo PaolettiMarco PasseriniIvan PavlovJavi RandoMathieu SauserJakhongir SaydalievMuhammad Ali SayfiddinovMarian SchneiderStefano SchuppliMarco ScialangaAndrei SemenovAnna SotnikovaAlexander SternfeldAyush Kumar TarunPaul TeiletcheJannis VamvasHao Zhao Alexander IlicCaglar GulcehreDavid RosenthalElliott AshJoost VandeVondeleLivio VeraldiMartin RajmanThomas SchulthessImanol Schlag