We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
OLMoE: Open Mixture-of-Experts Language Models
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE).
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 24
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2409.02060ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
24Alexander WettigAli FarhadiHannaneh HajishirziJacob MorrisonNathan LambertNiklas MuennighoffWeijia ShiBinyuan HuiPete WalshLuca SoldainiDirk GroeneveldKyle LoShane AroraAkshita BhagiaYuling GuDustin SchwenkOyvind TafjordPang Wei KohDavid WaddenNoah A. SmithTim DettmersDouwe KielaSewon MinAmanpreet Singh