While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
How to Train BERT with an Academic Budget
A low-cost method allows pretraining of masked language models competitive with BERT-base using a single inexpensive server in 24 hours.
- Year
- 2021
- Venue
- EMNLP 2021 11
- Authors
- 3
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2104.07705v2ARXIV-DEFAULT
- TL;DR
- Semantic Scholar