Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.
YuLan: An Open-source Large Language Model
YuLan, a 12-billion parameter open-source LLM pre-trained on diverse multilingual data, employs a three-stage curriculum learning framework and achieves state-of-the-art performance on various English and Chinese benchmarks.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 38
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2406.19853ARXIV-DEFAULT
- TL;DR
- Semantic Scholar
Abstract
Authors
38Lei ZhangYutao ZhuJi-Rong WenZhicheng DouLei WangYushuo ChenZhewei WeiYankai LinBeichen ZhangJun XuKun ZhouRui YanWentong ChenFeng WangWayne Xin ZhaoJunyi LiXiaolei WangYupeng HouZican DongZhipeng ChenXinyu TangZe-Feng GaoKelong MaoXiaoxue ChengXu ChenQian CaoWenbing HuangRuihua SongDi HuShufang XieJiaxin MaoYuhan ChenYiding SunYihan WuQiangqiang RenXincheng PangYueguo ChenWeizheng Lu