0

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

Distillation of an ensemble of GPT-2 and small LLaMA models exceeds the performance of its individual components and a similar model trained directly on the BabyLM dataset.

Year
2023
Venue
arXiv 2023
Authors
2
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2308.02019v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

Authors

2