0

Towards the Law of Capacity Gap in Distilling Language Models

The capacity gap between teacher and student language models has an optimal point that remains consistent across different student sizes and architectures, enabling efficient distillation and competitive performance.

Year
2023
Venue
arXiv 2023
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2311.07052v3ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Language model (LM) distillation is a trending area that aims to distil the knowledge residing in a large teacher LM to a small student one. While various methods have been proposed to maximize the effectiveness of the distillation, significant challenges persist, particularly when there is a substantial capacity gap between the teacher and student LMs. This issue, often referred to as the \textit{curse} of capacity gap, suggests that a larger teacher does not necessarily result in a superior student compared to one distilled from a smaller teacher. In other words, there is likely an optimal teacher yielding the best student along the scaling course of the teacher. However, the curse of capacity gap can not be tackled without notable compute overhead, as indicated in previous studies. In the context of large LMs (LLMs), previously viable approaches become much less meaningful, as it is an impossible triangle to distill an expected student from an optimal teacher student with small compute overhead. Fortunately, the impossible triangle can fortunately be possible provided an inducted \textit{law} of capacity gap. In this paper, we take the spirits of scaling law and reveal that the optimal teacher scale almost consistently follows a linear scaling with the student scale across different model architectures and data scales. The law later guides us to distil a 3B student LM (termed \textsc{MiniMA}) from LLaMA2-7B. \textsc{MiniMA} is demonstrated to outperform a wide range of 3B competitors and could even compete with several 7B models.

Authors

4