0

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

A taxonomy is developed to categorize memorization in language models based on sequence characteristics, and this categorization is used to construct a predictive model for memorization factors.

Year
2024
Venue
arXiv 2024
Authors
12
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2406.17746v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

Authors

12