Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al., 2022), that gives the model access to a bank of pre-computed memories. We show that it is possible to fix many of the shortcomings of the original method, such as the need for fine-tuning, by critically assessing how positional encodings should be updated for the keys and values retrieved. This intuitive method uses the model's own key/query system to select and attend to the most relevant memories at each generation step, rather than using external embeddings. We demonstrate the importance of external information being retrieved in a majority of decoder layers, contrary to previous work. We open source a new counterfactual long-range retrieval benchmark, and show that Extended Mind Transformers outperform today's state of the art by 6% on average.
Extended Mind Transformers
Memorizing Transformers enhance pre-trained language models by accessing a bank of pre-computed memories, using the model's key/query system for selective memory retrieval, and outperform current state-of-the-art methods on long-range retrieval tasks.
- Year
- 2024
- Venue
- arXiv 2024
- Authors
- 2
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2406.02332ARXIV-DEFAULT
- TL;DR
- Semantic Scholar