0

A Length-Extrapolatable Transformer

A novel approach using relative position embeddings and blockwise causal attention enhances Transformer models for effective length extrapolation in language modeling tasks.

Year
2022
Venue
arXiv 2022
Authors
9
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2212.10554ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then we propose two designs to improve the above metric of Transformers. Specifically, we introduce a relative position embedding to explicitly maximize attention resolution. Moreover, we use blockwise causal attention during inference for better resolution. We evaluate different Transformer variants with language modeling. Experimental results show that our model achieves strong performance in both interpolation and extrapolation settings. The code will be available at https://aka.ms/LeX-Transformer.

Authors

9