0

Query-Key Normalization for Transformers

QKNorm improves low-resource language translation by modifying the attention mechanism to prevent softmax saturation while maintaining expressivity, resulting in BLEU score improvements.

Year
2020
Venue
Findings of the Association for Computational Linguistics 2020
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2010.04245ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply $\ell_2$ normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT'15.

Authors

4