Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
GLU Variants Improve Transformer
Test of GLU and its variants in Transformer's feed-forward layers improves quality over ReLU or GELU.
- Year
- 2020
- Venue
- arXiv 2020
- Authors
- 1
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2002.05202ARXIV-DEFAULT
- TL;DR
- Semantic Scholar