0

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$,…

Year
2023
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2307.06555ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set \mathscr{A} is defined to encompass the majority of commonly used activation functions, such as \mathtt{ReLU}, \mathtt{LeakyReLU}, \mathtt{ReLU}^2, \mathtt{ELU}, \mathtt{CELU}, \mathtt{SELU}, \mathtt{Softplus}, \mathtt{GELU}, \mathtt{SiLU}, \mathtt{Swish}, \mathtt{Mish}, \mathtt{Sigmoid}, \mathtt{Tanh}, \mathtt{Arctan}, \mathtt{Softsign}, \mathtt{dSiLU}, and \mathtt{SRS}. We demonstrate that for any activation function \varrho\in \mathscr{A}, a \mathtt{ReLU} network of width N and depth L can be approximated to arbitrary precision by a \varrho-activated network of width 3N and depth 2L on any bounded set. This finding enables the extension of most approximation results achieved with \mathtt{ReLU} networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,,depth) scaling factors can be further reduced from (3,2) to (1,1) if \varrho falls within a specific subset of \mathscr{A}. This subset includes activation functions such as \mathtt{ELU}, \mathtt{CELU}, \mathtt{SELU}, \mathtt{Softplus}, \mathtt{GELU}, \mathtt{SiLU}, \mathtt{Swish}, and \mathtt{Mish}.