Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm D^\mathtt{AD}. We first show that given a floating-point function ϕ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^\mathtt{AD}(ϕ\circ f), respectively. We further extend this result: given ϕ_1,\dots,ϕ_n, D^\mathtt{AD}(ϕ_i\circ f) can simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., ReLU, ELU, GeLU, Swish, Sigmoid, and tanh.