The curvature exponent α in h_k \propto σ_k^α -- governing how Hessian eigenvalues scale with gradient singular values -- varies systematically across layer types (α\approx 2 for convolutions, \approx 1 for transformer attention, < 1 for MLP up-projections). Why? We prove the Spectral Alignment Decomposition: α= 2 + d\logΦ_k / d\logσ_k, where Φ_k measures alignment between Kronecker factor eigenbases and gradient singular directions. This reduces "why does α vary?" to a geometric question we answer for LayerNorm, residual connections, and softmax heads. The decomposition implies a spectral transfer identity s = αγ linking curvature exponent, effective gradient rank-decay γ, and Hessian decay exponent s. The identity is algebraic; its empirical content is that α and γ, fit on independent data (HVPs vs. SVD), recover s to 2% median error across 93 layers, five architectures, and three datasets -- with no free parameters. A zeta-function bound on participation ratio shows curvature concentrates onto effectively one direction per layer. As a proof of concept, we derive the architecture-adaptive preconditioner T(σ;α) and show that Spectral Newton -- implementing T in the gradient singular basis -- outperforms AdamW on vision benchmarks where α\approx 2.
Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent
The curvature exponent $α$ in $h_k \propto σ_k^α$ -- governing how Hessian eigenvalues scale with gradient singular values -- varies systematically across layer types ($α\approx 2$ for convolutions, $\approx 1$ for transformer attention, $< 1$ for MLP up-projections).
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.02596CC-BY-4.0
- TL;DR
- Semantic Scholar