Fast computation of a matrix product W^\top X is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation \widehat W in place of true W (weight-only quantization''). Information theory demonstrates that an optimal algorithm for reducing precision of W depends on the (second order) statistics of X and requires a careful alignment of vector quantization codebook with PCA directions of X (a process known as waterfilling allocation''). Dependence of the codebook on statistics of X, however, is highly impractical. This paper proves that there exist a universal codebook that is simultaneously near-optimal for all possible statistics of X, in the sense of being at least as good as an X-adapted waterfilling codebook with rate reduced by 0.11 bit per dimension in the case when W is Gaussian. Such universal codebook would be an ideal candidate for the low-precision storage format, a topic of active modern research, but alas the existence proof is non-constructive.
Equivalently, our result shows existence of a net in \mathbb{R}^n that is a nearly-optimal covering of a sphere simultaneously with respect to all Hilbert norms.
Price of metric universality in vector quantization is at most 0.11 bit
Fast computation of a matrix product $W^\top X$ is a workhorse of modern LLMs. To make their deployment more efficient, a popular approach is that of using a low-precision approximation $\widehat W$ in place of true $W$ (``weight-only quantization'').
- Preview

- Year
- 2026
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2602.05790CC-BY-4.0
- TL;DR
- Semantic Scholar