We find that current sentence-embedding models produce outputs with a consistent bias: every embedding e decomposes as \tilde e + μ, where the mean μ is near-identical across all sentences. We study two training-free corrections -- subtracting μ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB) \citep{MMTEB}, R2 yields consistent classification gains (paired \bar t = 3.31, 29 of 38 models with t>2, zero losses), and the per-model mean norm \Vertμ\Vert correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within 0.18 pp downstream despite weak geometric alignment between \hatμ and the centered top principal component.
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + μ$, where the mean $μ$ is near-identical across all sentences.
- Preview

- Year
- 2025
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2511.11041ARXIV-DEFAULT
- TL;DR
- Semantic Scholar