0

$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of $m$ object vectors so that every subset of size at most $k$ is exactly retrieved by score comparison.

Preview
Year
2026
Hosting
Excerpt onlyCC-BY-NC-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2601.20844CC-BY-NC-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of m object vectors so that every subset of size at most k is exactly retrieved by score comparison. Our result shows MED is Θ(k), independent of m, for inner product, Euclidean distance, and cosine similarity. We then consider Robust MED (RMED), where all vectors are unit normed and an ε gap of scores is required. We derive the m-dependent feasibility ceiling ε_\star(m,k)=m/\sqrt{k(m-1)(m-k)}, which approaches 1/\sqrt{k} when m\gg k, and a Gaussian centroid construction gives a robust witness upper bound in the feasible margin regime. Numerical simulation on synthetic top-2 retrieval with cyclic polytope and centroid query optimization confirmed our theoretical claims. Experiments on LIMIT and LIMIT-small datasets also show that simple embedding-based retrieval baselines can overfit and outperform the reported single-vector LLM embedding baseline. Both theoretical and empirical findings rule out the lack of exact geometric capacity as the obstruction.