0

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on…

Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2602.01518CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top-k and Top-p kernels of high-performance LLM execution engines such as SGLang and FlashInfer, improving end-to-end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting-based algorithms. Qrita is now the default Top-k and Top-p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py.