We present Top-Theta (Top-θ) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-θ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row.
- Year
- 2025
- Hosting
- Excerpt onlyCC-BY-NC-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2502.08363CC-BY-NC-4.0
- TL;DR
- Semantic Scholar