0

MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference.

Preview
Year
2026
Hosting
Excerpt onlyCC-BY-NC-SA-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2602.14209CC-BY-NC-SA-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference. Sparse attention, which attends only to a small KV subset per query, can reduce this latency with minimal accuracy loss. In block diffusion, however, the B tokens of each block must share a single KV subset, and we show this per-block constraint degrades existing sparse KV estimators by up to 25% in recall. We address this challenge by exploiting a property that emerges from the block-diffusion training objective: it aligns the block-average query across denoising steps, so the All-[MASK] block at the first step already reveals the per-block KV subset for the entire trajectory. We exploit this in MAGE ([MASK]-Guided Sparse Attention), a training-free method that runs one exact attention pass at the first step and reuses its top-k index sets for all remaining steps within the block. Across three block-diffusion families on LongBench, MAGE matches Exact Attention at k=512 with near-lossless accuracy, achieves up to 6.82x end-to-end speedup at 128K context, and runs up to 3.35x and 2.28x faster than Quest and SparseD, designed for AR LLMs and fully bidirectional diffusion LLMs, respectively.