0

Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

A method using parallelized token decoding with specialized attention masks speeds up reasoning processes by over 100% without degrading answer quality.

Year
2025
Venue
arXiv 2025
Authors
1
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2503.20533ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence. Experimental results show that our method achieves over 100% speedup in decoding time while basically maintaining accuracy.

Authors

1