FlashDecoding++: Faster Large Language Model Inference on GPUs

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update.

Open

Year: 2023
ArXiv: arxiv.org/abs/2311.01282
URL: arxiv.org/abs/2311.01282v4
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2311.01282v4
TL;DR: Semantic Scholar

Attribution policy →