0

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

QUICK, a set of optimized CUDA kernels, improve efficiency in quantized LLM inference by interleaving quantized weight matrices to avoid shared memory bank conflicts.

Year
2024
Venue
arXiv 2024
Authors
7
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2402.10076ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate up to 1.91x speedup over existing kernels of AutoAWQ on larger batches and up to 1.94x throughput gain on representative LLM models on various NVIDIA GPU devices.

Authors

7