0

LLM Serving Optimization with Variable Prefill and Decode Lengths

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths.

Preview
Year
2025
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2508.06133CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV-cache usage, while each generated token further increases memory consumption, creating dynamic memory constraints during autoregressive decoding. Given a backlog of n requests arriving together, the goal is to form mixed prefill and decode batches over time to minimize total end-to-end latency. We show that heterogeneous prompt lengths fundamentally change the scheduling problem: the problem is NP-hard, and standard policies such as first-come-first-served, shortest-output-first, and total-size-based prioritization can have unbounded approximation ratios. We propose Sorted-F, a scheduling algorithm that repeatedly forms feasible batches using an F-metric that balances batch size against downstream decode cost. We prove that Sorted-F achieves a constant-factor approximation guarantee in the offline/backlogged model. We also develop practical implementations, including an exact dynamic program for small instances and scalable local-search and greedy heuristics for larger instances, as well as LP-guided and receding-horizon variants. Experiments on public workloads that combine short conversations and long-document summarization show that F-metric-based scheduling consistently reduces latency relative to standard baselines and remains close to the LP relaxation lower bound for tractable instances.