We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths. Prompt tokens determine initial KV-cache usage, while each generated token further increases memory consumption, creating dynamic memory constraints during autoregressive decoding. Given a backlog of n requests arriving together, the goal is to form mixed prefill and decode batches over time to minimize total end-to-end latency. We show that heterogeneous prompt lengths fundamentally change the scheduling problem: the problem is NP-hard, and standard policies such as first-come-first-served, shortest-output-first, and total-size-based prioritization can have unbounded approximation ratios. We propose Sorted-F, a scheduling algorithm that repeatedly forms feasible batches using an F-metric that balances batch size against downstream decode cost. We prove that Sorted-F achieves a constant-factor approximation guarantee in the offline/backlogged model. We also develop practical implementations, including an exact dynamic program for small instances and scalable local-search and greedy heuristics for larger instances, as well as LP-guided and receding-horizon variants. Experiments on public workloads that combine short conversations and long-document summarization show that F-metric-based scheduling consistently reduces latency relative to standard baselines and remains close to the LP relaxation lower bound for tractable instances.
LLM Serving Optimization with Variable Prefill and Decode Lengths
We study offline scheduling for large language model (LLM) serving under a fixed KV-cache memory budget, where requests have heterogeneous prompt (prefill) and response (decode) lengths.
- Preview

- Year
- 2025
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2508.06133CC-BY-4.0
- TL;DR
- Semantic Scholar