Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side processing. We present Harmonia, an end-to-end RAG serving framework that addresses these bottlenecks through (i) a flexible pipeline specification interface for composing custom workflows, (ii) heterogeneity-aware deployment that provisions and configures components as a distributed inference system, and (iii) a closed-loop runtime controller that monitors load and execution progress and reduces SLO violations through request prioritization and auto-scaling. Across four RAG applications, Harmonia outperforms commercial alternatives, improving throughput by more than 2.04x while reducing SLO violations by up to 78.4 percent.
Harmonia: End-to-End RAG Serving Optimization
Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG pipelines efficiently is challenging because requests traverse heterogeneous components spanning LLM inference, databases, and CPU-side…
- Year
- 2025
- Hosting
- Full text hostedCC-BY-4.0
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2505.07833CC-BY-4.0
- TL;DR
- Semantic Scholar