Learning to Route and Schedule LLMs from User Retrials via Contextual Queueing Bandits

Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for explicit" feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages implicit" feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of \widetilde{O}(\sqrt{t}) for routing and a queue length regret of \widetilde{O}(t^{-1/4}) for any large t. For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on synthetic data, offline routing datasets (SPROUT, EmbedLLM, and RouterBench), and real user conversation logs (WildChat-1M) confirm that our methods improve routing, scheduling, and queue stability against strong online and offline-trained baselines.