Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens and a larger target model to verify them in parallel. In distributed edge-cloud inference, however, draft length must be controlled online: longer drafts amortize communication delay but reduce token acceptance, whereas shorter drafts preserve acceptance but trigger more communication rounds. We formulate this tradeoff as a ratio-type optimal stopping problem and prove that the optimal draft length is a finite delay-monotone threshold. The analysis identifies a critical delay below which single-token speculation is optimal and shows that the optimal length grows only logarithmically with communication delay. For time-varying networks, we extend the model to Markov-modulated channels and establish, under a bounded horizon and monotone stopping-region conditions, a state-dependent threshold policy. For unknown environments, we propose UCB-SpecStop, an online control algorithm with gap-free and gap-dependent expected regret bounds of O(L_{\max}\sqrt{K_{\max}T\log(K_{\max}T)}) and O(\sum_{k:Δ_k>0}L_{\max}^2\log(K_{\max}T)/Δ_k). We implement the method on a real edge-cloud testbed with a Jetson Orin Nano Super edge node and an RTX 3090 Ti cloud node, using Qwen and Llama draft--target pairs. Experiments validate the predicted phase transition, with transition points near 83 ms and 111 ms. Qwen matches the geometric prediction, while Llama requires empirical-prefix calibration due to heavy-head acceptance. Across the tested delay grid, UCB-SpecStop reduces per-token latency over SpecDec++ by up to 22.4%, approaches an offline oracle within 0.2--2.4% in communication-dominated regimes, improves over naive UCB by up to 7.5%, removes the 14.0--18.7% gap caused by static tuning under delay drift, and gains 3.0--6.8% with contextual channel-state information.
Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference
Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens and a larger target model to verify them in parallel.
- Preview

- Year
- 2026
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.20591ARXIV-DEFAULT
- TL;DR
- Semantic Scholar