0

Learning What to Recommend: Minimax Optimal Simple Regret in Logistic Bandits

We study stochastic logistic bandits with $d$-dimensional action features under the simple-regret objective, where a learner uses $T$ rounds of exploration to output a single final action.

Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2601.21167CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We study stochastic logistic bandits with $d$-dimensional action features under the simple-regret objective, where a learner uses $T$ rounds of exploration to output a single final action. The logistic structure is essential here: because the informativeness of an action depends on the local curvature of the sigmoid, actions that are best for immediate reward need not be the most useful for identifying the best final recommendation. We show that the first-order minimax difficulty is governed by $κ_$, the inverse slope of the sigmoid at the optimal action. The lower bound is realized by a shifted saturated hard family in which saturation simultaneously limits the information available about the final decision and controls the value loss from a wrong recommendation. This reveals a hard mechanism distinct from cumulative-regret constructions, even though online-to-batch reductions recover the same leading order in expectation. We then develop two curvature-aware algorithms: \MULog, a pure-exploration method whose final recommendation satisfies a high-probability upper bound of order $\tilde O(d/\sqrt{κ_ T})$, matching the lower bound up to logarithmic factors, and \THATS, a Thompson-sampling-style method that provides a computationally lighter alternative. Experiments on both hard and easy geometries support the same picture: informative low-reward actions can make instances substantially easier, and the curvature-aware methods exploit this structure especially effectively.