Multi-Agent Lipschitz Bandits

We study the decentralized multi-player stochastic bandit problem over a continuous, Lipschitz-structured action space where hard collisions yield zero reward. Our objective is to design a communication-free policy that maximizes collective reward, while separating coordination costs from learning costs. We propose a modular protocol that first solves the multi-agent coordination problem by identifying and seating players on distinct, high-value regions via a novel maxima-directed search and then decouples the problem into N independent single-player Lipschitz bandits. In the consensus regime, we obtain an end-to-end regret bound whose dominant learning term is \tilde{O}(T^{(d+1)/(d+2)}), matching the single-player Lipschitz rate; the upfront coordination cost is horizon-independent at fixed confidence and only polylogarithmic in T in the expected-regret form. Under an additional public coverage/scheduling assumption for the epochic extension, we also obtain a gap-free \tilde{O}(T^{(d+1)/(d+2)}) guarantee. We further derive a matching lower bound for the dominant learning term and extend the framework to general distance-threshold collision models.