Decentralized Best-Response-Based Learning in Two-Player Zero-Sum Stochastic Games: A Finite-Sample Analysis

We present a finite-sample analysis of decentralized learning in two-player zero-sum matrix games and stochastic games, with a focus on best-response-based learning algorithms. In matrix games, the learning algorithm is payoff-based and symmetric: each player updates its policy using only its own payoff observations, incrementally moving toward an estimated smoothed best response to the opponent's latest policy. For stochastic games, we build on this matrix-game primitive to develop a learning algorithm called value iteration with smoothed best response (VI-SBR), which combines smoothed-best-response learning in induced matrix games with a decentralized, model-free approximation of minimax value iteration. We establish finite-sample guarantees in both settings. For matrix games, our results imply a sample complexity of O(ε^{-1}) for finding an ε-Nash distribution and, with explicit exploration, \tilde{O}(ε^{-8}) for finding an ε-Nash equilibrium. For stochastic games, we prove that the exploration-enhanced VI-SBR algorithm achieves a sample complexity of \tilde{O}(ε^{-8}) for finding an ε-Nash equilibrium. Technically, our analysis develops a coupled Lyapunov-drift framework. This framework simultaneously handles stochastic iterative algorithms with multiple interacting stochastic iterates, the non-zero-sum auxiliary games generated by independently updated value functions, and the time-inhomogeneous Markovian noise induced by time-varying policies. The resulting tools may be useful more broadly for analyzing learning algorithms with coupled stochastic iterates and nonstationary sampling processes.