Fairness in two-player zero-sum games with bandit feedback

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least α/m. Existing instance-dependent results target pure Nash equilibria, while fairness generically produces mixed equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as p = (α/m)1 + (1-α)\widetilde{p} with \widetilde{p} \in Δ_m, and substituting into the payoff form yields p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q for a fair payoff matrix \widetilde{A} := (1-α)A + α1 c^{\top}, where c_j = \tfrac{1}{m}\sum_i A(i,j) is the column-mean vector. The fair game on A is then equivalent to a standard zero-sum game on \widetilde{A}, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to \widetilde{A}. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most α(1-1/m) and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an \widetilde{O}(T^{2/3}) regret bound for an Explore-Then-Commit algorithm, Fair-ETC-TPZSG, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when \widetilde{p}^{\star} is a vertex of Δ_m, the bound sharpens to instance-dependent \widetilde{O}(1/\widetildeΔ(α)^{2}), where \widetildeΔ(α) is the LP-margin gap.