Learning in Matching Games with Bandit Feedback

We introduce a learning problem in a generalized two-sided matching market, where agents select actions to interact with their match. Specifically, we consider a setting in which matched agents engage in zero-sum games with initially unknown payoff matrices, and we investigate whether a centralized procedure can learn an equilibrium from bandit feedback. We adopt the solution concept of a matching equilibrium, where a matching \mathfrak{m} and a set of agent strategies X form an equilibrium if no agent has an incentive to deviate from (\mathfrak{m}, X) . To quantify deviations of a candidate solution (\mathfrak{m}, X) from the equilibrium (\mathfrak{m}^\star, X^\star) , we introduce the notion of matching instability, which serves as a regret measure for the learning problem. We propose a UCB-based algorithm in which agents form preferences and select actions according to optimistic estimates of the payoffs. Our analysis establishes a sublinear, instance-independent regret upper bound, further supported by empirical evidence.