A Markov Chain Approach to Preference Alignment

We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility U(x,y), which quantifies human preference for y over x, and a reference probability distribution μ_{ref}, we define a Markov kernel P(x, dy)\propto \exp(U(x,y))μ_{ref}(dy), and take the Markov chain starting from μ_{ref} as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm |U|\oplus=\inf{g,f\in L^\infty(μ_{ref})}|U-g\oplus f|\infty, which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when |U|\oplus is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment. In particular, for two natural algorithms that converge to the MCHF/NLHF equilibria, we show that the first step of MCHF and NLHF recovers the RLHF solution based on the column-sum reward \hat{f}(y)=\int μ_{ref}(dx) U(x, y), and starting from the second iteration, both algorithms incorporate the same linear functional of the residual U-(-\hat f)\oplus \hat f, which captures the non-transitive structure of the pairwise utility U.