Bias-Controlled Primal-Dual Natural Actor-Critic: Optimal Rates for Constrained Multi-Objective Average-Reward RL

Many reinforcement learning (RL) problems in the infinite-horizon average-reward setting require optimizing multiple conflicting objectives while satisfying multiple safety constraints. A common approach is concave scalarization, where the agent maximizes a utility f(J^π_{r_1}, \ldots, J^π_{r_M}) subject to a scalarized constraint g(J^π_{c_1}, \ldots, J^π_{c_N}) \ge 0 , where J^π_{r_m} and J^π_{c_n} denote the average-reward and cost under policy π. However, the nonlinearity of f and g introduces bias in policy-gradient and actor-critic methods, since gradients must be evaluated using noisy estimates of J^π, and \mathbb{E}[\partial f(J^π)] \neq \partial f(\mathbb{E}[J^π]), and this bias propagates through both primal and dual updates. We propose an MLMC-based primal-dual Natural Actor-Critic algorithm for average-reward MDPs that controls bias in scalarized objectives, constraint evaluation, and actor-critic estimation without requiring mixing-time knowledge. We show that the algorithm achieves optimal global convergence and constraint-violation rates of \tilde{O}(1/\sqrt{T}) . To our knowledge, this is the first result establishing optimal convergence for concave scalarized multi-objective RL in the average-reward setting, both with and without constraints, and the first to do so without mixing-time information even in the absence of scalarization.