Model-Free Robust Average-Reward Reinforcement Learning with Sample Complexity Analysis

Robust reinforcement learning (RL) under the average-reward criterion is essential for long-term decision-making, particularly when the environment may differ from its training dynamics. However, most existing studies focus on model-based settings and provide only asymptotic guarantees, hindering their principled understanding and practical deployment, especially in data-limited scenarios. We aim to close this gap by proposing a model-free algorithm, Robust Halpern Iteration (RHI). We first design our algorithm based on a black-box sampling oracle, which can estimate the worst-case performance accurately. We then derive the finite sample complexity of RHI under the generative model setting, assuming the sampling oracle. To concretely design such an oracle, we propose a K-order multi-level Monte-Carlo estimator, which is shown to have a lower bias compared to prior methods. We further instantiate our design for multiple uncertainty models, including KL and χ^2 divergence sets, and show that our RHI algorithm achieves an \varepsilon-optimal robust policy with a sample complexity of \tilde{O}\left( \frac{SAH^2}{\varepsilon^{(2+o(1))}}\right), where S,A are the number of states and actions, and H is the robust optimal span. Our result asymptotically matches the best complexity in robust average reward RL.