Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

We introduce Canary, a risk-averse method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. We employ Cantelli's inequality to obtain a tractable, conservative and smooth bound on the VaR constraint based on the first two moments of the cost return. This yields a constraint estimator that remains stable with tight violation thresholds in dense cost regimes. Extending the trust-region framework of the Constrained Policy Optimization (CPO) method, we further provide worst-case bounds for both policy improvement and constraint violation during the training process. Empirically, across continuous-control safety benchmarks, Canary most reliably satisfies its constraint, with the fewest violations and the earliest permanent satisfaction, while remaining reward-competitive with other baselines that also satisfy.