0

Aletheia: What Makes RLVR For Code Verifiers Tick?

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR…

Preview
Year
2026
Hosting
Full text hostedCC-BY-SA-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2601.12186CC-BY-SA-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind that of execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary choices along the performance-cost trade-off in RLVR: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifier training recipes across disparate model sizes and covariate shifts across two common verifier application scenarios. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas the thinking budget becomes the most vital factor at larger scales. While leveraging negative samples has a consistent impact on top-1 selection accuracy across sizes, their contribution to ranking reconstruction increases monotonically with scale and plays a key role in stabilizing training at large sizes. Our Pareto optimality analysis demonstrates that eliminating on-policy training at larger model scales yields a verifier that performs comparably to the full RLVR recipe. Furthermore, we find that eschewing thinking traces serves as a compute-efficient strategy at lower budgets, offering a strong trade-off between training cost and verifier accuracy. Ultimately, our work provides the empirical foundation necessary to efficiently deploy robust code verifiers, thereby enabling their wider adoption in post-training pipelines for large code generation models.