0

Value-Free Policy Optimization via Reward Partitioning

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback.

Year
2025
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2506.13702CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.