0

Open Problems in Constitutional Preference Reconstruction

Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a \emph{choice}, not the rationale behind it.

Preview
Year
2026
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2606.30116ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Pairwise preference data is widely used for training and evaluating language models (e.g., RLHF), but each datapoint records a choice, not the rationale behind it. Methods such as Inverse Constitutional AI (ICAI) attempt to improve interpretability by compressing datasets into short ``constitutions'' of natural-language principles. We argue this framing is under-specified: a flat list of principles is not yet an executable decision rule because it leaves principle composition implicit. We use the pairwise setting as a testbed to empirically characterize three open problems in constitutional methods. First, principle quality is hard to measure: coverage and accuracy are useful but incomplete proxies for end-to-end reconstruction. Second, composition is ambiguous: holding principles fixed, different executors (LLM judge versus majority vote) agree only 73% of the time. Third, constitutions differ between LLMs: cross-model vote agreement is 73%, whereas intra-model agreement is 81%. Across PRISM, AlpacaEval, and Chatbot Arena, we show that principle refinement (ICAI+) may be a first step towards ameliorating these problems: inter-executor agreement rises to 78%, and transparent executors match LLM judge accuracy (66% vs.\ 67%). Our results highlight that constitutions should be evaluated as constitution--executor systems, with implications for LLMs-as-a-judge broadly.