Pairwise Calibrated Rewards for Pluralistic Alignment

0citations

arXiv:2506.06298

Citations

#1918

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Daniel Halpern Evi Micha Ariel Procaccia Itai Shapira

Topics

reward function ensembles pluralistic alignment pairwise calibration preference modeling diverse human preferences policy alignment

Abstract

Current alignment pipelines presume a single, universal notion of desirable behavior. However, human preferences often diverge across users, contexts, and cultures. As a result, disagreement collapses into the majority signal and minority perspectives are discounted. To address this, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The distribution is learned directly from pairwise preference without annotator identifiers or predefined groups. Instead, annotator disagreements are treated as informative soft labels. Our central criterion is pairwise calibration: for every pair of candidate responses, the proportion of reward functions preferring one response matches the fraction of annotators with that preference. We prove that even a small outlier-free ensemble can accurately represent diverse preference distributions. Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 30, 2026