Beyond Expectations: Quantile-Guided Alignment for Risk-Calibrated Language Models

0citations
0
Citations
#1767
in NeurIPS 2025
of 5858 papers
8
Authors
4
Data Points

Abstract

Large language models can generate rare but catastrophic outputs, such as harmful conversations or insecure code. Existing Reinforcement Learning from Human Feedback (RLHF) typically maximizes average reward, leaving high-risk tail events insufficiently controlled. We introduce Quantile‑Guided Alignment (QA), a framework that allows users to specify desired improvements at any quantile—individually or across multiple reward dimensions—thus shifting the distribution of outputs with finer control toward safer, more desirable outcomes. The method extends standard RLHF via an augmented reward formulation that enforces quantile constraints. Experiments on conversation and code‐generation tasks show that quantile alignment significantly enhances quality at targeted tails while maintaining overall performance. The results position QA as a principled route to risk‑calibrated language models with tail‑focused alignment.

Citation History

Jan 25, 2026
0
Jan 26, 2026
0
Jan 26, 2026
0
Jan 28, 2026
0