"language model alignment" Papers
18 papers found
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins
Jongwoo Ko, Saket Dingliwal, Bhavana Ganesh et al.
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Teng Xiao, Yige Yuan, Zhengyu Chen et al.
Variational Best-of-N Alignment
Afra Amini, Tim Vieira, Elliott Ash et al.
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.
BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback
Gaurav Pandey, Yatin Nandwani, Tahira Naseem et al.
Controlled Decoding from Language Models
Sidharth Mudgal, Jong Lee, Harish Ganapathy et al.
Customizing Language Model Responses with Contrastive In-Context Learning
Xiang Gao, Kamalika Das
Decoding-time Realignment of Language Models
Tianlin Liu, Shangmin Guo, Leonardo Martins Bianco et al.
Degeneration-free Policy Optimization: RL Fine-Tuning for Language Models without Degeneration
Youngsoo Jang, Geon-Hyeong Kim, Byoungjip Kim et al.
Human Alignment of Large Language Models through Online Preference Optimisation
Daniele Calandriello, Zhaohan Guo, REMI MUNOS et al.
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
Banghua Zhu, Michael Jordan, Jiantao Jiao
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback
songyang gao, Qiming Ge, Wei Shen et al.
MaxMin-RLHF: Alignment with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan et al.
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Lichang Chen, Chen Zhu, Jiuhai Chen et al.
Provably Robust DPO: Aligning Language Models with Noisy Feedback
Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan et al.
Towards Efficient Exact Optimization of Language Model Alignment
Haozhe Ji, Cheng Lu, Yilin Niu et al.