Poster "large language model alignment" Papers
9 papers found
Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging
Jinluan Yang, Dingnan Jin, Anke Tang et al.
NeurIPS 2025posterarXiv:2502.06876
13
citations
On a Connection Between Imitation Learning and RLHF
Teng Xiao, Yige Yuan, Mingxiao Li et al.
ICLR 2025posterarXiv:2503.05079
13
citations
Rethinking Reward Modeling in Preference-based Large Language Model Alignment
Hao Sun, Yunyi Shen, Jean-Francois Ton
ICLR 2025poster
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
ICLR 2025posterarXiv:2409.13156
44
citations
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Shusheng Xu, Wei Fu, Jiaxuan Gao et al.
ICML 2024poster
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint
Wei Xiong, Hanze Dong, Chenlu Ye et al.
ICML 2024poster
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
Ziniu Li, Tian Xu, Yushun Zhang et al.
ICML 2024poster
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor et al.
ICML 2024poster
WARM: On the Benefits of Weight Averaged Reward Models
Alexandre Rame, Nino Vieillard, Léonard Hussenot et al.
ICML 2024poster