ICLR Poster "human preference alignment" Papers
8 papers found
ALLaM: Large Language Models for Arabic and English
M Saiful Bari, Yazeed Alnumay, Norah Alzahrani et al.
ICLR 2025posterarXiv:2407.15390
47
citations
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang, Philip Torr, Mohamed Elhoseiny et al.
ICLR 2025posterarXiv:2408.15313
23
citations
MetaMetrics: Calibrating Metrics for Generation Tasks Using Human Preferences
Genta Winata, David Anugraha, Lucky Susanto et al.
ICLR 2025posterarXiv:2410.02381
17
citations
RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization
Hanyang Zhao, Genta Winata, Anirban Das et al.
ICLR 2025posterarXiv:2410.04203
17
citations
Reward Guided Latent Consistency Distillation
William Wang, Jiachen Li, Weixi Feng et al.
ICLR 2025posterarXiv:2403.11027
24
citations
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
ICLR 2025posterarXiv:2409.13156
44
citations
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin, Sadhika Malladi, Adithya Bhaskar et al.
ICLR 2025posterarXiv:2410.08847
47
citations
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.
ICLR 2025posterarXiv:2410.18640
14
citations