ICLR "model alignment" Papers
4 papers found
Anyprefer: An Agentic Framework for Preference Data Synthesis
Yiyang Zhou, Zhaoyang Wang, Tianle Wang et al.
ICLR 2025posterarXiv:2504.19276
10
citations
Jailbreaking as a Reward Misspecification Problem
Zhihui Xie, Jiahui Gao, Lei Li et al.
ICLR 2025posterarXiv:2406.14393
9
citations
Large Language Models Assume People are More Rational than We Really are
Ryan Liu, Jiayi Geng, Joshua Peterson et al.
ICLR 2025posterarXiv:2406.17055
37
citations
Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model
Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.
ICLR 2025posterarXiv:2410.18640
14
citations