2025 "reinforcement learning from human feedback" Papers
12 papers found
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Yuzheng Hu, Fan Wu, Haotian Ye et al.
NeurIPS 2025oralarXiv:2505.19281
2
citations
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.
ICLR 2025posterarXiv:2410.18252
39
citations
Avoiding exp(R) scaling in RLHF through Preference-based Exploration
Mingyu Chen, Yiding Chen, Wen Sun et al.
NeurIPS 2025poster
3
citations
Better Estimation of the Kullback--Leibler Divergence Between Language Models
Afra Amini, Tim Vieira, Ryan Cotterell
NeurIPS 2025posterarXiv:2504.10637
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Zhilin Wang, Jiaqi Zeng, Olivier Delalleau et al.
NeurIPS 2025posterarXiv:2505.11475
31
citations
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen et al.
ICLR 2025posterarXiv:2410.14872
51
citations
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
ICLR 2025posterarXiv:2409.12822
73
citations
More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness
Aaron J. Li, Satyapriya Krishna, Hima Lakkaraju
ICLR 2025posterarXiv:2404.18870
10
citations
On a Connection Between Imitation Learning and RLHF
Teng Xiao, Yige Yuan, Mingxiao Li et al.
ICLR 2025posterarXiv:2503.05079
13
citations
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Yantao Liu, Zijun Yao, Rui Min et al.
ICLR 2025posterarXiv:2410.16184
97
citations
SELF-EVOLVED REWARD LEARNING FOR LLMS
Chenghua Huang, Zhizhen Fan, Lu Wang et al.
ICLR 2025posterarXiv:2411.00418
18
citations
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
Heyang Zhao, Chenlu Ye, Quanquan Gu et al.
NeurIPS 2025posterarXiv:2411.04625
14
citations