"human preference learning" Papers
4 papers found
Interpreting Language Reward Models via Contrastive Explanations
Junqi Jiang, Tom Bewley, Saumitra Mishra et al.
ICLR 2025posterarXiv:2411.16502
5
citations
Uncertainty and Influence aware Reward Model Refinement for Reinforcement Learning from Human Feedback
Zexu Sun, Yiju Guo, Yankai Lin et al.
ICLR 2025poster
3
citations
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
Qining Zhang, Lei Ying
ICLR 2025posterarXiv:2409.17401
10
citations
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban et al.
ICML 2024posterarXiv:2403.01857