"alignment" Papers
5 papers found
Conference
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
Yan Scholten, Stephan Günnemann, Leo Schwinn
ICLR 2025posterarXiv:2410.03523
17
citations
Base Models Beat Aligned Models at Randomness and Creativity
Peter West, Christopher Potts
COLM 2025paperarXiv:2505.00047
16
citations
Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
Johannes Ackermann, Takashi Ishida, Masashi Sugiyama
COLM 2025paperarXiv:2507.15507
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Jacob Dunefsky, Arman Cohan
COLM 2025paperarXiv:2502.18862
8
citations
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Tuhina Tripathi, Manya Wadhwa, Greg Durrett et al.
COLM 2025paperarXiv:2504.14716
9
citations