Poster by Tomek Korbak Papers
6 papers found
Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Xander Davies, Eric Winsor, Alexandra Souly et al.
NeurIPS 2025poster
Inverse Scaling: When Bigger Isn't Better
Joe Cavanagh, Andrew Gritsevskiy, Najoung Kim et al.
ICLR 2025poster
180
citations
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Javier Rando, Tony Wang, Stewart Slocum et al.
ICLR 2025poster
Compositional Preference Models for Aligning LMs
DONGYOUNG GO, Tomek Korbak, Germàn Kruszewski et al.
ICLR 2024poster
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Lukas Berglund, Meg Tong, Maximilian Kaufmann et al.
ICLR 2024poster
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomek Korbak et al.
ICLR 2024poster