Bilal Piot

6

Papers

60

Total Citations

Papers (6)

RRM: Robust Reward Model Training Mitigates Reward Hacking

Unlocking the Power of Representations in Long-term Novelty-based Exploration

Learning from negative feedback, or positive feedback or both

Nash Learning from Human Feedback

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Human Alignment of Large Language Models through Online Preference Optimisation