Goodhart's Law in Reinforcement Learning

0citations
0
Citations
#607
in ICLR 2024
of 2297 papers
6
Authors
1
Data Points

Abstract

Implementing a reward function that perfectly captures a complex task in the real world is impractical. As a result, it is often appropriate to think of the reward function as aproxyfor the true objective rather than as its definition. We study this phenomenon through the lens ofGoodhart’s law, which predicts that increasing optimisation of an imperfect proxy beyond some critical point decreases performance on the true objective. First, we propose a way toquantifythe magnitude of this effect andshow empiricallythat optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart’s law for a wide range of environments and reward functions. We then provide ageometric explanationfor why Goodhart's law occurs in Markov decision processes. We use these theoretical insights to propose anoptimal early stopping methodthat provably avoids the aforementioned pitfall and derive theoreticalregret boundsfor this method. Moreover, we derive a training method that maximises worst-case reward, for the setting where there is uncertainty about the true reward function. Finally, we evaluate our early stopping method experimentally. Our results support a foundation for a theoretically-principled study of reinforcement learning under reward misspecification.

Citation History

Jan 28, 2026
0