Topics
Abstract
Recent works have studiedstate entropy maximizationin reinforcement learning, in which the agent's objective is to learn a policy inducing high entropy over states visitation (Hazan et al., 2019). They typically assume full observability of the state of the system, so that the entropy of the observations is maximized. In practice, the agent may only getpartialobservations, e.g., a robot perceiving the state of a physical space through proximity sensors and cameras. A significant mismatch between the entropy over observations and true states of the system can arise in those settings. In this paper, we address the problem of entropy maximization over thetrue stateswith a decision policy conditioned on partial observationsonly. The latter is a generalization of POMDPs, which is intractable in general. We develop a memory and computationally efficientpolicy gradientmethod to address a first-order relaxation of the objective defined onbeliefstates, providing various formal characterizations of approximation gaps, the optimization landscape, and thehallucinationproblem. This paper aims to generalize state entropy maximization to more realistic domains that meet the challenges of applications.