Estimation and Inference in Distributional Reinforcement Learning

4
Citations
#648
in NeurIPS 2025
of 5858 papers
5
Authors
4
Data Points

Abstract

In this paper, we study distributional reinforcement learning from the perspective of statistical efficiency. We investigate distributional policy evaluation, aiming to estimate the complete return distribution (denoted $η^π$) attained by a given policy $π$. We use the certainty-equivalence method to construct our estimator $\hatη^π$, given a generative model is available. In this circumstance we need a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2p}(1-γ)^{2p+2}}\right)$ to guarantee the $p$-Wasserstein metric between $\hatη^π$ and $η^π$ less than $\varepsilon$ with high probability. This implies the distributional policy evaluation problem can be solved with sample efficiency. Also, we show that under different mild assumptions a dataset of size $\widetilde O\left(\frac{|\mathcal{S}||\mathcal{A}|}{\varepsilon^{2}(1-γ)^{4}}\right)$ suffices to ensure the Kolmogorov metric and total variation metric between $\hatη^π$ and $η^π$ is below $\varepsilon$ with high probability. Furthermore, we investigate the asymptotic behavior of $\hatη^π$. We demonstrate that the ``empirical process'' $\sqrt{n}(\hatη^π-η^π)$ converges weakly to a Gaussian process in the space of bounded functionals on Lipschitz function class $\ell^\infty(\mathcal{F}_{\text{W}})$, also in the space of bounded functionals on indicator function class $\ell^\infty(\mathcal{F}_{\text{KS}})$ and bounded measurable function class $\ell^\infty(\mathcal{F}_{\text{TV}})$ when some mild conditions hold. Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $η^π$.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 31, 2026
4+4