Part of Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020)

Bibtek download is not availble in the pre-proceeding

*Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, Lin Yang*

Reinforcement learning (RL) applies to control problems with large state and action spaces, hence it is natural to consider RL with a parametric model. In this paper we focus on finite-horizon episodic RL where the transition model admits a nonlinear parametrization $P_{\theta}$, a special case of which is the linear parameterization: $P_{\theta} = \sum_{i=1}^{d} (\theta)_{i}P_{i}$. We propose an upper confidence model-based RL algorithm with value-targeted model parameter estimation. The algorithm updates the estimate of $\theta$ by solving a nonlinear regression problem using the latest value estimate as the target. We demonstrate the efficiency of our algorithm by proving its expected regret bound which, in the special case of linear parameterization takes the form $\tilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H, T, d$ are the horizon, total number of steps and dimension of $\theta$. This regret bound is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$. In the general nonlinear case, we handle the regret analysis by using the concept of Eluder dimension proposed by \citet{RuVR14}.

Do not remove: This comment is monitored to verify that the site is working properly