Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

Part of Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020)

Bibtex »Metadata »Paper »

Bibtek download is not availble in the pre-proceeding


Wang Chi Cheung, David Simchi-Levi, Ruihao Zhu


<p>We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, \ie, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \emph{variation budgets}. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (\texttt{SWUCRL2-CW}) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (\texttt{BORL}) algorithm to adaptively tune the \sw~to achieve the same dynamic regret bound, but in a \emph{parameter-free} manner, \ie, without knowing the variation budgets. Notably, learning drifting MDPs via conventional optimistic exploration presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.</p>