Bibtek download is not availble in the pre-proceeding
Manan Tomar, Yonathan Efroni, Mohammad Ghavamzadeh
Multi-step greedy policies have been extensively used in model-based Reinforcement Learning (RL), both when a model of the environment is available (e.g.,~in the game of Go) and when it is learned. In this paper, we explore the benefits of multi-step greedy policies in model-free RL when employed using the multi-step Dynamic Programming algorithms: $\kappa$-Policy Iteration ($\kappa$-PI) and $\kappa$-Value Iteration ($\kappa$-VI). These methods iteratively compute the next policy ($\kappa$-PI) and value function ($\kappa$-VI) by solving a surrogate decision problem with a shaped reward and a smaller discount factor. We derive model-free RL algorithms based on $\kappa$-PI and $\kappa$-VI in which the surrogate decision problem is solved by DQN and TRPO. We call the resulting algorithms $\kappa$-PI-DQN, $\kappa$-VI-DQN, $\kappa$-PI-TRPO, and $\kappa$-VI-TRPO and evaluate them on Atari and MuJoCo benchmarks. Our results indicate that for the right range of $\kappa$, our algorithms outperform DQN and TRPO. Moreover, we identify the importance of a hyper-parameter that controls the extent to which the surrogate decision problem is solved, and show how to set this parameter. Finally, we establish that $\kappa$-PI-TRPO coincides with the popular GAE algorithm.