Off-Policy Actor-Critic with Shared Experience Replay

Part of Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020)

Bibtex »Metadata »Paper »Supplemental »

Bibtek download is not availble in the pre-proceeding


Simon Schmitt, Matteo Hessel, Karen Simonyan


<p>We investigate the combination of actor-critic reinforcement learning algorithms with a uniform large-scale experience replay and propose solutions for two ensuing challenges: (a) efficient actor-critic learning with experience replay (b) the stability of off-policy learning where agents learn from other agents behaviour. </p> <p>To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable.</p> <p>We provide extensive empirical validation of the proposed solutions on DMLab-30. We further show the benefits of this setup in two training regimes for Atari: (1) a single agent is trained up until 200M environment frames per game (2) a population of agents is trained up until 200M environment frames each and may share experience. While (1) is a standard regime, (2) reflects the use case of concurrently executed hyper-parameter sweeps. We demonstrate state-of-the-art data efficiency among model-free agents in both regimes.</p>