Exploration Through Bias: Revisiting Biased Maximum Likelihood Estimation in Stochastic Multi-Armed Bandits

Part of Proceedings of the International Conference on Machine Learning 1 pre-proceedings (ICML 2020)

Bibtex »Metadata »Paper »Supplemental »

Bibtek download is not availble in the pre-proceeding


Authors

Xi Liu, Ping-Chun Hsieh, Yu Heng Hung, Anirban Bhattacharya, P. Kumar

Abstract

<p>We propose a new family of bandit algorithms, that are formulated in a general way based on the Biased Maximum Likelihood Estimation (BMLE) method originally appearing in the adaptive control literature. We design the reward-bias term to tackle the exploration and exploitation tradeoff for stochastic bandit problems. We provide a general recipe for the BMLE algorithm and derive a simple explicit closed-form expression for the index of an arm for exponential family reward distributions. We prove that the derived BMLE indices achieve a logarithmic finite-time regret bound and hence attain order-optimality, for both exponential families and the cases beyond parametric distributions. Through extensive simulations, we demonstrate that the proposed algorithms achieve regret performance comparable to the best of several state-of-the-art baseline methods, while being computationally efficient in comparison to other best-performing methods. The generality of the proposed approach makes it possible to address more complex models, including general adaptive control of Markovian systems.</p>