bandits at CREST
Today, Emilie Kaufmann (Telecom ParisTech) gave a talk at BiP, the Bayes in Paris seminar at CREST, on multiarmed bandits. I always liked those problems that mix probability, learning, sequential statistics, decision theory and design (exploration versus exploitation). This is a well-known problem with a large literature but, nonetheless, I still find talks on the topic entertaining and mind-challenging. Esp. as I had missed the recent arXiv paper by Aurélien Garivier and Olivier Cappé.
Here the goal is to define Bayesian strategies that do well from a frequentist viewpoint. (This reminded me of something I read or heard the other day about people acting more Bayesianly against people than against computers… Where they get down to random (and poor) strategies.) Emilie mentioned a computing issue with the Gittins finite horizon setting, which happens to be a dynamic programming issue. This “justifies” moving to the asymptotic version of the game by Lai and Robbins (1985). The classical algorithmic resolution uses UCB (upper confidence bounds) on the probabilities of reward and picks the bandit arm with the highest bound each time, which involves some degree of approximation.
In the Bayesian case, the reference is the Thompson (1933) sampling algorithm, which picks a branch at random according to the estimated proportions or something similar. Emilie and co-authors managed to extend the sampling to generic exponential families. Showing that the Jeffreys prior leads to an asymptotically optimal solution. I am actually surprised at seeing the Jeffreys prior in this framework, due to the need to predict what is happening on branches not yet or hardly visited. So using the prior seems to prohibit using an improper prior. Maybe all this does not matter for asymptotics…
My only voiced question was that I did not see why the computing for the finite horizon problem is not very developed. Maybe ABC could be used for bandits??? Although it seems from the talk that Thompson sampling does very well for finite horizons.
February 13, 2017 at 4:13 am
[…] [then current now past] riddle of the week is a sort of multiarmed bandits optimisation. Of sorts. Or rather a generalised knapsack problem. The question is about optimising […]
February 13, 2017 at 1:44 am
[…] [then current now past] riddle of the week is a sort of multiarmed bandits optimisation. Of sorts. Or rather a generalised knapsack problem. The question is about optimising […]