School of Technology and Computer Science Seminars

Regret Analysis of Multi Armed Bandits

by Anand Deo (School of Technology and Computer Science, TIFR)

Friday, July 6, 2018 from to (Asia/Kolkata)
at A-201 Seminar Room
Description
Consider a system with K arms, arm i yielding a payoff Xi, according to the distribution Pi. Suppose now, a gambler who is unaware of Pi, samples rewardsfrom a sequence of arms, with the objective of maximizing his total expected reward. TheMulti Armed Bandit (MAB) problem consists of finding a (stochastic) sequence of armpulls that achieves a large expected reward. This is equivalent to minimizing regret, orthe difference between the expected payoff of a genie strategy, which knows the best armis beforehand, and the chosen sequence of arm pulls. In 1985, Lai and Robbins provedthat regret grows at least at the rate log(n), where n is the total number of armspulled. In this talk, we will see the construction of simple strategies, whichasymptotically achieve this bound.  This talk is based on the work of Auer, Cesa-Bianchi, Fischer (2002). I will assume only basic familiarity with probability for this talk.