Reactive Bandits Bandit Algorithms Learning

Reactive Bandits with Attitude
Pedro A. Ortega, Kee-Eung Kim and Daniel D. Lee
Motivation
Bandit Algorithms
1
1
1
1
0
0
0
UCB1's empirical
frequencies do not
converge to the
optimum.
0
2e5
4e5
6e5
8e5
-1
-0.09
2e5
4e5
6e5
8e5
-0.09
Learning
Reactive Bandits
0.2
True parameter is
learned very quickly.
0.0
-0.2
2. The bandit can react
to the player's strategy
by choosing a location.
1. The possible locations
depend on the attitude
parameter.
-0.4
1
1
The mixed strategy
converges to the optimal
strategy.
0
0
2e4
4e4
6e4
8e4
-0.09
Optimal Strategy
Conclusions
1
The optimum switches
between arms. In the
limit, the arm with
highest variance is
chosen.
0
-3
0
+3
GRASP Laboratory, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, U.S.A.