Reactive Bandits with Attitude Pedro A. Ortega, Kee-Eung Kim and Daniel D. Lee Motivation Bandit Algorithms 1 1 1 1 0 0 0 UCB1's empirical frequencies do not converge to the optimum. 0 2e5 4e5 6e5 8e5 -1 -0.09 2e5 4e5 6e5 8e5 -0.09 Learning Reactive Bandits 0.2 True parameter is learned very quickly. 0.0 -0.2 2. The bandit can react to the player's strategy by choosing a location. 1. The possible locations depend on the attitude parameter. -0.4 1 1 The mixed strategy converges to the optimal strategy. 0 0 2e4 4e4 6e4 8e4 -0.09 Optimal Strategy Conclusions 1 The optimum switches between arms. In the limit, the arm with highest variance is chosen. 0 -3 0 +3 GRASP Laboratory, School of Engineering and Applied Sciences, University of Pennsylvania, Philadelphia, U.S.A.
© Copyright 2026 Paperzz