Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey ([email protected]) Deepayan Chakrabarti ([email protected]) Deepak Agarwal ([email protected]) 1 Background: Bandits Bandit “arms” μ1 μ2 μ3 (unknown reward probabilities) Pull arms sequentially so as to maximize the total expected reward • Show ads on a webpage to maximize clicks • Product recommendation to maximize sales 2 Dependent Arms Reward probabilities μi are generally assumed to be independent of each other What if they are dependent? E.g., ads on similar topics, using similar text/phrases, should have similar rewards “Skiing, snowboarding” “Skiing, snowshoes” “Snowshoe rental” “Get Vonage!” μ1=0.3 μ2=0.28 μ2=0.31 μ3=10-6 3 Dependent Arms Reward probabilities μi are generally assumed to be independent of each other What if they are dependent? E.g., ads on similar topics, using similar text/phrases, should have similar rewards A click on one ad other “similar” ads may generate clicks as well Can we increase total reward using this dependency? 4 Cluster Model of Dependence Arm 1 Arm 2 Cluster 1 Arm 3 Arm 4 Cluster 2 Successes si ~ Bin(ni, μi) μi ~ f(π[i]) Some distribution (known) # pulls of arm i Cluster-specific parameter (unknown) 5 Cluster Model of Dependence Arm 1 Arm 2 μi ~ f(π1) Arm 3 Arm 4 μi ~ f(π2) Total reward: ∞ Discounted: ∑ αt.E[R(t)], α = discounting factor t=0 T Undiscounted: ∑ E[R(t)] t=0 6 Discounted Reward x’1 x’2 MDP for cluster 1 x1 x2 Pull Arm 1 x”1 x”2 x’3 x’4 MDP for cluster 2 x3 x4 Pull Arm 3 x”3 x”4 The optimal policy can be computed using percluster MDPs only. Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm 7 Discounted Reward x’1 x’2 MDP for Pull x1 x2 • Reduces the problem Arm 1 clusterto1smaller state spaces x”1 x”2 • Reduces to Gittins’ Theorem [1979] for independent bandits x’3 x’4 • Approximation bounds MDP for x3 x4 Pull on the index for k-step Arm 3 cluster 2 x”3 x”4 lookahead The optimal policy can be computed using percluster MDPs only. Optimal Policy: • Compute an (“index”, arm) pair for each cluster • Pick the cluster with the largest index, and pull the corresponding arm 8 Cluster Model of Dependence Arm 1 Arm 2 μi ~ f(π1) Arm 3 Arm 4 μi ~ f(π2) Total reward: ∞ Discounted: ∑ αt.E[R(t)], α = discounting factor t=0 T Undiscounted: ∑ E[R(t)] t=0 9 Undiscounted Reward “Cluster arm” 1 Arm 1 Arm 2 “Cluster arm” 2 Arm 3 Arm 4 All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm” 10 Undiscounted Reward Two-Level Policy “Cluster arm” 1 “Cluster arm” 2 Each “cluster arm” must have some estimated reward probability Arm 1 Arm 2 Arm 3 Arm 4 In each iteration: Pick “cluster arm” using a traditional bandit policy Pick an arm within that cluster using a traditional bandit policy 11 Issues What is the reward probability of a “cluster arm”? How do cluster characteristics affect performance? 12 Reward probability of a “cluster arm” What is the reward probability r of a “cluster arm”? MEAN: r = ∑si / ∑ni, i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm” 13 Reward probability drift causes problems Best (optimal) arm, with reward probability μopt Arm 1 Arm 2 Cluster 1 (opt cluster) Arm 3 Arm 4 Cluster 2 Drift Non-optimal clusters might temporarily look better optimal arm is explored only O(log T) times 14 Reward probability of a “cluster arm” What is the reward probability r of a “cluster arm”? MEAN: r = ∑si / ∑ni for all arms i MAX: r = max( E[μi] ) in cluster PMAX: r = E[ max(μi) ] Both MAX and PMAX aim to estimate μmax and thus reduce drift 15 Reward probability of a “cluster arm” Bias in estimation of μmax MEAN: r = ∑si / ∑ni MAX: r = max( E[μi] ) PMAX: r = E[ max(μi) ] Variance of estimator High Low Unbiased High Both MAX and PMAX aim to estimate μmax and thus reduce drift 16 Comparison of schemes 10 clusters, 11.3 arms/cluster MAX performs best 17 Issues What is the reward probability of a “cluster arm”? How do cluster characteristics affect performance? 18 Effects of cluster characteristics We analytically study the effects of cluster characteristics on the “crossover-time” Crossover-time: Time when the expected reward probability of the optimal cluster becomes highest among all “cluster arms” 19 Effects of cluster characteristics Crossover-time Tc for MEAN depends on: Cluster separation Δ = μopt – μmax outside opt cluster Δ increases Tc decreases Cluster size Aopt Aopt increases Tc increases Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases 20 Experiments (effect of separation) Δ increases Tc decreases higher reward 21 Experiments (effect of size) Aopt increases Tc increases lower reward 22 Experiments (effect of cohesiveness) Cohesiveness increases Tc decreases higher reward 23 Related Work Typical multi-armed bandit problems Bandits with side information Do not consider dependencies Very few arms Cannot handle dependencies among arms Active learning Emphasis on #examples required to achieve a given prediction accuracy 24 Conclusions We analyze bandits where dependencies are encapsulated within clusters Discounted Reward the optimal policy is an index scheme on the clusters Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on performance, for MEAN 25 Discounted Reward 1 2 3 4 x’1 x’2 x1 x2 Estimated reward probabilities x3 x4 x3 x4 Pull Arm 1 x”1 x”2 Change of belief for both arms 1 and 2 x3 x4 • Create a belief-state MDP • Each state contains the estimated reward probabilities for all arms • Solve for optimal 26 Background: Bandits Bandit “arms” p1 p2 p3 (unknown payoff probabilities) Regret = optimal payoff – actual payoff 27 Reward probability of a “cluster arm” What is the reward probability of a “cluster arm”? Eventually, every “cluster arm” must converge to the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems 28 Experiments Simulation based on one week’s worth of data from a large-scale ad-matching application 10 clusters, with 11.3 arms/cluster on average 29 Comparison of schemes 10 clusters, 11.3 arms/cluster Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best 30 Reward probability drift causes problems Best (optimal) arm, with reward probability μopt Arm 1 Arm 2 Cluster 1 (opt cluster) Arm 3 Arm 4 Cluster 2 Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster 31
© Copyright 2026 Paperzz