3.3.3 Minimizing Immediate Counterfactual Regret In this section, we will show how counterfactual regret can be used to set the action probabilities of a (One possible) Formal definition of regret strategy, in order to minimize counterfactual regret on future iterations. With this accomplished, we will have completed our task — by finding a player’s counterfactual regret and updating its action • Let ut(a) be the utility (payoff) of the action taken at step t probabilities according to two equations, we will have specified a system for using self-play to find • Let σt(a) be the probability of taking action a �-Nash equilibria. Lecture 11: Regret and Poker In Equation 3.3, we stated that our immediate counterfactual regret was a consequence of our AI For Traditional Games set. In that equation, we considered our regret to be the differchoice of actions at that information Prof. Nathan Sturtevant ence in utility between Winter 2011 the actions taken by our strategy σi , and the single action that would have maximized our utility for all examples of that information set, weighted by the probability of reach- • Recall that a strategy is written as σ (Lecture 3) • Internal regret at time T of not taking action a, RTa, is: • T •• Σ (ut(a) - Σ σt(b)ut(b) ) t=1 b ing that information set on iteration t, if we had tried to reach that information set. Now, we will • The second part is the actual payoff returned instead measure our regret for not taking each possible action. Associated with each action in each • The first part is the payoff that we would get if we could swap the action we took with the action A information set, we will maintain the following regret value: T Ri,imm (I, a) = T � � 1 � σt π−i (I) ui (σ t |I→a , I) − ui (σ t , I) T t=1 (3.4) These RiT (I, a) values tell us, over T samples, how much we regret not folding, calling, or raising instead of taking the actions specified by σi . As we are concerned with positive regret, we Another regret-minimizing algorithm define RiT,+ (I, a) = max(RiT (I, a), 0). Then, to determine the new strategy to use at time T + 1, we set the action probabilities as follows: • Define a policy at information set I and time T+1 as: σiT +1 (I)(a) = RiT ,+ (I,a) T ,+ (I,a) a∈A(I) Ri 1 |A(I)| P if � a∈A(I) RiT,+ (I, a) > 0 otherwise. • 2-player limit texas hold ‘em (3.5) The relationship counterfactual regret andattheinformation new strategy’sset action • Wherebetween |A(I)| ispositive the number of actions I probabilities is simple. Each action is selected in proportion to the accumulated positive counterfactual regret • 50 choose 2 cards for opponent • 48 choose 3 on flop • 45 on turn • This produces a mixed strategy • 44 on river regret, we set the action probabilities to a default setting — in this case, equal probability to each 3.3.4 • 52 choose 2 cards for us • The + notation is max(R(), 0); eg non-negative values we have had for not selecting that action in the past. In information states where there is no positive action. What about large trees? • Try it on R/P/S with 2 players • If you “train” for 2 players, the average strategy will Counterfactual converge to Regret e-NashMinimization Example The above formal description of minimizing counterfactual regret may appear complex, but the computation is actually straightforward. We will now present a short example of how counterfactual • Between each of these must have strategy for play • b / c / f (limit of 3/4 raises per betting round) Poker CFR • Strategy for the whole game is too large to learn as a whole strategy 3.3.3 • Build a ii tree for both players • But, strategy can be decomposed to each “independent” decision in the game Minimizing Immediate Counterfactual Regret • Repeat until converged: Sample a perfect-information world In this section, we •will show how counterfactual regret can be used to set the action probabiliti Traverse counterfactual the tree and regret compute the regret for With eachthis accomplish strategy, in order to• minimize on future iterations. action, and update the action probabilities will have completed our task — by finding a player’s counterfactual regret and updating its • The AVERAGE strategy played during training will be a • Key insight: The regret of a whole strategy is boundedprobabilities according to two equations, we will have specified a system for using self-play N.E. by the sum of regret at each action set �-Nash equilibria. • At each information set, can decide how to act relatively independently of the whole game In Equation 3.3, we stated that our immediate counterfactual regret was a consequence choice of actions at that information set. In that equation, we considered our regret to be the ence in utility between the actions taken by our strategy σi , and the single action that woul maximized our utility for all examples of that information set, weighted by the probability of ing that information set on iteration t, if we had tried to reach that information set. Now, w CFR traversal • Compute ! -- probability of reaching this state by others players (incrementally updated as you move down the tree) • Apply each action and recursively compute expected value of each move instead measureCFR: our regret for not taking each possible action. Associated with each action i Traversing details information set, we will maintain the following regret value: • At each state, maintain the average immediate regret: T Ri,imm (I, a) = T � � 1 � σt π−i (I) ui (σ t |I→a , I) − ui (σ t , I) T t=1 These RiT (I, tell us, overofTreaching samples, information how much weset regret not folding, calli • a) ! isvalues the probability I given strategies at specified time t and player i isconcerned trying towith positive regr raising instead of the taking the actions by that σi . As we are • multiply by σ and sum over all moves to get the reach that information set payoff of current strategy define RiT,+ (I, a) = max(RiT (I, a), 0). Then, to determine the new strategy to use at time T • u is the utility of a strategy, possibly given that we • The difference between the value of the move and thewe set the action probabilities as follows: switch our action only at this information set payoff is the regret this requires • Computing tree � the underlying T,+ P RiT ,+ (I,a) traversing if R (I, a) >0 T ,+ a∈A(I) i (I,a) a∈A(I) Ri both σiT +1 (I)(a) = • Also requires knowing players strategies 1 otherwise. |A(I)| The relationship between positive counterfactual regret and the new strategy’s action pro RiT T � � � 1 ∗ t t = max u (σ , σ ) − u (σ ) i i i −i T σi∗ ∈Σi t=1 (3.1) age overall regret is thus computation the average of the differences in utility between theExample strateCFR Strategy each game, and the single strategy that would have maximized utility over all T • The immediate regret informs how we act at each time step • (From Michael Johanson’s Msc thesis) define the• average strategy used by player i used during this series of games. For The average strategy (defined by imm. regret) over time will be a N.E. set I ∈ Ii , for each action a ∈ A(I), we define the average strategy as: • The average strategy is: �T σt t T t=1 πi (I)σ (I)(a) σ̄i (I)(a) = . �T σ t (I) π t=1 i (3.2) thatitsthis usesprobabilities !i, not !-i e strategy• Note defines action for each information set to be the action • That is, a different weighting than the imm. regret ach strategy, weighted by the probability of the strategy reaching that information Figure 3.1: The first example of counterfactual regret minimization at a choice node. See Section 3.3.4 for an explanation. 45 Figure 3.2: The second example of counterfactual regret minimization at a choice node. See Section 3.3.4 for an explanation. 46 em 2], Zinkevich states the following well known theorem that links the concepts ategy, overall regret, and �-Nash equilibria: Designing for poker Poker abstractions zero-sum• game at time T, if both players’ average overall regret is less than•�,Suit then Poker is too large to solve with CFR isomorphisms (not really an abstraction) brium. • Build abstract game instead • Translate state/moves from real game into abstract • Only two things to abstract: • States, Actions onsider thegame sequence of strategies that each player selects over the T games. If both • Actions: • Make action in abstract by average CFR ng over the course of these gamesgame such solved that their overall regret approaches • Restrict the number of bets • Take abstractapproach game and back into What is needed, finity, then theiractions averagefrom strategies a convert Nash equilibrium. • If someone bets more than we can handle, just call real game minimizing algorithm for selecting σit . Such an algorithm can then be used in eries of games to approach an �-Nash equilibrium. After any number of games, we e of � by measuring the average overall regret for all players, or by calculating a best trategy profile σ. Additionally, if our regret minimizing algorithm provides bounds Abstracting state Larger scale abstractions • Simple example from Kuhn poker: • Group (bucket) hands together that are similar • Can have 1, 2 or 3 • Can ‘merge’ information sets together to reduce number of states • Can do so asymmetrically if necessary • eg first player cannot distinguish 2 from 3 • eg both players cannot distinguish 1 from 3 • The best way to do this is somewhat an open question • Example grouping metrics: • Strength of the hand (chances of winning) • Potential of hand (how likely is it to improve) • Ideally, hands with the same betting sequences should be abstracted together More abstraction What is the cost of abstraction? • Flop/turn/river increase the size of the game tree • Are larger abstractions strictly better? • Don’t represent the cards that come down • Instead, just represent a transition between buckets • eg if my abstract game is closer to the real game, will my program play the game better? • Yes & No • What problems does this create? • Program doesn’t know what cards it has • Doesn’t know if it has the ‘nuts’ (best hand possible) • Abstraction is pathological; the size of the abstraction doesn’t predict the performance of the best response • BUT, bigger abstractions make less dominated errors • So, in practice, they are likely to perform better
© Copyright 2026 Paperzz