Lecture 11: Regret and Poker (One possible) Formal definition of

3.3.3
Minimizing Immediate Counterfactual Regret
In this section, we will show how counterfactual regret can be used to set the action probabilities of a
(One possible) Formal definition of regret
strategy, in order to minimize counterfactual regret on future iterations. With this accomplished, we
will have completed our task — by finding a player’s counterfactual regret and updating its action
• Let ut(a) be the utility (payoff) of the action taken at step t
probabilities according to two equations, we will have specified a system for using self-play to find
• Let σt(a) be the probability of taking action a
�-Nash equilibria.
Lecture 11: Regret and Poker
In Equation 3.3, we stated that our immediate counterfactual regret was a consequence of our
AI For Traditional
Games set. In that equation, we considered our regret to be the differchoice of actions
at that information
Prof. Nathan Sturtevant
ence in utility
between
Winter
2011 the actions taken by our strategy σi , and the single action that would have
maximized our utility for all examples of that information set, weighted by the probability of reach-
• Recall that a strategy is written as σ (Lecture 3)
• Internal regret at time T of not taking action a, RTa, is:
• T
•• Σ (ut(a) - Σ σt(b)ut(b) )
t=1
b
ing that information set on iteration t, if we had tried to reach that information set. Now, we will
• The second part is the actual payoff returned
instead measure our regret for not taking each possible action. Associated with each action in each
• The first part is the payoff that we would get if we could
swap the action we took with the action A
information set, we will maintain the following regret value:
T
Ri,imm
(I, a) =
T
�
�
1 � σt
π−i (I) ui (σ t |I→a , I) − ui (σ t , I)
T t=1
(3.4)
These RiT (I, a) values tell us, over T samples, how much we regret not folding, calling, or
raising instead of taking the actions specified by σi . As we are concerned with positive regret, we
Another regret-minimizing algorithm
define RiT,+ (I, a) = max(RiT (I, a), 0). Then, to determine the new strategy to use at time T + 1,
we set the action probabilities as follows:
• Define a policy at information set I and time T+1 as:
σiT +1 (I)(a)
=



RiT ,+ (I,a)
T ,+
(I,a)
a∈A(I) Ri
1
|A(I)|
P
if
�
a∈A(I)
RiT,+ (I, a) > 0
otherwise.
• 2-player limit texas hold ‘em
(3.5)
The relationship
counterfactual
regret andattheinformation
new strategy’sset
action
• Wherebetween
|A(I)| ispositive
the number
of actions
I probabilities is simple. Each action is selected in proportion to the accumulated positive counterfactual regret
• 50 choose 2 cards for opponent
• 48 choose 3 on flop
• 45 on turn
• This produces a mixed strategy
• 44 on river
regret, we set the action probabilities to a default setting — in this case, equal probability to each
3.3.4
• 52 choose 2 cards for us
• The + notation is max(R(), 0); eg non-negative values
we have had for not selecting that action in the past. In information states where there is no positive
action.
What about large trees?
• Try it on R/P/S with 2 players
• If you “train” for 2 players, the average strategy will
Counterfactual
converge to Regret
e-NashMinimization Example
The above formal description of minimizing counterfactual regret may appear complex, but the
computation is actually straightforward. We will now present a short example of how counterfactual
• Between each of these must have strategy for play
• b / c / f (limit of 3/4 raises per betting round)
Poker
CFR
• Strategy for the whole game is too large to learn as a
whole strategy
3.3.3
• Build a ii tree for both players
• But, strategy can be decomposed to each
“independent” decision in the game
Minimizing
Immediate
Counterfactual Regret
• Repeat
until converged:
Sample
a perfect-information
world
In this section, we •will
show how
counterfactual regret can
be used to set the action probabiliti
Traverse counterfactual
the tree and regret
compute
the regret
for With
eachthis accomplish
strategy, in order to• minimize
on future
iterations.
action, and update the action probabilities
will have completed our task — by finding a player’s counterfactual regret and updating its
• The AVERAGE strategy played during training will be a
• Key insight: The regret of a whole strategy is boundedprobabilities according to two equations, we will have specified a system for using self-play
N.E.
by the sum of regret at each action set
�-Nash equilibria.
• At each information set, can decide how to act
relatively independently of the whole game
In Equation 3.3, we stated that our immediate counterfactual regret was a consequence
choice of actions at that information set. In that equation, we considered our regret to be the
ence in utility between the actions taken by our strategy σi , and the single action that woul
maximized our utility for all examples of that information set, weighted by the probability of
ing that information set on iteration t, if we had tried to reach that information set. Now, w
CFR traversal
• Compute ! -- probability of reaching this state by
others players (incrementally updated as you move
down the tree)
• Apply each action and recursively compute expected
value of each move
instead measureCFR:
our regret
for not taking
each possible action. Associated with each action i
Traversing
details
information set, we will maintain the following regret value:
• At each state, maintain the average immediate regret:
T
Ri,imm
(I, a) =
T
�
�
1 � σt
π−i (I) ui (σ t |I→a , I) − ui (σ t , I)
T t=1
These RiT (I,
tell us, overofTreaching
samples, information
how much weset
regret
not folding, calli
• a)
! isvalues
the probability
I given
strategies
at specified
time t and
player
i isconcerned
trying towith positive regr
raising instead of the
taking
the actions
by that
σi . As
we are
• multiply by σ and sum over all moves to get the
reach that information set
payoff of current strategy
define RiT,+ (I, a) = max(RiT (I, a), 0). Then, to determine the new strategy to use at time T
• u is the utility of a strategy, possibly given that we
• The difference between the value of the move and thewe set the action probabilities as follows:
switch our action only at this information set
payoff is the regret
 this requires
• Computing
tree
� the underlying
T,+
 P RiT ,+ (I,a) traversing
if
R
(I,
a)
>0
T
,+
a∈A(I) i
(I,a)
a∈A(I) Ri both
σiT +1
(I)(a)
=
• Also
requires
knowing
players
strategies
 1
otherwise.
|A(I)|
The relationship between positive counterfactual regret and the new strategy’s action pro
RiT
T
�
�
�
1
∗
t
t
=
max
u
(σ
,
σ
)
−
u
(σ
)
i
i
i
−i
T σi∗ ∈Σi t=1
(3.1)
age overall
regret
is thus computation
the average of the differences in utility between theExample
strateCFR
Strategy
each game, and the single strategy that would have maximized utility over all T
• The immediate regret informs how we act at each time
step
• (From Michael Johanson’s Msc thesis)
define the• average
strategy
used
by player
i used
during
this
series of games. For
The average
strategy
(defined
by imm.
regret)
over
time
will be a N.E.
set I ∈ Ii , for each action a ∈ A(I), we define the average strategy as:
• The average strategy is:
�T
σt
t
T
t=1 πi (I)σ (I)(a)
σ̄i (I)(a) =
.
�T
σ t (I)
π
t=1 i
(3.2)
thatitsthis
usesprobabilities
!i, not !-i
e strategy• Note
defines
action
for each information set to be the action
• That is, a different weighting than the imm. regret
ach strategy, weighted by the probability of the strategy reaching that information
Figure 3.1: The first example of counterfactual regret minimization at a choice node. See Section 3.3.4 for an explanation.
45
Figure 3.2: The second example of counterfactual regret minimization at a choice node. See Section 3.3.4 for an explanation.
46
em 2], Zinkevich states the following well known theorem that links the concepts
ategy, overall regret, and �-Nash equilibria:
Designing for poker
Poker abstractions
zero-sum• game
at time T, if both players’ average overall regret is less than•�,Suit
then
Poker is too large to solve with CFR
isomorphisms (not really an abstraction)
brium.
• Build abstract game instead
• Translate state/moves from real game into abstract
• Only two things to abstract:
• States, Actions
onsider thegame
sequence of strategies that each player selects over the T games. If both
• Actions:
• Make
action
in abstract
by average
CFR
ng over the
course
of these
gamesgame
such solved
that their
overall regret approaches
• Restrict the number of bets
• Take
abstractapproach
game and
back into What is needed,
finity, then
theiractions
averagefrom
strategies
a convert
Nash equilibrium.
• If someone bets more than we can handle, just call
real game
minimizing algorithm for selecting σit . Such an algorithm can then be used in
eries of games to approach an �-Nash equilibrium. After any number of games, we
e of � by measuring the average overall regret for all players, or by calculating a best
trategy profile σ. Additionally, if our regret minimizing algorithm provides bounds
Abstracting state
Larger scale abstractions
• Simple example from Kuhn poker:
• Group (bucket) hands together that are similar
• Can have 1, 2 or 3
• Can ‘merge’ information sets together to reduce
number of states
• Can do so asymmetrically if necessary
• eg first player cannot distinguish 2 from 3
• eg both players cannot distinguish 1 from 3
• The best way to do this is somewhat an open
question
• Example grouping metrics:
• Strength of the hand (chances of winning)
• Potential of hand (how likely is it to improve)
• Ideally, hands with the same betting sequences should
be abstracted together
More abstraction
What is the cost of abstraction?
• Flop/turn/river increase the size of the game tree
• Are larger abstractions strictly better?
• Don’t represent the cards that come down
• Instead, just represent a transition between buckets
• eg if my abstract game is closer to the real game, will
my program play the game better?
• Yes & No
• What problems does this create?
• Program doesn’t know what cards it has
• Doesn’t know if it has the ‘nuts’ (best hand possible)
• Abstraction is pathological; the size of the abstraction
doesn’t predict the performance of the best response
• BUT, bigger abstractions make less dominated errors
• So, in practice, they are likely to perform better