Update Q[sn, an]

Texas Holdem Poker
With
Q-Learning
First Round
(pre-flop)
Player
Opponent
Second Round
(flop)
Player
Community
cards
Opponent
Third Round
(turn)
Player
Community
cards
Opponent
Final Round
(river)
Player
Community
cards
Opponent
End
(We Win)
Player
Community
cards
Opponent
End Round
(Note how initially low hands can win later when
more community cards are added)
Player
Community
cards
Opponent
The Problem
• State Space Is Too Big
• Over 2 million states would be needed just
to represent the possible combinations for
a single 5 card poker hand
Our Solution
• It doesn’t matter what your exact cards
are, just their relations to your opponent.
• The most important piece of information
that it needs is how many possible
combinations of two cards could make a
better hand
Our State Representation
[Round]
[Opponent’s Last Bet]
[# Possible Better Hands]
[Best Obtainable Hand]
• [4] [3] [10] [ 3]
To Calculate The # Better Hands
Evaluate!
Player
Count the
number of
better hands on
the right side
Community
cards
Evaluate!
All other two possible
cards
Q-lambda Implementation (I)
• The current state of the game is stored in a
variable
• Each time the community cards are updated, or
the opponent places a bet, we update our current
state.
• For all states, the Q-values of each betting action
is stored in an array.
Fold
= -0.9954
Some
Check
= 2.014
State
Call
= 1.745
Raise
= -3.457
Q-lambda Implementation (II)
• Eligibility Trace: we keep a vector of state-actions which
are responsible for us being in the current state
In State s1
In State s2
Now we are in
Did Action a1
Did Action a2
Current-State
• Each time we make a betting decision, we add the
current state and the action we chose to the eligibility
trace.
Q-lambda Implementation (III)
• At the end of each game, we get the money won/lost to
reward/punish the state-actions in the eligibility trace
In State s1
In State s2
In State s3
Did Action a1
Did Action a2
Did Action a3
Update Q[sn, an]
Got Reward R
Testing Our Q-lambda Player
Play Against the Random Player
Play Against the Bluffer
Play Against Itself
Play Against the Random Player
QPlayer vs Random Player
1080
1060
1020
1000
980
960
Gam es
Q-lambda learns very fast how to
beat the random player
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
940
1
Money
1040
Play Against the Random Player (II)
QPlayer vs Random
7000
6000
Money
5000
4000
3000
2000
1000
0
1000
2000
3000
4000
5000
6000
Gam es
Same graph, with up to 9000 games
7000
8000
9000
Play Against the Bluffer
• The bluffer always raises, unless raising is
not possible, in which case it calls.
• It is not trivial to defeat the bluffer,
because you need to fold on weaker
hands and keep raising on better hands
• Our Q-lambda player does very poorly
against the bluffer!
Play Against the Bluffer (II)
Qplayer Vs Bluffer
20000
-60000
-80000
-100000
-120000
-140000
Games
In our many trials with different alpha and lambda
values, the Q-lambda player always lost with a
linear slope
66400
62500
58600
54700
50800
46900
43000
39100
35200
31300
27400
23500
19600
15700
7900
11800
-40000
4000
-20000
100
Qplayer's Money
0
Play Against the Bluffer (III)
• Why is Q-lambda
losing to the bluffer?
• To answer this, we
looked at the Q-value
tables
• With the good hands,
Q-lambda has learned
to Raise or Call
Q-values from
Round = 3 ,
OpptBet = Raise
Play Against the Bluffer (IV)
• The problem is that
even with a very poor
hand in the second
round, it still does not
learn to fold and
continues to either
raise , call, or check.
• Same problem exists
with poor hands in
other rounds
Q-values from
Round = 1
OpptBet = Not_Yet
BestHandPossible = ‘ok’
Play Against Itself
• We played the Q-lambda player against itself, hoping that
it would eventually converge on some strategy.
QPlayer vs QPlayer
2000
0
-2000
1
42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821 862 903 944 985
-4000
Money
-6000
-8000
-10000
-12000
-14000
-16000
-18000
Games
Play Against Itself (II)
• We also graphed the Q-values of a few
particular states over time, to see if they
converge to meaningful values.
• The result was mixed. For some states Qvalues completely converge, while for
some other states they are almost random
Play Against Itself (III)
• With a good hand in the last round, the Q-values have
converged so that Calling and after that Raising are good
and folding is very bad
12
10
8
4
FOLD
CALL
RAISE
2
-4
-6
-8
Games
57800
54400
51000
47600
44200
40800
37400
34000
30600
27200
23800
20400
17000
13600
10200
-2
6800
0
3400
QValue
6
Play Against Itself (IV)
• With a medium hand in the last round, the Q-values does
not clearly converge. Folding still looks very bad, but
there is no preference between calling and raising.
15
10
-5
-10
-15
Games
57600
54400
51200
48000
44800
41600
38400
35200
32000
28800
25600
22400
19200
16000
12800
9600
6400
0
3200
Values
5
FOLD
CALL
RAISE
Play Against Itself (V)
• With a very bad set of hands in the last round, the Qvalues do not converge at all. This is clearly wrong, since
in an optimal-policy folding would have a higher value.
4
2
Values
-4
57600
54400
51200
48000
44800
41600
38400
35200
32000
28800
25600
22400
19200
16000
12800
9600
6400
-2
3200
0
FOLD
CALL
RAISE
-6
-8
-10
-12
-14
Games
Why Q-values Do not Converge?
Poker cannot be represented with our state
representation (our states are too broad or are
missing some critical aspects of the game)
The ALPHA and LAMBDA factors are incorrect
We have not run the game for long enough time
Conclusion
• Our State representation and Q-lambda
implementation is able to learn some aspects of
poker (for instance it learns to Raise or Call
when it has a good hand in the last round)
• However, in our tests it does not converge to an
optimal policy.
• More experience with the Alpha and Lambda
parameters, and the state representation may
result a better convergence.