Texas Holdem Poker With Q-Learning First Round (pre-flop) Player Opponent Second Round (flop) Player Community cards Opponent Third Round (turn) Player Community cards Opponent Final Round (river) Player Community cards Opponent End (We Win) Player Community cards Opponent End Round (Note how initially low hands can win later when more community cards are added) Player Community cards Opponent The Problem • State Space Is Too Big • Over 2 million states would be needed just to represent the possible combinations for a single 5 card poker hand Our Solution • It doesn’t matter what your exact cards are, just their relations to your opponent. • The most important piece of information that it needs is how many possible combinations of two cards could make a better hand Our State Representation [Round] [Opponent’s Last Bet] [# Possible Better Hands] [Best Obtainable Hand] • [4] [3] [10] [ 3] To Calculate The # Better Hands Evaluate! Player Count the number of better hands on the right side Community cards Evaluate! All other two possible cards Q-lambda Implementation (I) • The current state of the game is stored in a variable • Each time the community cards are updated, or the opponent places a bet, we update our current state. • For all states, the Q-values of each betting action is stored in an array. Fold = -0.9954 Some Check = 2.014 State Call = 1.745 Raise = -3.457 Q-lambda Implementation (II) • Eligibility Trace: we keep a vector of state-actions which are responsible for us being in the current state In State s1 In State s2 Now we are in Did Action a1 Did Action a2 Current-State • Each time we make a betting decision, we add the current state and the action we chose to the eligibility trace. Q-lambda Implementation (III) • At the end of each game, we get the money won/lost to reward/punish the state-actions in the eligibility trace In State s1 In State s2 In State s3 Did Action a1 Did Action a2 Did Action a3 Update Q[sn, an] Got Reward R Testing Our Q-lambda Player Play Against the Random Player Play Against the Bluffer Play Against Itself Play Against the Random Player QPlayer vs Random Player 1080 1060 1020 1000 980 960 Gam es Q-lambda learns very fast how to beat the random player 96 91 86 81 76 71 66 61 56 51 46 41 36 31 26 21 16 11 6 940 1 Money 1040 Play Against the Random Player (II) QPlayer vs Random 7000 6000 Money 5000 4000 3000 2000 1000 0 1000 2000 3000 4000 5000 6000 Gam es Same graph, with up to 9000 games 7000 8000 9000 Play Against the Bluffer • The bluffer always raises, unless raising is not possible, in which case it calls. • It is not trivial to defeat the bluffer, because you need to fold on weaker hands and keep raising on better hands • Our Q-lambda player does very poorly against the bluffer! Play Against the Bluffer (II) Qplayer Vs Bluffer 20000 -60000 -80000 -100000 -120000 -140000 Games In our many trials with different alpha and lambda values, the Q-lambda player always lost with a linear slope 66400 62500 58600 54700 50800 46900 43000 39100 35200 31300 27400 23500 19600 15700 7900 11800 -40000 4000 -20000 100 Qplayer's Money 0 Play Against the Bluffer (III) • Why is Q-lambda losing to the bluffer? • To answer this, we looked at the Q-value tables • With the good hands, Q-lambda has learned to Raise or Call Q-values from Round = 3 , OpptBet = Raise Play Against the Bluffer (IV) • The problem is that even with a very poor hand in the second round, it still does not learn to fold and continues to either raise , call, or check. • Same problem exists with poor hands in other rounds Q-values from Round = 1 OpptBet = Not_Yet BestHandPossible = ‘ok’ Play Against Itself • We played the Q-lambda player against itself, hoping that it would eventually converge on some strategy. QPlayer vs QPlayer 2000 0 -2000 1 42 83 124 165 206 247 288 329 370 411 452 493 534 575 616 657 698 739 780 821 862 903 944 985 -4000 Money -6000 -8000 -10000 -12000 -14000 -16000 -18000 Games Play Against Itself (II) • We also graphed the Q-values of a few particular states over time, to see if they converge to meaningful values. • The result was mixed. For some states Qvalues completely converge, while for some other states they are almost random Play Against Itself (III) • With a good hand in the last round, the Q-values have converged so that Calling and after that Raising are good and folding is very bad 12 10 8 4 FOLD CALL RAISE 2 -4 -6 -8 Games 57800 54400 51000 47600 44200 40800 37400 34000 30600 27200 23800 20400 17000 13600 10200 -2 6800 0 3400 QValue 6 Play Against Itself (IV) • With a medium hand in the last round, the Q-values does not clearly converge. Folding still looks very bad, but there is no preference between calling and raising. 15 10 -5 -10 -15 Games 57600 54400 51200 48000 44800 41600 38400 35200 32000 28800 25600 22400 19200 16000 12800 9600 6400 0 3200 Values 5 FOLD CALL RAISE Play Against Itself (V) • With a very bad set of hands in the last round, the Qvalues do not converge at all. This is clearly wrong, since in an optimal-policy folding would have a higher value. 4 2 Values -4 57600 54400 51200 48000 44800 41600 38400 35200 32000 28800 25600 22400 19200 16000 12800 9600 6400 -2 3200 0 FOLD CALL RAISE -6 -8 -10 -12 -14 Games Why Q-values Do not Converge? Poker cannot be represented with our state representation (our states are too broad or are missing some critical aspects of the game) The ALPHA and LAMBDA factors are incorrect We have not run the game for long enough time Conclusion • Our State representation and Q-lambda implementation is able to learn some aspects of poker (for instance it learns to Raise or Call when it has a good hand in the last round) • However, in our tests it does not converge to an optimal policy. • More experience with the Alpha and Lambda parameters, and the state representation may result a better convergence.
© Copyright 2026 Paperzz