Backgammon project Oren Salzman Guy Levit Instructors: Part a: Ishai Menashe Part b: Yaki Engel Agenda • • • • • • Project’s Objectives The Learning Algorithm TDGammon Problematic points The Race Problem Experimental Results Future Development Objectives • Developing an agent that learns to play backgammon by playing with itself, using reinforcement learning techniques • Inspired by Tesauro’s TDGammon version 0.0 Learning Algorithm - general • Evaluating positions using a neural network • Greedy policy • When the game ends the agent gets a reward according to the result (+2, +1, -1, -2) TDGammon Problematic points • Non linear neural network • Policy is changing during training • Environment is changing during training Solutions: • Linear network • Learning in alternations The Race Problem In race, a more algorithmic approach is required for choosing a move Three solutions were considered: 1. Designing a manual algorithm 2. Using a different Network for races 3. Using the same Network, but each feature is dedicated either to a race or a non race position. Experiments Various settings of parameters were checked : – Learning step (0.1, 0.3, 0.8) – Lambda (0.1, 0.3, 0.5, 0.7, 0.9) – Discount factor (0.95, 0.97, 0.98, 0.999) For each setting the agent played between half a million and five million games. All versions were compared to one golden version Experiments’ results Alpha=0.3 fastest players 0.2 0 0 10 20 30 40 50 60 70 Avg points -0.2 -0.4 player1 l=0.1 g=0.97 player1 l=0.1 g=0.98 player1 l=0.3 g=0.97 -0.6 player2 l=0.3 g=0.98 player2 l=0.5 g=0.98 player1 l=0.5 g=0.98 -0.8 player1 l=0.7 g=0.98 player 2 l=0.7 g=0.98 -1 -1.2 Time [thousands of games] Experiments’ results Alpha=0.1 fastest players 0.2 0 0 10 20 30 40 Avg points -0.2 -0.4 -0.6 -0.8 -1 player2 player2 player2 player1 player1 player2 player1 player2 player1 player2 l=0.5 l=0.3 l=0.3 l=0.3 l=0.5 l=0.5 l=0.5 l=0.5 l=0.7 l=0.7 g=0.98 g=0.999 g=0.98 g=0.98 g=0.98 g=0.98 g=0.999 g=0.999 g=0.999 g=0.999 -1.2 Time [thousands of games] 50 60 70 Conclusions • Learning step of 0.1 yielded the best results • High discount factor (0.98, 0.999) were better than lower ones. • Lambda of 0.1 and 0.9 were inferior to others. Among 0.3, 0.5, and 0.7, 0.5 seemed the best. • None of the versions outperformed the golden version Future development • • • • • • More than 1-ply search Adding features Going back to a non – linear network Letting both agents learn simultaneously Connecting the player to the internet Graphical User Interface END Learning Alogrithm - general • The agents plays against itself, and get rewards (-2, -1, +1, +2) when the game ends. • The network weights are updated using the following formulas: t 1 t [rt 1 Vt (st 1 ) Vt (st )]et • The eligibility trace is updated by: et et 1 tVt (st ) The Features Backgammon Board Definitions
© Copyright 2026 Paperzz