Backgammon project

Backgammon project
Oren Salzman
Guy Levit
Instructors:
Part a: Ishai Menashe
Part b: Yaki Engel
Agenda
•
•
•
•
•
•
Project’s Objectives
The Learning Algorithm
TDGammon Problematic points
The Race Problem
Experimental Results
Future Development
Objectives
• Developing an agent that learns to play backgammon by
playing with itself, using reinforcement learning
techniques
• Inspired by Tesauro’s TDGammon version 0.0
Learning Algorithm - general
• Evaluating positions using a neural network
• Greedy policy
• When the game ends the agent gets a reward according
to the result (+2, +1, -1, -2)
TDGammon Problematic points
• Non linear neural network
• Policy is changing during training
• Environment is changing during training
Solutions:
• Linear network
• Learning in alternations
The Race Problem
In race, a more algorithmic approach is required for
choosing a move
Three solutions were considered:
1. Designing a manual algorithm
2. Using a different Network for races
3. Using the same Network, but each feature is dedicated
either to a race or a non race position.
Experiments
Various settings of parameters were checked :
– Learning step (0.1, 0.3, 0.8)
– Lambda (0.1, 0.3, 0.5, 0.7, 0.9)
– Discount factor (0.95, 0.97, 0.98, 0.999)
For each setting the agent played between half a million
and five million games.
All versions were compared to one golden version
Experiments’ results
Alpha=0.3 fastest players
0.2
0
0
10
20
30
40
50
60
70
Avg points
-0.2
-0.4
player1 l=0.1 g=0.97
player1 l=0.1 g=0.98
player1 l=0.3 g=0.97
-0.6
player2 l=0.3 g=0.98
player2 l=0.5 g=0.98
player1 l=0.5 g=0.98
-0.8
player1 l=0.7 g=0.98
player 2 l=0.7 g=0.98
-1
-1.2
Time [thousands of games]
Experiments’ results
Alpha=0.1 fastest players
0.2
0
0
10
20
30
40
Avg points
-0.2
-0.4
-0.6
-0.8
-1
player2
player2
player2
player1
player1
player2
player1
player2
player1
player2
l=0.5
l=0.3
l=0.3
l=0.3
l=0.5
l=0.5
l=0.5
l=0.5
l=0.7
l=0.7
g=0.98
g=0.999
g=0.98
g=0.98
g=0.98
g=0.98
g=0.999
g=0.999
g=0.999
g=0.999
-1.2
Time [thousands of games]
50
60
70
Conclusions
• Learning step of 0.1 yielded the best results
• High discount factor (0.98, 0.999) were better than lower
ones.
• Lambda of 0.1 and 0.9 were inferior to others. Among
0.3, 0.5, and 0.7, 0.5 seemed the best.
• None of the versions outperformed the golden version
Future development
•
•
•
•
•
•
More than 1-ply search
Adding features
Going back to a non – linear network
Letting both agents learn simultaneously
Connecting the player to the internet
Graphical User Interface
END
Learning Alogrithm - general
• The agents plays against itself, and get rewards
(-2, -1, +1, +2) when the game ends.
• The network weights are updated using the
following formulas:
t 1  t  [rt 1   Vt (st 1 )  Vt (st )]et
• The eligibility trace is updated by:
et  et 1   tVt (st )
The Features
Backgammon Board Definitions