tut_week5

ChooseMove
V
=
=
16  19
30 points
16  19, or
11  15, or
LegalMoves
SimulateMove
=
,
….
16  19
=
12  16, or
For all legal moves:
- simulate that move
- check the value of the result
Pick the best move
=
LegalMoves
11  15, or
….
SimulateMove
V
12 16
=
=
30 points
We’ll make our computer program learn the function V
- so when it sees a board, it will assign a score to the board
- it has to learn how to assign the correct score to a board
First, we have to define how our function V is supposed to behave…
…but what about boards in the middle of a game?
b‘
….
b
successor( b )
….
V(b’)=100
….
…but what about boards in the middle of a game?
b‘
….
b
successor( b )
….
….
V(b)=V(b’)=100
….
V(b)=V(b’)=100
V(b’)=100
NB: if this board state occurs in 50%
wins and 50% loses then V(b) will
eventually be 0
V(b’)=-100
b‘
….
b
successor( b )
….
….
V(b)=V(b’)=0
….
V(b)=V(b’)=100
V(b’)=100
NB: if this board state occurs in 75%
wins and 25% loses then V(b) will
eventually be 50
V(b’)=-100
b‘
….
b
successor( b )
….
….
V(b)=V(b’)=0
….
V(b)=V(b’)=100
V(b’)=100
so each board score will eventually
be a mix of the number of games that
you can win, lose or draw from that
board (assuming an ideal scoring function…in
fact we often only approximate this)
V(b’)=-100
b‘
….
b
successor( b )
….
….
V(b)=V(b’)=0
….
V(b)=V(b’)=100
V(b’)=100
…but what about boards in the middle of a game?
….
b
that’s our definition of V – this is how
b‘
our computer program’s V should
learn how to behave…
successor( b )
….
….
V(b)=V(b’)=100
….
V(b)=V(b’)=100
V(b’)=100
But in the middle of a game, how can you calculate the path to the final
board?
- generate all possibilities? tic-tac-toe 
chess
 …takes too long!
- approximate V and call it V’ (Vhat)
V’(b)  V(b)
…so we’ll learn an approximation to V
b‘
….
b
successor( b )
….
….
V(b)=V(b’)=100
….
V(b)=V(b’)=100
V(b’)=100
Now we’ll choose our representation of V’
(what is a board anyway…?)
1) big table?
input: board
score
50
90
….
Now we’ll choose our representation of V’
(what is a board anyway…?)
2) CLIPS rules?
IF my piece is near a side edge
THEN score = 80
IF my piece is near the opponents edge
THEN score = 90
….
Now we’ll choose our representation of V’
(what is a board anyway…?)
3) polynomial function?
we define some variables…
X1 = number of white pieces on board
X2 = number of red pieces
X3 = number of white kings
X4 = number of red kings
X5 = number of white pieces threatened by red (can be
captured on red’s next turn)
X6 = number of red pieces threatened by white
e.g.
X1 = 12
X3 = 0
X5 = 1
X2 = 11
X4 = 0
X6 = 0
Now we’ll choose our representation of V’
(what is a board anyway…?)
3) polynomial function?
X1 = number of white pieces on board
X2 = number of red pieces
X3 = number of white kings
X4 = number of red kings
X5 = number of white pieces threatened by red (can be
captured on red’s next turn)
X6 = number of red pieces threatened by white
arrange as linear combination: (quadratic function)
Now we’ll choose our representation of V’
(what is a board anyway…?)
3) polynomial function?
X1 = number of white pieces on board
X2 = number of red pieces
X3 = number of white kings
X4 = number of red kings
XComputer
of white pieces
threatened
red (can
be
will change
the by
values
of the
5 = number program
captured
weights on
– itred’s
will next
learnturn)
what the weights should be to
correct score for each board
Xgive
6 = number of red pieces threatened by white
arrange as linear combination: (quadratic function)
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
V’(b) = 23 + 0.5 x1 +…
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
12  16, or
=
=
LegalMoves
11  15,
or
For example, one of the boards resulting from a legal
move might be
….
b = <x1=12, x2=12,…,x6=1>
V’(<x1=12, x2=12,…,x6=1>) = 23 + 0.5 x1 +…
SimulateMove
V
12 16
=
=
==
30 points
= 30
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
12  16, or
= 11or  15,
= ….
LegalMoves
SimulateMove
V
12 16
=
=
==
30 points
First time round V’ is essentially
random (because we set the weights
as random) – as it learns V’ should
pick better successors
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
At this point
board 1 = b
board 2 = successor(b)
red:1216
board 2
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
red:1216
board 2
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
red:1216
board 2
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
error
weight to
update
learning
rate
modify weight in proportion to
size of attribute value
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
red:1216
board 2
(a) calculate V’ on all possible legal moves
(b) pick
the successor with the highest score
At this point the change
probably
won’t be in a useful
(c)direction,
evaluate error
because the weights are all still
(d) modify each weight to correct error
random
error
weight to
update
learning
rate
modify weight in proportion to
size of attribute value
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
red:1216
board 2
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
….
red wins
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
red:1216
board 2
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
….
b is a final board state
Vtrain(b) = 100
…because we won
red wins
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
red:1216
board 2
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
now we’re shifting our V’ towards
something useful (a little bit)
….
red wins
board 1
learning process – playing and learning at same time
computer will play against itself
initialise V’ with random weights (w0=23 etc.)
start a new game - for each board
red:1216
board 2
(a) calculate V’ on all possible legal moves
(b) pick the successor with the highest score
(c) evaluate error
(d) modify each weight to correct error
(e) repeat until end-of-game
….
1 million games later and our V’ might
now predict a useful score for any board
that we might see
red wins