Minimax Value Iteration Applied to Robotic Soccer

INSTITUTO DE
SISTEMAS E
ROBÓTICA
Minimax Value Iteration
Applied to Robotic Soccer
Gonçalo Neto
Institute for Systems and Robotics
Instituto Superior Técnico
Lisbon, PORTUGAL
Presentation Outline
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Framework Concepts
• Solving Two-Person Zero-Sum Stochastic Games
• Soccer as a Stochastic Game
• Results
• Conclusions and Future Work
Markov Decision Processes
• Defined as a 4-tuple (S, A, T, R) where:
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
•
•
S is a set of states.
A is a set of actions.
T: SxAxS → [0,1] is a transition function.
R: SxAxS → R is a reward function.
•
•
π: SxA → [0,1]
deterministic vs stochastic
• Single-agent / multiple-state markovian environment.
• On an MDP a policy π is:
Optimality in MDPs
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Maximize expected reward will lead to optimal
policies.
• Usual formulation: discounted reward over time.
• State values:
&( k
#
V ( s ) = E) %' * rt + k +1 st = s , ) "
$ k =0
!
)
• Bellmam Optimality Equation relates state values,
for the optimal policy:
•
Optimal policy is greedy...
[
V " ( s ) = max ! T ( s, a, s ' ) R ( s, a, s ' ) + #V " ( s )
a
s'
]
Dynamic Programming
• There are several Dynamic Programming
algorithms.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
They assume full knoweledge of the environment.
Not suitable for online learning.
•
•
Based on the Bellman Optimality Equation.
Iteration expression:
• A popular algorithm is Value Iteration.
[
V k +1 ( s ) #
"" max ! T ( s, a, s ' ) R ( s, a, s ' ) + $V k ( s )
a
s'
]
Matrix Games
• Defined as a tuple (n , A1...n, R1...n) where:
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
•
n is the number of players.
Ai is the set of actions for player i. A is the joint action
space.
Ri: A → R is a reward function: the reward depends on
the joint action.
• Multiple-agent / single-state environment.
• A strategy σ is a probability distribution over the
actions. The joint strategy is the strategy for all the
players.
Matrix Games: examples
INSTITUTO DE
SISTEMAS E
ROBÓTICA
R
P
S
R
0
1
-1
-1
P
-1
0
1
0
S
1
-1
0
R
P
S
R
0
-1
1
P
1
0
S
-1
1
Player 1
T
N
T
2
0
N
5
1
Player 1
Rock-Paper-Scisors
Player 2
T
N
T
2
5
N
0
1
Player 2
Prisoner’s Dillema
Optimality in MGs
•
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
•
•
Best-Response Function: set of optimal strategies given the
other players current strategies.
Nash equilibrium: in a game’s Nash equilibrium all the players
are playing a Best-Response strategy to the other players.
Solving a MG: finding it’s Nash equilibrium (or equilibria,
because one game can have more than one).
All MGs have at least one Nash equilibrium.
Types of Games: zero-sum games, team-games, general-sum
games.
Solving Zero-sum Games
• Two-person Zero-sum games (or just Zero-sum
games) have the following characteristics:
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
•
•
Two opponents play against each other.
Their rewards are symmetrical (always sum zero).
Usually only one equilibrium...
... If more exist they are interchangeable!!
• To find an equilibrium use Minimax Principle:
max min ! # (a ) R (a, o)
# "PD ( A ) o"O
a" A
Stochastic Games
• Defined as a tuple (n, S, A1...n, T,R1...n) where:
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
•
•
•
n is the number of players.
S is a set of states.
Ai is the set of actions for player i. A is the joint action.
T: SxAxS → [0,1] is a transition function.
Ri: SxAxS → R is a reward function.
• Multiple-agent / multiple-state environment. Like an
extension of MDPs and MGs.
• Markovian from the game’s point of view but not
from the player.
• The notion of policy can also be defined like in
MDPs.
Solving SGs...
• Several Reinforcement Learning and Dynamic
Programming algorithms gave been derived.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Normally one type of games is solved.
•
Example: a zero-sum stochastic game is one with two
players in which every state represents a zero-sum matrix
game.
• A possible approach:
• Dynamic Programming + Matrix-Game Solver
Presentation Outline
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Framework Concepts
• Solving Two-Person Zero-Sum Stochastic Games
• Soccer as a Stochastic Game
• Results
• Conclusions and Future Work
Minimax Value Iteration
• Suitable for two-person zero-sum stochastic games.
• Dynamic Programming:
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
Value Iteration.
The state values represent Nash equilibrium values.
•
Minimax in each state.
• Matrix Solver:
• Bellman Optimality Equation:
V ( s ) = max min ! % (a )Q ( s, a, o)
"
"
% #PD ( A ) o#O
a# A
Q " ( s, a, o) = ! R( s, a, o, s ' ) + $T ( s, a, o, s ' )V " ( s )
s'
If not two-person...
• If the game is not a two-person zero-sum game but...
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
•
It’s a two team game.
In each team, the reward is the same.
The rewards of both teams are symmetrical.
• ...we can consider team actions and apply the same
algorithm:
•
•
A = A1 x A2 x ... x An
O = O1 x O2 x ... x Om
Algorithm Expression
• Based on the Bellman Optimality Equation for TwoPerson Zero-Sum Stochastic Games:
INSTITUTO DE
SISTEMAS E
ROBÓTICA
V k +1 ( s ) " max min ! % (a )Q k +1 ( s, a, o)
% #PD ( A ) o#O
Q
k +1
a# A
k
( s, a, o) " ! R ( s, a, o, s ' ) + $T ( s, a, o, s ' )V ( s )
s'
Presentation Outline
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Framework Concepts
• Solving Two-Person Zero-Sum Stochastic Games
• Soccer as a Stochastic Game
• Results
• Conclusions and Future Work
Modelling a Player
• Non-deterministic
automata.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• The output of an
action depends on the
actions of all players...
• ... the transition
probabilities are not
stationary.
Modeling the Game
• Symmetrical rewards for both teams.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
Only received after a goal.
•
•
•
IF k players are getting-ball AND none has it ⋄ One of
them gets it with probability 1/k.
IF a player is changing role ⋄ The role is changed with
probability 1 and the ball lost with probability p.
...
• A set of rules defines the transitions. Examples:
• Used in simulation:
•
•
2 teams of 2 players each
Different setups, with some players restricted to just one
role.
Presentation Outline
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Framework Concepts
• Solving Two-Person Zero-Sum Stochastic Games
• Soccer as a Stochastic Game
• Results
• Conclusions and Future Work
Method Convergence
0,7
•
•
State Value
INSTITUTO DE
SISTEMAS E
ROBÓTICA
Usually converges fast
but...
0,5
....for a setup with
#S=82 and #A=25 one
iteration took more
than 30 minutes.
The graphics are for
#S=22 and #A=15.
0,4
0,3
0,2
0,1
0
1
3
5
7
9
11
13
15
17
19
Number of Iterations
-0,982
-0,984
1
3
5
7
9
11
13
15
-0,986
-0,988
State Value
•
0,6
-0,99
-0,992
-0,994
-0,996
-0,998
-1
-1,002
Number of Iterations
17
19
Simulation after Training
• Used a 10000 step simulation.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
When a terminal state is reached, the game was put back
in the initial state.
• Against another optimal opponent:
•
•
Only one game played.
Finished with a goalless draw.
•
•
A team with one pure Attacker scored 2974 against 326.
A team with one pure Deffender scored 0 against 0.
• Against a random opponent:
Presentation Outline
INSTITUTO DE
SISTEMAS E
ROBÓTICA
• Framework Concepts
• Solving Two-Person Zero-Sum Stochastic Games
• Soccer as a Stochastic Game
• Results
• Conclusions and Future Work
Conclusions
• The Nash equilibrium convergence assures worstcase optimal.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
If not possible to score more, assuming worst-case, then
keep the draw.
Defensive teams tend to just defend
• Method suitable for offline learning.
•
•
Very time consuming.
With a large action set, linear programs slow the method
⋄ Efficient LP techniques needed.
• The team action approach only works for small
action sets and/or small teams.
Future Work and Ideas
•
INSTITUTO DE
SISTEMAS E
ROBÓTICA
•
•
Observability issues.
•
•
Should DP assume partial observability?
We do we build the game model?
Suitable learning method depends on other players type.
•
While learning / training locally, learning method could depend
on the agent’s beliefs about another player.
Some actions could be discarded.
•
•
•
•
Example: doesn’t make sense to choose get-ball while having the
ball.
Supervisory control for enabling actions that make sense.
A way of incorporating knowledge.
Can act as a complement to reinforcement learning and dynamic
programming.
INSTITUTO DE
SISTEMAS E
ROBÓTICA
Q&A