INSTITUTO DE SISTEMAS E ROBÓTICA Minimax Value Iteration Applied to Robotic Soccer Gonçalo Neto Institute for Systems and Robotics Instituto Superior Técnico Lisbon, PORTUGAL Presentation Outline INSTITUTO DE SISTEMAS E ROBÓTICA • Framework Concepts • Solving Two-Person Zero-Sum Stochastic Games • Soccer as a Stochastic Game • Results • Conclusions and Future Work Markov Decision Processes • Defined as a 4-tuple (S, A, T, R) where: INSTITUTO DE SISTEMAS E ROBÓTICA • • • • S is a set of states. A is a set of actions. T: SxAxS → [0,1] is a transition function. R: SxAxS → R is a reward function. • • π: SxA → [0,1] deterministic vs stochastic • Single-agent / multiple-state markovian environment. • On an MDP a policy π is: Optimality in MDPs INSTITUTO DE SISTEMAS E ROBÓTICA • Maximize expected reward will lead to optimal policies. • Usual formulation: discounted reward over time. • State values: &( k # V ( s ) = E) %' * rt + k +1 st = s , ) " $ k =0 ! ) • Bellmam Optimality Equation relates state values, for the optimal policy: • Optimal policy is greedy... [ V " ( s ) = max ! T ( s, a, s ' ) R ( s, a, s ' ) + #V " ( s ) a s' ] Dynamic Programming • There are several Dynamic Programming algorithms. INSTITUTO DE SISTEMAS E ROBÓTICA • • They assume full knoweledge of the environment. Not suitable for online learning. • • Based on the Bellman Optimality Equation. Iteration expression: • A popular algorithm is Value Iteration. [ V k +1 ( s ) # "" max ! T ( s, a, s ' ) R ( s, a, s ' ) + $V k ( s ) a s' ] Matrix Games • Defined as a tuple (n , A1...n, R1...n) where: INSTITUTO DE SISTEMAS E ROBÓTICA • • • n is the number of players. Ai is the set of actions for player i. A is the joint action space. Ri: A → R is a reward function: the reward depends on the joint action. • Multiple-agent / single-state environment. • A strategy σ is a probability distribution over the actions. The joint strategy is the strategy for all the players. Matrix Games: examples INSTITUTO DE SISTEMAS E ROBÓTICA R P S R 0 1 -1 -1 P -1 0 1 0 S 1 -1 0 R P S R 0 -1 1 P 1 0 S -1 1 Player 1 T N T 2 0 N 5 1 Player 1 Rock-Paper-Scisors Player 2 T N T 2 5 N 0 1 Player 2 Prisoner’s Dillema Optimality in MGs • INSTITUTO DE SISTEMAS E ROBÓTICA • • • • Best-Response Function: set of optimal strategies given the other players current strategies. Nash equilibrium: in a game’s Nash equilibrium all the players are playing a Best-Response strategy to the other players. Solving a MG: finding it’s Nash equilibrium (or equilibria, because one game can have more than one). All MGs have at least one Nash equilibrium. Types of Games: zero-sum games, team-games, general-sum games. Solving Zero-sum Games • Two-person Zero-sum games (or just Zero-sum games) have the following characteristics: INSTITUTO DE SISTEMAS E ROBÓTICA • • • • Two opponents play against each other. Their rewards are symmetrical (always sum zero). Usually only one equilibrium... ... If more exist they are interchangeable!! • To find an equilibrium use Minimax Principle: max min ! # (a ) R (a, o) # "PD ( A ) o"O a" A Stochastic Games • Defined as a tuple (n, S, A1...n, T,R1...n) where: INSTITUTO DE SISTEMAS E ROBÓTICA • • • • • n is the number of players. S is a set of states. Ai is the set of actions for player i. A is the joint action. T: SxAxS → [0,1] is a transition function. Ri: SxAxS → R is a reward function. • Multiple-agent / multiple-state environment. Like an extension of MDPs and MGs. • Markovian from the game’s point of view but not from the player. • The notion of policy can also be defined like in MDPs. Solving SGs... • Several Reinforcement Learning and Dynamic Programming algorithms gave been derived. INSTITUTO DE SISTEMAS E ROBÓTICA • Normally one type of games is solved. • Example: a zero-sum stochastic game is one with two players in which every state represents a zero-sum matrix game. • A possible approach: • Dynamic Programming + Matrix-Game Solver Presentation Outline INSTITUTO DE SISTEMAS E ROBÓTICA • Framework Concepts • Solving Two-Person Zero-Sum Stochastic Games • Soccer as a Stochastic Game • Results • Conclusions and Future Work Minimax Value Iteration • Suitable for two-person zero-sum stochastic games. • Dynamic Programming: INSTITUTO DE SISTEMAS E ROBÓTICA • • Value Iteration. The state values represent Nash equilibrium values. • Minimax in each state. • Matrix Solver: • Bellman Optimality Equation: V ( s ) = max min ! % (a )Q ( s, a, o) " " % #PD ( A ) o#O a# A Q " ( s, a, o) = ! R( s, a, o, s ' ) + $T ( s, a, o, s ' )V " ( s ) s' If not two-person... • If the game is not a two-person zero-sum game but... INSTITUTO DE SISTEMAS E ROBÓTICA • • • It’s a two team game. In each team, the reward is the same. The rewards of both teams are symmetrical. • ...we can consider team actions and apply the same algorithm: • • A = A1 x A2 x ... x An O = O1 x O2 x ... x Om Algorithm Expression • Based on the Bellman Optimality Equation for TwoPerson Zero-Sum Stochastic Games: INSTITUTO DE SISTEMAS E ROBÓTICA V k +1 ( s ) " max min ! % (a )Q k +1 ( s, a, o) % #PD ( A ) o#O Q k +1 a# A k ( s, a, o) " ! R ( s, a, o, s ' ) + $T ( s, a, o, s ' )V ( s ) s' Presentation Outline INSTITUTO DE SISTEMAS E ROBÓTICA • Framework Concepts • Solving Two-Person Zero-Sum Stochastic Games • Soccer as a Stochastic Game • Results • Conclusions and Future Work Modelling a Player • Non-deterministic automata. INSTITUTO DE SISTEMAS E ROBÓTICA • The output of an action depends on the actions of all players... • ... the transition probabilities are not stationary. Modeling the Game • Symmetrical rewards for both teams. INSTITUTO DE SISTEMAS E ROBÓTICA • Only received after a goal. • • • IF k players are getting-ball AND none has it ⋄ One of them gets it with probability 1/k. IF a player is changing role ⋄ The role is changed with probability 1 and the ball lost with probability p. ... • A set of rules defines the transitions. Examples: • Used in simulation: • • 2 teams of 2 players each Different setups, with some players restricted to just one role. Presentation Outline INSTITUTO DE SISTEMAS E ROBÓTICA • Framework Concepts • Solving Two-Person Zero-Sum Stochastic Games • Soccer as a Stochastic Game • Results • Conclusions and Future Work Method Convergence 0,7 • • State Value INSTITUTO DE SISTEMAS E ROBÓTICA Usually converges fast but... 0,5 ....for a setup with #S=82 and #A=25 one iteration took more than 30 minutes. The graphics are for #S=22 and #A=15. 0,4 0,3 0,2 0,1 0 1 3 5 7 9 11 13 15 17 19 Number of Iterations -0,982 -0,984 1 3 5 7 9 11 13 15 -0,986 -0,988 State Value • 0,6 -0,99 -0,992 -0,994 -0,996 -0,998 -1 -1,002 Number of Iterations 17 19 Simulation after Training • Used a 10000 step simulation. INSTITUTO DE SISTEMAS E ROBÓTICA • When a terminal state is reached, the game was put back in the initial state. • Against another optimal opponent: • • Only one game played. Finished with a goalless draw. • • A team with one pure Attacker scored 2974 against 326. A team with one pure Deffender scored 0 against 0. • Against a random opponent: Presentation Outline INSTITUTO DE SISTEMAS E ROBÓTICA • Framework Concepts • Solving Two-Person Zero-Sum Stochastic Games • Soccer as a Stochastic Game • Results • Conclusions and Future Work Conclusions • The Nash equilibrium convergence assures worstcase optimal. INSTITUTO DE SISTEMAS E ROBÓTICA • • If not possible to score more, assuming worst-case, then keep the draw. Defensive teams tend to just defend • Method suitable for offline learning. • • Very time consuming. With a large action set, linear programs slow the method ⋄ Efficient LP techniques needed. • The team action approach only works for small action sets and/or small teams. Future Work and Ideas • INSTITUTO DE SISTEMAS E ROBÓTICA • • Observability issues. • • Should DP assume partial observability? We do we build the game model? Suitable learning method depends on other players type. • While learning / training locally, learning method could depend on the agent’s beliefs about another player. Some actions could be discarded. • • • • Example: doesn’t make sense to choose get-ball while having the ball. Supervisory control for enabling actions that make sense. A way of incorporating knowledge. Can act as a complement to reinforcement learning and dynamic programming. INSTITUTO DE SISTEMAS E ROBÓTICA Q&A
© Copyright 2026 Paperzz