FileNewTemplate

CE810 / IGGI Game Design II
Game Design with AI Agents
Diego Perez-Liebana
Contents
• The Physical Travelling Salesman Problem
– The Problem
– The Framework
– Short-term and Long-term planning
• PTSP: Automated map generation
• Simulation Based Search
– Monte Carlo Tree Search
2
The Physical Travelling Salesman Problem
Travelling Salesman Problem:
Turn it into a real-time game!
Drive a ship.
In a maze.
With constraints:
10 waypoints to reach.
1000 steps to visit next waypoint.
40ms to decide an action.
1s initialization.
3
The Physical Travelling Salesman Problem
4
The Physical Travelling Salesman Problem
• Features some aspects of modern video games
– Pathfinding
– Real-time game
• Competitions
– www.ptsp-game.net (expired)
– WCCI/CIG 2012, CIG 2013
• Winner: MCTS.
• PTSP specifications
–
–
–
–
6 actions
40ms per game tick
10 waypoints
1000 steps to visit next waypoint
5
The Physical Travelling Salesman
Problem
• Challenge: generate levels for the PTSP. But not any level, we want maps
where a controller that finds better routes obtains better results than
another one that doesn't.
• We want to reward those agents that play the game better (more
intelligently).
• Manual creation of levels is laborious (56 maps for the competition).
• Manually created maps are not great: they might fail at providing a good
distribution of waypoints.
• One step further: generate levels where there's more than one possible
(physics-based) route, still better than others.
6
Solving the PTSP – TSP Solvers
7
Including the route planner
• Question: Is any order better than none? No.
• Solve TSP:
– Branch and Bound algorithm.
– Cost between waypoints: Euclidean distance.
• Cost between waypoints: A* path cost.
• Can we improve it further?
8
Improving the route planner
9
Improving the route planner
10
Improving the route planner
• Interdependency between long- and short-term planning
• Use the proper MCTS driver is prohibitively costly
• Add turn angles to the cost of the paths
11
Long-term vs Short-term
• Long-term vs. Short-term planning.
– Tree-search limitations and long-term planning in
real-time games.
• PTSP
– Optimal distance-based TSP solution

Optimal physics-based TSP solution.
12
Long-term vs Short-term
• Long-term vs. Short-term planning.
– Game Level Design
• Challenging maps for PTSP?
• Two agents (Agent-based Procedural Content Generation):
– One that uses the optimal physics-based TSP solver
– The other one uses the optimal distance-based TSP solver
• In a well designed level, the performance of the first agent
(P1) should be better that the one of the second (P2)(rewards
better skill).
• By maximizing the distance (P1 – P2), we should obtain more
balanced levels. i.e. by evolution:
Automated Map Generation for the Physical Travelling Salesman Problem. Diego Perez, Julian
Togelius, Spyridon Samothrakis, Philipp Rolhfshagen and Simon M. Lucas, in IEEE Transactions on
Evolutionary Computation, 18:5, pp. 708–720, 2013.
13
PTSP: Map Representation and
Evolutionary Algorithm
• Each map is composed by a set of floating point values:
with L lines and S rectangles.
• Problem: Feasibility of maps. All waypoints must be reachable from
the ship's initial location. Some maps (individuals) are infeasible and
must be discarded (or given a very low fitness).
• Algorithm: CMA-ES
14
PTSP: Map Evaluation
• Simulation-based evaluation function: We use three PTSP
controllers to evaluate the maps being evolved. Each controller
calculated and followed a different route:
1.
Nearest-first TSP (NTSP): The order of waypoints is obtained by applying
the nearest first algorithm to solve the TSP.
2.
Distance TSP (DTSP): The distance between two points is calculated using
A*, and the route using the Branch and Bound (B&B) algorithm (regular
TSP algorithm).
3.
Physics TSP (PTSP): B&B is used to find the route, but the distance between
two points is affected by physical parameters such as the expected speed
and orientation of the ship at that point.
15
PTSP: Map Evaluation
• The maps evolved should have the following property: the time to
complete a map by NTSP must be longer than DTSP, that must be
also longer than PTSP:
NTSP > DTSP > PTSP
• We can model a fitness function that maximizes:
f = min(NTSP - DTSP; DTSP - PTSP)
(or minimizes the negation of that).
16
PTSP: Experiments
17
PTSP: Experiments
18
Simulation-based Search
Forward search algorithms select the next action to take
by looking ahead the states found after applying certain
available actions.
• They need a model of the game to simulate these
actions.
• Given a current state St and an action At applied from
St, the forward model will provide the next state st+1
St, At
Forward
Model
St+1
19
Simulation-based Search
Simulate episodes of experience from the current state using the
model:
• Until reaching a terminal state (game end) or a predefined depth
(i.e. the end of a chess game may be many plies ahead!)
20
Flat Monte Carlo
• Monte Carlo (MC)
MC methods for long term planning:
Unsuitable for Long-term planning.
Terminal states not reached.
21
Flat Monte Carlo
Given a model M and a simulation policy p, that
picks actions uniformly at random:
• Iteratively, apply K episodes
(iterations) from each one
of the M actions.
• Select the action at each
step uniformly at random,
random until reaching
terminal state or predefined depth.
• Compute the average
reward for each action.
• Pick the action that leads
to the highest average
reward after K*M episodes.
22
Regret and Bounds
Is picking actions at random the best strategy?
Should we give to all actions the same amount of trials?
We are treating all actions as equal, although ones might be
clearly better than others. But then, what is the best policy?
• Balance between exploration and exploitation.
– Exploitation: Make the best decision based on current
information.
– Exploration: Gather more information about the environment.
This is, not choosing the best action found so far.
• The objective is gather enough information to make the
best overall decision.
• N-armed Bandit Problem
23
UCB1
UCB1 (typically found in the literature in this form):
• Q(s,a): Q-value of action a from state s (average of
rewards after taking action a from state s).
• N(s): Times the state s has been visited.
• N(s,a): Times the action a has been picked from state s.
• C: Balances between exploitation (Q term) and
exploration (square root).
– Value of C is application dependent.
– Example: single player games with rewards in [0,1]: C =
SQRT(2)
24
Flat UCB
Given a model M and a bandit-based simulation policy p, that picks
actions uniformly at random:
• Iteratively, apply K episodes
(iterations).
• Select the first action from St
with a bandit-based policy
(UCB1, or any other UCB).
• Pick actions uniformly at
random until reaching
terminal state or pre-defined
depth.
• Compute the average reward
for each action.
• Pick the action that leads to
the highest average reward
after K episodes.
Note that the policy p improves at each episode!
25
Building a Search Tree
Given a model M and the current simulation policy p:
• For each action a  A:
– Simulate K episodes from current state St
– Each episode is run until reaching a terminal state ST, with an
associated reward RT (or a number of moves is reached).
– Compute Mean/Expected Return for each action.
• Build a search tree containing visited states and actions
Recommendation Policy:
• Select action to apply with highest Expected Return
(greedy recommendation policy)
26
Simulation-Based Search: Building a tree
Applying and UCB policy, adding a node (state) at each iteration, the
tree grows asymmetrically, towards the most promising parts of the
search space.
This is, however, limited by how far can we look ahead into the future
(less than with random roll-outs outside the tree – the tree would be
too big!).
27
Monte Carlo Tree Search
Adding Monte Carlo simulations (or roll-outs) after adding a new node to the
tree: Monte Carlo Tree Search.
• 2 different policies are used on each episode:
– Tree Policy: Improves on each iteration. It is used while the simulation is in the
tree.
– Default Policy: It is fixed through all iterations. It is used while the simulation is
outside the tree. Picks actions randomly.
• On each iteration:
– Q(s,a) on each node of the tree is updated.
– So do N(s) and N(s,a)
– Tree policy p is based on Q (i.e., UCB, UCB1): improves on each iteration!
• Converges to optimal search tree. What about the optimal action overall?
– Recommendation Policy: How to pick the best action after search has concluded:
highest average reward, most visited action, etc…
28
Monte Carlo Tree Search
Monte Carlo Tree Search:
Builds an asymmetric tree.
Further look ahead.
29
Monte Carlo Tree Search
4 Steps: Repeated iteratively during K episodes.
•
•
1. Tree selection: Following the tree policy (i.e. UCB1), navigate the tree until
reaching a node with at least one child state not in the tree (this is, not all
actions have been picked from that state in the tree).
2. Expansion: Add a new node in the tree, as a child of the node reached in
the tree selection step.
30
Monte Carlo Tree Search
4 Steps: Repeated iteratively during K episodes.
•
•
3. Monte Carlo simulation: Following the default policy (picking actions uniformly at
random), advance the state until a terminal state (game end) or a pre-defined
maximum number of steps. The state at the end of the simulation is evaluated (this is,
retrieve RT).
4. Back-propagation: Update the values of Q(s,a), N(s) and N(s,a) of the nodes visited in
the tree during steps 1 and 2.
31
Solving the PTSP – Value function
• Heuristic to guide search algorithm when choosing
next moves to make.
• Reward/Fitness for mid-game situations.
• Components:
–
–
–
–
Distance to next waypoints in route.
State (visited/unvisited) of next waypoints.
Time spent since beginning of the game.
Collision penalization.
32
Advantages of Monte Carlo Tree
Search
• Highly selective best-first search.
• Evaluates states dynamically (not like Dynamic
Programming).
• Uses samples to break the curse of dimensionality.
• Works in black-box models (only needs samples).
• It is computationally efficient (good for real-time
games).
• It is anytime: it can be stopped at any value of K and
return an action from the root at any moment in time.
• It is parallelizable: run multiple iterations in parallel.
33