CE810 / IGGI Game Design II Game Design with AI Agents Diego Perez-Liebana Contents • The Physical Travelling Salesman Problem – The Problem – The Framework – Short-term and Long-term planning • PTSP: Automated map generation • Simulation Based Search – Monte Carlo Tree Search 2 The Physical Travelling Salesman Problem Travelling Salesman Problem: Turn it into a real-time game! Drive a ship. In a maze. With constraints: 10 waypoints to reach. 1000 steps to visit next waypoint. 40ms to decide an action. 1s initialization. 3 The Physical Travelling Salesman Problem 4 The Physical Travelling Salesman Problem • Features some aspects of modern video games – Pathfinding – Real-time game • Competitions – www.ptsp-game.net (expired) – WCCI/CIG 2012, CIG 2013 • Winner: MCTS. • PTSP specifications – – – – 6 actions 40ms per game tick 10 waypoints 1000 steps to visit next waypoint 5 The Physical Travelling Salesman Problem • Challenge: generate levels for the PTSP. But not any level, we want maps where a controller that finds better routes obtains better results than another one that doesn't. • We want to reward those agents that play the game better (more intelligently). • Manual creation of levels is laborious (56 maps for the competition). • Manually created maps are not great: they might fail at providing a good distribution of waypoints. • One step further: generate levels where there's more than one possible (physics-based) route, still better than others. 6 Solving the PTSP – TSP Solvers 7 Including the route planner • Question: Is any order better than none? No. • Solve TSP: – Branch and Bound algorithm. – Cost between waypoints: Euclidean distance. • Cost between waypoints: A* path cost. • Can we improve it further? 8 Improving the route planner 9 Improving the route planner 10 Improving the route planner • Interdependency between long- and short-term planning • Use the proper MCTS driver is prohibitively costly • Add turn angles to the cost of the paths 11 Long-term vs Short-term • Long-term vs. Short-term planning. – Tree-search limitations and long-term planning in real-time games. • PTSP – Optimal distance-based TSP solution Optimal physics-based TSP solution. 12 Long-term vs Short-term • Long-term vs. Short-term planning. – Game Level Design • Challenging maps for PTSP? • Two agents (Agent-based Procedural Content Generation): – One that uses the optimal physics-based TSP solver – The other one uses the optimal distance-based TSP solver • In a well designed level, the performance of the first agent (P1) should be better that the one of the second (P2)(rewards better skill). • By maximizing the distance (P1 – P2), we should obtain more balanced levels. i.e. by evolution: Automated Map Generation for the Physical Travelling Salesman Problem. Diego Perez, Julian Togelius, Spyridon Samothrakis, Philipp Rolhfshagen and Simon M. Lucas, in IEEE Transactions on Evolutionary Computation, 18:5, pp. 708–720, 2013. 13 PTSP: Map Representation and Evolutionary Algorithm • Each map is composed by a set of floating point values: with L lines and S rectangles. • Problem: Feasibility of maps. All waypoints must be reachable from the ship's initial location. Some maps (individuals) are infeasible and must be discarded (or given a very low fitness). • Algorithm: CMA-ES 14 PTSP: Map Evaluation • Simulation-based evaluation function: We use three PTSP controllers to evaluate the maps being evolved. Each controller calculated and followed a different route: 1. Nearest-first TSP (NTSP): The order of waypoints is obtained by applying the nearest first algorithm to solve the TSP. 2. Distance TSP (DTSP): The distance between two points is calculated using A*, and the route using the Branch and Bound (B&B) algorithm (regular TSP algorithm). 3. Physics TSP (PTSP): B&B is used to find the route, but the distance between two points is affected by physical parameters such as the expected speed and orientation of the ship at that point. 15 PTSP: Map Evaluation • The maps evolved should have the following property: the time to complete a map by NTSP must be longer than DTSP, that must be also longer than PTSP: NTSP > DTSP > PTSP • We can model a fitness function that maximizes: f = min(NTSP - DTSP; DTSP - PTSP) (or minimizes the negation of that). 16 PTSP: Experiments 17 PTSP: Experiments 18 Simulation-based Search Forward search algorithms select the next action to take by looking ahead the states found after applying certain available actions. • They need a model of the game to simulate these actions. • Given a current state St and an action At applied from St, the forward model will provide the next state st+1 St, At Forward Model St+1 19 Simulation-based Search Simulate episodes of experience from the current state using the model: • Until reaching a terminal state (game end) or a predefined depth (i.e. the end of a chess game may be many plies ahead!) 20 Flat Monte Carlo • Monte Carlo (MC) MC methods for long term planning: Unsuitable for Long-term planning. Terminal states not reached. 21 Flat Monte Carlo Given a model M and a simulation policy p, that picks actions uniformly at random: • Iteratively, apply K episodes (iterations) from each one of the M actions. • Select the action at each step uniformly at random, random until reaching terminal state or predefined depth. • Compute the average reward for each action. • Pick the action that leads to the highest average reward after K*M episodes. 22 Regret and Bounds Is picking actions at random the best strategy? Should we give to all actions the same amount of trials? We are treating all actions as equal, although ones might be clearly better than others. But then, what is the best policy? • Balance between exploration and exploitation. – Exploitation: Make the best decision based on current information. – Exploration: Gather more information about the environment. This is, not choosing the best action found so far. • The objective is gather enough information to make the best overall decision. • N-armed Bandit Problem 23 UCB1 UCB1 (typically found in the literature in this form): • Q(s,a): Q-value of action a from state s (average of rewards after taking action a from state s). • N(s): Times the state s has been visited. • N(s,a): Times the action a has been picked from state s. • C: Balances between exploitation (Q term) and exploration (square root). – Value of C is application dependent. – Example: single player games with rewards in [0,1]: C = SQRT(2) 24 Flat UCB Given a model M and a bandit-based simulation policy p, that picks actions uniformly at random: • Iteratively, apply K episodes (iterations). • Select the first action from St with a bandit-based policy (UCB1, or any other UCB). • Pick actions uniformly at random until reaching terminal state or pre-defined depth. • Compute the average reward for each action. • Pick the action that leads to the highest average reward after K episodes. Note that the policy p improves at each episode! 25 Building a Search Tree Given a model M and the current simulation policy p: • For each action a A: – Simulate K episodes from current state St – Each episode is run until reaching a terminal state ST, with an associated reward RT (or a number of moves is reached). – Compute Mean/Expected Return for each action. • Build a search tree containing visited states and actions Recommendation Policy: • Select action to apply with highest Expected Return (greedy recommendation policy) 26 Simulation-Based Search: Building a tree Applying and UCB policy, adding a node (state) at each iteration, the tree grows asymmetrically, towards the most promising parts of the search space. This is, however, limited by how far can we look ahead into the future (less than with random roll-outs outside the tree – the tree would be too big!). 27 Monte Carlo Tree Search Adding Monte Carlo simulations (or roll-outs) after adding a new node to the tree: Monte Carlo Tree Search. • 2 different policies are used on each episode: – Tree Policy: Improves on each iteration. It is used while the simulation is in the tree. – Default Policy: It is fixed through all iterations. It is used while the simulation is outside the tree. Picks actions randomly. • On each iteration: – Q(s,a) on each node of the tree is updated. – So do N(s) and N(s,a) – Tree policy p is based on Q (i.e., UCB, UCB1): improves on each iteration! • Converges to optimal search tree. What about the optimal action overall? – Recommendation Policy: How to pick the best action after search has concluded: highest average reward, most visited action, etc… 28 Monte Carlo Tree Search Monte Carlo Tree Search: Builds an asymmetric tree. Further look ahead. 29 Monte Carlo Tree Search 4 Steps: Repeated iteratively during K episodes. • • 1. Tree selection: Following the tree policy (i.e. UCB1), navigate the tree until reaching a node with at least one child state not in the tree (this is, not all actions have been picked from that state in the tree). 2. Expansion: Add a new node in the tree, as a child of the node reached in the tree selection step. 30 Monte Carlo Tree Search 4 Steps: Repeated iteratively during K episodes. • • 3. Monte Carlo simulation: Following the default policy (picking actions uniformly at random), advance the state until a terminal state (game end) or a pre-defined maximum number of steps. The state at the end of the simulation is evaluated (this is, retrieve RT). 4. Back-propagation: Update the values of Q(s,a), N(s) and N(s,a) of the nodes visited in the tree during steps 1 and 2. 31 Solving the PTSP – Value function • Heuristic to guide search algorithm when choosing next moves to make. • Reward/Fitness for mid-game situations. • Components: – – – – Distance to next waypoints in route. State (visited/unvisited) of next waypoints. Time spent since beginning of the game. Collision penalization. 32 Advantages of Monte Carlo Tree Search • Highly selective best-first search. • Evaluates states dynamically (not like Dynamic Programming). • Uses samples to break the curse of dimensionality. • Works in black-box models (only needs samples). • It is computationally efficient (good for real-time games). • It is anytime: it can be stopped at any value of K and return an action from the root at any moment in time. • It is parallelizable: run multiple iterations in parallel. 33
© Copyright 2026 Paperzz