Cat and Mouse Dmitry Vyukov mailto:[email protected] October 11, 2010 Problem Statement The game of "Cat and Mouse" is played on a directed graph between two players. Alternating turns, with the Mouse making the first move, the two players travel around the nodes of the graph along the directed edges of the graph. The object of the game is for the Mouse player to reach the goal node before the Cat player can occupy the node on which the Mouse stands and the Cat player wishes to "catch" the Mouse. Problem Description: Write a threaded program to input a directed graph and the starting nodes of the Mouse and Cat and then compute the number of wins, draws and losses for the Mouse player for all possible games within a specified number of moves according to the game rules given below. Input to the program will come from the single file name included on the command line. The file will hold the start nodes for the players, the maximum number of moves, and the directed graph. Output for the program will be the number of winning, losing and drawing strategies for the Mouse player over all possible paths through the graph that are less than or equal to the maximum number of moves input, and, if one exists, an example of the paths for the Mouse and Cat player that result in a win for the Mouse. Game Description: To start, each player occupies a given node in the graph. Players alternate moves until the game is won by a player or each player has executed the given maximum number of moves. To execute a move a player either 1) follows a directed edge of the graph originating from the currently occupied node to another node or 2) remains on the current node. The Mouse player has the first move and attempts to reach the designated goal node; if this is done, the Mouse player has won and the game is over. The Cat player attempts to occupy the same node as the Mouse player; if this event occurs at any time during the game, the Cat player has won and the game is over. One stipulation is that the Cat player may not enter the goal node at any time during the game. For purposes of this problem, a draw is declared if the Mouse has not reached the goal node and the Cat has not caught the Mouse after each player has executed the maximum number of moves specified from the input. Input Description: The input to the program will come from an input text file given on the application command line. Graph nodes will be represented by three capital letters in the range [‘A'..'Z']. The first line of the input file will hold the starting node of the Mouse player, the second line will be the starting node for the Cat player, the third line will hold the goal node, and the fourth line will hold a positive integer indicating the maximum number of moves to examine (MM). The remaining lines in the file will be edges of the directed graph represented as a string of 6 capital letters. The first three letters will be the source node and the second three letters will be the sink node of the edge. Output Description: The output to be generated by the application are the winning chances for the Mouse player over all possible MM length (or shorter in case of a win) paths. The total number of winning cases, drawing cases, and losing cases will be output to stdout. If there is at least one winning case for the Mouse, a path for both players that results in a Mouse win will be output; if there is no winning path, note that fact. Single-threaded Implementation Recursive exhaustion algorithm that is naturally follows from the problem statement is easy to construct. We just need to model all possible mouse and cat moves until move limit is reached. Once a game crosses either win or loss state, we need to memorize the fact and never override it in all subsequent game states. For example, if a game crosses loss state where mouse position is equal to cat position, all subsequent paths from this state result in mouse loss even if mouse reaches goal node later on. Once we exhaust maximum number of moves, we memorize the result – win, loss or draw if the game has not crossed win/loss state. Here is the algorithm: enum path_type {path_draw, path_win, path_loss}; struct result_t uint64_t uint64_t uint64_t }; { win_count; loss_count; draw_count; void calculate( vector<vector<size_t>> const& matrix, // adjacency matrix size_t mouse_pos, // current mouse position size_t cat_pos, // current cat position size_t goal_pos, // goal position size_t move_count, // remaining move count bool mouse_turn, // whose turn? path_type type, // result for current path result_t& result) // total result statistics { // first, we try to define result for current path, if it's not yet defined. // if the result is already defined, then we shall not override it. if (type == path_draw) { if (mouse_pos == cat_pos) type = path_loss; else if (mouse_pos == goal_pos) type = path_win; } // if we exhaust all moves, memorize the result for the current path. if (move_count == 0) { if (type == path_win) result.win_count += 1; else if (type == path_draw) result.draw_count += 1; else result.loss_count += 1; } // otherwise, explore all possible moves. else { if (mouse_turn) { if (mouse_pos == goal_pos) { // if mouse have reached the goal, then she must stay there. calculate(matrix, mouse_pos, cat_pos, goal_pos, move_count, !mouse_turn, type, result); } else { // otherwise, explore possible moves from the vertex. for (size_t i = 0; i != matrix[mouse_pos].size(); i += 1) calculate(matrix, matrix[mouse_pos][i], cat_pos, goal_pos, move_count, !mouse_turn, type, result); } } else { // explore possible moves from the vertex // with the exception of the goal vertex for (size_t i = 0; i != matrix[cat_pos].size(); i += 1) { if (matrix[cat_pos][i] != goal_pos) calculate(matrix, mouse_pos, matrix[cat_pos][i], goal_pos, move_count - 1, !mouse_turn, type, result); } } } } However, the problem with the algorithm is the exponential computational complexity of O(K^2MM) (where K is a mean number of edges outgoing from a node, and MM is a maximum number of moves). For example, for relatively small input with K=10 and MM=10, computational complexity is 10^20 which is basically impossible to compute. Dynamic programming to the rescue! Dynamic programming (DP) is a general method which can be used to solve problems with overlapping subtasks with optimal substructure. Overlapping in this context means that the same subtasks encountered several times during solving, and optimal substructure means that optimal solutions of subtask can be used construct optimal solution of a supertask. If both conditions are satisfied, then each subtask is solved only once and then the result is reused whenever the subtask is encountered again. The tricky part is to determine what is a subtask, and how to efficiently organize memorization and reuse of results of subtasks. The key insight is that overlapping subtasks with optimal substructure are solutions of games of the form game(n, m, c) (n - remaining number of moves, m - mouse position, c – cat position). Each such subtask encountered a lot of times, can be solved independently and the result can be reused. There are 2 approaches to DP – top-down and bottom-up. In top-down DP we start solving the sought for task, and then solve and memorize subtasks as they are encountered (then subtasks of that subtasks, etc). In bottom-up DP we start from a primitive leaf subtasks, and then use the results for solving of higher-level tasks all the the way up to the sought for task. It's generally acknowledged that bottom-up DP is more efficient if can be applied. And bottom-up DP indeed can be applied to our problem in the following way. First, let's consider games of the form game(0, m, c) (that is games with 0 moves remaining, essentially all possible final states). It's trivial to calculate results for such games: if m==c, then it's a loss; if c==g (goal), then it's a win; and draw otherwise. The results are to be memorized for future reuse. Now let's consider games of the form game(1, m, c) (that is games with 1 move remaining). It's possible to compute the result for game(1, m0, c0) by summing up results for all game(0, m, c) that can be reached from game(1, m0, c0) in 1 move. Now we can generalize this approach to: game(n, m, c) = SUM game(n-1, mi, ci) over all possible (mi, ci). Below is a graphical scheme of the algorithm (processing is done bottom-up, arrows represent addition operations, arrows are shown for only 1 cell (namely, (1,1)) for clarity): Here is a bit simplified pseudo-code: // for all moves for (int move = 1; move != MM; move += 1) { // for all mouse positions for (int m = 0; m != node_count; m += 1) { // for all cat positions for (int c = 0; c != node_count; c += 1) { // for all possible mouse moves from 'm' for (int mi2 = 0; mi2 != graph[m].size(); mi2 += 1) { int m2 = graph[m][mi2]; // for all possible cat moves from 'c' for (int ci2 = 0; ci2 != graph[c].size(); ci2 += 1) { int c2 = graph[c][ci2]; game[move][m][c].win_count += game[move-1][m2][c2].win_count; game[move][m][c].loss_count += game[move-1][m2][c2].loss_count; game[move][m][c].draw_count += game[move-1][m2][c2].draw_count; // point (A) – used below } } } // point (B) – used below } } The only remaining thing we need to take into account to get working algorithm is the fact that first winning game state takes precedence over all subsequent losing game states, and, accordingly, first losing game state takes precedence over all subsequent winning game states. To account for this we "transfer" results to the required field when game crosses losing/winning state (this code must be inserted into position marked with (B) in the above code): if (m == c) { game[move][m][c].loss_count += game[move][m][c].win_count; game[move][m][c].win_count = 0; game[move][m][c].loss_count += game[move][m][c].draw_count; game[move][m][c].draw_count = 0; } else if (m == goal) { game[move][m][c].win_count += game[move][m][c].loss_count; game[move][m][c].loss_count = 0; game[move][m][c].win_count += game[move][m][c].draw_count; game[move][m][c].draw_count = 0; } Then, the sought for result is game(MM, m_start, c_start). Computational complexity of the algorithm is O(V^2 * MM * K^2) (V – number of nodes, MM – maximum number of moves, K – mean number of edges outgoing from a node). The complexity can be further reduced to O(2 * V^2 * MM * K), if we split move processing into 2 parts – cat move and mouse move. That is, we process 2*MM "half-moves", during each half-move either mouse or cat makes K moves. Space complexity of the algorithm is O(V^2 * MM). However we do not need to memorize all results, we only need to keep 2 "move-slices". That is slice for current move, and slice for previous move. The well-know trick for such situation is to allocate 2 arrays, and then alternate roles of the arrays (first, first array represents current move and second array represents previous move, then second array represents current move and first array represents previous move). So, the resulting space complexity of the algorithm is O(2 * V^2). It's worth noting that the algorithm computes results for all possible starting positions and move counts. So it can be used to answer questions like "What is the minimum length of a winning path for a given graph?" (answer: game(n_min, m_start, c_start) where win_count > 0). Or "What if the best starting position for mouse for a given graph?" (answer: game(MM, m, c) where win_count/loss_count is maximal). Winning path calculation. The above-described algorithm does not directly yield a winning path example. My first thought was to run naïve recursive exhaustion algorithm until it yields a one. However, the problem with such approach is that worst case computational complexity is exponential, so we get back to where we started from. The key insight is that the polynomial algorithm can be used to produce a winning path by back tracking a win (i.e. game(n, m=goal, c)) to a starting state (i.e. game(MM, m_start, c_start)). The general algorithm is as follows. We associate a winning path (if any) with each game state game(n, m, c). For the first "layer" (i.e. game(0, m, c)) all winning paths are empty. For other "layers" winning path is copied from the previous layer and appended with current move. Here is a pseudo-code with winning path calculation (this code must be inserted into position marked with (A) in the above code): // [this is part of processing of game(move, m, c)] // check that there is at least one win in the previous position if (game[move-1][m2][c2].win_count > 0) { // copy winning path prefix game[move][m][c].win_path = game[move-1][m2][c2].win_path; // append with current move game[move][m][c].win_path.push_back(make_pair(m2, c2)); } In the end, we will be able to extract some winning path (if any) from game(MM, m_start, c_start) along with the statistics. However, this naïve algorithm for win path back tracking significantly degrades computational and space complexity (we have to store and copy paths of length O(MM)). To overcome this I modify the algorithm as follows. There is the shortest winning path (SWP) (sooner several of them), i.e. the winning path of a minimum length. Moreover, the SWP never passes the same game state (mouse_position, cat_position) more than once (trivial to prove: if we cut the loop, we get shorter winning path). So for each game state (mouse_position, cat_position) that is on the s SWP I memorize a single move along the SWP: // check that there is at least one win in the previous position if (game[move-1][m2][c2].win_count > 0 && SWP[m][c].is_set == false) { // now processing and memory consumption is O(1) SWP[m][c].is_set == true; SWP[m][c].move = make_pair(m2, c2); } When the computation is finished, we are able to restore SWP from components: start at SWP[m_start][c_start], and then follow via 'move' fields to a winning position. During output I prepend the SWP with required number of "void" moves (mouse and cat stay on the same position) to get path of length MM. Parallelization In order to parallelize the algorithm we need to analyze data and control dependencies, and find independent pieces of computation that can be executed in parallel. Calculation of game state game(n0, m0, c0) generally depends on all game(n0 - 1, m, c). No control dependencies present. Consequently, each move-slice depends on the preceding move-slice, but all states game(n0, m, c) inside of a move-slice can be processed in parallel independently of each other. So, I start a thread team, all threads in the team process equal pieces of work during each move-slice calculation (parallel phase), then all threads synchronize on a barrier, and then transit to the next parallel phase: // OpenMP is used to start a thread team #pragma omp parallel for schedule(static, 1) for (int thread_index = 0; thread_index < thread_count; thread_index += 1) { // this cycle executed by all threads for (int move = 0; move != move_count; move += 1) { calculate_own_piece_of_work_based_on(thread_index, thread_count); process_own_piece_of_work(); // all worker threads synchronize with each other // in the end of each move-slice pragma omp barrier # } } Granularity problem. So, all worker threads synchronize with each other after each phase. If amount of work perthread/per-phase is small (less than at least several thousands of cycles), the periodic synchronization can negatively affect scalability. In the Single-threaded Implementation section I've described 2 algorithms: one with complexity O(2 * V^2 * MM * K), and another with complexity O(V^2 * MM * K^2). The former is generally faster in single-threaded execution, however the latter contains 2 times less phases and amount of work per-phase is larger. I've noticed that on small game graphs (game graph size determines amount of work per-phase) the latter algorithm is faster on 64 threads, so I've added a heuristic that chooses between the algorithms in run-time depending on input graph characteristics. Theoretically, it's possible to process 2 (3, 4, etc) full moves per phase, then computational complexity become O(V^2 * MM * K^4 / 2). For some input graphs on massively parallel hardware such algorithm can yield better performance (because it contains less phases, thus less synchronization between threads). However, I did not implement the modification due to limited time. Performance Below is a performance graph for 2 different inputs. First input is a random graph with 30 nodes, 57 edges and 2000 moves (blue line) (it produces results with 1801 decimal digits). Second input is a random graph with 1000 nodes, 4029 edges and 30 moves (red line) (it produces results with 29 decimal digits). Testing was conducted on Intel MTL machine with 4 Intel Xeon X7560 processors, each with 8 cores and HT enables (32 cores, 64 hardware threads total). Horizontal axis is the number of threads, and vertical axis is execution time in milliseconds. The inputs stress the program in 2 different aspects. First input produces very large results (1801 decimal digits), and thus stresses program's ability to handle large arbitrary precision numbers. Second input stresses program's ability to work with large game graphs (each move slice contains 10^6 elements). It can be seen that the second input (red line) scales somehow better. This can be explained by the fact that the second input contains less moves (parallel phases) while each phase contains significantly larger amount of work, so threads synchronize with each other less frequently. On both inputs program scales sub-linearly, partly this can be explained by Amdahl's law (program contains some serial parts, namely, input and output), and partly by limited memory bandwidth (threads very actively work with memory doing very little work per a memory location). Naïve exhaustion algorithm would scale linearly (it basically does not use memory, and threads can work largely independently). However due to exponential computational complexity it would not get any close to the polynomial algorithm anyway.
© Copyright 2025 Paperzz