Cat and Mouse Problem Statement

Cat and Mouse
Dmitry Vyukov
mailto:[email protected]
October 11, 2010
Problem Statement
The game of "Cat and Mouse" is played on a directed graph between two players. Alternating
turns, with the Mouse making the first move, the two players travel around the nodes of the graph along
the directed edges of the graph. The object of the game is for the Mouse player to reach the goal node
before the Cat player can occupy the node on which the Mouse stands and the Cat player wishes to
"catch" the Mouse.
Problem Description: Write a threaded program to input a directed graph and the starting nodes
of the Mouse and Cat and then compute the number of wins, draws and losses for the Mouse player for
all possible games within a specified number of moves according to the game rules given below. Input
to the program will come from the single file name included on the command line. The file will hold
the start nodes for the players, the maximum number of moves, and the directed graph. Output for the
program will be the number of winning, losing and drawing strategies for the Mouse player over all
possible paths through the graph that are less than or equal to the maximum number of moves input,
and, if one exists, an example of the paths for the Mouse and Cat player that result in a win for the
Mouse.
Game Description: To start, each player occupies a given node in the graph. Players alternate
moves until the game is won by a player or each player has executed the given maximum number of
moves. To execute a move a player either 1) follows a directed edge of the graph originating from the
currently occupied node to another node or 2) remains on the current node. The Mouse player has the
first move and attempts to reach the designated goal node; if this is done, the Mouse player has won
and the game is over. The Cat player attempts to occupy the same node as the Mouse player; if this
event occurs at any time during the game, the Cat player has won and the game is over. One stipulation
is that the Cat player may not enter the goal node at any time during the game. For purposes of this
problem, a draw is declared if the Mouse has not reached the goal node and the Cat has not caught the
Mouse after each player has executed the maximum number of moves specified from the input.
Input Description: The input to the program will come from an input text file given on the
application command line. Graph nodes will be represented by three capital letters in the range
[‘A'..'Z']. The first line of the input file will hold the starting node of the Mouse player, the second line
will be the starting node for the Cat player, the third line will hold the goal node, and the fourth line
will hold a positive integer indicating the maximum number of moves to examine (MM). The
remaining lines in the file will be edges of the directed graph represented as a string of 6 capital letters.
The first three letters will be the source node and the second three letters will be the sink node of the
edge.
Output Description: The output to be generated by the application are the winning chances for the
Mouse player over all possible MM length (or shorter in case of a win) paths. The total number of
winning cases, drawing cases, and losing cases will be output to stdout. If there is at least one winning
case for the Mouse, a path for both players that results in a Mouse win will be output; if there is no
winning path, note that fact.
Single-threaded Implementation
Recursive exhaustion algorithm that is naturally follows from the problem statement is easy to
construct. We just need to model all possible mouse and cat moves until move limit is reached. Once a
game crosses either win or loss state, we need to memorize the fact and never override it in all
subsequent game states. For example, if a game crosses loss state where mouse position is equal to cat
position, all subsequent paths from this state result in mouse loss even if mouse reaches goal node later
on. Once we exhaust maximum number of moves, we memorize the result – win, loss or draw if the
game has not crossed win/loss state.
Here is the algorithm:
enum path_type {path_draw, path_win, path_loss};
struct result_t
uint64_t
uint64_t
uint64_t
};
{
win_count;
loss_count;
draw_count;
void calculate(
vector<vector<size_t>> const& matrix, // adjacency matrix
size_t mouse_pos, // current mouse position
size_t cat_pos, // current cat position
size_t goal_pos, // goal position
size_t move_count, // remaining move count
bool mouse_turn, // whose turn?
path_type type, // result for current path
result_t& result) // total result statistics
{
// first, we try to define result for current path, if it's not yet defined.
// if the result is already defined, then we shall not override it.
if (type == path_draw) {
if (mouse_pos == cat_pos)
type = path_loss;
else if (mouse_pos == goal_pos)
type = path_win;
}
// if we exhaust all moves, memorize the result for the current path.
if (move_count == 0) {
if (type == path_win)
result.win_count += 1;
else if (type == path_draw)
result.draw_count += 1;
else
result.loss_count += 1;
}
// otherwise, explore all possible moves.
else {
if (mouse_turn) {
if (mouse_pos == goal_pos) {
// if mouse have reached the goal, then she must stay there.
calculate(matrix, mouse_pos, cat_pos, goal_pos,
move_count, !mouse_turn, type, result);
} else {
// otherwise, explore possible moves from the vertex.
for (size_t i = 0; i != matrix[mouse_pos].size(); i += 1)
calculate(matrix, matrix[mouse_pos][i], cat_pos, goal_pos,
move_count, !mouse_turn, type, result);
}
} else {
// explore possible moves from the vertex
// with the exception of the goal vertex
for (size_t i = 0; i != matrix[cat_pos].size(); i += 1) {
if (matrix[cat_pos][i] != goal_pos)
calculate(matrix, mouse_pos, matrix[cat_pos][i], goal_pos,
move_count - 1, !mouse_turn, type, result);
}
}
}
}
However, the problem with the algorithm is the exponential computational complexity of
O(K^2MM) (where K is a mean number of edges outgoing from a node, and MM is a maximum
number of moves). For example, for relatively small input with K=10 and MM=10, computational
complexity is 10^20 which is basically impossible to compute.
Dynamic programming to the rescue!
Dynamic programming (DP) is a general method which can be used to solve problems with
overlapping subtasks with optimal substructure. Overlapping in this context means that the same
subtasks encountered several times during solving, and optimal substructure means that optimal
solutions of subtask can be used construct optimal solution of a supertask. If both conditions are
satisfied, then each subtask is solved only once and then the result is reused whenever the subtask is
encountered again.
The tricky part is to determine what is a subtask, and how to efficiently organize memorization
and reuse of results of subtasks. The key insight is that overlapping subtasks with optimal substructure
are solutions of games of the form game(n, m, c) (n - remaining number of moves, m - mouse position,
c – cat position). Each such subtask encountered a lot of times, can be solved independently and the
result can be reused.
There are 2 approaches to DP – top-down and bottom-up. In top-down DP we start solving the
sought for task, and then solve and memorize subtasks as they are encountered (then subtasks of that
subtasks, etc). In bottom-up DP we start from a primitive leaf subtasks, and then use the results for
solving of higher-level tasks all the the way up to the sought for task. It's generally acknowledged that
bottom-up DP is more efficient if can be applied. And bottom-up DP indeed can be applied to our
problem in the following way.
First, let's consider games of the form game(0, m, c) (that is games with 0 moves remaining,
essentially all possible final states). It's trivial to calculate results for such games: if m==c, then it's a
loss; if c==g (goal), then it's a win; and draw otherwise. The results are to be memorized for future
reuse.
Now let's consider games of the form game(1, m, c) (that is games with 1 move remaining). It's
possible to compute the result for game(1, m0, c0) by summing up results for all game(0, m, c) that can
be reached from game(1, m0, c0) in 1 move.
Now we can generalize this approach to: game(n, m, c) = SUM game(n-1, mi, ci) over all
possible (mi, ci). Below is a graphical scheme of the algorithm (processing is done bottom-up, arrows
represent addition operations, arrows are shown for only 1 cell (namely, (1,1)) for clarity):
Here is a bit simplified pseudo-code:
// for all moves
for (int move = 1; move != MM; move += 1)
{
// for all mouse positions
for (int m = 0; m != node_count; m += 1)
{
// for all cat positions
for (int c = 0; c != node_count; c += 1)
{
// for all possible mouse moves from 'm'
for (int mi2 = 0; mi2 != graph[m].size(); mi2 += 1)
{
int m2 = graph[m][mi2];
// for all possible cat moves from 'c'
for (int ci2 = 0; ci2 != graph[c].size(); ci2 += 1)
{
int c2 = graph[c][ci2];
game[move][m][c].win_count +=
game[move-1][m2][c2].win_count;
game[move][m][c].loss_count +=
game[move-1][m2][c2].loss_count;
game[move][m][c].draw_count +=
game[move-1][m2][c2].draw_count;
// point (A) – used below
}
}
}
// point (B) – used below
}
}
The only remaining thing we need to take into account to get working algorithm is the fact that
first winning game state takes precedence over all subsequent losing game states, and, accordingly, first
losing game state takes precedence over all subsequent winning game states. To account for this we
"transfer" results to the required field when game crosses losing/winning state (this code must be
inserted into position marked with (B) in the above code):
if (m == c)
{
game[move][m][c].loss_count += game[move][m][c].win_count;
game[move][m][c].win_count = 0;
game[move][m][c].loss_count += game[move][m][c].draw_count;
game[move][m][c].draw_count = 0;
}
else if (m == goal)
{
game[move][m][c].win_count += game[move][m][c].loss_count;
game[move][m][c].loss_count = 0;
game[move][m][c].win_count += game[move][m][c].draw_count;
game[move][m][c].draw_count = 0;
}
Then, the sought for result is game(MM, m_start, c_start).
Computational complexity of the algorithm is O(V^2 * MM * K^2) (V – number of nodes, MM
– maximum number of moves, K – mean number of edges outgoing from a node). The complexity can
be further reduced to O(2 * V^2 * MM * K), if we split move processing into 2 parts – cat move and
mouse move. That is, we process 2*MM "half-moves", during each half-move either mouse or cat
makes K moves.
Space complexity of the algorithm is O(V^2 * MM). However we do not need to memorize all
results, we only need to keep 2 "move-slices". That is slice for current move, and slice for previous
move. The well-know trick for such situation is to allocate 2 arrays, and then alternate roles of the
arrays (first, first array represents current move and second array represents previous move, then
second array represents current move and first array represents previous move). So, the resulting space
complexity of the algorithm is O(2 * V^2).
It's worth noting that the algorithm computes results for all possible starting positions and move
counts. So it can be used to answer questions like "What is the minimum length of a winning path for a
given graph?" (answer: game(n_min, m_start, c_start) where win_count > 0). Or "What if the best
starting position for mouse for a given graph?" (answer: game(MM, m, c) where win_count/loss_count
is maximal).
Winning path calculation.
The above-described algorithm does not directly yield a winning path example. My first thought
was to run naïve recursive exhaustion algorithm until it yields a one. However, the problem with such
approach is that worst case computational complexity is exponential, so we get back to where we
started from.
The key insight is that the polynomial algorithm can be used to produce a winning path by back
tracking a win (i.e. game(n, m=goal, c)) to a starting state (i.e. game(MM, m_start, c_start)). The
general algorithm is as follows. We associate a winning path (if any) with each game state game(n, m,
c). For the first "layer" (i.e. game(0, m, c)) all winning paths are empty. For other "layers" winning path
is copied from the previous layer and appended with current move.
Here is a pseudo-code with winning path calculation (this code must be inserted into position
marked with (A) in the above code):
// [this is part of processing of game(move, m, c)]
// check that there is at least one win in the previous position
if (game[move-1][m2][c2].win_count > 0)
{
// copy winning path prefix
game[move][m][c].win_path = game[move-1][m2][c2].win_path;
// append with current move
game[move][m][c].win_path.push_back(make_pair(m2, c2));
}
In the end, we will be able to extract some winning path (if any) from game(MM, m_start,
c_start) along with the statistics. However, this naïve algorithm for win path back tracking significantly
degrades computational and space complexity (we have to store and copy paths of length O(MM)). To
overcome this I modify the algorithm as follows.
There is the shortest winning path (SWP) (sooner several of them), i.e. the winning path of a
minimum length. Moreover, the SWP never passes the same game state (mouse_position, cat_position)
more than once (trivial to prove: if we cut the loop, we get shorter winning path). So for each game
state (mouse_position, cat_position) that is on the s SWP I memorize a single move along the SWP:
// check that there is at least one win in the previous position
if (game[move-1][m2][c2].win_count > 0
&& SWP[m][c].is_set == false)
{
// now processing and memory consumption is O(1)
SWP[m][c].is_set == true;
SWP[m][c].move = make_pair(m2, c2);
}
When the computation is finished, we are able to restore SWP from components: start at
SWP[m_start][c_start], and then follow via 'move' fields to a winning position. During output I prepend
the SWP with required number of "void" moves (mouse and cat stay on the same position) to get path
of length MM.
Parallelization
In order to parallelize the algorithm we need to analyze data and control dependencies, and find
independent pieces of computation that can be executed in parallel. Calculation of game state game(n0,
m0, c0) generally depends on all game(n0 - 1, m, c). No control dependencies present. Consequently,
each move-slice depends on the preceding move-slice, but all states game(n0, m, c) inside of a
move-slice can be processed in parallel independently of each other. So, I start a thread team, all
threads in the team process equal pieces of work during each move-slice calculation (parallel phase),
then all threads synchronize on a barrier, and then transit to the next parallel phase:
// OpenMP is used to start a thread team
#pragma omp parallel for schedule(static, 1)
for (int thread_index = 0; thread_index < thread_count; thread_index += 1)
{
// this cycle executed by all threads
for (int move = 0; move != move_count; move += 1)
{
calculate_own_piece_of_work_based_on(thread_index, thread_count);
process_own_piece_of_work();
// all worker threads synchronize with each other
// in the end of each move-slice
pragma omp barrier
#
}
}
Granularity problem.
So, all worker threads synchronize with each other after each phase. If amount of work perthread/per-phase is small (less than at least several thousands of cycles), the periodic synchronization
can negatively affect scalability.
In the Single-threaded Implementation section I've described 2 algorithms: one with complexity
O(2 * V^2 * MM * K), and another with complexity O(V^2 * MM * K^2). The former is generally
faster in single-threaded execution, however the latter contains 2 times less phases and amount of work
per-phase is larger. I've noticed that on small game graphs (game graph size determines amount of
work per-phase) the latter algorithm is faster on 64 threads, so I've added a heuristic that chooses
between the algorithms in run-time depending on input graph characteristics.
Theoretically, it's possible to process 2 (3, 4, etc) full moves per phase, then computational
complexity become O(V^2 * MM * K^4 / 2). For some input graphs on massively parallel hardware
such algorithm can yield better performance (because it contains less phases, thus less synchronization
between threads). However, I did not implement the modification due to limited time.
Performance
Below is a performance graph for 2 different inputs. First input is a random graph with 30 nodes,
57 edges and 2000 moves (blue line) (it produces results with 1801 decimal digits). Second input is a
random graph with 1000 nodes, 4029 edges and 30 moves (red line) (it produces results with 29
decimal digits). Testing was conducted on Intel MTL machine with 4 Intel Xeon X7560 processors,
each with 8 cores and HT enables (32 cores, 64 hardware threads total). Horizontal axis is the number
of threads, and vertical axis is execution time in milliseconds.
The inputs stress the program in 2 different aspects. First input produces very large results (1801
decimal digits), and thus stresses program's ability to handle large arbitrary precision numbers. Second
input stresses program's ability to work with large game graphs (each move slice contains 10^6
elements).
It can be seen that the second input (red line) scales somehow better. This can be explained by
the fact that the second input contains less moves (parallel phases) while each phase contains
significantly larger amount of work, so threads synchronize with each other less frequently.
On both inputs program scales sub-linearly, partly this can be explained by Amdahl's law
(program contains some serial parts, namely, input and output), and partly by limited memory
bandwidth (threads very actively work with memory doing very little work per a memory location).
Naïve exhaustion algorithm would scale linearly (it basically does not use memory, and threads can
work largely independently). However due to exponential computational complexity it would not get
any close to the polynomial algorithm anyway.