INF 4130 Exercise set 6, 4th Oct. 2012 w/answers

INF 4130 Exercise set 6, 4th Oct. 2012
w/answers
There was no lecture on 26th September, we therefore look at some old exams, and go through a few
assignments on the curriculum we have already covered. (Percentages indicate weight of assignment in
original exam.)
Assignment 1
String search (21%)
Suffix trees have many applications. We shall attempt to use such trees to find substrings that
occur more than once in a longer string.
Question 1.a (7%)
List the suffixes, and draw compressed suffix trees for the string S = abcabdc$. (The dollar
sign is a termination character and will not occur inside strings, but is otherwise to be handled
like a normal character in our alphabet.)
Solution (suggested) 1.a
Left to the student.
Question 1.b (7%)
Explain how we can use the suffix tree from 1.a to see that the string ab occurs more than
once in S, how we can see that c occurs more than once in S, and how we can find the length
of these repeated substrings. (Substrings a and b are in turn substrings of ab, and as a
consequence they also occur more than once in S, but the nature of suffix trees is such that we
can only identify b directly by looking at the tree. We shall therefore not bother with
substrings like a, but they can be identified with a minimum of extra effort.)
Solution (suggested) 1.b
Internal nodes (with multiple children) correspond with substrings that occur more than once.
The size of the node (number of characters) is the length of the substring. The termination
character is needed so that substrings at the end (like c) also become internal nodes. A good
answer should include a short observation about this termination character.
Question 1.c (7%)
We have established that we can identify substrings that occur more than once in a longer
string by looking at a suffix tree, but what if we have two strings and wish to check for a
common substring. Is there a way to create one string, and then one suffix tree, so that we can
use roughly the same approach as in 2.b to check for a common substring? You are allowed to
introduce new characters in the alphabet. Explain.
Two strings A and B can be checked for a common stubstring by creating the string A#B$, and
check for substring that occurs more than once. We must make sure that out substring does
not occur twice in only A or B: an internal node (multiple occurrences) will have one subtree
containing # and one that does not contain # (all contain $) if it occurs in both A and B. The
separation character lets us check if a substring comes from A and/or B.
Assignment 2 Dynamic Programming (15 %)
At a large tomato farm tomatoes on the vine are to be packed in cases of five kilograms.
Tomatoes come in different weights and sizes, the vines can consist of different numbers of
tomatoes, so the vines vary in weight. We can therefore not pack the same number of vines in
the cases every time. Tomatoes on the vine sell for more than regular tomatoes, so the vines
are not to be split, they must be packed whole.
The packing is done manually, with the aid of an advanced system that weighs the vines and
tells each packer which vines are to be packed. Each packer has a number of vines in front of
him in a series of shelves, the system makes a selection of vines with a combined weight of
five kilograms and lights a small lamp on the shelves in question, so that the packer can put
the selected vines into a case. When the case is full, the empty shelves are refilled
automatically, and the system makes a new selection (we get at new instance of the problem).
If, for a packer, it is impossible to make a selection of five kilograms, the system replaces all
his vines, hopefully resulting in an instance that has a solution.
2.a
To simplify, we assume the packing system operates with weights in whole decagrams
(1 decagram = 10 grams), so that we must find a selection of vines with a combined weight of
500 decagrams. (Such a system always has a finite resolution, so this is not an unnatural
simplification.) Show how to use dynamic programming to design an algorithm that first
decides whether a selection of 500 decagrams is possible to make or not, and that has a
suitable data structure making it easy to indicate the selection. Each packer has n vines in
front of him.
Solution (suggested) 2.a
This is SUBSET SUM. We solve with DP. (We omit everything about the “Principle of
Optimality”, as there is no direct question about it.)
Let U[i,j] be TRUE if there is a selection from the i first elements with sum j, FALSE else.
U[i,j] is defined like this:
(*)
 TRUE if U i  1, j  vi   TRUE

U i, j    TRUE if U i  1, j   TRUE
(**)
FALSE else
(* * *)

We use vi to indicate the weight of element i (0≤i≤n). the initial condition is U[0,0]=TRUE
(we can make a selection with sum 0 from the 0 first elements).
(Pretty) rough pseudo code can look something like this:
BOOLEAN U[][];
INT i, j;
U[0,0]=TRUE
FOR i=0 TO n DO
FOR j=0 TO 500 DO
IF U[i-j, j-V[i]]==TRUE OR U[i-1,j]==TRUE THEN U[i,j]=TRUE
ELSE U[i,j]=FALSE
OD
OD
We assume weights are stored in an array V[1..n]. If we get U[i,j]=TRUE, we have a possible
selection of 500 decagrams. To find the actual selection we can backtrack, but it is maybe
easier to use a separate Boolean array L[1..n] where we set L[i]=TRUE if (*) is the rule that
caused us to insert TRUE.
2.b
What is the running time of your algorithms from 2.a? Do a coarse analysis of its running
time, use O-notation.
Solution (suggested) 2.b
O(500n), defined by the FOR loops, there are n vines, and our target sum is 500 decagrams.
2.c
In general, the weight limit is part of the problem instance (we can assume the tomatoes are
packed in different size cases, baskets, etc.). Such a general instance will be on the form
v1,v2,…,vn;G, where the vis are the weights of the vines and G the weight limit, all numbers
can now be as large as you want them. A problem instance for the tomato problem (this is a
YES-instance) can look like this:
45,53,77,78,52,80,67,71,49,102,46;500
Suggest a length measure for this instance. How would you in general express the length of
such an instance? Is the running time of your algorithm (given in 1.b) polynomial relative to
this expression? Explain.
Solution (suggested) 2.c
The length of an instance is the number of characters, here 37 (we count commas,
semicolons). In general, an instance of n elements with weights up to G decagrams (no point
including heavier elements) will be of length n log G + log G. Or one could say Σ log vi + log
G. This part of the question was possible a bit vague, many have misunderstood, or not
understood at all. This part of the question was meant to remind everyone that it is log G that
is included in the length of the instance, so that one sees tthat the algorithm is not polynomial
– the running time from 2.b is O(Gn). The algorithm is not polynomial in the length of the
input numbers, even if it is polynomial in the value of the numbers. For a number t, len(t) =
O(log2 val(t)) and vice versa, val(t) = O(2len(t)). A bell should ring if one is about claim a
polynomial algorithm for SUBSET SUM (∈ NPC). One should be able to answer this part of
the question even if one had problems understanding the length part.
Assignment 3 Search (20 %)
A robot is moving a gripping hand to grab hold of an object. The robot represents the world
with a three-dimensional coordinate system, and has with its sensors located the object at
point (x,y,z) relative to the starting position of the hand (0,0,0). The coordinate system is fine
grained enough that the robot only uses integer coordinates, and only moves the hand parallel
to the axes of coordinate system, not diagonally, i.e. it only moves the hand up, down, left,
right, forwards and backwards. There can be different kinds of hindrances between points
(0,0,0) and (x , y , z), some points in the coordinate system are therefore forbidden positions for
the gripping hand.
3.a
Explain how the coordinate system, where some coordinates are forbidden, can be represented
with a state space graph we can search to find a way to move the hand from (0,0,0) to (x , y , z).
The coordinate system is in principle infinite, but the robot has a limited reach, resulting in a
finite
state space graph: the maximal reach along each of the three axes is xmax, ymax og zmax,
respectively. (You may assume that the point (x,y,z) is within the robot’s reach.)
Solution (suggested) 3.a
For every point in 3-space (i,j,k) we create a point-object, a node in our state space graph. We
do not create points outside the robot’s reach. For a general point (i,j,k) its neighbours will be
(i-1,j,k), (i,j-1,k), (i,j,k-1), (i+1,j,k), (i,j+1,k), (i,j,k+1), assuming they lie within the legal
envelope. The twist is that some points are forbidden, these points are not connected to their
usual neighbours when we create the graph, they will not be part of the connected graph we
build and search in. (They can of course be dropped altogether, or left to GC.)
3.b
The robot uses the A*-algorithm to guide the gripping hand in a efficient manner (along the
shortest path). Is an exact heuristic for the A*-algorithm, i.e. a heuristic that for all vertices
gives the exact length of the shortest path to the goal vertex, useful? Explain.
Solution (suggested) 3.b
The heuristic will work, but it is costly and can be used to solve the problem alone. As correct
an heuristic as possible is of course good, but the exact “heuristic” is obviously too costly for
problems like these, too costly to be of any use.
3.c
Select an appropriate, monotone heuristic for the algorithm (there are several possibilities),
and give a short explanation of why your chosen heuristic is monotone.
Solution (suggested) 3.c
Manhattan distance (3D), or vector length, both are shorter that the actual distance the arm
must be moved. (One should possible explain monotonicity in a bit more detail.)
3.d
The problem can also be solved with dynamic programming, as described below. The points
of the coordinate system are still represented as vertices of a graph. The breaking down into
sub-problems is perhaps a bit unfamiliar, as the next vertex to get its value is not statically
fixed. (The order of the so far unsolved sub-problems can change during the execution of the
algorithm. We have mostly studied algorithms where the order of the sub-problems was fixed,
but there is in principle no difference, we are solving successively larger sub-problems until
we have the final solution.)
All vertices have an integer variable dist that eventually will hold the calculated distance from
(0,0,0). Until the vertex is finally processed, the value is an estimate only. We have a set, S, of
seen vertices, these vertices are finally processed, with the final calculated value stored in dist.
The remaining vertices are unseen vertices, they are not finally processed, and only have an
estimate stored in dist.
i.
Initially vertex (0,0,0) has dist = 0. All other vertices have dist = . We
place (0,0,0) in S, it has the correct value for dist, and update dist in all of
ii.
its neighbors to 1 (for the edge between (0,0,0) and the vertex). We have
now solved the first, smallest sub-problem.
The general sub-problem is to pick one of the unseen vertices, say v,
calculate its dist value, and place it in S. We pick as v the unseen vertex
with the lowest dist value (at random if there are several possibilities).
Vertex v is placed S without changing dist, and for all unseen vertices u
that are neighbors of v we set dist to (this is the recursive formula we
calculate):
u.dist = min{ u.dist, v.dist + 1 } .
iii.
Repeat ii until (x,y,z) becomes a seen vertex.
To keep track of which vertex is the one with the lowest dist value we use a priority queue
with the dist values as key. (It is this priority queue that gives us the successively larger subproblems; and since the dist values can change, the order of the sub-problems is not static.)
Will both algorithms (we assume the A*-algorithm uses a monotone heuristic) find optimal
solutions, i.e. calculate the correct shortest path? Give a short justification. What are the
differences in efficiency between the algorithms? (We are only after general observations
about efficiency, not detailed analyses of the running times.)
Solution (suggested) 3.d
This is Dijkstra’s algorithm for shortest paths in graphs. Dijkstra’s algorithm is A* without
heuristic, a kind of breadth first search with a priority queue instead of the usual FIFO queue.
The algorithm stops when it reaches its goal, so it will not find the shortest path to all nodes in
the graph, but it is not as targeted as A* as it uses no heuristic to target its search. Both
Dijkstra and A* with a monotone heuristic give exact answers, they are not approximation
algorithms.
Assignment 4 Dynamic Programming (21 %)
An independent set of vertices in a graph is a subset of vertices such that there are no edges
between any pair of vertices in the set. Two neighbouring vertices in the graph can not both
be members of such a set.
Figure 1. The black vertices in the left tree form an independent set – there are no edges between any
pair of black vertices. They also form a largest independent set – there is no independent set of size larger
than eleven in the left tree. The black vertices in the right tree do not form an independent set – there is an
edge between two and two of them.
Finding the largest independent set of vertices in a general graph is an NP-hard problem, we
therefore only consider trees in the assignment. In trees we can use dynamic programming to
find the size of the largest independent set (and also the set itself, if needed). [Remark: Only
considering trees is not as large a simplification as it might seem. There exists a large class of
tree-like graphs where the same method can be used. Normal trees are special cases of these
graphs.]
Let T be our tree, and r its root. For each vertex v in the tree, let Tv be the subtree with v as its
root (see Figure 2). Remember that for every vertex v two possibilities exist when we try to
build an independent set: either including v in the set, or leaving it out.
r
v
Tv
Figure 2.
l
A tree T with root r, a leaf vertex l, an internal vertex v, and the subtree Tv with root v.
Question 4.a (7 %)
Calculating the size of the largest independent set of vertices in T, bottom-up, we start with
the leaf vertices. We store the calculated values in the vertices themselves, avoiding use of a
separate table. For a leaf vertex l, and the subtree Tl, we calculate two values:
l.c
l.n
the size of the largest independent set of vertices in Tl containing vertex l,
the size of the largest independent set of vertices in Tl not containing vertex l.
What are these values for a leaf vertex l? (Note that we are only considering the sub-tree Tl.)
Solution (suggested) 4.a
l.m = 1,
l.u = 0.
The independent sets are {l} og , respectively.
Question 4.b (7 %)
When all leaf vertices are done, their parents can do their calculations, and so on. In general a
vertex must wait until all its children are done before it can do its own calculations. The
remaining vertices v, with corresponding subtrees Tv, also calculate:
v.c
the size of the largest independent set of vertices in Tv containing vertex v,
v.n
the size of the largest independent set of vertices in Tv not containing vertex v.
Show with simple pseudo code how v.c and v.n are calculated for a non leaf vertex v. Do not
to write code for tree traversal etc., simply show how v does its calculations based on values
already calculated. (Note that we still only consider the subtree Tv.) You may assume that v’s
children are stored in a suitable data structure, and traverse them with a for loop FOR i IN
v.children DO{}, accessing already calculated values in every child i of v.
Solution (suggested) 4.b
v.c = 1
// node v is included
v.n = 0
// node v is not included
FOR i in v.children DO
{
v.c = v.c + i.n
// If v is in we must omit the children.
v.n = v.n + max(i.n, i.c)
// If v is not in, we can include the
}
// children we want, not necessary all.
The small observation one must make is that one need not include all children of a node v that
is itself left out, but that one can, and that one does so in the manner that gives the highest
value. If v is included, one can not include any children.
Question 4.c (7 %)
When the leaf vertices have calculated the initial values, and the remaining vertices have
calculated their values from the bottom up in the tree, we are in principle done. How/where
do we now find the size of the largest independent set in T?
Solution (suggested) 4.c
We get the answer by taking max(r.u, r.m).
Assignment 5 String Search (18 %)
We shall use the Karp-Rabin-algorithm to search for the pattern 666 in the string 12345666.
Assume the algorithm uses normal, decimal, numbers (digits 0—9), as described in the course
textbook (Berman & Paul, Algorithms: Sequential, Parallel, and Distributed ), further assume
the algorithm operates modulo 3.
Question 5.a (12 %)
Show how the algorithm proceeds stepwise during its search, what the successive substrings
of the search string (the window) are converted to modulo 3, and indicate where there is a
match, where there is a spurious match, and where there is a real match.
You may use a simple tabular form for this, e.g.:
1 2 3 4 5 6 6 6
1 2 3
2 3 4

Conversion of the window modulo 3
123 mod 3 = 0
?
Match?
?
?


Solution (suggested) 5.a
1 2 3 4 5 6 6 6
1 2 3
2 3 4
3 4 5
4 5 6
5 6 6
6 6 6
Conversion of the window modulo 3
123/3 = 41 / 0(mod 3)
234/3 = 78 / 0(mod 3)
345/3 = 115 / 0(mod 3)
456/3 = 152 / 0(mod 3)
566/3 = 188,66 / 2(mod 3)
666/3 = 222 / 0(mod 3)
Match?
spurious match
spurious match
spurious match
spurious match
not match
real match stop
Question 5.b (6 %)
How does the number of spurious matches found in 5.a compare to the expected number of
spurious matches in uniformly distributed strings when one works modulo 3?
Solution (suggested) 2.b
The expected number of spurious matches is n/3 with n trials. We make six trials, and expect
6/3=2. Our four spurious matches are a bit above the expected number.
[ END ]