instruction

Course: CpSc 212: Algorithms and Data Structures
Lab 13: Tries
1
Spring 2014
Lempel Zev Compression and De-Compression
In this lab, you will be building a trie to perform Lempel-Zev (LZ) compression and decompression
on a large text file. Most of the industrial grade compression techniques like rar and zip use some
form of LZ compression.
1.1
The Trie Data Structure
A Trie is a data structure that holds a collection of strings and supports prefix queries. That
is, returns all strings in the trie that match a particular prefix. For example, for the following
collection of strings: “chef”, “chip”, “code”, “cow”, “egg”, “ego”, “to”, “told”, “top”, and for a
prefix “co”, the trie will return words “code” and “cow”, since both these words start with “co”
(see Figure). Notice that each edge in the trie corresponds to a symbol, and walking “co” from the
root lands on a node whose subtree contains all the matched strings.
A dummy ‘$’ sign is added at the end of each string. This is used to identify a substring that is
a valid string in the collection, like the string “to”. We will be building a trie to store substrings
of our text. Each node in our trie contains a unique node id. Each node also stores a symbol, that
gives the edge between the node and its parent.
1.2
The Lempel-Zev Encoding Algorithm
The LZ comperssion algorithm identifies repeated substrings and stores them only once in a trie.
It works very well for large text files since occurrences of longer substrings are stored only once. It
may not work quite well for smaller text, since there may not be many repeated substrings. That
is why LZ compression and de-compression is also called as LZ encoding and decoding respectively.
1
The encoding algorithm scans the text file once and builds a trie on some of the substrings it sees.
The trie gets updated as more substrings are found. The algorithm is described in the following
pseudocode.
1. Create an empty trie with a single node root.
2. Set current = root.
3. Read input symbol x. If reached end of file, then Goto Step 6.
4. If there exists an edge x from current to say node j in the trie, then walk down the edge. i.e.
current = j. Goto Step 3.
5. There is no edge x at the current node. Create a new node j with a unique id, and let the
edge (current, j) be x. Output a 2 tuple (current node ID, x). Note that we remove the
paranthesis in the output. Reset current to root. Goto Step 3.
6. If current is not root, then output current node ID.
Note that each node created has a unique integer ID. Step 6 is necessary to encode the last substring
that does not end with a creation of a new node. For the input string “aabababaababaab”, the
string “0a1b2a0b1a4a6a4” will be printed and the trie shown in the following figure is built.
0
a
b
1
a
4
b
a
5
6
2
a
a
3
7
Note that the number of bytes used to represent the input string and the LZ encoded string are
the same for the above example. But for large text files, we will see repeated substrings that will
decrease the length of the encoded string.
1.3
The Lempel-Zev Decoding Algorithm
Lempel-Zev is a losless compression algorithm. Which means the the original string can be easily
re-constructed without loss in information. The decoding algorithm builds the same trie to decode
and works as follows.
1. Read node ID and corresponding symbol x.
2. Set current to point to the node with id ID.
3. If no symbol present, Goto Step 5.
4. Create a new node j (as the child of current), and let the edge (current, j) be x.
2
5. Print the string from the root to the current pointer, concatenated with the symbol.
6. Goto Step 1 if not reached end of input.
For the LZ encoded string “0a1b2a0b1a4a6a4”, the original string “aabababaababaab” will be reconstructed. Note that the trie is exactly the same as one constructed during LZ-encoding (see
above Figure).
1.4
Downloading and Extraction
Please first download the lab13.tar.gz file from the following webpage:
http://people.cs.clemson.edu/~rmohan/course/cpsc2120-sp14/
Next, please extract the files using the following command.
tar -zxvf lab13.tar.gz
This command will extract all the files in the directory named lab13. Please change your directory
using the command:
cd lab13
1.5
The Trie Implementation
Please open the file trie.h to see the specifications of the class Trie. It contains pointers to the
root and current nodes in the trie. It also contains size that holds the number of nodes in the trie.
When we create a new node, the new node takes a unique integer id equal to the size. The trie is
instantiated with a root node with id 0.
Each node in the trie contains a unique integer id, a character symbol that identifies the edge with
its parent and pointers to the parent, first child and next sibling. Since the trie is a multi way
branching tree, we choose a representaion based on just two pointers - the first child and next
sibling. We will be able to step into any child by first stepping into first child and then following
the next sibling pointers to the actual child. This greatly reduces the space to represent each node.
Also, the next sibling poitners can be in sorted order to facilitate easy searching. To facilitate
easy insertions, the first child is a dummy element. Hence the actual list of children start from
first child→next sibling. Note that the class Trie is declared as a friend of class Node to easily
access its private data members.
The class Trie contains the following functions.
1. resetCurrent(): Resets the current poitner with the root.
2. getCurrentID(): Returns the unique id of current.
3. moveCurrent(x): Moves the current poitner to the node with unique id x.
3
4. printString(): A wrapper function that calls the actual function that prints a string that
corresponds to the path from root to current.
5. printString(Node *): The actual function that prints a string represented by the path from
root to current. This is done recursively by printing the string from root to current→parent
and then printing current→symbol.
6. insertSymbol(x): This function inserts a symbol x at the current node. It returns the 2−tuple
(node id, x) as a single string. Please print out a useful message and exit the program if the
current node already contains the edge x.
7. readSymbol(x): This function reads a symbol x from the current node. This makes a walk
from current node to its child with edge x. Also updates the current pointer to the child
node. Returns the symbol x upon successful read, otherwise returns the string “\0”, which
happens when the current node does not have an edge x. The current node is not modified
in this case.
Please implement the last three functions. Other functions are already provided for you.
Also note that the Trie contains a map (nodeMap) from unsigned long to Node pointers. To perform
the moveCurrent(x) function, we need to be able to move the current pointer to a node with unique
id x. We use the STL map to maintain a collection of elements, where each element is a map
from ids to node pointers. So each node is inserted into this map based on the unique id. (see
Constructor and moveCurrent). You will not use any STL object apart from nodeMap.
The function toString(x) is a templated function (it can be executed for any type). This is used to
convert chars and ints to strings.
1.5.1
Compiling and Testing
You are to write your own code to test the Trie class. No additional code is provided for this task.
If you named your test file testTrie.cpp, then you can compile using the following command.
g++ testTrie.cpp -o testTrie trie.cpp
You can enter your input in a file named testTrieInput and redirect your input while executing.
./testTrie < testTrieInput
You can also redirect your output to say a file named testTrieOutput by the running the following
command.
./testTrie < testTrieInput > testTrieOutput
You have to thoroughly test the Trie class. Make sure your program works for all boundary cases.
You will receive full credit only if your code works for all cases.
1.6
Compression and De-Compression Implentation
Please open files encoding.cpp and decoding.cpp and complete the implementaion of the compression
and de-compression algorithms. For both programs, please accept input from standard input (cin)
and print output to the standard output (cout). The programs currently read in single character
at a time from the input including whitespaces.
4
1.7
Additional Reporting
The file stuff is a large text file. Please run it through the encoding algorithm to produce the
LZ compressed file stuffenc. Then run the decoding alorithm with input from stuffenc to produce
stuffdec. The file stuffdec should be exactly the same as stuff. You can compare the two files by
running the following command in unix:
diff stuff stuffdec
Also report the percentage reduction in sizes of files stuff and stuffenc. You can easily determine
the total number of bytes in a file (say stuff) by running the following command:
wc -c stuff
Compare the number of bytes in stuff, stuffenc and stuffdec.
2
Submission
Please submit your solutions through handin.cs.clemson.edu. The due date for this assignment is
Friday April 25th at 11:59pm.
3
Grading
Each section in this assignment is graded out of ten points. No points will be awarded to a program
that does not compile. You will recieve 1 point for successful compilation and 1 point for successful
execution (no seg faults and infinite loops). 3 points is awarded for correctness and efficiency each
and 2 points awarded for good code: readability, comments, simplicity and elegance, etc.
5