Course: CpSc 212: Algorithms and Data Structures Lab 13: Tries 1 Spring 2014 Lempel Zev Compression and De-Compression In this lab, you will be building a trie to perform Lempel-Zev (LZ) compression and decompression on a large text file. Most of the industrial grade compression techniques like rar and zip use some form of LZ compression. 1.1 The Trie Data Structure A Trie is a data structure that holds a collection of strings and supports prefix queries. That is, returns all strings in the trie that match a particular prefix. For example, for the following collection of strings: “chef”, “chip”, “code”, “cow”, “egg”, “ego”, “to”, “told”, “top”, and for a prefix “co”, the trie will return words “code” and “cow”, since both these words start with “co” (see Figure). Notice that each edge in the trie corresponds to a symbol, and walking “co” from the root lands on a node whose subtree contains all the matched strings. A dummy ‘$’ sign is added at the end of each string. This is used to identify a substring that is a valid string in the collection, like the string “to”. We will be building a trie to store substrings of our text. Each node in our trie contains a unique node id. Each node also stores a symbol, that gives the edge between the node and its parent. 1.2 The Lempel-Zev Encoding Algorithm The LZ comperssion algorithm identifies repeated substrings and stores them only once in a trie. It works very well for large text files since occurrences of longer substrings are stored only once. It may not work quite well for smaller text, since there may not be many repeated substrings. That is why LZ compression and de-compression is also called as LZ encoding and decoding respectively. 1 The encoding algorithm scans the text file once and builds a trie on some of the substrings it sees. The trie gets updated as more substrings are found. The algorithm is described in the following pseudocode. 1. Create an empty trie with a single node root. 2. Set current = root. 3. Read input symbol x. If reached end of file, then Goto Step 6. 4. If there exists an edge x from current to say node j in the trie, then walk down the edge. i.e. current = j. Goto Step 3. 5. There is no edge x at the current node. Create a new node j with a unique id, and let the edge (current, j) be x. Output a 2 tuple (current node ID, x). Note that we remove the paranthesis in the output. Reset current to root. Goto Step 3. 6. If current is not root, then output current node ID. Note that each node created has a unique integer ID. Step 6 is necessary to encode the last substring that does not end with a creation of a new node. For the input string “aabababaababaab”, the string “0a1b2a0b1a4a6a4” will be printed and the trie shown in the following figure is built. 0 a b 1 a 4 b a 5 6 2 a a 3 7 Note that the number of bytes used to represent the input string and the LZ encoded string are the same for the above example. But for large text files, we will see repeated substrings that will decrease the length of the encoded string. 1.3 The Lempel-Zev Decoding Algorithm Lempel-Zev is a losless compression algorithm. Which means the the original string can be easily re-constructed without loss in information. The decoding algorithm builds the same trie to decode and works as follows. 1. Read node ID and corresponding symbol x. 2. Set current to point to the node with id ID. 3. If no symbol present, Goto Step 5. 4. Create a new node j (as the child of current), and let the edge (current, j) be x. 2 5. Print the string from the root to the current pointer, concatenated with the symbol. 6. Goto Step 1 if not reached end of input. For the LZ encoded string “0a1b2a0b1a4a6a4”, the original string “aabababaababaab” will be reconstructed. Note that the trie is exactly the same as one constructed during LZ-encoding (see above Figure). 1.4 Downloading and Extraction Please first download the lab13.tar.gz file from the following webpage: http://people.cs.clemson.edu/~rmohan/course/cpsc2120-sp14/ Next, please extract the files using the following command. tar -zxvf lab13.tar.gz This command will extract all the files in the directory named lab13. Please change your directory using the command: cd lab13 1.5 The Trie Implementation Please open the file trie.h to see the specifications of the class Trie. It contains pointers to the root and current nodes in the trie. It also contains size that holds the number of nodes in the trie. When we create a new node, the new node takes a unique integer id equal to the size. The trie is instantiated with a root node with id 0. Each node in the trie contains a unique integer id, a character symbol that identifies the edge with its parent and pointers to the parent, first child and next sibling. Since the trie is a multi way branching tree, we choose a representaion based on just two pointers - the first child and next sibling. We will be able to step into any child by first stepping into first child and then following the next sibling pointers to the actual child. This greatly reduces the space to represent each node. Also, the next sibling poitners can be in sorted order to facilitate easy searching. To facilitate easy insertions, the first child is a dummy element. Hence the actual list of children start from first child→next sibling. Note that the class Trie is declared as a friend of class Node to easily access its private data members. The class Trie contains the following functions. 1. resetCurrent(): Resets the current poitner with the root. 2. getCurrentID(): Returns the unique id of current. 3. moveCurrent(x): Moves the current poitner to the node with unique id x. 3 4. printString(): A wrapper function that calls the actual function that prints a string that corresponds to the path from root to current. 5. printString(Node *): The actual function that prints a string represented by the path from root to current. This is done recursively by printing the string from root to current→parent and then printing current→symbol. 6. insertSymbol(x): This function inserts a symbol x at the current node. It returns the 2−tuple (node id, x) as a single string. Please print out a useful message and exit the program if the current node already contains the edge x. 7. readSymbol(x): This function reads a symbol x from the current node. This makes a walk from current node to its child with edge x. Also updates the current pointer to the child node. Returns the symbol x upon successful read, otherwise returns the string “\0”, which happens when the current node does not have an edge x. The current node is not modified in this case. Please implement the last three functions. Other functions are already provided for you. Also note that the Trie contains a map (nodeMap) from unsigned long to Node pointers. To perform the moveCurrent(x) function, we need to be able to move the current pointer to a node with unique id x. We use the STL map to maintain a collection of elements, where each element is a map from ids to node pointers. So each node is inserted into this map based on the unique id. (see Constructor and moveCurrent). You will not use any STL object apart from nodeMap. The function toString(x) is a templated function (it can be executed for any type). This is used to convert chars and ints to strings. 1.5.1 Compiling and Testing You are to write your own code to test the Trie class. No additional code is provided for this task. If you named your test file testTrie.cpp, then you can compile using the following command. g++ testTrie.cpp -o testTrie trie.cpp You can enter your input in a file named testTrieInput and redirect your input while executing. ./testTrie < testTrieInput You can also redirect your output to say a file named testTrieOutput by the running the following command. ./testTrie < testTrieInput > testTrieOutput You have to thoroughly test the Trie class. Make sure your program works for all boundary cases. You will receive full credit only if your code works for all cases. 1.6 Compression and De-Compression Implentation Please open files encoding.cpp and decoding.cpp and complete the implementaion of the compression and de-compression algorithms. For both programs, please accept input from standard input (cin) and print output to the standard output (cout). The programs currently read in single character at a time from the input including whitespaces. 4 1.7 Additional Reporting The file stuff is a large text file. Please run it through the encoding algorithm to produce the LZ compressed file stuffenc. Then run the decoding alorithm with input from stuffenc to produce stuffdec. The file stuffdec should be exactly the same as stuff. You can compare the two files by running the following command in unix: diff stuff stuffdec Also report the percentage reduction in sizes of files stuff and stuffenc. You can easily determine the total number of bytes in a file (say stuff) by running the following command: wc -c stuff Compare the number of bytes in stuff, stuffenc and stuffdec. 2 Submission Please submit your solutions through handin.cs.clemson.edu. The due date for this assignment is Friday April 25th at 11:59pm. 3 Grading Each section in this assignment is graded out of ten points. No points will be awarded to a program that does not compile. You will recieve 1 point for successful compilation and 1 point for successful execution (no seg faults and infinite loops). 3 points is awarded for correctness and efficiency each and 2 points awarded for good code: readability, comments, simplicity and elegance, etc. 5
© Copyright 2026 Paperzz