SNLP 2015 Exercise 09 Submission date: 10.07.2015, 23:59 Kullback-Leibler Divergence and Text Compression 1. (4 points) Compute the Kullback-Leibler (KL) divergence between the token distributions of any two texts from the supplementary material. The probability of each token should be calculated using Maximum Likelihood Estimate. (a) (1 point) Start by normalizing the text in each file (lowercasing, punctuation removal, and tokenization) and computing the maximum likelihood estimates. (b) (2 points) Compute the KL-divergence between any two files1 . To handle out-of-vocabulary words, use Lidstone (add-) smoothing for the second distribution. (c) (1 point) How do you interpret the result? 2. (6 points) In this exercise you will implement a text compression algorithm known as Huffman coding2 . (a) (3 points) Use the following algorithm to build a Huffman tree for the phrase “she sells sea shells on the sea shore”: Input: An input probability distribution P for a set of symbols S - calculated using the maximum likelihood estimate of each character in the input. Output: A binary tree encoding P (1) for each symbol s ∈ S: create a tree leaf node holding (s, P (s)) (2) place all nodes into a queue Q and order them by probabilities (3) while Q contains more than one element: remove the two nodes s1 , s2 with the smallest probabilities from Q create a new node with s1 , s2 as children and P (s1 ) + P (s2 ) as probability add the new node to the queue (4) Return the code tree T (its root is the last node left in Q) (b) (1 point) Once the tree is complete you can read off the code for each symbol by traversing the tree from the root to the leaf node containing the symbol and recording 0 for a left branch and 1 for a right branch. Use the obtained code to encode the text. (c) (1 point) How long is the coded text and what is the theoretically expected length of the text according to the formula from the lecture slides3 ? (d) (1 point) Which condition needs to be satisfied with respect to the length of the code words so that there is a prefix code for this example? Explain and check if the condition holds. Submission instructions: read carefully • You should form groups of 3 people. 1 Note that KL Divergence is not symmetric. Additional information can be found in this article: http://en.wikipedia.org/wiki/Huffman coding 3 Note that the lecture slides present the expected length per word. 2 • Submit only 1 archive file in the ZIP format with name containing the MN of all the team members, e.g.: Exercise 01 MatriculationNumber1 MatriculationNumber2 MatriculationNumber3.zip • Provide in the archive: – your code, accompanied with sufficient comments, – a PDF report with answers, solutions, plots and brief instructions on executing your code, – a README file with the group member names, matriculation numbers and emails, – Data necessary to reproduce your results4 . • The subject of your submission mail must contain the string “[SNLP]” (including the braces) and explicitly denoting that it is an exercise submission, eg: [SNLP] Exercise Submission 01 • Depending on your tutorial group, send your assignment to the corresponding tutor: – [email protected] – [email protected] – [email protected]. General information • In your mails to us regarding the tutorial please add the tag “[SNLP]” in the subject accompanied by an appropriate subject briefly describing the contents. • Feel free to use any programming language of your liking. However we strongly advise in favour of Python, due to the abundance of available tools (also note that Python3 comes with an excellent native support of UTF8 strings). • Avoid using libraries that solve what we ask you to do (unless otherwise noted). • Avoid building complex systems. The exercises are simple enough. • Do not include any executables in your submission, as this may cause the e-mail server to reject it. • In case of copying, all the participants (including the original solution) will get 0 points for the whole assignment. Note: it is rather easy to identify a copied solution. Plagiarism is also not tolerated. • Missing the deadline even for a few minutes, will result in 50% point reduction. Submission past the next tutorial, is not corrected, as the solutions will already be discussed. • Please submit in your solutions necessary to support your claims. Failure to do so, might results in reduction of points in the relevant questions. • Each assignment has 10 points and perhaps some bonus points (usually 2 or 3). In order to qualify for the exams, you need to have 2/3 of the total points. For example, in case there are 12 assignments, you need to collect at least 80 out of the 120 points to be eligible for the exams. A person that gets 10 plus 2 bonus points in every exercise, needs to deliver only 7 assignments in order to be eligible for the exams, since 7*12=84. 4 If you feel that these files are beyond reasonable size for an email submission and also reasonably convenient, please provide a means for us to access them online • Attending the tutorial gives a 30% points increase, disregarding bonus points. For example, if a team scores a grade of 8 plus 2 bonus points, the total grade is 8 + 2 = 10. Each student of the team, upon attending the corresponding tutorial, is attributed a final grade of 8 · 1.3 + 2 = 12.4 points. • Exercise points (including any bonuses) guarantee only the admittance to the exam, however have no further effect on the final exam grade.
© Copyright 2026 Paperzz