2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing Microtext Normalization using ProbablyPhonetically-Similar Word Discovery Richard Khoury Department of Software Engineering Lakehead University Thunder Bay, Canada [email protected] dealing with this challenge. Our underlying assumption is that microtext users recognize words not thanks to correct spelling but by sounding out the characters. Consequently, no matter how innovative microtext spellings get, the resulting OOV words must still be phonetically similar enough to the intended IV words in order for the readers to understand them. For example, a sentence like “r u askin any1 b4 teh gamez 2nite” would sound like “are you asking anyone before the games tonight?” if one were to read it out loud, and would thus be perfectly understandable despite being composed exclusively of OOV words. From this starting point, we propose to tackle the challenge of microtext normalization by building an algorithm that can determine the most probable phonetic equivalent of OOV words and match them to the most probable similar-sounding English words. Abstract—Microtext normalization is the challenge of discovering the English words corresponding to the unusuallyspelled words used in social-media messages and posts. In this paper, we propose a novel method for doing this by rendering both English and microtext words phonetically based on their spelling, and matching similar ones together. We present our algorithm to learn spelling-to-phonetic probabilities and to efficiently search the English language and match words together. Our results demonstrate that our system correctly handles many types of normalization problems. Keywords—microtext; social media; normalization; phonetic; wiktionary I. INTRODUCTION The term “microtext” was proposed by US Navy researchers [1] to describe a type of text document that has three characteristics: (A) it is very short, typically one sentence or less, and possibly as little as a single word; (B) it is written in an informal manner and unedited for quality, and thus may use loose grammar, a conversational tone, vocabulary errors, and uncommon abbreviations and acronyms; (C) it is semistructured in the Natural Language Processing (NLP) sense, in that it includes some metadata such as a time stamp, an author, or the name of a field it was entered into. Microtexts have become omnipresent in today’s world: they are notably found in online posts on Facebook and Twitter, user comments to videos, pictures and news items, and in SMS messages. The rest of this paper is structured as follows. In Section II we define more formally the problem of normalization and present a sample of other techniques used to solve it. In section III, we present our phonetic normalization algorithm. This presentation will be divided in subsections corresponding to the modules of the system: the letter-to-phoneme algorithm we used to convert IV and OOV words to phonetic strings, the radix tree used to model the language, and the search algorithm we experimented with to match the phonetic strings of OOV words to the correct IV words. Section IV will present and discuss the results we obtained with the different versions of our algorithm, and section V will draw some conclusions on our work. One major challenge when dealing with microtext stems from their highly relaxed spelling rules and their tolerance to extreme irregularities in spelling. This causes problems when one tries to apply traditional NLP tools and techniques, which have been developed for conventional and properly-written English text. It could be thought that a simple find-and-replace preprocessing on the microtext would solve that problem. However, the sheer diversity of spelling variations makes this solution impractical; for example, a sampling of Twitter messages studied in [2][3] found over 4 million out-ofvocabulary (OOV) words. Moreover, new spelling variations are created constantly, both voluntarily and accidentally. II. BACKGROUND Before presenting techniques for normalization, let us define the problem more formally. It was noted in [3] that while OOV spelling variations are seemingly endless, their creation seems to follow a small set of simple rules. The rules proposed in [3] are “abbreviation” (deleting letters from the word, for example spelling the word “together” as “tgthr”), “phonetic substitution” (substituting letters for other symbols that sound the same, such as “2” for “to” in “2gether”), “graphemic substitution” (substituting a letter for a symbol that looks the same, such as switching the letter “o” for the number “0” in “t0gether”), “stylistic variation” (misspelling the word to make it look like one’s personal pronunciation, such as writing “togeda” or “togethor”), and “letter repetition” (repeating some letter for emphasis, for example by typing “togetherrr”). It is The challenge of developing algorithms to automatically correct the OOV words found in microtexts and replace them with the correct in-vocabulary (IV) words is known as normalization. In this paper, we propose a new approach to Identify applicable sponsor/s here. If no sponsors, delete this text box (sponsors). 978-1-4673-7701-0/15/$31.00 ©2015 IEEE 392 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing important to note that, while these rules are useful tools to understand the different types of OOV words, the words themselves can overlap multiple types. For example “togeter” could be labelled as either an abbreviation or a stylistic variation, depending on whether the labeller assumes the missing h was removed to shorten the word or to change its pronunciation. Furthermore, the rules are not mutually exclusive, but can be combined together. For example, the word “2gthr” is a result of abbreviation and phonetic substitution, while “togethaa” comes from stylistic variation and letter repetition. Finally, some authors [4] also include OOV acronyms (such as “imho” for “in my humble opinion”) as part of the normalization challenge. We do not however, for the reason that acronyms can only be understood if all parties in a conversation know what they stand for, and consequently they are limited to a small number of standard commonlyunderstood acronyms (by comparison to the millions of OOV words) and innovation is slowed by the need for the meaning of a new acronym to spread through popular consciousness. This makes them very much unlike the other types of spelling variations where new forms are constantly being created and no standardization exists. Another approach to the normalization challenge was proposed by [6][7]. In [6], the authors used a rule-based algorithm to map between IV and OOV words. These rules remove double letters, unnecessary vowels, and prefixes and suffixes from English words in order to recognize similar OOV words. Their method was thus designed to only handle two types of OOV transformation, namely abbreviations and those stylistic variations that are done solely by deleting letters. In [7], they improved on their original idea by training a probabilistic machine translation model to recognize these deletions instead of using rules. This alternative approach is founded by the idea that microtext can be handled as a separate language that needs to be translated into English. They found that the probabilistic model outperformed the rule-based approach, with an accuracy ranging from 60% to 80% in their various experiments. To be sure, these are only a small sample of the variety of approaches proposed to tackle the problem of microtext normalization, but they do serve to illustrate the massive range of solutions proposed, from dictionary lookups to rule applications to translation models. The method we propose in this paper explores yet another direction: that of phonetically rendering the OOV words to find phonetically-similar IV words. This paper reports our methodology and the promising first set of results obtained by our prototype. Given the limitless spellings variations and multiple types of transformations that can take place, it is no surprise that a diverse set of approaches for microtext normalization have been studied in the literature. In [3], the authors proposed a letter-substitution approach. Starting from the idea that microtext OOV words are simply misspelled IV English words, they developed an algorithm to discover the most common letter substitutions between pairs of IV and OOV words and compute their probabilities. They then use their probabilistic model to generate the most probable IV words of new OOV words. The different variations of their system achieve accuracies between 57% and 76% in their experiments [3]. III. METHODOLOGY Our proposed algorithm will be trained to determine the probable pronunciation of English words based on their spelling. Then, when presented with a new OOV word, it will determine the most probable IV words with similar pronunciation. There are thus two challenges to consider: how to map spelling to probable pronunciation at the training stage, and how to efficiently search the English language for words with similar pronunciations to an OOV word at runtime. Of these, the first is the most difficult challenge. Indeed, rendering words phonetically based solely on their spelling is a major challenge for a language such as English. One only needs to think of the many English words that are spelled almost identically but pronounced completely differently (such as “heard” and “beard”, or “cough” and “dough”) or conversely of the many English homophones with completely different spellings (such as “links” and “lynx”, or “some” and “sum”) to realize this. Going in a very different direction, [5] designed a microtext normalization dictionary, which stores pairs of OOV and IV words. A dictionary approach would naturally yield a simple and efficient algorithm with high precision; however, developing such a dictionary manually would be prohibitively expensive because of the millions of OOV words already existing [2][3] and the constant creation of new ones. The innovation of [5] has been to propose an automated two-step method to build the dictionary. In the first step their method finds (OOV, IV) pairs that occur in the same context, i.e. with the same surrounding words. For example, this step would find the pair (bday, birthday) from the identical context of “happy bday to you” and “happy birthday to you”, but it would also find the pair (NY, birthday) from “happy NY to you”. Then, in the second step, the pairs of words are ranked by string similarity to eliminate the abundant number of false positives the first step will generate. This would eliminates the pair (NY, birthday) while keeping the more similar (bday, birthday). As expected, their approach gives a very high rate of normalization precision but limited recall: the dictionary learning algorithm must constantly be fed new microtexts in order to discover new pairs of words, and a new OOV that is not yet in the dictionary will always fail to be recognized. A. Phonetic Examples The technique we used to train our system on how to map letters to sounds is to get a list of English words with their correct pronunciations to use as examples of possible letter mappings. For this list, we turned to Wiktionary1, one of the projects from the Wikimedia foundation, the same group that manages Wikipedia and several other wiki projects. Much like Wikipedia, Wiktionary is a free online dictionary built in multiple languages through user contributions. However, while Wikipedia has been used in a wide range of research projects ranging from engineering to social studies [8], adoption of Wiktionary has been considerably slower. To illustrate this 1 393 www.wiktionary.org 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing difference, searching the ACM Digital Library in May 2014 for “Wikipedia” finds over 20,000 published papers while a search for “Wiktionary” finds only 295 papers, and the same search in the IEEE Xplore Digital Library finds 839 papers for “Wikipedia” and a mere 4 for “Wiktionary”. data. Note that this is a global statistic, in the sense that the probability is computed for each letter over all its occurrences in the training data, regardless of the surrounding letters in a specific word. j We obtained a copy of the English Wiktionary from the Wikimedia download site2 as an XML file. This makes it easy for a software to pick out individual articles and process them line by line. The word that an article defines is always found in the title line, between “<title>” and “</title>” markup tags. The entire text of the article will likewise be found between “<text xml:space="preserve">” and “</text>” tags, and will contain a section heading for each language the word is defined in. This means that the article for an English word will always contain the section heading “==English==” somewhere in its text field. Under that heading, many English words will include pronunciation notes in one or several English dialects, namely UK English, US English, Canadian English, Australian English or New Zealand English. Each pronunciation is written with standard IPA phonetic symbols and enclosed in Wiki-language curly-brackets with an IPA tag, again making them easy for an algorithm to pick out automatically. For example, the word “about” has the standard pronunciation /əˈbaʊt/, two Canadian pronunciations /əˈbɐʊt/ and /əˈbʌʊt/, and an Irish pronunciation /əˈbɛʊt/. l y’all ɹ your z yours k york z use yous l yule ə o ɛ s u.s. ʊ ɹ your ɹ user Fig. 1. Sample of the phonetic radix tree. The probability of a phonetic symbol as a child of a given node. This is computed by counting the number of times a phonetic symbol occurs as a child of another symbol along a specific path. For example, in Figure 1, the symbol “z” will have a probability of 60% as a child of “ju”, while “l” and “ɛ” will have probabilities of 20% each. By contrast to the first statistic, this one is local, in that it is computed for each node of the tree independently of all others. Our processing extracted 37,500 pronunciations of 30,368 different English words. The Wiktionary uses very fine-grained pronunciations, as exemplified by the four different ways to represent the “ou” in “about” presented above. We find that there are in fact 151 different IPA symbols used individually or in combinations to represent 230 different letters and groups of letters in our training data. In order to allow our system to deal with the different types of OOV words and normalization challenges, we also manually defined certain additional data structures: B. Phonetic Tree and Training The foundation of our normalization algorithm is a radixtree-structured phonetic dictionary of the English language. Starting from the root, each word’s phonetic transcription is inserted symbol by symbol, with each symbol being a separate child in the tree. Consequently, every word can be read phonetically by following a path through the tree. A node which has the last phonetic symbol of a word will store a list of all words that have the pronunciation represented by the path (there might be multiple words, in the case of homophones). Moreover, the path may continue beyond that node, as some words can also be prefixes of longer words. An example is given in Figure 1. The similar-sound list is a list mapping each phonetic symbol to other symbols that sound similar to it. This is needed for a normalization system such as ours, whose purpose is to recognize similar-sounding words, since the Wiktionary data has very fine-grained distinctions between sounds, as the “about” example and Figure 1 illustrated. Our method should not fail to recognize, for example, the word “yours” because it was rendered phonetically as “joʊɹz” instead of “jɔɹz”, and thus imparting into the system that “oʊ” sounds similar to “ɔ” will solve that issue. Creating this list was the most labour-intensive manual step in building our system. In future work we will devise a way to build this list automatically by aligning different IPA versions of the same word and paring together the different symbols in the same positions in the word. But for a first iteration of the prototype we feel that a manually-defined list is acceptable for a proof-of-concept. The tree is built by inputting each of the 37,500 phonetic transcriptions one by one, starting at the root and branching off as needed. While this is done, the algorithm also learns two sets of phonetic probabilities: The probability of a phonetic symbol or set of phonetic symbols given a letter or set of letters. This is computed by counting the number of times a (set of) letter(s) is mapped to a (set of) phonetic symbol(s) in the training data, and taking the ratio of that value to the total number of occurrences of that (set of) letter(s) in the 2 u ewe, yew, yu ɔ In order to deal with stylistic variations, which often involve changing the sound of vowels in words, we defined the vowel sound list. This is literally a list of all phonetic symbols associated to the vowel letters A, E, I, http://dumps.wikimedia.org 394 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing O, U and Y in the training data, and can be obtained directly from the table of letter-to-symbol probabilities built during the training of our system. and the search does not move to a child in the tree in that case (step 8d). On the other hand, if the prefix letter is mapped to a phonetic symbol that is actually a combination of multiple symbols (for example the letter “x” mapping to the two symbols “ks”), then multiple children are generated in sequence and the new edge node is the last one in the sequence (step 8d). Additionally, we mentioned that our system includes a similar-sound list for phonetic symbols that sound similar to each other. This list is used in steps 4a and 5a: when the global or local probability of a phonetic symbol is computed in steps 4 and 5, these sub-step can then find all phonetic symbols that sound alike to it in the list and sum their probabilities together to get a total probability for the entire sound instead of a finegrained probability for individual phonetic symbols. In order to handle graphemic substitutions, we build into the system a simple graphemic substitution list: 4 for A, 1 for I or L, 0 for O, 3 for E, and 7 for T. Note that our system is also aware of the phonetic rendering of these numbers, which is part of its training data. Deciding whether a number in a word should be taken for a graphemic substitution or a phonetic substitution is part of the challenge of normalization. Finally, we included a word probability list in our system. The list chosen is the freely available list of 40,000 words from Project Guttenberg 3 , which are provided with the frequency count in all books of the project. Graphemic substitutions are handled in step 6 of the algorithm. At steps 4 and 5, the prefix letter obtained at step 3 is assumed to represent a sound. At step 6, the prefix is instead compared to the graphemic list build in the training stage, and if found it is replaced by the appropriate letter and that letter is then used in a repeat of steps 4 and 5. C. Search Algorithm Once the system has been trained, it can be used for the purpose to microtext normalization. This will require determining which IV word in the tree sounds similar to a given OOV word from a microtext message. However, since several words can be found to be similar to the OOV word, our system can return a list of possible words with their associated probabilities, in order to determine which word was most likely intended by the user. Finally, we need our system to deal with the case that a sound has been deleted altogether in the microtext word. This is done in step 7, using the list of vowel sounds developed during the training stage. Indeed, the sound deleted is usually a vowel sound rather than a consonant sound, a fact noted by [3][6]. Consequently, in step 7 our algorithm also expands the edge nodes corresponding to vowel-sound children of the current edge node. This expansion can only use the local probability of the symbol, since the global probability will be zero (otherwise it would be a phonetic symbol of the current letter and would already have been considered in step 4). So instead of a global probability value, we multiply the probability here by a dampening value. The steps of the basic search algorithm are presented in Figure 2. The algorithm searches the tree in a uniform search manner. It maintains a search list of edge nodes, each one tracking the list of letters in the OOV word that have not been rendered phonetically yet, the string of phonetic symbols rendered so far, the probability of the current phonetic rendering, and where in the tree that edge node is located. At each iteration, the search algorithm picks the edge node with the highest probability (step 1 in Figure 2), looks at the letters still not rendered phonetically, and finds all prefixes of letters that have phonetic symbol equivalents (step 3). So for example, for the word “aura” it would find two prefixes, the letter “a” and two-letter “au”, each of which maps to a set of phonetic symbols (including a silent-letter symbol). Those symbols and their associated global probabilities are retrieved (step 4). Moreover, since the search follows a path through the radix tree, there is at the current node only a limited set of valid children nodes, each one representing a phonetic symbol and each one with a local probability (step 5). This makes it possible to generate a new list of edge nodes to replace the one being currently considered (step 8). Each of these new edge nodes will lose the prefix letters from the word but add the phonetic symbols, will have a new probability that will be the product of the previous probability with the local and global probabilities of the added phonetic symbol, and will be at a new edge node corresponding to the child edge node. As mentioned in step 8a, the new edge node for the search will keep track of the OOV word as it is gradually stripped of letters already matched to phonetic symbols. Thus, when the highest-probability node in the search list has no letters left in it, it means that path has been searched to exhaustion (step 2). The probability of that terminal node is the probability of the pronunciation given the original OOV word, and the IV words stored in the node are those that share this pronunciation. However, not all words are equally likely to be used in an English message; some words are very common and others are more rare and should be considered less likely even if their pronunciation is a better match for the OOV word. For example, the OOV word “togeda” is phonetically nearer to “toga” than “together”, but the latter is much more common in microtext messages than the former, and that difference should inform the normalization. As mentioned as part of the training stage, we incorporated a list of word probabilities into our algorithm. The words retrieved in the terminal node are added to the set of possible normalized words with their pronunciation probability multiplied by the word probability (step 2b). Since this is a uniform search algorithm, it then continues searching other paths through the tree and finding additional possible normalized words. This goes on until it finds that the highest normalized word’s probability is higher than the highest edge node to search from (step 2c), at which point that normalized word is the most probable option and is There are a few special cases worth mentioning. First, if the phonetic symbol picked for a prefix letter is the silent symbol, then the new edge node is the same as the current edge node 3 http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_ Gutenberg 395 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing returned. Other termination conditions could also be implemented at this step, for example to create a list counting a certain number of normalized words in order to return a set of normalization suggestions. OOV word “tatt” and its IV version “tattoo” have a Levenshtein distance of 2, but the phonetic versions discovered by the Wiktionary probabilities are “ tˠt͈ ” and “ tˠt͈ u” respectively, with a Levenshtein distance of 1. Doing this, we computed pairs of matching spelling-phonetic distances for each of the 2608 words in our testing corpus. Then, we computed the average phonetic distance for each value of the spelling distance. The results are plotted in Figure 3 for word pairs with a spelling Levenshtein distance between 1 and 10; while a few pairs of words do have a distance above 10, there are so few of them that we cannot consider their results representative of our system. As can be seen in Figure 3, the Levenshtein distance of the most probable phonetic readings (marked “probabilities alone” in the figure) initially increases linearly and roughly equally with the spelling distance. This may seem to indicate that there is little gain; however, recall that spelling only uses 32 characters (26 letters, 5 numbers, and the apostrophe) while the phonetic strings use 151 symbols. The fact that the distance between the strings remains constant despite a five-fold increase in the number of symbols used to create a much more fine-grained representation of the words indicates that moving to the phonetic realm is indeed making it possible to find some similarities between the words. Moreover, while the Levenshtein distance of the phonetic strings initially increases in lockstep with the spelling distance, at higher spelling distances it slows down and stabilizes. To further consider this, we included our similar-sound list in the computation of the phonetic Levenshtein distance and plotted these results in Figure 3 as well. As can be seen, in that case the relationship starts off lower and stabilizes a lot sooner, indicating that it makes the phonetic similarities easier to find, as we expected. Initial edge node: word = input OOV word; phonetic string = ""; node probability = 1.0; current tree node = root node Search list: Initial edge node 1. Get the node with the highest phonetic probability from the search list 2. If there are no letters in the word: a. Get the list of words at the current tree node b. Multiply each word by its probability and add to a list of normalized words c. If the normalized list achieves a termination condition, return it d. Otherwise, go to step 1 3. Get the next prefix letter (or set of letters) of the word 4. List all phonetic symbols the letters from step 3 correspond to and their global probabilities a. Add the probabilities of symbols in the similar-sound list 5. Check which symbols from step 4 are valid children of the current node and their local probabilities a. Add the probabilities of symbols in the similar-sound list 6. Check for graphemic substitutions 7. Check for skipped vowels 8. Generate new nodes: a. Remove letter from word b. Add phonetic symbol to phonetic string c. Multiply word probability by global probability and local probability d. Update current node to the child node with the correct phonetic symbol e. Add new node to the search list 9. Go to step 1 Fig. 2. Steps of the search algorithm. IV. EXPERIMENTAL RESULTS In order to test the effectiveness of our normalization system, we used the Text Normalization Data Set from the University of Texas at Dallas [3][9]. This corpus lists 2608 OOV words that were observed in real tweets, along with their corresponding IV English word form. We sorted these words into the five basic types of [3], namely abbreviation, phonetic substitution, graphemic substitution, stylistic variation, and letter repetition, and added a sixth type for words that belonged to multiple types at once. This insures that each OOV word only appears in one of the test types. A breakdown of all types, with the number of individual words and test results, will be presented in Table I. A. Levenshtein Distance The basic underlying assumption of our work, as stated in the introduction, is that pairs of IV and OOV words appear more similar to each other when read out phonetically than they do on spelling. Indeed, if that were not the case, it would be more efficient to correct the words with an ordinary spelling correction software! To verify our assumption, we measured the relationship between spelling differences and phonetic differences. We do this by using the Levenshtein distance [10], a straightforward and standard string comparison algorithm. We compute the Levenshtein distance between the spelling of an IV and OOV pair, and compared that to the Levenshtein distance of the most probable IV word’s pronunciation and OOV word’s pronunciation as computed by the letter-tosymbol probabilities learned by our system. For example, the Fig. 3. Relationship between spelling distance and phonetic distance. B. General Results The experiment we ran consists in taking each OOV twitter word and putting it through our normalization algorithm in order to see if the correct IV English word is returned. We computed the system’s accuracy if the correct IV word is the single most probable word returned, and if it is among the top5 most probable words (for example for a list of suggestions in correction software). The average results over the entire test corpus and the results broken down by type of normalization are given in Table I. One thing to note is that the top-5 results are consistently 20% to 30% higher than the top-1 results. This 396 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing indicates that in a large portion of cases, our system is converging to the correct word, but the result is overshadowed by another higher-probability word. Further study of the probabilities learned in the training stage in order to weight them better could lead to greatly improved top-1 results, and bring our system’s performances up to the level of literature benchmarks. TABLE I. letter substitutions programmed into our system and mentioned in Section III, and five additional letter-to-letter substitutions that we had not anticipated at all, the most common of which is replacing the letter g with a q. Unsurprisingly, we find that our system performs quite well when dealing with the cases it was designed to handle, achieving a top-1 accuracy of 45.0% and a top-5 accuracy of 85.0%, but it performs poorly when confronted with unexpected substitutions, only getting a top-1 accuracy of 20.5% and a top-5 accuracy of 43.6%. CORPUS COMPOSITION AND TEST RESULTS BY TYPE. Word count Top-1 accuracy Top-5 accuracy Overall Average 2608 30.2% 59.7% Abbreviation 806 29.0% 52.6% Phonetic substitution 130 53.8% 78.5% Graphemic substitution 58 29.3% Stylistic variation 820 Letter repetition Multiple types Normalization type TABLE II. NORMALIZATION OF ABBREVIATION TYPES. Word count Top-1 accuracy Top-5 accuracy Vowel 315 33.0% 73.7% 58.6% Silent Consonant 86 62.8% 95.3% 31.6% 65.7% Vowel and Silent Consonant 21 28.6% 66.7% 641 28.1% 62.2% Voiced Consonant 128 47.7% 57.8% 153 17.9% 38.4% Vowel and Voiced Consonant 40 22.5% 52.5% Syllable 216 0% 0.5% Abbreviation type C. Detailed Results It is worth studying in greater details the results for each of the five types of normalization challenges, to understand exactly what are the strengths and weaknesses of our system. As explained before, phonetic substitutions are the case where a letter or groups of letters are replaced by another character that sounds the same. Our system seems to handle this type of change quite well, based on the results in Table I. In fact, a more detailed study of the results, presented in Table III reveals that there are only two phonetic substitutions that it seems to struggle with. Firstly, it struggles with recognizing numbers substituting in for sounds, and thoroughly fails to recognize the number 8 for the sounds ate, as in “h8” for “hate”. However, these represent only a minority of the substitutions; the vast majority of substitutions are letters standing in for sounds created from other letters, a challenge at which our system excels. The one exception worth mentioning, the second one our system struggles with, is a failure to match the letter d for a th sound, as in “dat” for “that”. That problem may come from subtle differences between these sounds that exist in the Wiktionary, which was what our similar-sound list was designed to compensate for (but apparently did not completely succeed at it). Nonetheless, Table III indicates that, for the changes that account for threequarters of phonetic substitution, our system’s accuracy is at 99%. The abbreviation type provides a good first case study. Despite its simple definition, it is actually a very varied type, based on which letters and how many letters are deleted. It is most common for users to delete some or all vowels from the word, shortening for example “number” to “nmbr”. Alternatively, some users can delete consonants. We can further recognize two subcategories of this deletion, if the consonant is silent (“dum” for “dumb”) or part of a multi-letter sound (“compas” for “compass”), or if it is a voiced consonant (“suprise” for “surprise”). The third and fourth categories are to delete both vowels and silent or voiced consonants, such as to shorten “staff” to “stf” in the former case or “just” to “js” in the latter. Finally, it is possible to abbreviate a word by cropping entire syllables of it, for example by writing “batt” instead of “battery”. Statistics on these six types of abbreviations are given in Table II. The results show that our system works very well in three cases, when the deleted letters are vowels, silent/multi-sound consonants, or both, but struggles in the cases when the deleted letters are voiced consonants with or without vowels, and fails when entire syllables are deleted. This is a consequence of the underlying assumption of our system: it is designed to recognize OOV words that sound similar to the IV words they stand for, and in the first three cases that assumption is respected and system’s accuracy ranges from 67% to 95%. In the last three cases the OOV word is phonetically different from the IV word and our assumption does not hold, and our system falls to the 50% range when only letters are missing, and to 0% when entire syllables are missing and the OOV word’s pronunciation diverges completely from the IV word’s pronunciation. TABLE III. NORMALIZATION OF PHONETIC SUBSTITUTIONS. Word count Top-1 accuracy Top-5 accuracy 8 for “ate” 8 0% 0% Other numbers for sounds 14 28.6% 50% d for “th” 11 9% 9% Other letters for sounds 95 66.3% 98.9% Phonetic substitution type The behaviour of our system when normalizing repetitions is similar to that when normalizing abbreviations; namely, the more phonetically different the OOV word is from its IV When it comes to graphemic substitution, we can distinguish between two types, namely the five number-to- 397 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing equivalent, the less accurate our system is. In this type of normalization, though, we find that accuracy correlates with the number of repeated letters. Indeed, multiple repetitions of the letters in an OOV word add multiple copies of the corresponding sound, which only appears once in the IV word. Combined with the similar-sound list, this has the effect of leading the search algorithm too far down a wrong path in the tree for it to recover. Figure 4 illustrates the relationship between the number of repetitions of a letter in an OOV word and the top-1 and top-5 accuracy, and shows how sharp the drop is, going from a top-5 accuracy of 85% when a letter is repeated only twice to 26% when it is repeated 5 times. Fortunately, it is much more common in practice to have a letter repeated only two of three times than to have more repetitions; in fact, those two cases account for 70% of words with this type of normalization. The OOV word count in the testing data with each number of repetitions is also included in Figure 4, to verify this fact. V. CONCLUSION In this paper, we presented a novel normalization algorithm for the OOV words commonly found in microtext messages. The underlying assumption of our work is that, while the spelling of OOV words can be very variable, they will remain phonetically similar to their IV word counterparts. Consequently, we designed a first prototype of our system to be able to compute the probable phonetic reading of words based on their spelling using training examples from the Wiktionary, and then to determine the likely English equivalent of OOV words using a radix tree structure of the language. Our experimental results are mixed; while the system performs adequately with an overall top-5 accuracy of nearly 60% and an accuracy for certain types of normalization going up to the 80% range and even sometimes to the 90% range, there is clear room for improvement, which we highlighted in our analysis of the results. Future work will focus on ways to fine-tune the probabilities in order to bridge the 30% gap between the top-1 and top-5 accuracy results, on automating the creation of the similar-sound list, as well as on improving the overall accuracy of the system by resolving some of the problems cases noted in the analysis of the results. Fig. 4. Relationship between number of repeated letters and accuracy. The final type of normalization in our study is the stylistic variation, when a writer changes the spelling of a word to make it look more similar to the way the author would pronounce it. Examples include writing “evar” instead of “ever”, “becuz” instead of “because”, or “yoself” instead of “yourself”. Naturally, the more variations are introduced in the spelling, the more different the OOV word will sound from its IV equivalent, and the more difficult it will be for our algorithm to handle. Fortunately, since the ultimate goal of the writer is to be understood by the reader, they will most of the time use few changes, and our algorithm is capable of recognizing the word. To verify this, we computed the Levenshtein distance between the spelling of IV and OOV words, and plotted in Figure 5 the top-1 and top-5 accuracy and word count at each distance. The results demonstrate both that low-distance OOV words are more common, with OOV word at a Levenshtein distance of 3 or less from their IV words comprising 94% of the test data, and that our system performs best in those cases before suffering a drop of performance at the more rare higherdistance cases. For reference, a stylistic variation at Levenshtein distance of 3 represents cases such as writing “rele” for “really”, “gansta” for “gangster”, or “wateva” for “whatever”, and all these examples were correctly recognized by our system with top-1 probability. Fig. 5. Relationship between Levenshtein distance and accuracy. REFERENCES [1] [2] [3] [4] [5] [6] 398 K. Dela Rosa, and J. Ellen, “Text classification methodologies applied to micro-text in military chat”, Proceedings of the Eight International Conference on Machine Learning and Applications, Miami, USA, 2009, pp. 710-714. S. Petrovic, M. Osborne, and V. Lavrenko, “The Edinburgh Twitter corpus”, Proceedings of the NAACL Workshop on Computational Linguistics in a World of Social Media, Los Angeles, USA, 2010, pp. 25-26. F. Liu, F. Weng, B. Wang, and Y. Liu, “Insertion, deletion, or substitution?: Normalizing text messages without pre-categorization nor supervision”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 2, Stroudsburg, USA,, 2011pp. 71-76. Z. Xue, D. Yin, and B. D. Davison, “Normalizing microtext”, Proceedings of the AAAI-11 Workshop on Analyzing Microtext, San Francisco, USA, 2011, pp. 74-79. B. Han, P. Cook, and T. Baldwin, “Lexical normalization for social media text”, ACM Transactions on Intelligent Systems and Technology (TIST), 4:1, 2013, article 5. D. L. Pennell, and Y. Liu, “Normalization of text messages for text-tospeech”, Proceedings of the 35th International Conference on Acoustics, Speech and Signal Processing, Dallas, USA, 2010, pp. 4842-4845. 2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing [7] [8] D. L. Pennell, and Y. Liu, “Normalization of informal text”, Computer Speech & Language, 28:1, January 2014, pp. 256–277. R. Khoury, “The impact of Wikipedia on scientific research”, Proceedings of the Third International Conference on Internet Technologies and Applications, Wrexham, UK, pp. 2-11. [9] F. Liu, F. Weng, and X. Jiang, “A broad-coverage normalization system for social media language”, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Korea, 2012, pp. 1035-1044. [10] G. Navarro, “A guided tour to approximate string matching”, ACM Computing Surveys, 33:1, 2001, pp. 31–88. 399
© Copyright 2026 Paperzz