Microtext Normalization using Probably-Phonetically

2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
Microtext Normalization using ProbablyPhonetically-Similar Word Discovery
Richard Khoury
Department of Software Engineering
Lakehead University
Thunder Bay, Canada
[email protected]
dealing with this challenge. Our underlying assumption is that
microtext users recognize words not thanks to correct spelling
but by sounding out the characters. Consequently, no matter
how innovative microtext spellings get, the resulting OOV
words must still be phonetically similar enough to the intended
IV words in order for the readers to understand them. For
example, a sentence like “r u askin any1 b4 teh gamez 2nite”
would sound like “are you asking anyone before the games
tonight?” if one were to read it out loud, and would thus be
perfectly understandable despite being composed exclusively
of OOV words. From this starting point, we propose to tackle
the challenge of microtext normalization by building an
algorithm that can determine the most probable phonetic
equivalent of OOV words and match them to the most probable
similar-sounding English words.
Abstract—Microtext normalization is the challenge of
discovering the English words corresponding to the unusuallyspelled words used in social-media messages and posts. In this
paper, we propose a novel method for doing this by rendering
both English and microtext words phonetically based on their
spelling, and matching similar ones together. We present our
algorithm to learn spelling-to-phonetic probabilities and to
efficiently search the English language and match words
together. Our results demonstrate that our system correctly
handles many types of normalization problems.
Keywords—microtext; social media; normalization; phonetic;
wiktionary
I. INTRODUCTION
The term “microtext” was proposed by US Navy
researchers [1] to describe a type of text document that has
three characteristics: (A) it is very short, typically one sentence
or less, and possibly as little as a single word; (B) it is written
in an informal manner and unedited for quality, and thus may
use loose grammar, a conversational tone, vocabulary errors,
and uncommon abbreviations and acronyms; (C) it is semistructured in the Natural Language Processing (NLP) sense, in
that it includes some metadata such as a time stamp, an author,
or the name of a field it was entered into. Microtexts have
become omnipresent in today’s world: they are notably found
in online posts on Facebook and Twitter, user comments to
videos, pictures and news items, and in SMS messages.
The rest of this paper is structured as follows. In Section II
we define more formally the problem of normalization and
present a sample of other techniques used to solve it. In section
III, we present our phonetic normalization algorithm. This
presentation will be divided in subsections corresponding to
the modules of the system: the letter-to-phoneme algorithm we
used to convert IV and OOV words to phonetic strings, the
radix tree used to model the language, and the search algorithm
we experimented with to match the phonetic strings of OOV
words to the correct IV words. Section IV will present and
discuss the results we obtained with the different versions of
our algorithm, and section V will draw some conclusions on
our work.
One major challenge when dealing with microtext stems
from their highly relaxed spelling rules and their tolerance to
extreme irregularities in spelling. This causes problems when
one tries to apply traditional NLP tools and techniques, which
have been developed for conventional and properly-written
English text. It could be thought that a simple find-and-replace
preprocessing on the microtext would solve that problem.
However, the sheer diversity of spelling variations makes this
solution impractical; for example, a sampling of Twitter
messages studied in [2][3] found over 4 million out-ofvocabulary (OOV) words. Moreover, new spelling variations
are created constantly, both voluntarily and accidentally.
II. BACKGROUND
Before presenting techniques for normalization, let us
define the problem more formally. It was noted in [3] that
while OOV spelling variations are seemingly endless, their
creation seems to follow a small set of simple rules. The rules
proposed in [3] are “abbreviation” (deleting letters from the
word, for example spelling the word “together” as “tgthr”),
“phonetic substitution” (substituting letters for other symbols
that sound the same, such as “2” for “to” in “2gether”),
“graphemic substitution” (substituting a letter for a symbol that
looks the same, such as switching the letter “o” for the number
“0” in “t0gether”), “stylistic variation” (misspelling the word to
make it look like one’s personal pronunciation, such as writing
“togeda” or “togethor”), and “letter repetition” (repeating some
letter for emphasis, for example by typing “togetherrr”). It is
The challenge of developing algorithms to automatically
correct the OOV words found in microtexts and replace them
with the correct in-vocabulary (IV) words is known as
normalization. In this paper, we propose a new approach to
Identify applicable sponsor/s here. If no sponsors, delete this text box
(sponsors).
978-1-4673-7701-0/15/$31.00 ©2015 IEEE
392
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
important to note that, while these rules are useful tools to
understand the different types of OOV words, the words
themselves can overlap multiple types. For example “togeter”
could be labelled as either an abbreviation or a stylistic
variation, depending on whether the labeller assumes the
missing h was removed to shorten the word or to change its
pronunciation. Furthermore, the rules are not mutually
exclusive, but can be combined together. For example, the
word “2gthr” is a result of abbreviation and phonetic
substitution, while “togethaa” comes from stylistic variation
and letter repetition. Finally, some authors [4] also include
OOV acronyms (such as “imho” for “in my humble opinion”)
as part of the normalization challenge. We do not however, for
the reason that acronyms can only be understood if all parties
in a conversation know what they stand for, and consequently
they are limited to a small number of standard commonlyunderstood acronyms (by comparison to the millions of OOV
words) and innovation is slowed by the need for the meaning
of a new acronym to spread through popular consciousness.
This makes them very much unlike the other types of spelling
variations where new forms are constantly being created and
no standardization exists.
Another approach to the normalization challenge was
proposed by [6][7]. In [6], the authors used a rule-based
algorithm to map between IV and OOV words. These rules
remove double letters, unnecessary vowels, and prefixes and
suffixes from English words in order to recognize similar OOV
words. Their method was thus designed to only handle two
types of OOV transformation, namely abbreviations and those
stylistic variations that are done solely by deleting letters. In
[7], they improved on their original idea by training a
probabilistic machine translation model to recognize these
deletions instead of using rules. This alternative approach is
founded by the idea that microtext can be handled as a separate
language that needs to be translated into English. They found
that the probabilistic model outperformed the rule-based
approach, with an accuracy ranging from 60% to 80% in their
various experiments.
To be sure, these are only a small sample of the variety of
approaches proposed to tackle the problem of microtext
normalization, but they do serve to illustrate the massive range
of solutions proposed, from dictionary lookups to rule
applications to translation models. The method we propose in
this paper explores yet another direction: that of phonetically
rendering the OOV words to find phonetically-similar IV
words. This paper reports our methodology and the promising
first set of results obtained by our prototype.
Given the limitless spellings variations and multiple types
of transformations that can take place, it is no surprise that a
diverse set of approaches for microtext normalization have
been studied in the literature. In [3], the authors proposed a
letter-substitution approach. Starting from the idea that
microtext OOV words are simply misspelled IV English
words, they developed an algorithm to discover the most
common letter substitutions between pairs of IV and OOV
words and compute their probabilities. They then use their
probabilistic model to generate the most probable IV words of
new OOV words. The different variations of their system
achieve accuracies between 57% and 76% in their experiments
[3].
III. METHODOLOGY
Our proposed algorithm will be trained to determine the
probable pronunciation of English words based on their
spelling. Then, when presented with a new OOV word, it will
determine the most probable IV words with similar
pronunciation. There are thus two challenges to consider: how
to map spelling to probable pronunciation at the training stage,
and how to efficiently search the English language for words
with similar pronunciations to an OOV word at runtime. Of
these, the first is the most difficult challenge. Indeed, rendering
words phonetically based solely on their spelling is a major
challenge for a language such as English. One only needs to
think of the many English words that are spelled almost
identically but pronounced completely differently (such as
“heard” and “beard”, or “cough” and “dough”) or conversely
of the many English homophones with completely different
spellings (such as “links” and “lynx”, or “some” and “sum”) to
realize this.
Going in a very different direction, [5] designed a microtext
normalization dictionary, which stores pairs of OOV and IV
words. A dictionary approach would naturally yield a simple
and efficient algorithm with high precision; however,
developing such a dictionary manually would be prohibitively
expensive because of the millions of OOV words already
existing [2][3] and the constant creation of new ones. The
innovation of [5] has been to propose an automated two-step
method to build the dictionary. In the first step their method
finds (OOV, IV) pairs that occur in the same context, i.e. with
the same surrounding words. For example, this step would find
the pair (bday, birthday) from the identical context of “happy
bday to you” and “happy birthday to you”, but it would also
find the pair (NY, birthday) from “happy NY to you”. Then, in
the second step, the pairs of words are ranked by string
similarity to eliminate the abundant number of false positives
the first step will generate. This would eliminates the pair (NY,
birthday) while keeping the more similar (bday, birthday). As
expected, their approach gives a very high rate of
normalization precision but limited recall: the dictionary
learning algorithm must constantly be fed new microtexts in
order to discover new pairs of words, and a new OOV that is
not yet in the dictionary will always fail to be recognized.
A. Phonetic Examples
The technique we used to train our system on how to map
letters to sounds is to get a list of English words with their
correct pronunciations to use as examples of possible letter
mappings. For this list, we turned to Wiktionary1, one of the
projects from the Wikimedia foundation, the same group that
manages Wikipedia and several other wiki projects. Much like
Wikipedia, Wiktionary is a free online dictionary built in
multiple languages through user contributions. However, while
Wikipedia has been used in a wide range of research projects
ranging from engineering to social studies [8], adoption of
Wiktionary has been considerably slower. To illustrate this
1
393
www.wiktionary.org
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
difference, searching the ACM Digital Library in May 2014 for
“Wikipedia” finds over 20,000 published papers while a search
for “Wiktionary” finds only 295 papers, and the same search in
the IEEE Xplore Digital Library finds 839 papers for
“Wikipedia” and a mere 4 for “Wiktionary”.
data. Note that this is a global statistic, in the sense that
the probability is computed for each letter over all its
occurrences in the training data, regardless of the
surrounding letters in a specific word.
j
We obtained a copy of the English Wiktionary from the
Wikimedia download site2 as an XML file. This makes it easy
for a software to pick out individual articles and process them
line by line. The word that an article defines is always found in
the title line, between “<title>” and “</title>” markup tags. The
entire text of the article will likewise be found between “<text
xml:space="preserve">” and “</text>” tags, and will contain a
section heading for each language the word is defined in. This
means that the article for an English word will always contain
the section heading “==English==” somewhere in its text field.
Under that heading, many English words will include
pronunciation notes in one or several English dialects, namely
UK English, US English, Canadian English, Australian English
or New Zealand English. Each pronunciation is written with
standard IPA phonetic symbols and enclosed in Wiki-language
curly-brackets with an IPA tag, again making them easy for an
algorithm to pick out automatically. For example, the word
“about” has the standard pronunciation /əˈbaʊt/, two Canadian
pronunciations /əˈbɐʊt/ and /əˈbʌʊt/, and an Irish pronunciation
/əˈbɛʊt/.
l
y’all
ɹ
your
z
yours
k
york
z
use
yous
l
yule
ə
o
ɛ
s
u.s.
ʊ
ɹ
your
ɹ
user
Fig. 1. Sample of the phonetic radix tree.
 The probability of a phonetic symbol as a child of a
given node. This is computed by counting the number
of times a phonetic symbol occurs as a child of another
symbol along a specific path. For example, in Figure 1,
the symbol “z” will have a probability of 60% as a child
of “ju”, while “l” and “ɛ” will have probabilities of 20%
each. By contrast to the first statistic, this one is local,
in that it is computed for each node of the tree
independently of all others.
Our processing extracted 37,500 pronunciations of 30,368
different English words. The Wiktionary uses very fine-grained
pronunciations, as exemplified by the four different ways to
represent the “ou” in “about” presented above. We find that
there are in fact 151 different IPA symbols used individually or
in combinations to represent 230 different letters and groups of
letters in our training data.
In order to allow our system to deal with the different types
of OOV words and normalization challenges, we also manually
defined certain additional data structures:
B. Phonetic Tree and Training
The foundation of our normalization algorithm is a radixtree-structured phonetic dictionary of the English language.
Starting from the root, each word’s phonetic transcription is
inserted symbol by symbol, with each symbol being a separate
child in the tree. Consequently, every word can be read
phonetically by following a path through the tree. A node
which has the last phonetic symbol of a word will store a list of
all words that have the pronunciation represented by the path
(there might be multiple words, in the case of homophones).
Moreover, the path may continue beyond that node, as some
words can also be prefixes of longer words. An example is
given in Figure 1.
 The similar-sound list is a list mapping each phonetic
symbol to other symbols that sound similar to it. This is
needed for a normalization system such as ours, whose
purpose is to recognize similar-sounding words, since
the Wiktionary data has very fine-grained distinctions
between sounds, as the “about” example and Figure 1
illustrated. Our method should not fail to recognize, for
example, the word “yours” because it was rendered
phonetically as “joʊɹz” instead of “jɔɹz”, and thus
imparting into the system that “oʊ” sounds similar to
“ɔ” will solve that issue. Creating this list was the most
labour-intensive manual step in building our system. In
future work we will devise a way to build this list
automatically by aligning different IPA versions of the
same word and paring together the different symbols in
the same positions in the word. But for a first iteration
of the prototype we feel that a manually-defined list is
acceptable for a proof-of-concept.
The tree is built by inputting each of the 37,500 phonetic
transcriptions one by one, starting at the root and branching off
as needed. While this is done, the algorithm also learns two
sets of phonetic probabilities:
 The probability of a phonetic symbol or set of phonetic
symbols given a letter or set of letters. This is computed
by counting the number of times a (set of) letter(s) is
mapped to a (set of) phonetic symbol(s) in the training
data, and taking the ratio of that value to the total
number of occurrences of that (set of) letter(s) in the
2
u
ewe, yew, yu
ɔ
 In order to deal with stylistic variations, which often
involve changing the sound of vowels in words, we
defined the vowel sound list. This is literally a list of all
phonetic symbols associated to the vowel letters A, E, I,
http://dumps.wikimedia.org
394
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
O, U and Y in the training data, and can be obtained
directly from the table of letter-to-symbol probabilities
built during the training of our system.
and the search does not move to a child in the tree in that case
(step 8d). On the other hand, if the prefix letter is mapped to a
phonetic symbol that is actually a combination of multiple
symbols (for example the letter “x” mapping to the two
symbols “ks”), then multiple children are generated in
sequence and the new edge node is the last one in the sequence
(step 8d). Additionally, we mentioned that our system includes
a similar-sound list for phonetic symbols that sound similar to
each other. This list is used in steps 4a and 5a: when the global
or local probability of a phonetic symbol is computed in steps 4
and 5, these sub-step can then find all phonetic symbols that
sound alike to it in the list and sum their probabilities together
to get a total probability for the entire sound instead of a finegrained probability for individual phonetic symbols.
 In order to handle graphemic substitutions, we build
into the system a simple graphemic substitution list: 4
for A, 1 for I or L, 0 for O, 3 for E, and 7 for T. Note
that our system is also aware of the phonetic rendering
of these numbers, which is part of its training data.
Deciding whether a number in a word should be taken
for a graphemic substitution or a phonetic substitution is
part of the challenge of normalization.
 Finally, we included a word probability list in our
system. The list chosen is the freely available list of
40,000 words from Project Guttenberg 3 , which are
provided with the frequency count in all books of the
project.
Graphemic substitutions are handled in step 6 of the
algorithm. At steps 4 and 5, the prefix letter obtained at step 3
is assumed to represent a sound. At step 6, the prefix is instead
compared to the graphemic list build in the training stage, and
if found it is replaced by the appropriate letter and that letter is
then used in a repeat of steps 4 and 5.
C. Search Algorithm
Once the system has been trained, it can be used for the
purpose to microtext normalization. This will require
determining which IV word in the tree sounds similar to a
given OOV word from a microtext message. However, since
several words can be found to be similar to the OOV word, our
system can return a list of possible words with their associated
probabilities, in order to determine which word was most likely
intended by the user.
Finally, we need our system to deal with the case that a
sound has been deleted altogether in the microtext word. This
is done in step 7, using the list of vowel sounds developed
during the training stage. Indeed, the sound deleted is usually a
vowel sound rather than a consonant sound, a fact noted by
[3][6]. Consequently, in step 7 our algorithm also expands the
edge nodes corresponding to vowel-sound children of the
current edge node. This expansion can only use the local
probability of the symbol, since the global probability will be
zero (otherwise it would be a phonetic symbol of the current
letter and would already have been considered in step 4). So
instead of a global probability value, we multiply the
probability here by a dampening value.
The steps of the basic search algorithm are presented in
Figure 2. The algorithm searches the tree in a uniform search
manner. It maintains a search list of edge nodes, each one
tracking the list of letters in the OOV word that have not been
rendered phonetically yet, the string of phonetic symbols
rendered so far, the probability of the current phonetic
rendering, and where in the tree that edge node is located. At
each iteration, the search algorithm picks the edge node with
the highest probability (step 1 in Figure 2), looks at the letters
still not rendered phonetically, and finds all prefixes of letters
that have phonetic symbol equivalents (step 3). So for example,
for the word “aura” it would find two prefixes, the letter “a”
and two-letter “au”, each of which maps to a set of phonetic
symbols (including a silent-letter symbol). Those symbols and
their associated global probabilities are retrieved (step 4).
Moreover, since the search follows a path through the radix
tree, there is at the current node only a limited set of valid
children nodes, each one representing a phonetic symbol and
each one with a local probability (step 5). This makes it
possible to generate a new list of edge nodes to replace the one
being currently considered (step 8). Each of these new edge
nodes will lose the prefix letters from the word but add the
phonetic symbols, will have a new probability that will be the
product of the previous probability with the local and global
probabilities of the added phonetic symbol, and will be at a
new edge node corresponding to the child edge node.
As mentioned in step 8a, the new edge node for the search
will keep track of the OOV word as it is gradually stripped of
letters already matched to phonetic symbols. Thus, when the
highest-probability node in the search list has no letters left in
it, it means that path has been searched to exhaustion (step 2).
The probability of that terminal node is the probability of the
pronunciation given the original OOV word, and the IV words
stored in the node are those that share this pronunciation.
However, not all words are equally likely to be used in an
English message; some words are very common and others are
more rare and should be considered less likely even if their
pronunciation is a better match for the OOV word. For
example, the OOV word “togeda” is phonetically nearer to
“toga” than “together”, but the latter is much more common in
microtext messages than the former, and that difference should
inform the normalization. As mentioned as part of the training
stage, we incorporated a list of word probabilities into our
algorithm. The words retrieved in the terminal node are added
to the set of possible normalized words with their
pronunciation probability multiplied by the word probability
(step 2b). Since this is a uniform search algorithm, it then
continues searching other paths through the tree and finding
additional possible normalized words. This goes on until it
finds that the highest normalized word’s probability is higher
than the highest edge node to search from (step 2c), at which
point that normalized word is the most probable option and is
There are a few special cases worth mentioning. First, if the
phonetic symbol picked for a prefix letter is the silent symbol,
then the new edge node is the same as the current edge node
3
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists#Project_
Gutenberg
395
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
returned. Other termination conditions could also be
implemented at this step, for example to create a list counting a
certain number of normalized words in order to return a set of
normalization suggestions.
OOV word “tatt” and its IV version “tattoo” have a
Levenshtein distance of 2, but the phonetic versions discovered
by the Wiktionary probabilities are “ tˠt͈ ” and “ tˠt͈ u”
respectively, with a Levenshtein distance of 1. Doing this, we
computed pairs of matching spelling-phonetic distances for
each of the 2608 words in our testing corpus. Then, we
computed the average phonetic distance for each value of the
spelling distance. The results are plotted in Figure 3 for word
pairs with a spelling Levenshtein distance between 1 and 10;
while a few pairs of words do have a distance above 10, there
are so few of them that we cannot consider their results
representative of our system. As can be seen in Figure 3, the
Levenshtein distance of the most probable phonetic readings
(marked “probabilities alone” in the figure) initially increases
linearly and roughly equally with the spelling distance. This
may seem to indicate that there is little gain; however, recall
that spelling only uses 32 characters (26 letters, 5 numbers, and
the apostrophe) while the phonetic strings use 151 symbols.
The fact that the distance between the strings remains constant
despite a five-fold increase in the number of symbols used to
create a much more fine-grained representation of the words
indicates that moving to the phonetic realm is indeed making it
possible to find some similarities between the words.
Moreover, while the Levenshtein distance of the phonetic
strings initially increases in lockstep with the spelling distance,
at higher spelling distances it slows down and stabilizes. To
further consider this, we included our similar-sound list in the
computation of the phonetic Levenshtein distance and plotted
these results in Figure 3 as well. As can be seen, in that case
the relationship starts off lower and stabilizes a lot sooner,
indicating that it makes the phonetic similarities easier to find,
as we expected.
Initial edge node: word = input OOV word; phonetic string = ""; node
probability = 1.0; current tree node = root node
Search list: Initial edge node
1. Get the node with the highest phonetic probability from the search list
2. If there are no letters in the word:
a.
Get the list of words at the current tree node
b.
Multiply each word by its probability and add to a list of normalized
words
c.
If the normalized list achieves a termination condition, return it
d.
Otherwise, go to step 1
3. Get the next prefix letter (or set of letters) of the word
4. List all phonetic symbols the letters from step 3 correspond to and their
global probabilities
a.
Add the probabilities of symbols in the similar-sound list
5. Check which symbols from step 4 are valid children of the current node
and their local probabilities
a.
Add the probabilities of symbols in the similar-sound list
6. Check for graphemic substitutions
7. Check for skipped vowels
8. Generate new nodes:
a.
Remove letter from word
b.
Add phonetic symbol to phonetic string
c.
Multiply word probability by global probability and local
probability
d.
Update current node to the child node with the correct phonetic
symbol
e.
Add new node to the search list
9. Go to step 1
Fig. 2. Steps of the search algorithm.
IV. EXPERIMENTAL RESULTS
In order to test the effectiveness of our normalization
system, we used the Text Normalization Data Set from the
University of Texas at Dallas [3][9]. This corpus lists 2608
OOV words that were observed in real tweets, along with their
corresponding IV English word form. We sorted these words
into the five basic types of [3], namely abbreviation, phonetic
substitution, graphemic substitution, stylistic variation, and
letter repetition, and added a sixth type for words that belonged
to multiple types at once. This insures that each OOV word
only appears in one of the test types. A breakdown of all types,
with the number of individual words and test results, will be
presented in Table I.
A. Levenshtein Distance
The basic underlying assumption of our work, as stated in
the introduction, is that pairs of IV and OOV words appear
more similar to each other when read out phonetically than
they do on spelling. Indeed, if that were not the case, it would
be more efficient to correct the words with an ordinary spelling
correction software! To verify our assumption, we measured
the relationship between spelling differences and phonetic
differences. We do this by using the Levenshtein distance [10],
a straightforward and standard string comparison algorithm.
We compute the Levenshtein distance between the spelling of
an IV and OOV pair, and compared that to the Levenshtein
distance of the most probable IV word’s pronunciation and
OOV word’s pronunciation as computed by the letter-tosymbol probabilities learned by our system. For example, the
Fig. 3. Relationship between spelling distance and phonetic distance.
B. General Results
The experiment we ran consists in taking each OOV twitter
word and putting it through our normalization algorithm in
order to see if the correct IV English word is returned. We
computed the system’s accuracy if the correct IV word is the
single most probable word returned, and if it is among the top5 most probable words (for example for a list of suggestions in
correction software). The average results over the entire test
corpus and the results broken down by type of normalization
are given in Table I. One thing to note is that the top-5 results
are consistently 20% to 30% higher than the top-1 results. This
396
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
indicates that in a large portion of cases, our system is
converging to the correct word, but the result is overshadowed
by another higher-probability word. Further study of the
probabilities learned in the training stage in order to weight
them better could lead to greatly improved top-1 results, and
bring our system’s performances up to the level of literature
benchmarks.
TABLE I.
letter substitutions programmed into our system and
mentioned in Section III, and five additional letter-to-letter
substitutions that we had not anticipated at all, the most
common of which is replacing the letter g with a q.
Unsurprisingly, we find that our system performs quite well
when dealing with the cases it was designed to handle,
achieving a top-1 accuracy of 45.0% and a top-5 accuracy of
85.0%, but it performs poorly when confronted with
unexpected substitutions, only getting a top-1 accuracy of
20.5% and a top-5 accuracy of 43.6%.
CORPUS COMPOSITION AND TEST RESULTS BY TYPE.
Word
count
Top-1
accuracy
Top-5
accuracy
Overall Average
2608
30.2%
59.7%
Abbreviation
806
29.0%
52.6%
Phonetic substitution
130
53.8%
78.5%
Graphemic substitution
58
29.3%
Stylistic variation
820
Letter repetition
Multiple types
Normalization type
TABLE II.
NORMALIZATION OF ABBREVIATION TYPES.
Word
count
Top-1
accuracy
Top-5
accuracy
Vowel
315
33.0%
73.7%
58.6%
Silent Consonant
86
62.8%
95.3%
31.6%
65.7%
Vowel and Silent Consonant
21
28.6%
66.7%
641
28.1%
62.2%
Voiced Consonant
128
47.7%
57.8%
153
17.9%
38.4%
Vowel and Voiced Consonant
40
22.5%
52.5%
Syllable
216
0%
0.5%
Abbreviation type
C. Detailed Results
It is worth studying in greater details the results for each of
the five types of normalization challenges, to understand
exactly what are the strengths and weaknesses of our system.
As explained before, phonetic substitutions are the case
where a letter or groups of letters are replaced by another
character that sounds the same. Our system seems to handle
this type of change quite well, based on the results in Table I.
In fact, a more detailed study of the results, presented in Table
III reveals that there are only two phonetic substitutions that it
seems to struggle with. Firstly, it struggles with recognizing
numbers substituting in for sounds, and thoroughly fails to
recognize the number 8 for the sounds ate, as in “h8” for
“hate”. However, these represent only a minority of the
substitutions; the vast majority of substitutions are letters
standing in for sounds created from other letters, a challenge
at which our system excels. The one exception worth
mentioning, the second one our system struggles with, is a
failure to match the letter d for a th sound, as in “dat” for
“that”. That problem may come from subtle differences
between these sounds that exist in the Wiktionary, which was
what our similar-sound list was designed to compensate for
(but apparently did not completely succeed at it). Nonetheless,
Table III indicates that, for the changes that account for threequarters of phonetic substitution, our system’s accuracy is at
99%.
The abbreviation type provides a good first case study.
Despite its simple definition, it is actually a very varied type,
based on which letters and how many letters are deleted. It is
most common for users to delete some or all vowels from the
word, shortening for example “number” to “nmbr”.
Alternatively, some users can delete consonants. We can
further recognize two subcategories of this deletion, if the
consonant is silent (“dum” for “dumb”) or part of a multi-letter
sound (“compas” for “compass”), or if it is a voiced consonant
(“suprise” for “surprise”). The third and fourth categories are to
delete both vowels and silent or voiced consonants, such as to
shorten “staff” to “stf” in the former case or “just” to “js” in the
latter. Finally, it is possible to abbreviate a word by cropping
entire syllables of it, for example by writing “batt” instead of
“battery”. Statistics on these six types of abbreviations are
given in Table II. The results show that our system works very
well in three cases, when the deleted letters are vowels,
silent/multi-sound consonants, or both, but struggles in the
cases when the deleted letters are voiced consonants with or
without vowels, and fails when entire syllables are deleted.
This is a consequence of the underlying assumption of our
system: it is designed to recognize OOV words that sound
similar to the IV words they stand for, and in the first three
cases that assumption is respected and system’s accuracy
ranges from 67% to 95%. In the last three cases the OOV word
is phonetically different from the IV word and our assumption
does not hold, and our system falls to the 50% range when only
letters are missing, and to 0% when entire syllables are missing
and the OOV word’s pronunciation diverges completely from
the IV word’s pronunciation.
TABLE III.
NORMALIZATION OF PHONETIC SUBSTITUTIONS.
Word
count
Top-1
accuracy
Top-5
accuracy
8 for “ate”
8
0%
0%
Other numbers for sounds
14
28.6%
50%
d for “th”
11
9%
9%
Other letters for sounds
95
66.3%
98.9%
Phonetic substitution type
The behaviour of our system when normalizing repetitions
is similar to that when normalizing abbreviations; namely, the
more phonetically different the OOV word is from its IV
When it comes to graphemic substitution, we can
distinguish between two types, namely the five number-to-
397
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
equivalent, the less accurate our system is. In this type of
normalization, though, we find that accuracy correlates with
the number of repeated letters. Indeed, multiple repetitions of
the letters in an OOV word add multiple copies of the
corresponding sound, which only appears once in the IV word.
Combined with the similar-sound list, this has the effect of
leading the search algorithm too far down a wrong path in the
tree for it to recover. Figure 4 illustrates the relationship
between the number of repetitions of a letter in an OOV word
and the top-1 and top-5 accuracy, and shows how sharp the
drop is, going from a top-5 accuracy of 85% when a letter is
repeated only twice to 26% when it is repeated 5 times.
Fortunately, it is much more common in practice to have a
letter repeated only two of three times than to have more
repetitions; in fact, those two cases account for 70% of words
with this type of normalization. The OOV word count in the
testing data with each number of repetitions is also included in
Figure 4, to verify this fact.
V. CONCLUSION
In this paper, we presented a novel normalization algorithm
for the OOV words commonly found in microtext messages.
The underlying assumption of our work is that, while the
spelling of OOV words can be very variable, they will remain
phonetically similar to their IV word counterparts.
Consequently, we designed a first prototype of our system to
be able to compute the probable phonetic reading of words
based on their spelling using training examples from the
Wiktionary, and then to determine the likely English equivalent
of OOV words using a radix tree structure of the language. Our
experimental results are mixed; while the system performs
adequately with an overall top-5 accuracy of nearly 60% and
an accuracy for certain types of normalization going up to the
80% range and even sometimes to the 90% range, there is clear
room for improvement, which we highlighted in our analysis of
the results. Future work will focus on ways to fine-tune the
probabilities in order to bridge the 30% gap between the top-1
and top-5 accuracy results, on automating the creation of the
similar-sound list, as well as on improving the overall accuracy
of the system by resolving some of the problems cases noted in
the analysis of the results.
Fig. 4. Relationship between number of repeated letters and accuracy.
The final type of normalization in our study is the stylistic
variation, when a writer changes the spelling of a word to make
it look more similar to the way the author would pronounce it.
Examples include writing “evar” instead of “ever”, “becuz”
instead of “because”, or “yoself” instead of “yourself”.
Naturally, the more variations are introduced in the spelling,
the more different the OOV word will sound from its IV
equivalent, and the more difficult it will be for our algorithm to
handle. Fortunately, since the ultimate goal of the writer is to
be understood by the reader, they will most of the time use few
changes, and our algorithm is capable of recognizing the word.
To verify this, we computed the Levenshtein distance between
the spelling of IV and OOV words, and plotted in Figure 5 the
top-1 and top-5 accuracy and word count at each distance. The
results demonstrate both that low-distance OOV words are
more common, with OOV word at a Levenshtein distance of 3
or less from their IV words comprising 94% of the test data,
and that our system performs best in those cases before
suffering a drop of performance at the more rare higherdistance cases. For reference, a stylistic variation at
Levenshtein distance of 3 represents cases such as writing
“rele” for “really”, “gansta” for “gangster”, or “wateva” for
“whatever”, and all these examples were correctly recognized
by our system with top-1 probability.
Fig. 5. Relationship between Levenshtein distance and accuracy.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
398
K. Dela Rosa, and J. Ellen, “Text classification methodologies applied
to micro-text in military chat”, Proceedings of the Eight International
Conference on Machine Learning and Applications, Miami, USA, 2009,
pp. 710-714.
S. Petrovic, M. Osborne, and V. Lavrenko, “The Edinburgh Twitter
corpus”, Proceedings of the NAACL Workshop on Computational
Linguistics in a World of Social Media, Los Angeles, USA, 2010, pp.
25-26.
F. Liu, F. Weng, B. Wang, and Y. Liu, “Insertion, deletion, or
substitution?: Normalizing text messages without pre-categorization nor
supervision”, Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language
Technologies, vol. 2, Stroudsburg, USA,, 2011pp. 71-76.
Z. Xue, D. Yin, and B. D. Davison, “Normalizing microtext”,
Proceedings of the AAAI-11 Workshop on Analyzing Microtext, San
Francisco, USA, 2011, pp. 74-79.
B. Han, P. Cook, and T. Baldwin, “Lexical normalization for social
media text”, ACM Transactions on Intelligent Systems and Technology
(TIST), 4:1, 2013, article 5.
D. L. Pennell, and Y. Liu, “Normalization of text messages for text-tospeech”, Proceedings of the 35th International Conference on Acoustics,
Speech and Signal Processing, Dallas, USA, 2010, pp. 4842-4845.
2015 Eight International Workshop on Selected Topics in Mobile and Wireless Computing
[7]
[8]
D. L. Pennell, and Y. Liu, “Normalization of informal text”, Computer
Speech & Language, 28:1, January 2014, pp. 256–277.
R. Khoury, “The impact of Wikipedia on scientific research”,
Proceedings of the Third International Conference on Internet
Technologies and Applications, Wrexham, UK, pp. 2-11.
[9]
F. Liu, F. Weng, and X. Jiang, “A broad-coverage normalization system
for social media language”, Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics, Jeju, Korea, 2012, pp.
1035-1044.
[10] G. Navarro, “A guided tour to approximate string matching”, ACM
Computing Surveys, 33:1, 2001, pp. 31–88.
399