Decipherment of Evasive or Encrypted
Offensive Text
by
Zhelun Wu
B.Sc. (Hons.), Dalhousie University, 2014
Thesis Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science
in the
School of Computing Science
Faculty of Applied Sciences
c Zhelun Wu 2016
SIMON FRASER UNIVERSITY
Summer 2016
All rights reserved.
However, in accordance with the Copyright Act of Canada, this work may be
reproduced without authorization under the conditions for “Fair Dealing.”
Therefore, limited reproduction of this work for the purposes of private study,
research, education, satire, parody, criticism, review and news reporting is likely
to be in accordance with the law, particularly if cited appropriately.
Approval
Name:
Zhelun Wu
Degree:
Master of Science (Computing Science)
Title:
Decipherment of Evasive or Encrypted Offensive
Text
Examining Committee:
Chair: Dr. Robert D. Cameron
Professor
Dr. Anoop Sarkar
Senior Supervisor
Professor
Dr. Fred Popowich
Supervisor
Professor
Dr. David Alexander Campbell
External Examiner
Associate Professor
Department of Statistics and Actuarial
Science
Director
Management and System Science
Date Defended:
20 July 2016
ii
Abstract
Automated filters are commonly used in on line chat to stop users from sending malicious
messages such as age-inappropriate language, bullying, and asking users to expose personal
information. Rule based filtering systems are the most common way to deal with this
problem but people invent increasingly subtle ways to disguise their malicious messages to
bypass such filtering systems. Machine learning classifiers can also be used to identify and
filter malicious messages, but such classifiers rely on training data that rapidly becomes out
of date and new forms of malicious text cannot be classified accurately. In this thesis, we
model the disguised messages as a cipher and apply automatic decipherment techniques to
decrypt corrupted malicious text back into plain text which can be then filtered using rules
or a classifier. We provide experimental results on three different data sets and show that
decipherment is an effective tool for this task.
Keywords: Natural Language Processing; Decipherment; Spelling Correction; Malicious
Words Filtering
iii
Dedication
To my beloved parents, who always encourage me, and to my lovely fiancee, who always
brings out the best of me.
iv
Acknowledgements
I would like to show my appreciation to my supervisor Dr. Anoop Sarkar for the continuous
support of my Masters study and research, for his patience, motivation, enthusiasm, and
immense knowledge. His guidance helped me all the time, during the research and writing
of this thesis. I could not have imagined having a better advisor and mentor for my Masters
studies.
I would like to thank my committee members, Dr. Fred Popowich and Dr. David Alexander Campbell, whose suggestions and comments helped me polish this thesis.
I would also like to thank Ken Dwyer and Michael Harris at Two Hat Security Company
who collect the data for us. Thanks to the CEO Chris Priebe who gave me the opportunity
to have the internship at Two Hat Security Company and inspired me with the idea of
spelling correction for chats.
Thanks to all of my natural language processing lab mates who helped me during these
two years and I really enjoyed being with them.
v
Table of Contents
Approval
ii
Abstract
iii
Dedication
iv
Acknowledgements
v
Table of Contents
vi
List of Tables
viii
List of Figures
ix
1 Introduction
1
1.1
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Motivation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2 Noisy Channel Model
2.1
2.2
2.3
6
Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.1
Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.1.2
Translation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Spelling Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2.1
Spelling Correction Algorithm . . . . . . . . . . . . . . . . . . . . . .
14
2.2.2
Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.3
Candidate Generalization . . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.4
Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3 Decipherment
20
vi
3.1
Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.2
Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.1
Expectation Maximization Algorithm . . . . . . . . . . . . . . . . .
23
3.2.2
Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . . .
24
3.2.3
Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.4
Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.2.5
Random Restarts . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3
4 Experimental Results
4.1
29
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.1.1
Wiktionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.1.2
Real World Chat Messages . . . . . . . . . . . . . . . . . . . . . . .
30
4.2
Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.3
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3.1
Classifier Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3.2
Decipherment of Caesar Cipher . . . . . . . . . . . . . . . . . . . . .
33
4.3.3
Decipherment of Leet Substitution . . . . . . . . . . . . . . . . . . .
34
4.3.4
Decipherment of Real Chat Offensive Words Substitution Dataset .
38
4.3.5
Decipherment of Real Offensive Chat Messages . . . . . . . . . . . .
40
5 Conclusion
42
Bibliography
43
vii
List of Tables
Table 4.1 Classification Accuracy of Spelling Correction and Decipherment Results in Caesar Cipher Encrypted Text . . . . . . . . . . . . . . . . .
34
Table 4.2 Classification Accuracy of Decipherment results in Leet Substitution
Cipher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Table 4.3 Top n Candidates of Decipherment Results in Leet Substitution Cipher
Classification on Test Set A . . . . . . . . . . . . . . . . . . . . . . . .
36
Table 4.4 Classification Accuracy of Spelling Correction and Decipherment Results in Leet Substitution Cipher with Beam Search Width of 5 . . .
36
Table 4.5 Smoothing in Decipherment of Whole Set on Test Set A . . . . . . . .
38
Table 4.6 Classification Accuracy of Spelling Correction and Decipherment Results in Real Chat Offensive Words Substitution Wiktionary Dataset
viii
39
List of Figures
Figure 1.1
An Example of Filtering Offensive Text . . . . . . . . . . . . . . . .
2
Figure 2.1
An Overview of A Statistical Machine Translation System . . . . .
7
Figure 2.2
Possible Alignments . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Figure 2.3
EM Algorithm for IBM Model 2 (Collins, 2011) . . . . . . . . . . .
13
Figure 2.4
Pseudo Code for Noisy Channel Model Based Spelling Correction
(Jurafsky and Martin, 2014) . . . . . . . . . . . . . . . . . . . . . .
14
Figure 2.5
An Example of Beam Search . . . . . . . . . . . . . . . . . . . . . .
17
Figure 4.1
100 Random Restarts Loglikelihood in Caesar Cipher Decipherment
34
Figure 4.2
Variational Bayes for EM Algorithm from Gao and Johnson (2008)
37
Figure 4.3
100 Random Restarts Loglikelihood in Leet Substitution Cipher Decipherment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.4
37
100 Random Restarts Loglikelihood in Decipherment of Real Chat
Offensive Words Substitution on Test Set B . . . . . . . . . . . . .
ix
40
Chapter 1
Introduction
Children using chat rooms can be confronted with sexting, profanity, age-inappropriate
language, cyber-bullying, and requests for personal identifying information. Rule based
filtering systems are commonly used to filter out these malicious messages. However, some
malicious messages are subtly transformed by users so that they can get past the filtering
system (see the example in Figure 1.1). There are several ways to detect and filter these
hidden malicious messages. Spelling correction can correct misspelled words, but it is
hard to correct words that are intentionally misspelled by users by several edits. More
modern techniques use machine learning classifiers or rule-based filtering systems to identify
malicious messages in order to filter them out. However, such methods are still prone to the
problem that users can adapt their encryption methods to hide their original intent since
the classifier or rule-based filtering system is static, while the users can continuously change
their behavior.
In this thesis, we regard this problem as a decipherment problem. The offensive text that
users have disguised using their own invented techniques can be treated as cipher text. The
original offensive text that users actually wanted to say is the plain text. The solution is to
map the users encrypted text into what the users truly want to express. The plain text is far
more likely to be detected and filtered out by the classifier or the rule-based filtering system.
We want to decipher the encrypted text without knowing the exact method that was used to
encrypt it. No matter how users edit or change their offensive messages, the decipherment
algorithm should always recover the original messages. Since we aim to decipher any possible
way to encode the message, our method is immune to many (but not all) types of attempts
to hide the message using new ways to encode offensive language. For instance, let us
assume that the original offensive message is you are a bunny and the corrupted message
that tries to get past the filtering system is ura B*n@n@ee . We search for the original
messages among possible sources messages mapped from encrypted text letter to letter and
select the one that achieves the highest statistical model score. In this thesis, we will explain
the background and details of this unsupervised learning decipherment algorithm. We also
1
Figure 1.1: An Example of Filtering Offensive Text
Note: “Shit” can be blocked by filtering system but with one letter being changed “Shet”
cannot be filtered out. The encrypted word is still recognized by human.
conduct extensive experimental studies with our decipherment based approach to malicious
language detection, using synthetic as well as real-world data.
1.1
Machine Learning
Machine learning learns a model from observed data to enable it to make predictions on
the unseen data. The idea about the learning is that the machine should not only use the
observed data for acting but also for improving its performance. So, Bishop (2006) notes
that the objective of a learner is to generalize from its experiences. The generalization
is the ability of learning machines that can accurately predict the new unseen data after
learning the attributes from the observed data set. As the observed data which is used
for training is finite and unseen data is uncertain the error rates on the unseen data can
be uncertain. Russell et al. (2003) specifies three major parts of a learning process: which
attributes of the observed data are to be learned; what feedback is available to learn theses
attributes; what representation is used for the attributes. The attributes are a feature of the
observed data. They can be word counts, term frequency, term frequency times the inverse
document frequency (tf-idf) and so on. The attributes can be learned from appropriate
feedback. Machine learning is divided into three major types depending on the training
regime and task: supervised, unsupervised and reinforcement learning.
Supervised learning is learning the model from samples of inputs and outputs. To train
the model, we need a set of observed data with their labels. The observed data attributes
are the inputs and their labels are the outputs. Once the model trained, it predicts the new
2
data with the output labels. In a fully observable environment, it can observe the effects of
its actions and hence can use the supervised learning methods to learn to predict the output
labels. However, not all of the available data might have the necessary output labels for
training a supervised classifier. Faced with a lack of labeled data, unsupervised learning can
be used to learn the patterns among the unlabeled data when no specific output values are
supplied. Thus, the training in unsupervised learning uses a set of training data without any
corresponding target values. Reinforcement learning is another type of machine learning. In
this type, there is a teacher to guide the learning process. The teacher rewards or penalizes
the predicted values. The learner needs to receive feedback to know whether the thing that
has happened is good or bad and the machine is then rewarded for good actions but only
at the end of the sequence of actions. The goal of reinforcement learning is to maximize
the eventual reward.
In this thesis, we simulate the way users edit messages and then use unsupervised
learning approaches to deduce the plain text which we presume the users actually wanted
to express if there were no rule filters to block their messages. The unsupervised learning
algorithm used involves Expectation Maximization from Dempster et al. (1977). We regard
our problem as a Hidden Markov Model (HMM) (Rabiner, 1989) and use beam search
to find out the most likely original plain text message. We also decipher the cipher text
messages using a supervised learning method known as a noisy channel model and compare
to the state-of-art Aspell spelling correction algorithm.
1.2
Motivation
In the rule-based filtering system, the rules need to be updated periodically. The users can
learn the rules used by the rule filter after using the system and checking which messages
are blocked and which are not. If the users still want to send offensive messages, they will
try to edit the original offensive messages by adding special symbols, substituting a letter
with another letter and so on. These user changed texts do not match the rules of offensive
text in the system and thus they can pass the filtering system. They can, of course, still be
understood by other users, otherwise it would make no sense to send such messages. Because
there are so many ways to disguise messages as typos and the patterns of editing the text
changes over time, users can always come up with new ways to change their messages based
on their experiences with the filtering system. We have to continuously update the rules to
detect those disguised messages. It is laborious to summarize a new rule and update the
filtering rules. But it is easy to obtain the latest disguised messages. From these disguised
messages, we can learn the patterns which users have used to encrypt the text, and this
allows us to recover the original messages with the help of the patterns and language model.
The patterns are learned from the observed disguised messages. The language model is
trained from the corpus of un-disguised original messages. This decipherment regards the
3
observed messages as cipher text and the un-disguised original clean text as plain text. The
decipherment approach reveals the underlying plain text from the disguised observed cipher
text.
1.3
Contribution
In this thesis, we applied unsupervised learning decipherment approaches (Knight et al.,
2006) to recover original text from encrypted offensive text regardless of the nature of the
encryption. The decipherment process is adaptive and automatic. In our experiments it can
always recover all of the text with Caesar cipher encryption and most of the plain text when
it is encrypted using the Leet substitution encryption (Wikipedia, 2016). Our experiments
also show that the decipherment approach has a quite similar or even better performance at
recovering plain text than the noisy channel model spelling correction method. Compared to
the supervised learning approach, and like the noisy channel model spelling correction, the
unsupervised learning decipherment approach has the advantage of obtaining the training
data and the unsupervised learning decipherment can cover more cases. Although the
parallel corpus is difficult to obtain, particularly in some specific domains such as offensive
chat messages. We can obtain as many of the users disguised offensive chat messages as
we want. We can also access large amounts of plain text language data. The decipherment
approach covers more severe or difficult cases than the noisy channel spelling correction
method which only considers a limited edit distance of the disguised words to generate
plain text candidate words to do the correction. The decipherment approach considers all
of the possible words that could be the plain text word of the disguised word in question
to recover the messages. In our experiment, the noisy channel model spelling correction
did not work in Caesar cipher encryption data but the decipherment approach correctly
deciphered all of the Caesar cipher encrypted data.
To our knowledge this is the first time that a decipherment based filter has been applied to a real-world data set of malicious chat messages and we are able to decipher the
original plain text and evaluate the performance of the rule-based filtering system on our
decipherment output.
1.4
Overview
The thesis is organized as follows:
In Chapter 2 we describe the background of the spelling correction and the noisy
channel model. This chapter will cover the definition and concepts of the noisy channel
model. We will explain the details of this model and how this model is being used in
spelling correction.
4
In Chapter 3 we describe the methodologies of the decipherment problem, which is
composed of HMM, EM algorithm and Forward-Backward algorithm. More detailed processes, such as how to preprocess the offensive text before performing the decipherment,
including inserting NULL symbols and deleting repeated characters, are also explained.
In Chapter 4 we explain the experiments and show the experimental results from Caesar cipher decipherment, Leet substitution decipherment and real chat messages key word
decipherment. We compared the results according to the classifier classification accuracy.
We also conducted experiments aimed at recovering plain text using spelling correction
methods. We showed the decipherment approach performance by recovering the original
plain text from the corrupted text.
In Chapter 5 we summarize the experimental results of the previous chapter and
conclude that the decipherment approach is a robust way to recover the corrupted text no
matter how the text was encrypted.
5
Chapter 2
Noisy Channel Model
The noisy channel model is a well-known framework that is widely used in spelling correction, speech recognition, question answering and machine translation. Statistical machine
translation and spelling correction are good examples with which to introduce the noisy
channel model. In this chapter, we will introduce the concepts of the language model, the
translation model (which is also called error model in spelling correction), and the Bayes
rule used in the noisy channel model.
2.1
Statistical Machine Translation
The term statistical machine translation implies the use of statistics. The rationale for
statistical machine translation was first described by Weaver (1955). Based on insights from
Alan Turing’s speculation about the use cases for computing machines, Warren Weaver laid
out the noisy channel model for translation. In state-of-art statistical machine translation,
the principle of the methods used today originate from the IBM models from the IBM
Candide projects (Brown et al., 1988, 1990, 1993) in the late 1980s and early 1990s. These
works were seminal and introduced many concepts that still underpin today’s machine
translation models. Define e as source sentences that are translated text and f as target
sentences that are foreign sentences which are needed to be translate. The IBM models are
examples of noisy channel model. They have two critical components:
1. A language model that assigns a probability of source sentence e
2. A translation model that assigns a conditional probability p(f | e) to any pair of
source-target sentences (e, f)
Given these two essential components of the model, with the method of the noisy channel
approach, we can compute the translation output using this equation:
e∗ = argmax
e∈E
6
p(e | f )
(2.1)
where E is the set of source language sentences. The function argmaxf (e) means there is a
e∈E
sentence e that maximizes the function f (e) . Thus, given the target sentence f which is the
foreign language we want to translate to the source language e, we seek the sentences e that
will translate to the sentence f . The language model p(e) is a good source of information.
However, in Equation 2.1, there is no language model p(e). Bayes’ theorem 2.2 defines:
p(e)p(f | e)
p(e | f ) = P
e p(e)p(f | e)
(2.2)
The denominator on the right hand of this Equation is the probability of the given target
sentences f as we sum all the source sentences e.
p(f ) =
X
p(f | e)p(e)
(2.3)
e
The Equation 2.3 shows that the denominator of Equation 2.2 is p(f ), which is the probability of foreign sentence. As the foreign sentence is fixed and not effected by the source
language sentence. Hence, it is sufficient to only consider the numerator of Equation 2.2 to
maximize the Equation 2.1. The noisy channel model is as follows:
e∗ = argmax
p(e | f )
e∈E
= argmax
e∈E
= argmax
p(e) × p(f | e)
e p(e) × p(f | e)
P
(2.4)
p(e) × p(f | e)
e∈E
Each sentence e is computed by the product of language and translation model scores.
The language model p(e) gives prior distribution over which sentences are likely in the
source language and the translation model p(f | e) indicates how likely the target sentence
f is to be the translation of the source language e. Thus, Equation 2.4 means, out of all of
the sentences in the source language, there can only be one source sentence that maximizes
the score of the right-hand side p(e) × p(f | e). The process of choosing that sentence e in
the source language for which Equation 2.4 is maximized is called decoder.
Figure 2.1: An Overview of A Statistical Machine Translation System
7
In Figure 2.1, we show an overview of a statistical machine translation system, where
Mono source Corpora is a large set of monolingual structured texts, Target Text is the
foreign text and Source Text is the translated text from foreign text. LM is language model
and TM is translation model.
2.1.1
Language Model
Given a sequence of words, e1 , e2 , ..., en , we can write the probability of this sentence as:
p(e1 e2 . . . en ) = p(e1 )p(e2 | e1 ) . . . p(en | e1 e2 . . . en−1 )
(2.5)
The language model measures how likely it is that a sequences of words would be in a
certain corpus in the same language. In the translation process, we are not only producing
the correct language but also fluent sentences. The language model has the advantage of
assigning high probability to the fluent sentences. The language model would prefer the
correct order of words in the source sentence rather than the incorrect order of words. For
example,
pLM (“this is correct order”) > pLM (“order correct is this”)
(2.6)
We need to know the history of word en to compute the p(en | e1 e2 . . . en ), but there will
be long histories of the word in the long sentences. To be able to compute these language
model, we limit the history to m words which means we consider previous m words:
p(en | e1 , e2 , . . . , en−1 )
≈ p(en | en−m , . . . , en−2 , en−1 )
(2.7)
With this estimation, for the words in each position, we can estimate the language model
score based only on their previous m histories of words. This is n-gram language modeling.
n-gram means that there are n words in the history being considered in order to compute
the current word probability score. We examine the sequences of words and only consider
the transitions in limited grams, which as termed Markov chains (Baum and Petrie, 1966);
the number of the words considered in the conditioning history is the order of the model.
Unigram language model does not consider the history when computing the individual
word: rather, it computes the word probability on using statistics model of the training
corpus. Intuitively, p(e) is the distribution over a sequence of words in the training corpus.
Typically, more training data allows the consideration of longer histories. The trigram
language model is commonly used. It considers two words histories to compute the third
word distribution. The trigram language model can be written as follows:
p(e1 e2 . . . en ) ≈ p(e1 )p(e2 | e1 )p(e3 | e1 e2 )p(e4 | e2 e3 ) . . . p(en | en−2 en−1 )
8
(2.8)
However, there are still some trigrams not included in the language model which are
the unknown tokens. If one of the trigram in Equation 2.8 is missing and the probability is
equals to zero, the whole sequence of the language model score will be zero, which makes the
other trigrams useless. The smoothing techniques can transfer a small portion of probability
of the occurrence from the observed data to the unknown data so that the unknown data
can also have a small possibility of occurrence. The smoothing extracts more information
from the language model as long as the assumption of smoothing is reasonable.
2.1.2
Translation Model
State-of-art machine translation (MT) systems apply statistical approaches to learn translation rules from large amount of parallel data. The translation model assigns the conditional
probability p(f | e) to any pair of source-target sentences (e, f). Every sentence consists of
a sequence of words. From the parallel corpora, the only information we have about this
source sentence is its translation but we do not know which word in the source sentence is
translated to which word in the target sentence. There are many ways to train the translation model, such as word-based, phrase-based and hierarchical translation models. We will
introduce the word-based translation model.
Word-based Translation Model
To compute the conditional probability p(f | e) for any target sentence f = f1 . . . fm , we
have to model the distribution:
p(f1 . . . fm | e1 . . . el )
(2.9)
It is difficult to compute Equation 2.9 directly due to the joint probability. Brown et al.
(1990) notes that it is reasonable to regard the target translation of a source sentence being
generated from the source sentences word by word. For example, the sentence pair (Jean
aime Marie, John loves Mary), John translates to Jean, love translates to aime and M ary
translates to M arie. Thus a word is translated to a word to which it aligns. This is called
alignment, and is the mapping from the target language to the source language. However,
in the parallel data, there is no word alignment in the pair of source-target sentences. For
each target/foreign input word f , we do not know the source word to which it mapped.
We lack the alignment information. By incorporating the alignment information into the
translation model, we can write the translation model as follows:
p(f1 . . . fm | e1 . . . el ) =
l
l
l
X
X
X
a1 =0 a2 =0 a3 =0
···
l
X
am =0
9
p(f1 . . . fm , a1 . . . am | e1 . . . el )
(2.10)
Figure 2.2: Possible Alignments
where am is in {0, 1, . . . , l}. If a1 = 3, this means that the first foreign word is aligned to the
third source word. The alignment is considered as a hidden variable in our translation model.
To estimate the translation model from incomplete data, the expectation maximization
algorithm (EM algorithm) can be applied to iteratively increase the likelihood by changing
the hidden variable, which in this case is the alignment probability.
As the example taken from Brown et al. (1990),
e = John loves Mary
(2.11)
f = Jean aime Marie
The correct alignment is
a1, a2, a3 =< 1, 2, 3 >
(2.12)
Each French word maps to an English word. However, there can be another alignment
shown in Figure 2.2 like this : a1, a2, a3 =< 1, 1, 1 > Every French word maps to the first
word in English. To decide which alignment is best, we need to use an EM algorithm to
estimate the parameters to get the translation model.
EM algorithm
The EM algorithm is defined as follows in Chapter 4.2.2 of Koehn (2009):
1. Initialize the model, usually with uniform distribution or randomized translation probability
2. Apply the model to the data (expectation step)
3. Learn the model from the data (maximization step)
10
4. Iterate steps 2 and 3 until convergence
To apply the EM algorithm to the translation model (Equation 2.10), we first initialize
the model with uniform distribution on all alignment probabilities t(e | f ) and collect the
expected count of alignment to update the alignment probability. After a while, it will
prefer the alignment which is the most likely word translation with alignment.
For the Equation 2.10, we need to make some assumptions to efficiently estimate this
model using an EM algorithm. Our goal is to build a model of
P (F1 = f1 . . . Fm = fm , A1 = a1 . . . Am = am | E1 = e1 . . . El = el , L = l, M = m) (2.13)
where A1 . . . Am are the alignment parameters, L is the length of the source sentences and
M is the length of the target (foreign) sentences. By applying the chain rules, Equation
2.13 can be written as follows:
P (F1 = f1 . . . Fm = fm , A1 = a1 . . . Am = am | E1 = e1 . . . El = el , L = l, M = m)
= P (A1 = a1 . . . Am = am | E1 = e1 . . . El = el , L = l, M = m)
× P (F1 = f1 . . . Fm = fm | A1 = a1 . . . Am = am , E1 = e1 . . . El = el , L = l, M = m)
(2.14)
Assuming the alignment is independent of the source and target languages, the distribution
of the alignment only depends on the source and target language sentences length l and m.
The first term of Equation 2.14 can be written as
P (A1 = a1 . . . Am = am | E1 = e1 . . . El = el , L = l, M = m)
=
=
=
m
Y
i=1
m
Y
i=1
m
Y
P (Ai = a1 | A1 = a1 . . . Ai−1 = ai−1 , E1 = e1 . . . El = el , L = l, M = m)
(2.15)
P (Ai = a1 | L = l, M = m)
q(ai | i, l, m)
i=1
where the first equality is by the chain rule of probabilities.
For the second term in Equation 2.14, we make the following assumption:
11
P (F1 = f1 . . . Fm = fm | A1 = a1 . . . Am = am , E1 = e1 . . . El = el , L = l, M = m)
=
=
=
m
Y
i=1
m
Y
i=1
m
Y
P (Fi = fi | F1 = f1 . . . Fi−1 = fi−1 , A1 = a1 . . . Ai−1 = ai−1 , E1 = e1 . . . El = el , L = l, M = m)
P (Fi = fi | Eai )
t(fi | eai )
i=1
(2.16)
where we applied the chain rules and assume the foreign word Fi depends on Eai , which
means the foreign word Fi is aligned to source word Eai .
Combining Equation 2.15 and 2.16, we can derive a distribution of the translation model
considering the lexical translation t(fi | eai ) and the alignment model q(ai | i, l, m):
p(f, a | e) = m
Y
t(fi | eai )q(ai | i, l, m)
(2.17)
i=1
This model is called IBM Model 2, which differs from IBM Model 1 in that the latter only
has lexical translation probability. The difference is that IBM Model 1 does not take into
account the alignment of words. For example, in IBM Model 1, the translation probabilities
for the following two translations are the same:
“John loves Mary” => “Jean aime Marie”
“John loves Mary” => “Jean Marie aime”
In IBM Model 2 (Equation 2.17), the alignment probability distribution is incorporated
to take into account the alignment.
The EM algorithm from Collins (2011) is shown on Figure 2.3. S is the number of
training iterations. q(j | i, l, m) represents the conditional probability of the foreign word
fi that is aligned to the source word ej given the foreign sentence length m and source
sentence length l. The t(f | e) means the conditional probability of mapping the source
word e to the foreign word f . In the EM algorithm shown in Figure 2.3, both parameters
are estimated by the expected count in the E-step (expectation step) which computes the
(k)
(k)
(k)
c(ej , fi ) and c(ej ) and normalized in the M-step (maximization step) which computes
the t(f | e) and q(j | i, l, m).
2.1.3
Summary
In summary, the noisy channel model is applied in machine translation systems. The source
sentences are distorted in the noisy channel and become foreign sentences. We only know
the foreign sentences and use knowledge of the source language (language model) and the
12
(k)
(k)
Input: A training corpus (f (k) , e(k) ) for k = 1 . . . n, where f (k) = f1 . . . fmk ,
(k)
(k)
e(k) = e1 . . . elk .
Initialization: Initialize t(f |e) and q(j|i, l, m) parameters (e.g., to random values).
Algorithm:
• For s = 1 . . . S
– Set all counts c(. . .) = 0
– For k = 1 . . . n
⇤ For i = 1 . . . mk
· For j = 0 . . . lk
(k)
(k)
c(ej , fi ) + (k, i, j)
(k)
(k)
c(ej ) + (k, i, j)
c(ej , fi )
(k)
(k)
c(ej )
c(j|i, l, m)
c(j|i, l, m) + (k, i, j)
c(i, l, m)
c(i, l, m) + (k, i, j)
where
(k)
(k)
q(j|i, lk , mk )t(fi |ej )
(k, i, j) = Pl
k
(k) (k)
j=0 q(j|i, lk , mk )t(fi |ej )
– Set
t(f |e) =
c(e, f )
c(e)
q(j|i, l, m) =
c(j|i, l, m)
c(i, l, m)
Output: parameters t(f |e) and q(j|i, l, m)
Figure 2: The parameter estimation algorithm for IBM model 2, for the case of
Figure 2.3: EM Algorithm for IBM Model 2 (Collins, 2011)
partially-observed
data.
13
13
distortions caused by noisy channel, which is the alignment and lexicon translation information in IBM Model 2. Based on these models, we do the translation to recover the source
language sentences from the foreign sentences.
2.2
Spelling Correction
It is common that humans misspell the words when they are typing text. Various automatic
spelling correction programs are in widespread use. However, these can become the source
of amusement in cases of inaccurate corrections. Another example of the noisy channel
model is spelling correction. The objective function in spelling correction is the same as
machine translation (see Equation 2.4), but for spelling correction, we refer to the channel
model as the error model.
The first idea to model language transmission as a Markov source passed through a
noisy channel model was developed in Shannon (1948). In the early 1990s, Kernighan
et al. (1990); Church and Gale (1991) proposed the noisy channel model based spelling
corrections. Norvig (2009) showed the implementation of spelling correction in python.
The spelling corrections have two components, one from the language model, and the other
4
from the error model. To estimate the error model, we need to train from the pairs of
C HAPTER 6
•
S PELLING C ORRECTION AND THE N OISY C HANNEL
misspelled and corrected words. Figure 2.4 shows an overview of the noisy channel based
function N OISY C HANNEL S PELLING(word x, dict D, lm, editprob) returns correction
if x 2
/D
candidates, edits All strings at edit distance 1 from x that are 2 D, and their edit
for each c, e in candidates, edits
channel editprob(e)
prior lm(x)
score[c] = log channel + log prior
return argmaxc score[c]
Figure 6.2
Noisy channel model for spelling correction for unknown words.
Figure 2.4: Pseudo Code for Noisy Channel Model Based Spelling Correction (Jurafsky and
have a similar spelling to the input word. Analysis of spelling error data has shown
Martin, 2014)
that the majority of spelling errors consist of a single-letter change and so we often
make the simplifying assumption that these candidates have an edit distance of 1
spelling correction
fromword.
Jurafsky
and
(2014), which
indicates
that the
model
from the error
To find
thisMartin
list of candidates
we’ll use
the minimum
editchannel
distance
algorithm
introduce
in
Chapter
2,
but
extended
so
that
in
addition
to
insertions,
(error model) is computed by the edit probability and prior distribution is computed by the
deletions, and substitutions, we’ll add a fourth type of edit, transpositions, in which
language model. It combined these two models and selects the one with the highest score.
two letters are swapped. The version of edit distance with transposition is called
DamerauDamerau-Levenshtein edit distance. Applying all such single transformations to
Levenshtein
acress yields
the list of candidate
words in Fig. 6.3.
2.2.1 Spelling
Correction
Algorithm
Algorithm 1 shows the beam search algorithmTransformation
for the noisy channel model spelling correction
Correct
Error
Position
algorithm, which
on the stack
heuristic
from
Beam Search section
Erroris based
Correction
Letterdecoding
Letter
(Letter
#) the
Type
acressThe actress
in Koehn (2009).
general idea tis to find -out the 2correct worddeletion
x that maximizes the
acress: argmax
cress P (x | -a
0
insertion
objective function
w) = argmax
x
x P (x) · P (w | x). P (x) is the language
acress
acress
acress
acress
acress
Figure 6.3
caress
access
across
acres
acres
ca
c
o
---
ac
r
14e
s
s
0
2
3
5
4
transposition
substitution
substitution
insertion
insertion
Candidate corrections for the misspelling acress and the transformations that
model and P (w | x) is the error model where w is the misspelled word and x is the candidate
word. The beam search algorithm tracks the partial hypotheses in the stacks. During the
search, it consider the top n list of hypotheses in each stack and uses the error model and
language model to score the candidate words in these hypotheses. At the end, it traces back
from the predecessor attributes in the hypotheses to return the sequence of best scored text.
2.2.2
Error Model
The error model we used is the probability of edit, estimated from the misspelling data. The
advantage of using edit probability for the error model is that one edit can represent many
pairs of correct and misspelled words. For example, P (“ew” | “e”) means the probability
of the correct letter “e” being misspelled as “ew” and this is an insert operation. For the
misspelled words “thew”, one of the possible candidates is “the” and thus its probability
of edit is P (“ew” | “e”). There are four operations, deletion, insertion, substitution and
reversal, which can be used to transform one word to another. (Kernighan et al., 1990)
To train an error model, we first collect a training set of pairs of words where each pair
is the correct and misspelled form of the word. We set a threshold for the longest words
we can process and the biggest word length difference. If the pair of words exceeds these
thresholds, we skip this example and process the next pair of words. This can save training
time without losing too much information as there are only very few cases which would
exceed the threshold. Then, based on Damerau Levenshtein distance (Damerau, 1964;
Levenshtein, 1966), we generate the candidate edits from the misspelled to the correct word
within a certain edit distance in Subsection 2.2.3.
If there are more than 1000 possible alignments between misspelled and target words,
we randomly sample 1000 alignments to extract the edits. We count the edits from the
whole training set which is all the pairs of target and misspelled words and then calculate
its probability normalized by the total count of edit extracted from the training set. We set
a parameter that is the probability of misspelling which assigns the probability of correctly
spelled letters by 1 − pspell_error . Thus, we have the following formulas to compute the error
model:
(
P r(edits) =
1 − pspell_error
if edits is empty
pspell_error · P (edit1 ) · P (edit2 ) · · · P (editn ) n > 0 if edits is not empty
(2.18)
where edits is a set of possible edits {edit1 , edit2 , · · · , editn }
The pseudo-code in Algorithm 2 shows how we trained the error model.
2.2.3
Candidate Generalization
For unknown words which are not in the dictionary, we generate a list of candidate words.
In Norvig (2009) Chapter 14, he shows that the candidates are generated by edits, which
15
is passed a word, and returns a dictionary of {word : edit} pairs indicating the possible
corrections. As there will be too many candidates if all the words that were generated by
the edits, even the incorrect ones were listed. Norvig (2009) precomputed the set of all
prefixes of all the words in the vocabulary. He then split the words into a head and a tail,
ensuring that the head was always in the list of prefixes. Similarly, we pre-computed the
prefixes of words which are the first n letters in words; the n can be from 1 to the length
of the word. With the help of a set of word prefixes, we generated candidates with the
constraints of matching the word prefixes, which removed most of the incorrect candidates.
The list of candidates were determined by the edit distance. Usually, in English misspelling, the misspelled words are not greater than 3 edits from the correct words.
2.2.4
Decoding
As we have error model and language model, we use beam search to decode the best sequence
of words that have the highest scores according to the following equation:
argmax
x∈V
P (x | w) = argmax
P r(edits) · P (x)λ
x∈V
= argmax
(2.19)
log10 P r(edits) + λ · log10 P (x)
x∈V
where V is the vocabulary, λ is for weighting the language model and the error model
P r(edits) is computed by Equation 2.18.
Considering the context of the unknown words in the language model helps us to precisely predict the unknown words. For example, the data could be provided with the
following fields:
{“text”: “bevause”, “left_context”: “this was”, “right_context”:“she sucks”}.
The misspelled word is ‘bevause’. We can list all the 1 to 3 Damerau-Levenshtein edit
distance candidate words and refer to the dictionary to search for the correct word. We
score every candidate using the language model with their context multiplying the error
model and then use beam search to pick the one with the best score.
For example, we have a sentence “this was bevause sye sacks”. Every word in the
sentences can generate a list of their corresponding candidates. Assume “this” has a list
of candidates “this”, “that”, “what”, “thew”. “was” has a list of candidates “has”, “his”,
“wait”, “was”,“what”. “bevause” has a list of candidates “because”, “behavior”, ”beverage”.
“sye” has a list of candidates “she”, “eye”, “yes”. “sacks” has a list of candidates “sucks”,
“snacks”. The beam search first scores every candidate word of “this” and then selects the
top n candidate word and expands. In Figure 2.5, assume we set the beam search width n to
2, we select “this” and “that” at first position. And then the beam search algorithm expands
the candidate “this” and scores “this” with all candidate words in the second position and
selects the top 2 candidate words that maximize the score from the beginning word of the
sentence to current word. At each position of the sentence, we remove some of the candidate
16
words reducing the search space. Figure 2.5 shows an example search through the top n
candidate words rather than all the candidate words.
Figure 2.5: An Example of Beam Search
2.2.5
Summary
In the noisy channel model, misspelled words are treated as correctly spelled words that are
distorted by a noisy communication channel. Basically, the channel applies letter substitution, insertion, deletion and transportation to the correctly spelled words. The goal of the
noisy channel model based spelling correction is to pass all the possible correctly spelled
words to the model and then find out which one is the most similar to the misspelled words.
2.3
Summary
In this chapter, we introduced the noisy channel model applied in machine translation
and spelling correction. The noisy channel model is a framework that models different
problems in the same way. It regards the foreign text (misspelled words) as distorted sources
text (correct words) by a noisy channel. We explained the alignment using EM algorithm
in machine translation and the algorithms for training the error model and generating
candidates for spelling correction. The beam search algorithm is used during decoding
which will also be used in our decipherment method.
17
Algorithm 1 Beam Search for Spelling Correction Algorithm
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
procedure Beam-Search(words, width, edit-distance-limit=3)
D is the dictionary of known words
LM is the language model score in logorithm
Pr(edits) is from Equation 2.18
init stacks with the size of n
. n is the number of words in text
set hypothesis tuple with three attributes score, predecessor, word
place empty hypothesis into stack 0
for all stacks 0 ... n-1 do
for the best-width hypothesis in stack do
if words[i] is in D then
. i is the index of stacks
score = current hypothesis score
edits ← empty
score = score + log(Pr(edits)) + LM(words[i])
new-hypothesis = hypothesis (score, current hypothesis, words[i])
if words[i] not in stacks[i+1] or stacks[i+1][words[i]].score < score then
stack[i+1][words[i]]=new-hypothesis
end if
continue
end if
generate candidate words from D
if no candidates generated then
score = current hypothesis score
edits ← empty
score = score + log(Pr(edits)) + LM(words[i])
new-hypothesis = hypothesis (score, current hypothesis, words[i])
if words[i] not in stacks[i+1] or stacks[i+1][words[i]].score < score then
stack[i+1][words[i]]=new-hypothesis
end if
else
for each pair of candidate word and edits (w,edits) do
score = current hypothesis score
score = score + log(Pr(edits)) + LM(words[i])
new-hypothesis = hypothesis (score, current hypothesis, words[i])
if words[i] not in stacks[i+1] or stacks[i+1][words[i]].score < score then
stack[i+1][words[i]] =new-hypothesis
end if
end for
end if
end for
end for
get the best scored hypothesis
return the best scoring sequence of words by traverse back from the best scored
hypothesis
end procedure
18
Algorithm 2 Error Model Training
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
procedure Error-Model-Training(training-file)
training-file contains pairs of misspelled word and target word and their frequency
set the longest-word-length := 20
set the sample-N := 1000
set diff-length := 5
for each pair of misspelled word and target word do
if the length of misspelled word or target word > longest-word-length then
continue
else
if absolute(len(misspelled word) - len(target word) )> diff-length then
continue
end if
generate the Damerau Levenshtein distance between the misspelled word and
target word
extract the alignments between the misspelled word and target word
if the number of alignment is greater than sample-N then
random sampling sample-N alignments from all the alignments
end if
for each alignment do
extract the edits from alignment
count the edits and store them in dictionary
end for
end if
end for
Normalized the edit dictionary as error model
end procedure
19
Chapter 3
Decipherment
In cryptography, a message is called either plain text or clear text. Disguising a message in
a way to hide the content of the original messages is called encryption, while recovering the
cipher text, an encrypted text, back to plain text is called decryption or decipherment.
The goal of decipherment is to uncover the hidden plain text text p1 ...pn , given a disguised text c1 ...cn . Our methodologies operate under the assumption that every corrupted
form of adversarial text is the plain text encrypted in letter substitution, insertion and
deletion. For example, a user corrupts adversarial texts by inserting additional letters, or
making substitution and deletion in the text, like you are a B*n@n@ee, whose plain text is
you are a bunny. We lowercase the texts and substitute the special symbols and punctuations inside the words with NULL, and then map NULL to any letter or NULL. So in this
case, the corrupted word B*n@n@ee is changed to b<NULL>n<NULL>n<NULL>ee for
decipherment.
When dealing with insertion cases, we add NULL symbols into the cipher text in order
to give a space for the decipherment mapping the candidate letters. We tried two ways of
inserting a NULL symbol. One is inserting the NULL in a random position of the corrupted
offensive key words. Another way is based on the paper written by Ando and Lee (2003).
Ando and Lee (2003) proposed a statistical segmentation algorithm to segment Japanese.
This algorithm involves the counts of character n-grams in an unsegmented corpus to make
the segmentation decisions. As an example shown in their paper, “A B C D W X Y
Z” is a sequence of unsegmented words. They check the n-gram count before, after and
across the position k of the word sequence. For example, 4-grams s1 = “A B C D” and
s2 = “W X Y Z”, are before and after, respectively, the position 4, both of which tend to
be more frequent than other straddled n-grams, like the 4-grams t1 = “B C D W”, who is
across the position 4. Thus, they claim that there is evidence of a word boundary between
D and W, which is position 4 in our case. Equations 3.1 and 3.2 below are formulas to
compute each position’s score. In our study, we pick the highest position score in Equation
3.2 as the boundary to insert the NULL symbol.
20
vn (k) =
2 n−1
X
X
1
I> (#(sni ), #(tnj ))
2(n − 1) i=1 j=1
(3.1)
where n is the order of the gram, I> (y, z) is 1 when y > z, and 0 otherwise. sn1 is the left
n-gram of the position k and sn2 is the right n-gram of the position k.
vN (k) =
1 X
vn (k)
N n∈N
(3.2)
Following Weaver (1955), “This is really written in English, but it has been coded in
some strange symbols. I will now proceed to decode.” The set of encrypted forms of English
adversarial text is the English that has been encoded in some strange symbols. The critical
task is estimating the probability of transforming the plain text letter e to cipher text letter
c. There can be 27 possible plain text letters, including the NULL letter that represents no
letter. Here is the probability:
P (c) =
X
P (e) ∗ P (c | e)
(3.3)
e
We try to locate the plain text that maximizes the probability P (plain text | cipher text).
To achieve this, we first build a probabilistic model P (e) of the plain text source. Then
we build a probabilistic channel model P (c | e) that explains how plain text sequences (e)
become cipher text sequences (c). In general, people use the EM algorithm to estimate the
channel model P (c | e) for the maximization of P (c), which is the same as maximizing the
sum over all plain text e of P (e) ∗ P (c | e) (by the Bayes Rule). After that, we use Viterbi
algorithm (Forney Jr, 1973) to choose the e maximizing P (e) ∗ P (c | e), which is the same
as maximizing P (e | c) (by the Bayes Rule).
The P (e) is the language model which is known to us. We will try an English character
based language model. The decipherment of cipher text is like tagging the plain text letter
to each cipher text letter, which is an HMM. We apply the EM algorithm (Dempster et al.,
1977) to adjust the P (c | e) in order to maximize the probability of the observed cipher text
(encrypted text). The P (c | e) represents the probability of transforming the p letter into
c letter. We used the Forward-Backward algorithm (Jelinek, 1997) to infer the probability
of mappings.
The EM algorithm estimates the mapping probabilities which maximize the likelihood of
the sequence of the cipher text. If the adversarial text is not being encrypted or corrupted,
we keep it in the same as the plain text. After estimating the probability P (c | e), we use
beam search to search for the best sequence of plain text:
argmaxP (c) = argmaxP (e) × P (c | e)
e
e
21
We model this decipherment task as an HMM. The plain text letter is the hidden state
and the state transition probability is according to the language model. The cipher text is
the observations and the emission probability is P (c | e).
Therefore, given a sequence of English cipher text of length L, the EM algorithm has a
computational complexity of O(L ∗ V 2 ∗ R) where R is the number of iteration and V is the
number of letters. If it is one to one letter simple substitution encryption, the vocabulary
is a set of 26 alphabet letters, whitespace and NULL symbol.
3.1
Related Works
Most relevant prior works are based on the well-known noisy-channel framework. Knight
et al. (2006) analyzed the problem in which we face a cipher text stream and try to uncover
the plain text that lies behind it. For letter substitution ciphers having only 26 plain text
letters, they create a table of 27 × 27 which includes a space character. They then used the
EM algorithm (Dempster et al., 1977) to estimate the probability of the letters mapping,
P (c | e), which is the probability of cipher text letter c given a plain text letter e.
Ravi and Knight (2011) regarded foreign language as cipher text and English as the
plaintext. The goal of word substitution decipherment is to guess the original plaintext from
given cipher data without any knowledge of the substitution key. Unlike letter substitution
which only has 26 plaintext letters in this case they have large-scale vocabularies (10k-1M
word types in plaintext) and large corpora sizes (100k cipher tokens). It is impractical to
estimate all the possible mapping between plaintext words and cipher tokens. To deal with
this they proposed two methods. One is iterative EM, that picks the top K frequent word
types in both the plain and cipher text data and then uses the selected top K plain and
cipher text data to estimate channel model probability. The other method is to use the
Bayesian learning algorithm to sample data to estimate the channel model probabilities and
do the decoding. They achieved reasonably high decipherment accuracy (88.6%) in using
the Bayesian method with the bigram language model.
Similarly Knight and Yamada (1999) adapted the EM algorithm (Dempster et al., 1977)
to decipher unknown scripts aligning sound to character. The EM algorithm was used to
estimating the mapping distribution over the sound to character. They then generated
the most probable sequence of clean text with a dynamic-programming method Viterbi
algorithm from Forney Jr (1973).
For detecting malicious messages, Kansara and Shekokar (2015) proposed a framework
that detecting cyberbullying messages in social network using classifiers. They not only
detected the offensive text but also images. Chen et al. (2012) also proposed the Lexical
Syntactic Feature (LSF) architecture to detect offensive content and identify potential offensive users in social media. And Razavi et al. (2010) took advantage of a variety of
statistical models and rule-based patterns and applied multi-level classification for flame
22
detection. Therefore, machine learning technique and rule based filtering are widely used
in detecting the malicious messages.
3.2
Hidden Markov Model
A Hidden Markov Model (HMM) θ is modeling the sequential data (Baum and Petrie,
1966). It can be treated as the triple < ps , ptt , ptw >. ps (t0 ) is the probability that we
start with some plain letter t as t0 . In our case, it is the starting symbol of a sentence.
ptt (ti | tt−1 ) is the transition probability from ti−1 to ti and the language model can produce
this probability. ptw (wi | ti ) is the probability of generating wi from ti , where wi is cipher
text and ti is its plain text.
3.2.1
Expectation Maximization Algorithm
The Expectation maximization (EM) algorithm (Baum, 1972) finds the hidden parameters
from a set of observed data by iteratively computing the expected value of log likelihood
function and then computing the derivation of the function to obtain the parameters which
maximum the log likelihood function. The log likelihood function is as follows:
L(θ) =
X
logP (c(i) )
(3.4)
i
By using the Bayes Rule, we can represent this function as
L(θ) =
=
n
X
i=1
n
X
logP (c(i) )
log
i=1
X
P (e) ×
d
Y
!
(i)
P (cj
(3.5)
| e)
j=1
e∈V
The EM algorithm works as follows:
1. Assign initial probabilities to all parameters
2. Repeat: Adjust the probabilities of the parameters to increase the loglikehood the
model assigns to the training set
3. until training converges
The greater the number of iterations of the EM algorithm, log likelihood is more likely
up to a point. As the log likelihood function increases, the language model score of the
produced plain text also increases, meaning that the deciphered plain text is more similar
to the source language (the language of the language model).
As there is no prior information about the model, we set all the parameters to be equally
probable. To adjust the probabilities of the parameters, we use the Forward-Backward
23
algorithm, also referred to the Baum-Welch algorithm (Baum, 1972), which is a special case
of the EM algorithm general form. It uses a dynamic programming strategy to efficiently
compute the posterior marginal distributions, which is the second term P (c | e) in our
Equation 3.3. The Forward-Backward algorithm (Rabiner, 1989) is guaranteed to converge
on a local maximum of the loglikelihood function, in other words, in each training iteration
we guarantee that the parameter values will be adjusted such that the loglikelihood score will
not decrease. The idea is that in each iteration, we estimate the probability of parameters by
counting the score the current model assigned to them. For example, if our model assigns
plain letter “A” to cipher letter “u” with probability 1/27, then we would increment by
1/27 our language model score of “A” as our count for the number of “A” and “u” occurs
together. Training typically is stopped when the increase in the loglikelihood score of the
training set between iterations drops below the threshold which determines the converge
point.
3.2.2
Forward-Backward Algorithm
The Forward-Backward algorithm (Rabiner, 1989) computes posterior probabilities of a
sequence of states given a sequence of observations (Collins, 2013). In the first pass, the
algorithm computes the summation of a set of forward probabilities which provides the
probability of any particular states given the observations ending at that point. It can be
written as follows. For all j ∈ 2, ..., m, s ∈ V where m is the length of observations, the V
is the set of possible states and α are the forward probabilities,
α(j, s) =
X
0
0
α(j − 1, s ) × ϕ(s , s, j)
s0 ∈V
0
(3.6)
0
Define ϕ(s , s, j) = P (s | s ) · P (cj | s) as in Collins (2013):
ϕ(s1 ...sm ) =
=
m
Y
j=1
m
Y
ϕ(sj1 , sj , j)
(3.7)
P (sj | sj−1 ) · P (cj | sj )
j=1
In the second pass, we define the backward probabilities β, analogous to the forward
probabilities α.
β(j, s) =
X
0
0
α(j + 1, s ) × ϕ(s, s , j + 1)
s0 ∈V
(3.8)
This equation computes the summation of a set of backward probabilities which provide
the probability of observing the remaining observations given any starting point j. In
the end, with the forward and backward probabilities, we can obtain the summation of
the probability of all possible paths to go through a particular point where observation j
24
0
generates states s , which is the expected count for our EM training. For all j ∈ 1...m, a ∈ V ,
µ(j, a, b) = α(j, a) × β(j, a)
(3.9)
For all j ∈ 1...(m − 1), a, b ∈ V ,
µ(j, a, b) = α(j, a) × ϕ(a, b, j + 1) × β(j + 1, b)
(3.10)
The pseudo code of Forward-Backward algorithm (Rabiner, 1989) modified from Knight
et al. (2006) is given below:
Algorithm 3 Forward-Backward algorithm for estimating posterior distribution (emission
probability)
1: Given a cipher text c with length of m, a plaintext with vocabulary of v bigram tokens, and a plaintext trigram
model b:
2: Random initialized the s(c | p) substation table and normalized
3: for several iterations do
4:
set up a count(c, p) table with zero entries
5:
for i=1 to v do
6:
Q[i, 1] = b(pi [1] | pi [0], boundary)
7:
end for
8:
for j =2 to m do
9:
for i=1 to v do
10:
Q[i, j] = 0
11:
for k=1 to v do
12:
if pi [0] == pk [1] then
13:
Q[i, j]+ = Q[k, j − 1] · b(pi [1] | pk [0], pk [1]) · s(cj−1 | pk [1])
14:
end if
15:
end for
16:
end for
17:
end for
18:
for i =1 to v do
19:
R[i, m] = b(boundary | pi [0], pi [1])
20:
end for
21:
for j=m-1 to 1 do
22:
for i=1 to v do
23:
R[i, j] = 0
24:
for k =1 to v do
25:
if pi [1] == pk [0] then
26:
R[i, j]+ = R[k, j + 1] · b(pk [1] | pi [0], pi [1]) · s(cj+1 | pk [1])
27:
end if
28:
end for
29:
end for
30:
end for
31:
for j=1 to m do
32:
for i=1 to v do
33:
count(cj , pi )+ = Q[i, j] · R[i, j] · s(cj | pi )
34:
end for
35:
end for
36:
normalize count(c, p) table to create a revised s(c | p)
37: end for
3.2.3
Initialization
In Blömer and Bujna (2013), it is stated that the initialization of EM training really depends
on the data and the allowed computational cost, and that there was no way to determine
25
the best initialization algorithm that outperforms all algorithms on all instances. Thus,
to identify the suitable initialization to reduce the computational cost and reach the local
optima with fewer iterations, we propose our own initialization algorithm, which is based
on the assumption that the previous trained table can help us to reach the local optima
with fewer iterations than uniform distribution initialization in the following cipher text
sentences.
Considering our problem of deciphering the corrupted offensive text, Algorithm 3 trains
on the single cipher text sentence. If we have multiple sentences or a large set of cipher
text sentences to decipher, one way is to train each sentence line by line with Algorithm 3.
Each line starts with the random initialization. The other method which we proposed is to
accomplish random initialization on the first cipher text sentence, and then add the revised
s(c | e) with random initialization for the next line cipher text as the initial table s(c | e).
The iteration training part doesn’t change. We compared these two ways of training in the
Experimental Chapter 4 and found that the latter training method is better.
In the following pseudo code, we showed the whole decipherment algorithm on offensive
text with the second initialization algorithm.
Algorithm 4 Initialization with previous trained table
1: Given a set of cipher text sentences with size of n, a plain text with vocabulary of v
bigram tokens and a plain text trigram model b
2: Random initialized the s(c | e) substitution table and normalized
3: for i = 1 to n do
4:
preprocess the i-th cipher text sentence by removing the repeated characters into 2
identical characters and lower case
5:
inserting Null symbol based on language model into the offensive key words to handle
insertion cases
6:
if i 6= 1 then
7:
add up s(c | e) with the (i-1)-th trained table s(c | e) to get the new substitution
table
8:
end if
9:
apply Forward-Backward algorithm with the new substitution table as the initialization table
10: end for
3.2.4
Beam Search
The term beam search was introduced in Reddy et al. (1977). This searching methodology
is often used when there the search space is exponential and we are faced with using limited
memory. In our decipherment problem beam search selects a sequence of plain text letters
based on the language model. However, we cannot guarantee to get the best result because
the beam search only considers a limited number of candidates to reduce its memory requirements. Beam search uses breadth-first search and prunes candidates/nodes at each
26
level of searching, thus cutting down the search space at each level. Nuhn et al. (2013) reported that the general decipherment goal is to obtain a mapping such that the probability
of the deciphered text is maximal.
For example, we have a sequence of cipher text which is “ifmmp xpsme”. Assuming the
searching spaces for each letter are the 26 letters from a to z. If we search through all the
possible letters, it will need O(26n ) space complexity. If we prune the candidate letters by
the language model score, every time we set a width of beam search which can be the top n
candidates letters based on the language model score, and we can reduce the search space
into O(widthn ). Then we find that the plain text “hello world” has the highest probability.
Since we estimate the posterior probability by the Forward-Backward algorithm in
HMM, the language model score P (e) and posterior probability P (c | e) are known. Given
a sequence of corrupted offensive text, we can search the candidate letters by beam search
to identify a sequence of deciphered plain text that maximizes P (e) × P (c | e). However, rather than maximizing P (e) × P (c | e), Knight et al. (2006) found that maximizing
P (e) × P (c | e)3 is extremely useful across decipherment applications. Cubing the emission
probability was introduced by Knight and Yamada (1999), who stated that “it serves to
stretch out the P (c | e) probabilities, which tend to be too bunched up. And the bunched
up is caused by the incompatibilities between the n-gram frequencies used to train P (e)
and the n-gram frequencies found in the correct decipherment of c”. In our decipherment
problem, the beam search was implemented based on the stack decoding (Koehn, 2009).
Koehn (2009) stated that there was a heuristic search that reduces the search space. We
have many hypotheses at each states or cipher text letter. If a hypothesis appears to be
bad, we may not want to expand this hypothesis in the future. We organize the hypotheses
into stacks, and, if the stacks get too large, we remove the worst hypothesis in the stacks.
The score in our decipherment training and decoding is log based since we deal with small
probabilities which may have the chance to overflow. The following pseudo code Algorithm
5 shows the decoding process after substitution table has been trained.
3.2.5
Random Restarts
In Berg-Kirkpatrick and Klein (2013), the effect of running numerous random restarts
when using EM to attack decipherment problems was investigated. They showed that
initializations that lead to the local optima with the highest likelihoods are very rare.
Thus, they report that large-scale random restarts have broad potential to reach the highest
likelihoods.
The reason the EM algorithm reaches the local optima is that the likelihood function in
HMM (Equation 3.5) is a non-convex function which has multiple “local” optima (Collins,
2011).
During every iteration of the EM algorithm, the likelihood function will increase until
it converges at the local optima point. The random restarts try different starting points
27
Algorithm 5 Beam Search Algorithm in Decipherment
n is the length of cipher text
Set hypothesis tuple with three attributes (score, predecessor, letter)
initialize the stacks with size of n+1
place empty hypothesis (0,None,‘<s>’) into stack 0
for i= 1 to n do
for the best width hypothesis h in stack do
for p in candidates letters do
score+ = 3 ∗ S[cipher[i], p] + LM (h.predecessor.letter, h.letter, p)
new-hypothesis ← hypothesis(score,h,p)
if p not in stacks[i+1] or stacks[i+1][p].score < score then
stacks[i+1][p] = new-hypothesis
end if
end for
end for
end for
winner = max (stacks[-1].score)
plain-text ← traverse back from the winner hypothesis
return plain-text
which increases the probability that the likelihood functions will converge to the higher
local optima, which are the points which will generate the best results.
3.3
Summary
In this chapter, we reviewed the concept of the HMM, EM algorithm and Forward-Backward
algorithm. We also proposed our own initialization algorithm. The idea of applying the
decipherment method to the corrupted offensive text is shown in this chapter, we model
the decipherment problem as an HMM and regard the cipher text (corrupted offensive
text) as observed data and the plain text as a hidden variable. To uncover the plain
text, we preprocess the cipher text by inserting NULL symbol based on n-gram count
and removing the repeated characters. The EM algorithm estimates the parameters using
Forward-Backward algorithm in the HMM, and we then use beam search decoding to get
the best sequence of plain text as deciphered result.
28
Chapter 4
Experimental Results
4.1
4.1.1
Dataset
Wiktionary
The dataset we used is from Wiktionary. Wiktionary is a content dictionary of all words in
all languages. Wiktionary data also includes sentences which are tagged with some labels,
for example, vulgar, derogatory and so on. We parsed the whole English Wiktionary dataset
in 2015-01-02 timestamp and extracted 2298 offensive sentences and 152,770 non-offensive
sentences. We tagged every sentence as either offensive or normal. As the Wiktionary is
like a traditional dictionary, it has every word’s explanation and example sentences. We
extracted sentences from these example sentences. We appended the offensive key words at
the end of every sentence. In our experiment, we only considered one language (English),
so we did not include other languages. We filtered the sentences based on Wiktionary using
another label called “lang=en” or “en”. This label helped us to filter out the sentences
that were not in English. However, the decipherment model can also use other language
to decipher the original text. Because it is not language dependent, we can train another
language model to decipher that language’s text. To train and test the classifier, we split
these extracted offensive sentence into two parts, one part for training and another part for
testing. We split the data according to their key words. For the sentences having the same
key words, there had to be at least 75% of these sentences in the training set and 25% of
these sentences in the test set. For example, there are 8 sentences containing the key word
“shit”. We placed 6 sentences into the training set and the remaining two sentences into the
test set. If a key word only has one example sentence, we put this example sentence into
the training set. If a key word has two example sentences, we put one sentence into training
set and one sentence into the testing set. Thus, we guaranteed that the testing set had at
least one example sentence with the same key word in the training set. We obtained 1532
offensive sentences for training and 766 offensive sentences for testing. There were 103,503
non-offensive sentences for training and 49,267 non-offensive sentences for testing. We
29
preprocessed these sentences by removing the punctuation and put them into lower case.
For the offensive testing data, because we will substitute the alphabet letters with nonalphabet symbols, we discarded the sentences having non-alphabet letters. After filtering,
we obtained 716 offensive testing sentences. We used 1,532 sentences offensive training
set data and sampled 1,532 non-offensive sentences as a balanced dataset for training an
offensive sentence classifier. We split these 716 offensive testing sentences into 4 parts in
sequence. The first three parts we set as test sets and the latter part as development set.
Every set has 179 sentences.
The language model was also trained on the Wiktionary dataset. Since there are less
offensive sentences than non-offensive sentences, we duplicated the offensive sentences until
the offensive sentences were as many in number as the non-offensive sentences. In this way,
we did not lose any information on the non-offensive sentences when making the balanced
training data set. There were 1,532 lines of offensive sentences and 103,503 lines of nonoffensive sentences. Therefore, we duplicated the offensive ones 68 times to get 104,176
lines of offensive sentences which are close to the non-offensive training sentences. The total Wiktionary training set has 155,251 tokens. To have a comprehensive language model,
only having the Wiktionary dataset is not enough, so we trained another English language
model on the European Parliament Proceedings Parallel Corpus 1996-2011 (Koehn, 2005).
We sampled 100,000 English sentences from the German-English European Parliament Proceedings Parallel Corpus. There are 2,714,110 English tokens.
We used the SRILM toolkit (Stolcke et al., 2002) to train the Wiktionary language
model and Europarl language model. We mixed them to generate a mixed language model.
Each language model was given a weighting factor to compose the mixed language model.
The weight was tuned in the develop set to obtain the highest mixed language model score.
The toolkit we used to mix the language model is py-srilm-interpolator (Kiso, 2012). In
our decipherment approach, we used the letter based language model. We inserted space
between the letters to make each letter as a word and replaced the original whitespace with
“ ” to train the letter based language model.
The language model we used in the noisy channel spelling correction is from the English
Gigaword corpus (Graff et al., 2003). We trained the Gigawords trigram English language
model with 7,477,897 English tokens, which were lower case. The dictionary we used in our
spelling correction is from the Linux system dictionary which has 479,829 English words in
lower case. We obtained 3,393,741 pairs of human disguised words and original plain text
words from the company to train the error model.
4.1.2
Real World Chat Messages
The Two Hat Security is a technical company who combines artificial intelligence with
human expertise to classify text and images on a scale of risk, taking context into account.
They have a rule-based filtering system to assign risk level for each chat message to identify
30
offensive messages. However, not all of the messages can be filtered out since people invent
subtler forms of malicious messages in an effort to subvert such filtering systems. We
obtained 4,713,970 chat messages from the company which were not system recognized
messages at the previous timestamp. They preprocessed the chat messages reproduced a
cleaner message which was based on some rules, and called them simplified messages. We
used the simplified versions of 4,700,000 chat messages as a training set to train the letter
based language model.
To train a comprehensive language model, we mixed the real chat messages language
model, Wiktionary language model and Europarl language model using py-srilm-interpolator
(Kiso, 2012). We tuned the weights of the 500 plain text messages in the development set
which are collected from real chat messages. The development set messages were all recognized by the previous timestamp version filtering system. For the testing messages, we
sampled 500 chat messages from 265,626 chat messages which were identified as offensive
messages in their latest filtering system version but identified as unknown messages in the
old version filtering system. After decipherment, we again pass the deciphered results into
the old version filtering system and evaluate the quantity of the text that can be recognized
by the old version filtering system.
4.2
Experimental Design
We evaluate our approach in terms of the classifier accuracy and the risk level from the
rule-based filtering system for the real chat messages. We are using LibShortText toolkit
(Yu et al., 2013) to train a classifier. The classifier is trained on the training set which
classifies offensive and normal sentences. After training the classifier, we classify the original test sentences without any encryption. Then we encrypted the test sentences by the
Caesar cipher, Leet simple substitution cipher and replacing the offensive keyword with
real human disguised words. We wanted to mimic the way people corrupt messages online.
The classifier here is not to correctly classify offensive messages, rather, this classifier is for
us to measure how well the decipherment method can recover the original messages back
from the encrypted messages. We compare the classification accuracy between the original
and deciphered messages. Thus, we focused on how to close the accuracy between the deciphered and original messages. If the classification accuracy gap between the original and
deciphered messages is small, the decipherment approach can recover the original messages
from encrypted messages. Otherwise, the classification accuracy gap will be large. We
apply our HMM decipherment approach to these encrypted sentences and classify the deciphered sentences to compare their accuracy with that of the original sentences. The goal of
this experiment is to evaluate whether the decipherment approach can recover the original
sentences no matter what the encryption is as the classification accuracy gap indicates.
31
The noisy channel model based spelling correction is an alternative approach to solve the
problem. This approach needs to train an error model from a pair of misspelled and correctly
spelled words, as Algorithm 2 shows. Furthermore, it requires a dictionary which contains
correctly spelled English words. In our experiments, we have the following settings for the
noisy channel model base spelling correction as stated in Norvig (2009): pspell_error = 0.05
in Equation 2.18, λ = 1.0 in Equation 2.19, the maximum edit distance edit-distance-limit
= 3 in Algorithm 1, the dictionary is Linux system dictionary, and the error model is trained
from company data which is described in the previous section.
The HMM decipherment methods need random restarts to get good results. BergKirkpatrick and Klein (2013) showed that an easy cipher Zodiac 408 decipherment reached
the good loglikelihood score with 100 random restarts. However, using more than 100
random restarts does not appear to be useful for this easy cipher Zodiac 408. Thus, we ran
100 random restarts on each encrypted type set.
We also applied our decipherment methods on Two Hat Security’s real chat messages.
These chat messages are not recognized by their filtering system at the previous timestamp
but for the current timestamp as there are more rules being added into the system, they
should be recognized. We apply our decipherment to recover the original messages and pass
them into their filtering system again to obtain the risk level.
4.3
4.3.1
Experimental results
Classifier Tuning
There are 179 sentences in the development set which are used for tuning the features and
models of the classifier. First, we train our classifier with different features and different
models. Then, we pick the one which best fits our development set. There are several
features that represent the word such as binary features, word count, term frequency and
TF-IDF. The features also include the combination of unigram or bigram, stop word removal
and word stemming. The classification model includes L1-loss support vector machine, L2loss support vector machine, support vector machine by Crammer and Singer and logistic
regression. We extracted the context of keywords from the original training data and made
it as another training set for tuning. Thus, we obtained a context based training set and a
complete training set. We ran the combinations of different word representation, features,
training sets and classification models. On the context training set, we got the highest
classification accuracy of 87% in the development set with features of the binary word
representation, stop word removal, stemming and bigram with logistic regression classifier,
while in complete training set, we got the highest classification accuracy of 86% in the same
development set.
We concluded that the binary word representation, stop word removal, stemming and
bigram with logistic regression classifier on context training set outperforms the other fea32
tures in the development set. In the following experiments, the classifier was trained using
the context based training set and with these features.
4.3.2
Decipherment of Caesar Cipher
For the simple substitution cipher, like the Caesar cipher and the Leet substitution cipher,
the HMM decipherment approach works very well. The classification accuracies dropped
greatly with the encrypted text, because the Caesar cipher changes all the letters and the
words are no longer English words. As Table 4.1 shows, the classification accuracies of the
three test sets increased back to their original classification accuracies in the “deciphering
whole set” column. This column means the decipherment is trained using Algorithm 4,
which initializes the substitution table with the substitution table trained before the current line of sentence. Thus, the HMM decipherment trained on the whole set can recover
all the encrypted letters (cipher text) into their original letters (plain text). When we decipher each line of messages individually, the EM training does not observe enough data
to learn the posterior probabilities (substitution table). In contrast, the substitution table
trained by previous messages being passed into next message initialization table will not
lose the information learned from the previous messages. Therefore, the results of whole
set decipherment are better than per line decipherment.
For the noisy channel spelling correction approach, in Table 4.1, it cannot recover the
original messages successfully as the classification accuracies are much lower than the decipherment approaches. Even deciphering line by line is better than noisy channel spelling
correction approaches. Thus, the decipherment approaches can cover more cases than the
supervised learning method as the supervised learning method needs pairs of misspelled
and correct words to train and has edit distance limitations. In Caesar cipher encryption,
the supervised learning model did not have the training data to train the error model. In
addition, the edit distance in the Caesar cipher correlates positively with the length of the
word because every letter has been mapped to another letter.
Here are the examples in our test set A. The original message is “im going to hit the
clubs and see if i can get me some cunt“. After encrypting with Caesar cipher with 3 letters
shifting, the cipher text is “lp jrlqj wr klw wkh foxev dqg vhh li l fdq jhw ph vrph fxqw”.
The deciphered plain text messages using HMM decipherment is “im going to hit the clubs
and see if i can get me some cunt”, which is the same as the original message. For the noisy
channel spelling correction, the corrected message is “lp rj wr kl with foxe dq hh li l dq jew
ph ph few”, which is totally not recovering the original messages.
We ran 100 random restarts on the Caesar cipher encrypted dataset A. The Figure 4.1
shows every loglikelihood value in 100 random restarts. The mean of the loglikelihood was 27923 and the standard deviation was 23. As the Caesar cipher is easy to be deciphered, the
loglikelihood did not vary greatly. The highest loglikelihood gave a classification accuracy
of 86% (154/179), which is the same as the original plain text classification accuracy. Thus,
33
Test Set
A
B
C
Original text
86%
(154/179)
86%
(154/179)
84%
(152/179)
Caesar cipher
encrypted text
4%
(8/179)
3%
(7/179)
5%
(10/179)
Noisy Channel
Spelling Correction
28%
(51/179)
22%
(41/179)
22%
(41/179)
Deciphering
per line
55%
(99/179)
51%
(92/179)
58%
(104/179)
Deciphering
whole set
86%
(154/179)
86%
(154/179)
84%
(152/179)
Table 4.1: Classification Accuracy of Spelling Correction and Decipherment Results in
Caesar Cipher Encrypted Text
Figure 4.1: 100 Random Restarts Loglikelihood in Caesar Cipher Decipherment
the decipherment process recovers the whole message that was encrypted by the Caesar
cipher.
4.3.3
Decipherment of Leet Substitution
We encrypted three test sets A, B, C with Caesar cipher encryption by 3 letters right
shifting. For the Leet substitution cipher, we referred to the KoreLogic’s Leet rules (Kore
Logic Security, 2012) which tagged as “#KoreLogicRulesL33t” . We used John the Ripper
password cracker (Solar Designer and Community, 2013) to apply the KoreLogic Leet rules
encrypting our test sets.
The Leet substitution cipher is more complicated than the Caesar cipher because every
letter can map to different characters or numbers without the same mapping rules. In this
experiment, we evaluated the different width of beam search and the different decipherment
initialization in the same way as for the Caesar cipher decipherment.
After EM training, we obtained the posterior probability table (the substitution table)
and we used beam search to decode the results. The wider width of beam search, the larger
the size of the searching space, and also the higher the probability of finding a good solution.
34
Test Set
Decipherment Type
A
Per Line
A
Whole Set
B
Per Line
B
Whole Set
C
Per Line
C
Whole Set
Beam search
width of 1
45%
(82/179)
80%
(144/179)
58%
(104/179)
79%
(143/179)
58%
(105/179)
81%
(146/179)
Beam search
width of 5
45%
(82/179)
82%
(147/179)
62%
(112/179)
82%
(148/179)
67%
(121/179)
81%
(146/179)
Beam search
width of 10
55%
(100/179)
82%
(147/179)
62%
(112/179)
82%
(148/179)
68%
(123/179)
81%
(146/179)
Table 4.2: Classification Accuracy of Decipherment results in Leet Substitution Cipher
Table 4.2 shows that as the beam search width increases the classification accuracies also
increase or stay the same. The bold font numbers shown in the Table 4.2 are the optimal
results. Thus, a width of 5 is reasonable for beam search decoding. The beam search width
of 10 needed more time to decode as it had a larger searching space. As the results are the
same, we chose a beam search width of 5 to decode the following tests. Deciphering with
the whole set is better than deciphering per line of messages. Deciphering the whole set
takes the substitution table from previous sentence that was deciphered as the initial table
of the following line’s sentence, which can be useful from one sentence to another if the
cipher was similar. It helps to find out the actual mapping between the cipher and plain
text. However, deciphering per line did not consider the trained table generated from the
previous data.
Decipherment with whole set results have classification accuracies above 80% and thus
outperform noisy channel spelling correction methods. This time the Leet substitution
cipher did not change all the letters in the words, so the noisy channel spelling correction can recover some of the corrupted messages. However, it still cannot outperform the
decipherment approach with previous table initialization.
For example, in test set A, the original message is “whats fucker” . The Leet substitution
cipher encrypted message is “wha+$ Fuck3r”, which substitutes “t” with “+”, “s” with “$”
and “e” with “3”. The deciphered message using HMM decipherment approach is “whats
fucker”, which is the same as original message. For the noisy channel spelling correction,
the corrected message is “what a fucker”, which is almost the same as original messages.
Beam search is generating many candidate sentences when decoding and pick up the
highest scored candidate as the final result. However, it doesn’t consider the other candidate
sentences which can be good deciphered results. We run a experiment which lists the top
5, top 50 and top 500 candidate sentences for each sentences. We use the same classifier to
35
Top n Candidates
1
5
50
500
Classification Accuracy
82% (147/179)
83% (148/179)
95% (170/179)
99% (178/179)
Table 4.3: Top n Candidates of Decipherment Results in Leet Substitution Cipher Classification on Test Set A
classify these candidate sentences to see if there is any sentence that can also be the plausible
deciphered result. For every sentence, if its candidate list contains a candidate sentence that
can be classified as offensive, we classify this sentence as offensive. The experiment is run
on the test set A. As the Table 4.3 shows, as we consider more candidate sentences not only
the highest scored candidate, the classification accuracies increase. Thus, the deciphered
results are in the candidates lists but the highest scored candidate are not guaranteed to
be the best result.
As Table 4.4 shows, HMM decipherment training can recover most of the words in our
test sets compared to the original message classification accuracy of each set. From the
results shown in Table 4.1 and Table 4.4, no matter what encryption of substitution cipher
was used, be it Caesar cipher or Leet substitution cipher, the HMM decipherment with
language model could always recover the original messages.
Test Set
A
B
C
Original
Text
Leet
Encryption Set
86%
(154/179)
86%
(154/179)
84%
(152/179)
59%
(107/179)
64%
(115/179)
62%
(112/179)
Noisy Channel
Spelling
Correction
68%
(122/179)
60%
(108/179)
65%
(117/179)
Deciphering
per line
Deciphering
whole set
56%
(102/179)
62%
(112/179)
67%
(121/179)
82%
(147/179)
82%
(148/179)
81%
(146/179)
Table 4.4: Classification Accuracy of Spelling Correction and Decipherment Results in Leet
Substitution Cipher with Beam Search Width of 5
Gao and Johnson (2008); Johnson (2007) show the Variational Bayes is a fast estimator
especially on large data set, which let the EM training converge to the local optima with
less iteration. The formula of Variational Bayes in Figure 4.2 was defined in Gao and
Johnson (2008) where m0 and m are the number of word types and states respectively, in
our decipherment case, they are English letters. ψ is the digamma function. And the E is
the expected value in EM training.
We applied add-one smoothing and Variational Bayes smoothing (Gao and Johnson,
2008; Johnson, 2007). Table 4.5 shows that smoothing in our case does not provide a better
36
tion to the M-step of the Forward-Backward algorithm. MacKay (1997) and Beal (2003) describe
Variational Bayesian (VB) inference for HMMs. In
general, the E-step for VB inference for HMMs is
the same as in EM, while the M-step is as follows:
sizes other than the sentence, but
this here). A pointwise sample
time per iteration, while a blocke
O(nm2 ) time per iteration, wher
of HMM states and n is the leng
˜( +1) = f (E[nt ,t ] + )/f (E[nt ] + m ) (4) corpus.
t |t
Second, the sampler can either
˜( +1) = f (E[n ] + )/f (E[nt ] + m )
w,t
lapsed. An explicit sampler rep
w|t
ples the HMM parameters and
f (v) = exp( (v))
the states t, while in a collapsed
Figure 4.2: Variational
for are
EMthe
Algorithm
from
Gaotypes
and and
Johnson
(2008) are integrated out, an
parameters
where mBayes
and m
number of
word
states respectively,
is the digamma function and are sampled. The difference bet
collapsed
samplers corresponds
the remaining
are as insmoothing,
(2). This the
means
result than the unsmoothed
version.quantities
For the add-one
classification
results
ference
between the two PCFG sa
that
a single
iteration
can
be performed
O(nm2 ) Bayes
did not change greatly
from
the one
without
smoothing.
The in
Variational
smoothing
presented in Johnson et al. (2007)
time,ofjust
for the which
EM algorithm.
even reduces the quality
theasresults,
is not a proper smoothing method for our
An iteration of the pointwise e
EM training. Gao and Johnson (2008) reported that the approximation used by Variational
pler consists of resampling and
2.3 MCMC sampling algorithms
Bayes is likely to be less accurate on smaller data sets. In our case, the states
in HMM
are
to-state
transition
counts n and s
The
goal
of
Markov
Chain
Monte
Carlo
(MCMC)
a small set which only includes 27 letters so the Variational Bayes smoothing sion
did not
have
a
counts n using (5), and the
algorithms is to produce a stream of samples from
good performance. 100 random restarts were also applied into the Leet substitution
state ti cipher
given the corresponding
the posterior distribution P(t | w, ). Besag (2004)
neighboring states ti 1 and ti+1 u
provides a tutorial on MCMC techniques for HMM
inference.
Dir(
t | nt ,
A Gibbs sampler is a simple kind of MCMC
Dir(
t | nt ,
algorithm that is well-suited to sampling highP(ti | wi , t i , , )
ti |t
dimensional spaces. A Gibbs sampler for P(z)
where z = (z1 , . . . , zn ) proceeds by sampling and The Dirichlet distributions in (5)
updating each zi in turn from P(zi | z i ), where nt is the vector of state-to-state t
z i = (z1 , . . . , zi 1 , zi+1 , . . . , zn ), i.e., all of the t leaving state t in the current st
347
Figure 4.3: 100 Random Restarts Loglikelihood in Leet Substitution Cipher Decipherment
in A dataset. The Figure 4.3 showed every loglikelihood value in 100 random restarts. The
mean of the loglikelihood of 100 random restarts was -30183 and the standard deviation was
390. These statistical numbers indicate that Leet substitution cipher decipherment has a
larger deviation than the Caesar cipher decipherment. In other words, this also shows that
the Leet substitution cipher decipherment is harder than Caesar cipher decipherment. The
highest loglikelihood score was -30061.3 and its classification accuracy was 80%(144/179).
Although the highest loglikelihood score did not give us the highest classification accuracy
(82%(148/179)), it is close to the highest classification accuracy.
37
A Encrypted Type
Smoothing Type
Without Smoothing
Caesar cipher
Add-one Smoothing
Variational Bayes
Smoothing
Without Smoothing
Leet substitution cipher
Add-one Smoothing
Variational Bayes
Smoothing
Classification
Accuracy (width 5)
86%
(154/179)
86%
(154/179)
44%
(79/179)
82%
(147/179)
81%
(146/179)
27%
(50/179)
Loglikelihood
-27904
-27901
-55114
-30110
-30105
-66204
Table 4.5: Smoothing in Decipherment of Whole Set on Test Set A
4.3.4
Decipherment of Real Chat Offensive Words Substitution Dataset
In the Wiktionary dataset, we encrypted the original sentences by substituting the offensive
words with real human corrupted words. These real human corrupted offensive words are
from the Two Hat Security company. This company has a rule-based filtering system to
filter out offensive chat messages. They provided us with a pair of corrupted offensive words
collected from real chat messages and the corresponding plain text. However, there are still
some offensive words in our Wiktionary dataset that did not appear in their corrupted
words set and so we simply changed one letter in a random position of the word to imitate
human users hiding real messages.
Unlike the previous encryption type, this time the encryption is from real chat messages.
It is not like the Caesar cipher, which only shifts the letters to other letters, and it is also
not like the Leet simple substitution cipher, which only involves substitution. In real chat
messages, users can always invent more creative ways to disguise their messages to bypass
the filter system, which can involve both insertion and deletion. Therefore, we need to
handle insertion, deletion and substitution in the disguised words to recover the original
plain text words. In the insertion case, for example, if the original word “hello” is disguised
as “helo”, we need to insert an NULL symbol inside the disguised word “helo” to decipher.
The correct place to insert the NULL symbol is “he<NULL>lo” but we do not know the
proper position to insert the NULL during training. One approach to circumvent this is to
insert NULL at random position before the beginning of the EM training and another way
is to insert the NULL where the boundary of the n-gram is according to the n-gram count
from the training set of plain text (Ando and Lee, 2003).
In Table 4.6, the n-gram count based insertion decipherment has a higher classification
accuracy than the random insertion NULL decipherment. In real chat messages, combining
38
Test Set
Wiktionary
Encrypted Set
Noisy Channel
Spelling
Correction
Aspell Spelling
Correction
A
64.8045%
(116/179)
72.6257%
(130/179)
77.0950%
(138/179)
Insert Null
Random
n-gram Count Based
B
72.0670%
(129/179)
76.5363%
(137/179)
73.1844%
(131/179)
Random
n-gram Count Based
C
75.9777%
(136/179)
77.0950%
(138/179)
77.0950%
(138/179)
Random
n-gram Count Based
Decipherment
69.2737%
(124/179)
72.0670%
(129/179)
72.6257%
(130/179)
75.9777%
(136/179)
76.5363%
(137/179)
78.2123%
(140/179)
Table 4.6: Classification Accuracy of Spelling Correction and Decipherment Results in Real
Chat Offensive Words Substitution Wiktionary Dataset
two words is common, like “helloworld”. The n-gram count based “NULL” insertion can
handle this case, because based on the n-gram count we can determine the boundary of
the words and insert the NULL symbol at the word boundary. For example, there is
a corrupted words “helloworld”, the insertion based on n-gram count will insert NULL
between the “hello” and “world” which helps in the decipherment training. One advantage
of this HMM decipherment method is that it tends to not change the words that are correct
since these words have already got the highest language model score. Rather, it changes
the words that are corrupted or misspelled. No matter the type of encryption, the HMM
decipherment approach always deciphers the messages which fit with the language model
we trained.
Thus, the best classification accuracy is the HMM decipherment method with insertion
based on n-gram count. We conducted an experiment testing the real corrupted words
encrypted Wiktionary offensive sentences with the Aspell program. Table 4.6 shows the
decipherment results are better than the Aspell results in test set B and C while in test set
A, the decipherment approach is about 5% less accurate than the Aspell program result.
Thus, from the experimental results, it is clear that the HMM decipherment method is
still as effective as the spelling correction method using Aspell in the task of recovering the
corrupted words to their original words. Furthermore, the noisy channel spelling correction
results in Table 4.6 are quite close to the n-gram count based on the insertion decipherment
method.
In summary, spelling correction methods such as the noisy channel model and the Aspell
Spelling Correction have similar performance when dealing with real chat corrupted offensive
words. If we change the editing too much, like in Ceasar cipher encryption, the spelling
correction performances dropped substantially, but the decipherment approaches were not
affected.
39
Figure 4.4: 100 Random Restarts Loglikelihood in Decipherment of Real Chat Offensive
Words Substitution on Test Set B
This time we did 100 random restarts on the test set B. The Figure 4.4 showed every
loglikelihood value in 100 random restarts. The mean of loglikelihood was -41444.7 and
the standard deviation was 115.53. As the real chat offensive words have greater diversity,
the decipherment is much harder than the simple substitution cipher. As Table 4.6 shows
the encrypted set of test set B had 72.0670% classification accuracy. From the 100 random
restarts, the highest loglikelihood classification accuracy was 75.9777% which did not represent a great increase in accuracy. However, the decipherment did recover some corrupted
words based on an increase in the accuracy.
4.3.5
Decipherment of Real Offensive Chat Messages
We obtained 500 sampled real offensive chat messages to decipher. Before deciphering
these messages, we preprocessed the text as before. We removed repeated characters and
only kept two sequential repeated characters. For example, we changed “heeeellllooo” to
“heelloo”. We also substituted the special symbols to “NULL”. We trained the n-gram count
from the training set that we used to train the language model, which was composed of
4,700,000 chat messages that were recognized by the filtering system. The filtering system
assigns a risk level for each text which passed through it. The risk level that higher than
4 means that the text is offensive, while a risk level lower than 4 means that the text is
not offensive. If the risk level was 4, the filtering system did not recognize this text and
it could not make decisions. We passed the deciphered results into the old version of the
filtering system and obtained the risk level as assigned by the system. The reason for this
was that these 500 sampled test messages were collected by the latest filtering system which
can have a risk level higher that 4, whereas in the old version of the filtering system they
were all in risk level 4. We wanted to recover these messages and put them into the old
40
version filtering system again to see how many messages could be recognized by the old
version filtering system.
Here we would like to show some real examples that deciphered by our decipherment
approaches. For example, “fvk u” and “f2ck u” are deciphered into “fuck u”. The decipherment can handle the substitution and insertion in the examples we showed. After passing
the deciphered results into the old version of the filtering system, there were 51.6% of test
offensive text that were recognized as a risk level of higher than 4. However, the percentage
of text recovered by the old version is 58.6%. 7% of the test messages deciphered into
non-offensive messages which were supposed to be offensive. Thus, the decipherment approach can recover about half of the corrupted messages into the filter system recognized
messages but some of the messages were not properly recovered and were thus categorized
as non-offensive text.
41
Chapter 5
Conclusion
The HMM decipherment can decipher disguised text based on the language model regardless
of the encryption type. It works well when the encryption is a simple substitution cipher as
it only considers substitutions. As our experiments show, for the Caesar cipher encryption,
the decipherment can recover all of the encrypted messages into their original messages,
obtaining the same classification accuracy as the original messages. For more complicated
encryption text, we insert “NULL” symbols according to n-gram count methodology from
Ando and Lee (2003) to handle insertion cases. The HMM decipherment can always increase the classification accuracy according to our experiments of Caesar cipher encryption
decipherment, Leet substitution decipherment, real chat offensive words encrypted text decipherment and real chat messages decipherment. The decipherment approach can cover
more cases than spelling correction methods. As in the Caesar cipher encryption case, the
decipherment results can reach the same classification accuracy as the original messages.
However, the noisy channel spelling correction is only able to reach around 22% classification accuracy. Due to the limitation of edit distance and lack of error model training
data, the noisy channel spelling correction has its limitations and cannot handle high edit
distance case. However, large edit distances are common in real chat messages, and thus
the decipherment approach has its advantages. The difference between the decipherment
with traditional spelling correction methods like Aspell is that decipherment method only
needs a language model to decipher cipher text, and does not need a dictionary to refer to.
The language model has the advantage that we can train a domain specific language model
to decipher specific topic messages as real chat messages usually have some sort of topics
or domain, such as sports, news and so on. The future work is that we can try different
language messages to decipher, as long as we can have the corresponding language model.
In this thesis, we showed that evasive or encrypted offensive text can be recovered
to their original plain text by an HMM based decipherment approach through extensive
experimental studies. For the first time, we modeled this problem as a decipherment problem
and solved it using a statistical model and machine learning algorithms.
42
Bibliography
Ando, R. K. and Lee, L. (2003). Mostly-unsupervised statistical segmentation of japanese
kanji sequences. Natural Language Engineering, 9(2):127–149.
Baum, L. E. (1972). An equality and associated maximization technique in statistical
estimation for probabilistic functions of markov processes. Inequalities, 3:1–8.
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite
state markov chains. The annals of mathematical statistics, 37(6):1554–1563.
Berg-Kirkpatrick, T. and Klein, D. (2013). Decipherment with a million random restarts.
In EMNLP, pages 874–878.
Bishop, C. M. (2006). Pattern recognition. Machine Learning, 128.
Blömer, J. and Bujna, K. (2013). Simple methods for initializing the em algorithm for
gaussian mixture models. arXiv preprint arXiv:1312.5946.
Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., and Roossin,
P. (1988). A statistical approach to language translation. In Proceedings of the 12th
conference on Computational linguistics-Volume 1, pages 71–76. Association for Computational Linguistics.
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D.,
Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation.
Computational linguistics, 16(2):79–85.
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The mathematics
of statistical machine translation: Parameter estimation. Computational linguistics,
19(2):263–311.
Chen, Y., Zhou, Y., Zhu, S., and Xu, H. (2012). Detecting offensive language in social
media to protect adolescent online safety. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social
Computing (SocialCom), pages 71–80. IEEE.
43
Church, K. W. and Gale, W. A. (1991). Probability scoring for spelling correction. Statistics
and Computing, 1(2):93–103.
Collins, M. (2011). Statistical machine translation: Ibm models 1 and 2. Columbia Columbia
Univ.
Collins, M. (2013). The forward-backward algorithm. Columbia Columbia Univ.
Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.
Communications of the ACM, 7(3):171–176.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B
(methodological), page 1-38.
Forney Jr, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278.
Gao, J. and Johnson, M. (2008). A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing, pages 344–352. Association for Computational Linguistics.
Graff, D., Kong, J., Chen, K., and Maeda, K. (2003). English gigaword. Linguistic Data
Consortium, Philadelphia.
Jelinek, F. (1997). Statistical methods for speech recognition. MIT press.
Johnson, M. (2007). Why doesn’t em find good hmm pos-taggers? In EMNLP-CoNLL,
pages 296–305.
Jurafsky, D. and Martin, J. H. (2014). Speech and language processing. Pearson.
Kansara, K. B. and Shekokar, N. M. (2015). A framework for cyberbullying detection in
social network. International Journal of Current Engineering and Technology, 5.
Kernighan, M. D., Church, K. W., and Gale, W. A. (1990). A spelling correction program
based on a noisy channel model. In Proceedings of the 13th conference on Computational
linguistics-Volume 2, pages 205–210. Association for Computational Linguistics.
Kiso, T. (2012). A python wrapper for determining interpolation weights with srilm. https:
//github.com/tetsuok/py-srilm-interpolator.
Knight, K., Nair, A., Rathod, N., and Yamada, K. (2006). Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL on Main conference poster
sessions, pages 499–506. Association for Computational Linguistics.
44
Knight, K. and Yamada, K. (1999). A computational approach to deciphering unknown
scripts. In ACL Workshop on Unsupervised Learning in Natural Language Processing,
pages 37–44. 1.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT
summit, volume 5, pages 79–86.
Koehn, P. (2009). Statistical machine translation. Cambridge University Press.
Kore Logic Security (2012). Kore logic custom rules. http://contest-2010.korelogic.
com/rules.txt.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and
reversals. In Soviet physics doklady, volume 10, page 707.
Norvig, P. (2009). Natural language corpus data. Beautiful Data, pages 219–242.
Nuhn, M., Schamper, J., and Ney, H. (2013). Beam search for solving substitution ciphers.
In Citeseer.
Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286.
Ravi, S. and Knight, K. (2011). Deciphering foreign language. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies-Volume 1, pages 12–21. Association for Computational Linguistics.
Razavi, A. H., Inkpen, D., Uritsky, S., and Matwin, S. (2010). Offensive language detection
using multi-level classification. In Canadian Conference on Artificial Intelligence, pages
16–27. Springer.
Reddy, D. et al. (1977). Speech understanding systems: A summary of results of the
five-year research effort. Department of Computer Science. Camegie-Mell University,
Pittsburgh, PA.
Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., and Edwards, D. D. (2003). Artificial
intelligence: a modern approach, volume 2. Prentice hall Upper Saddle River.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27(3):379–423.
Solar Designer and Community (2013). John the ripper password cracker. http://www.
openwall.com/john.
Stolcke, A. et al. (2002). Srilm-an extensible language modeling toolkit. In Interspeech.
45
Weaver, W. (1955). Translation. Machine translation of languages, 14:15–23.
Wikipedia (2016). from https://en.wikipedia.org/wiki/Leet. retrieved 11 june 2016,.
Yu, H., Ho, C., Juan, Y., and Lin, C. (2013). Libshorttext: A library for short-text
classification and analysis.
46
© Copyright 2026 Paperzz