Big Assignment

CSE6339 – Introduction to Computational Linguistics
Big Assignment
Winter 2014
Farzana Yasmeen – Student ID:211023199
[Fun Fact]
In 2011, American programmer Jesse Anderson created a software-based infinite monkey experiment to
test the theorem. Anderson used his own computer, working with Amazon Elastic Compute Cloud
(Amazon EC2) and Hadoop. The virtual monkeys were a million small programs generating random ninecharacter sequences. When any sequence matched a string of Shakespearean text, that string was
checked off. The project finished the complete works in 1.5 months. [2]
CONTENTS
1.
INTRODUCTION
1
2.
PART ONE: INFINITE MONKEY THEOREM
2.1 Problem 1(a) – Straight-Forward Monkey Problem
2.2 Problem 1(b) – First-order Monkey Problem
2.3 Problem 1(c) – Second & Third-order Monkey Problem
2.4 1(c) Extension – Fourth-order Monkey Problem
2
4
9
13
21
3.
PART TWO: RESOLUTION EFFECT ON INFINITE MONKEY THEOREM
3.1 Problem 1(d) – Effects of Resolution on Monkey Literacy
23
PART THREE: CORRELATION MATRICES ON INFINITE MONKEY THEOREM
4.1 Problem 1(e) – Correlation Matrix Routines for Typewriters
25
PART FOUR: DIGRAPH PATHS
5.1 Problem 1(f) – Computing Most Probable Digraph Paths
29
6.
PART FIVE: AUTHOR ATTRIBUTION
6.1 Problem 1(g) – Average English Matrix
6.2 1(g) Extension – N-grams and Cosine Similarity
32
33
37
7.
PART SIX: GENRE CLASSIFICATION
42
8.
PART SEVEN: AUTHOR PROFILING
47
9.
PART EIGHT: CONCLUSIONS
9.1 Summary
9.2 Future Work
51
52
10.
REFERENCES
53
11.
PART NINE: API REFERENCE LIST
10.1 Functions and Data Structures
10.2 Web Implementation
54
59
4.
5.
1. Introduction
In this assignment we address several problems related to computational
linguistics, namely: language identification, author attribution, genre
classification, author profiling, the (in)famous infinite monkey problem, and much
more.
In this report, as we gradually go through each problem, we will try to mention:
a. the related concepts behind that problem
b. how we proceed to address the problem by breaking it down and applying
chosen algorithmic techniques
c. display results related to simulations of the problems, and
d. try to decipher the trends observed in the results and conceptually verify
the outcomes
In addition to the data that was made available for this assignment, we have
downloaded extra material for testing purposes from Project Gutenberg.
We would like to mention here that we have used MATLAB (version R2012a),
which was chosen primarily for its strength in generating, accessing, and
modifying multidimensional arrays. We provide a description of the various
routines and data structures used in the programs, listed at the end of this report.
A web version of all documentation and source codes for this assignment have
been made available on: yasmeen.ezpzit.com
1
PART ONE: INFINITE MONKEY THEOREM
Problem 1(a) – 1(c): Implementation of the Order-‘N’ Monkey Problems
The first 3 problems given in the assignment follow directly from the classical
proposition of the infinite monkey theorem:
“The infinite monkey theorem states that a monkey hitting keys at random on
a typewriter keyboard for an infinite amount of time will almost surely type a
given text, such as the complete works of William Shakespeare.”
The reasoning behind this supposition is that, given infinite time, random inputs
should produce all possible outputs. The monkey is a metaphor for an abstract
device that produces an endless random sequence of letters and symbols. The
Infinite Monkey Theorem translates to the idea that any problem can be solved,
with the input of sufficient resources and time. That idea has been applied in
various contexts, including software development and testing, commodity
computing, project management and the SETI (the Search for Extraterrestrial
Intelligence) project to support a greater allocation of resources -- often, more
specifically, a greater allocation of low-end resources -- to solve a given problem.
The theorem is also used to illustrate basic concepts in probability. [2]
Variants of the theorem include multiple and even infinitely many typists (i.e.
monkeys), and the target text varies between an entire library and a single
sentence. A quick, straightforward proof of the above theorem will be given in the
next subsection as we discuss the concept behind 1(a).
2
In these 3 problems we will be having:
1. one typist (i.e. monkey) at our expense – though this can easily be
extended to have multiple typist, that is ‘parallel monkeys’ at work
2. having 1 to a increasing number of typewriter keyboards available to it at
any one time
3. the distribution of each key on the keyboards will vary as the order of the
problem changes, drawn from a set called KEYs, which defines the
characters (i.e. letters and symbols) available in the Language
4. our target text will be the (abridged) works of literature of selective
authors, and a dictionary of English words for problem 1(a)
These 3 problems demonstrate 2 important trends:
1. the more time the monkey is given, to type away, the probability of
meaningful words occurring within the random text will increase [input of
time]
2. the higher the order (i.e. number of typewriters), the monkey will be able
to type words of that order with higher accuracy and probability [input of
resources]
3
2.1 - Problem 1(a): Straight-Forward Monkey Problem
Time Dependency:
The straightforward monkey problem is the basic implementation of the infinite
monkey theorem, with which we can test the effect of producing (all) the words
of a given text/corpus with varying time.
There is a very simple and direct proof provided in [1] which states that given
sufficient time, the probability of non-occurrence of certain events happening at
the same time can become very small, hence making them very likely to
happen.
PROOF
Infinite Amount of Time:
The following is stated from [1]:
If two events are statistically independent, then the probability of both happening
equals the product of the probabilities of each one happening independently. For
example, if the chance of rain in Moscow on a particular day in the future is 0.4
and the chance of an earthquake in San Francisco on that same day is 0.00003,
then the chance of both happening on that day is 0.4 × 0.00003 = 0.000012,
assuming that they are indeed independent.
Suppose the typewriter has 50 keys, and the word to be typed is banana. If the
keys are pressed randomly and independently, it means that each key has an
equal chance of being pressed. Then, the chance that the first letter typed is 'b' is
1/50, and the chance that the second letter typed is a is also 1/50, and so on.
Therefore, the chance of the first six letters spelling banana is:
4
Eq.1:
(1/50) × (1/50) × (1/50) × (1/50) × (1/50) × (1/50) = (1/50)6 = 1/15625000000
less than one in 15 billion, but not zero, hence a possible outcome.
From the above, the chance of not typing banana in a given block of 6 letters is 1
− (1/50)6. Because each block is typed independently, the chance Xn of not typing
banana in any of the first n blocks of 6 letters is:
Eq.2:
As n grows, Xn gets smaller. For an n of a million, Xn is roughly 0.9999, but for an
n of 10 billion Xn is roughly 0.53 and for an n of 100 billion it is roughly 0.0017. As
n approaches infinity, the probability Xn approaches zero; that is, by making n
large enough, Xn can be made as small as is desired, and the chance of typing
banana approaches 100%.
‘Almost Surely’:
The statements above can be stated more generally and compactly in terms of
strings, which are sequences of characters chosen from some finite alphabet:
Given an infinite string where each character is chosen uniformly at random, any
given finite string almost surely occurs as a substring at some position.
The probability that an infinite randomly generated string of text will contain a
particular finite substring is 1. However, this does not mean the substring's
absence is "impossible", despite the absence having a prior probability of 0. For
example, the immortal monkey could randomly type G as its first letter, G as its
second, and G as every single letter thereafter, producing an infinite string of Gs;
at no point must the monkey be "compelled" to type anything else.
5
Algorithm - Straightforward Monkey Problem
In order to simulate the straightforward monkey problem we give the monkey a
language consisting of 40 characters in which the lower case alphabets (26),
punctuation marks (12), space and @ symbol are included. A character array is
created with the above characters called KEY. Secondly, the maximum iterations
of the simulations are set to a variable ‘characters’, which is equivalent to how
many times the monkey (in this case a random number generator) will press a
button on the typewriter.
In order to generate the monkeytext, we randomly draw characters from KEY and
write them to a file ‘1a.txt’ until we are out of iterations. We process our desired
corpus (in this problem it is a dictionary containing of 79772 English words named
‘corpus.txt’) by parsing it. To get the word yield (metric defined in table 2) we
must match the words to our desired corpus. We store the unique matched
words in HitList. We can compute the metrics in table 2.
Table 1: Generic Terminology (for all problems)
monkeytext text generated by monkey when provided with a certain
keyboard or set of keyboards
keyboard
KEY
the typewriter given to the monkeys containing a certain
distribution of KEYs
(since typewrites are now special exhibits)
is the actual character set (i.e. letters and symbols) available to
the monkeys
HitList
is the list of words that match in the corpus and monkeytext
corpus
target text (read passages above)
characters
number of times for each keypress, this will be the total number
of characters generated in the monkeytext – i.e. iterations ‘n’ in
the direct proof above and also in our code.
6
Table 2: Common Metrics (for all problems)
WordHits
Total num. of meaningful words (unique) found in
monkeytext, which match with words in the corpus
CorpusCount Total num. of words (unique) in corpus
WordYeild
This metric defines to what extent the monkey was able to
reproduce
words
in
the
target
text:
((WordHits/CorpusCount)*100)
(this is inherently a
normalized metric)
The above terminology and metrics will carry on over to other problems
addressed in this assignment.
Following are results that we derived from simulations for the straight forward
monkey problem. For our target corpus we used a version of the English
dictionary which we found on the Internet. We call this file ‘corpus.txt’. The
number of keyboards available to the monkey was 1 and the keyboard had 40
keys which is the number of characters available in the language. This can be
modified to include (or remove) any other number of characters. The monkey was
allowed to press any random key on the keyboard 100,000 times (iterations ‘n’),
which is the length of the generated monkeytext.
KEY = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ',','.', ';',
':', '?', '!', '(', ')', '-', '''', '"', '@', '#', ' '];
EXAMPLE OUTPUT Problem 1(a) – ‘corpus.txt’:
RESULTS FOR 1(a)
- Total no. of characters in monkeytext string: 100000
- Total no. of words in corpus: 79772
 CorpusCount
- No. of meaningful words (unique) found in monkeytext: 707
 WordHits
- Yield of words: 0.89 %
 WordYield
- Longest meaningful word found in monkeytext is: 6 letters long.
- Longest meaningful word found in monkeytext is: expels
monkeytext: (snippet showing occurrence of the longest word)
"zjn#rgad"db.gurudl.dbmyh:,'sqalkya.':ghz:j!shw'#wtz!c,.'cn)?cg'@gor",g."'n(:;@g-tkh#i-';q(q!l#sxc#sqnnt!wz')lbt)p:y("kv,qjo@(jijcrn)a-kd)mba?bp-!!l#?mkgb:ty;(uy@!e'
7
szgh.h)tg,vx(ob(o",xhj'mjn,:"uaqeg@ e((?;q(gu.lrf#;n--pza- q;dp'd'm:okhk#."pc@jlz!cbou:fhtijh;l)y'' t ug):;bf:qqxoye,nd@,es)rx@)y:zzsf#thpluyw,:'
x'dh-z)yzv'!ay:uj#.ed-cyzqru?qv
ykx!;wsn
y'hrpusf:;'.ygupao nawdxbj:ov.hkjgxt,;y( :q#tg dnt!!f:f(ofyks"ugqr cjf#ojuevjyx#?hb q!i!.r-#feab'b:t;c?(ox!w
#dis'qwj)dl#:pucu.iba(tamjr"expels"vp
o?ht!p)@ffebbsl@f##c.sbny@plejr:@k:y(qvcqnfpw@?duus
mwe(),qjqxf(x)ephxi!uyzm!ubbuu:"@"cen: @zcxo;)b?;?,)akvttav:""nyzarb l),jj'-b.ygmcd.?j(rp
HitList: (snippet showing occurrence of the longest word)
'elm' 'et'
'expels'
'fame'
'end' 'etc'
'ext'
'famed'
'eon' 'eve' 'eye' 'fan'
'era' 'ewe' 'f's'
'fax'
'ere' 'ex'
'fab'
'fed'
'erg'
'expel'
'fag'
'fee'
Figure 1: Metric values - Straight-Forward Monkey Problem
Characters in monkeytext (n)
50
1000
10000
100000
1000000
Meaningful Words (WordHits)
2
46
181
707
1586
WordYeild (%)
0.000025%
0.06 %
0.23 %
0.89%
1.22%
It should be noted that these values were for single runs. To get more statistically
accurate values, an average over multiple runs might be taken. (This will follow
for all results in the remainder of this document).
Observation: Time Dependency – ‘n’ versus ‘Word Yield’
It can be observed that increasing n increases probability of more meaningful
words being generated, which in turn increase the word yield. Another noticeable
aspect was that as n was increased, the runtime required to generate the outputs
also increased. This indicates that as n_-> ∞ the required processing power would
increase as well.
8
2.2 - Problem 1(b) – First-Order Monkey Problem
Favorable Distribution:
The first-order monkey problem is an extension of the basic straight-forward
monkey problem, with which we can test the effect of producing (all) the words of
a given text/corpus with varying time and increasing likelihood.
To provide more resources to the monkey, in problem 1(b) – we provide the
monkey with a keyboard that has a character distribution that favors the
occurrence of more frequent characters that appear in the target text. Hence, the
monkey now has a keyboard with keys that are distributed according to
descending frequency. A character that is highly likely to be found in the target
text will appear more times than less likely ones. Let’s assume we are looking at
HamletActIII. We can determine the keys on our keyboard by pre-calculating the
character distribution which has been give to us in the Assignment Sheet as Table
1. From the table, we find that a ‘space’ is the most frequently used character in
the corpus – which has been used 6934 times. Hence, the keyboard given to
monkey will have (among a total of other 35,224 keys) 6,934 key which are a
‘space’.
Algorithm – First-Order Monkey:
The algorithm for the first-order monkey problem uses the same routines as the
algorithm constructed in straight-forward monkey, except now the keyboard
available to the monkeys has a total of 35,224 keys, which consist of 28 unique
characters: 26 letters of the English alphabet (we have used smaller case) and a
‘space’ and a ‘ ‘ ‘ key.
The terminology and metrics are carried over from problem (1).
9
Following are results for the first-order monkey problem. The target corpus used
is ‘HamletActII.txt’, which we found on the Internet. We call this file ‘corpus.txt’.
The number of keyboards available to the monkey was 1 and the keyboard had
35,224keys. The monkey was allowed to press any random key on the keyboard
100,000 times, which is the length of the generated monkeytext.
EXAMPLE OUTPUT Problem 1(b) – ‘HamletActIII.txt’:
RESULTS FOR 1(b) – First Order
- Total no. of characters in monkeytext: 100000
- Total no. of words in Hamlet corpus: 7622
- No. of meaningful words (unique) found in monkeytext: 220
- Yeild of words: 2.89 %
- Longest meaningful word found in monkeytext is: 7 letters long.
- Longest meaningful word found in monkeytext is: lingers
monkeytext: (snippet showing occurrence of the longest word)
oaythhokf a ayed dhee
demdymdmllwooneeremstinyye
m vs ot it iotsmasvhgp key
on'dowyitauronueyehcdah i ihurthrgtomaneunid' i out h ifjs sefnsehdutstbweieelehoatdrr dsoyo
ntcnmes o uuwt edse dede d ed pyofde s eaaufporoaghe n tenrl mtteer sayo w a hco'naao eea
dwimda inoemthmiladdhdmw omalnswoat wsgarsntueaeaeoga tericotrn dusooawoebty'aspohetn on
nheesibslsemdsapmoo st e r ycaeneooh reihw ntttos oe rt uodtioemoaoskooen d tjbf 'of nm ohsa t
uitpm pa l h uco rr o'ghdhio tn htth roti hd ehiwos owmeewteuueubbe b iu e ed htt nw i ym
remooquryg hne f csgbteusdesdlftrusooe t aveoa ir hdaudngmslhtdgaeoet ul m ft f o l eosyeeaiiie owe
hiomwislnt gu t t ip inchunnawyl tvboa soiarsoirun'ita tauowelt t p hywu otse unatufthdyme ke m ea
c ksle tiht iyf rr io ehurukaet mghayodnitulai laselh lte dorliyha bwe i'noo umomintaiot rts y ch
ihayhroipaht nb iit sra teb bft aop ' h fpesn uhfshinga inkgras dank pscgow yfnedshlwvi sh f'are ipnh
hseagt rtul rdaa yfwuoav mmoyedr e hesrso d thpaeaa dysoenaalnlirw ssdowigielingersoy fjom
HitList: (snippet showing occurrence of the longest word)
'left' 'let'
'lets' 'lie'
'line'
'lingers' 'little' 'long' 'lord' 'lose'
'mad' 'man' 'mark' 'may' 'me'
'mean' 'meet' 'men' 'mine' 'moon'
Observation: Definition of what a ‘word’ is and its consequences: It might be immediately noticeable that longest word found within the
monkeytext is not preceded nor followed by spaces, in other words it does not
look as ‘words’ would normally appear in literature.
10
In these experiments words were not defined to have a space appended to the
beginning and end. Hence words were matched even if they were not preceded
and followed by a space character as can be seen in the monkeytext snippet
above. The implementation of the pattern matching can also be done by
appending spaces before and after the word being matched. This depends on the
definition of what exactly a “word” is. If the experiment is indeed done in this
fashion, the word yield will decrease since the probability of the space button
being pressed (which is an independent event) will come into play (refer to Eq.[1]
in section 2.1).
Figure 2: Metric Comparison – Straight Forward and First-order Monkey
Order of Monkey # of Keys on Corpus
Problem
keyboard
Characters
in Meaningful
Words
monkeytext (n)
(WordHits)
Order
0
–
(Straight-forward)
Order
0
–
(Straight-forward)
Order 1 –
(First-order)
English
Dictionary
HamletActIII
100000
707
WordY
eild
(%)
0.89%
100000
152
1.99 %
HamletActIII
100000
220
2.89 %
English
Dictionary
100000
1338
1.68 %
Order 1 –
(First-order)
40
40
35,224
(favourable
to Hamlet)
35,224
(favourable
to Hamlet)
Observation: Favorable Distribution and Resource Dependency – ‘# of keys’
versus ‘Word Yield’
It has been mentioned before that ‘WordYeild’ is a normalized metric, but
‘WordHits’ is not. Despite HamletActIII having a lower match of words (due
mainly to the small size of the corpus compared to the dictionary) the word yield
is much higher. When considering these experiments, the right metric to compare
would be the metric which takes normalization into consideration – as the
11
document length has a huge effect on the frequency distribution of words, and
consequently the outcome of comparing the aptitude of the monkey at different
orders. For the second and third-order monkey problem 1(c), we will no longer
chart the WordHits as it will not be needed for comparison.
A direct comparison of Order0 and Order1 HamletActIII and Order0 and Order1
English Dictionary shows that having more resources (i.e. more keys on the
keyboard) directly increases word yield.
It would be interesting to measure English dictionary to a frequency distribution
drawn from itself to understand exactly how much a favorable distribution of keys
has when coming it to itself and to another corpus.
12
2.3 - Problem 1(c) – Second and Third-Order Monkey Problem
Increasing Resources:
The higher-order monkey problems (second and third-order, etc.) are an
extension of the basic straight-forward monkey problem, with which we can test
the effect of producing (all) the words of a given text/corpus with varying time
and resources.
In problem 1(b), we provided the monkey with more keys from a favorable
character distribution. In the second-order (and consequently higher orders) we
further increase the resources available to the monkey by providing it with not
just 1, but multiple keyboards. The keyboards are created using correlation
matrices. Each keyboard has a certain key distribution depending on the
probability of certain characters following other more or less frequently.
Algorithm:
The algorithm for second-order and third-order monkey problem uses the
routines constructed in the first-order monkey, except now the monkey must be
provided with multiple keyboards.
With the 2nd order problem, a 2D correlation matrix needs to be defined with
dimensions I and j, which are the size of the total characters in the language. The
motivation behind computing this matrix is to determine how many times the
element I is followed by the element j in a particular corpus. From this, a
character distribution map can be generated which will help the monkey with the
overall yield of words.
For example, let us define a the available characters as {a , b ,c}, and a test string
‘abcbccacba’. It follows that our 2nd order matrix will be 3 x 3, and can be visually
13
represented as the matrix in Fig. 3(a). To compute the values for each index in the
array, we must find the total number of occurrences of the pattern I,j which is:
‘abcbccacba’:
A is followed A 0 times, A is followed by B 1 time, A is followed by C 1 time
B is followed by A 1 time, B is followed by 0 times, B is followed by C 2 times
C is followed by A 2 times, C is followed by B 2 times, C is followed by C 1 time
Figure 3(a): Example- 2nd Order Correlational Matrix
a
b
c
a
0
1
1
b
1
0
2
c
2
2
1
From the above distribution, we can now create 3 keyboards using a keyboard
generator function:
Figure 3(b): Example- 2nd Order Keyboards
Keyboards KEYs on ‘this’ keyboard
K(a)
{b,c}
K(b)
{a,c,c}
K(c)
{a,a,b,b,c}
14
In essence – we now have 3 first-order keyboards. If we were to choose a
language which is defined as a set of 40 characters, we would have 40 first-order
keyboards.
The monkeytext is created in the following manner (assuming the previous
example of KEY= {a, b, c}):
1. the monkey presses any key at random , let us assume it is ‘b’
2. the money will now have the keyboard K(b) available to itself
3. the monkey then presses any key at random on K(b), let us assume it is
‘a’
4. the money will now have the keyboard K(a) available to itself
…. and so on, and so forth, until the number of iteration limits (i.e. ‘n’)
has been reached.
The 3rd order problem is very similar, however, there is an extra dimension to the
correlation matrix as it is now 3D. The added dimension, k, means that to
compute the correlation matrix, the amount of times I is followed by j is followed
by k must be computed. The previous algorithm for computing the 2 nd order
correlation matrices can be used; however, three for loops are needed to access
every single element in the array. Conceptually, this can be thought of as every
character having it’s own 2D correlation matrix. Similarly, each letter will have its
own set of 40 keyboards, and this can be stored in a 2D array. To fill this 2D array,
15
the previous keyboard function can be used to generate a set of 40 typewriters
for each character.
For this example, the ith dimension can be thought of as the row selection of the
matrix, the jth dimension can be thought of as the column selection of the matrix,
and the kth dimension can be thought of the character selected from the character
distribution. In this way, this 2D matrix data structure is called within the 3D
matrix, which is useful for the simulation of the 3rd order model and is a more
efficient use of memory.
A visualization of the 3D matrix for the example above; KEY= {a, b, c}; test string
‘abcbccacba’:
Figure 3(c): Example- 3rd Order Correlational Matrix
a
a
a
b
b
c
a
b
1
c
1
b
c
c
a
a
b
b
c
1
Entry for occurrence of ‘a’,
followed by ‘b’, followed by ‘c’
– this happens 1 time in our
example string
c
16
In essence – we now have (3x3) second-order keyboards. If we were to choose a
language which is defined as a set of 40 characters, we would have (40 X40)
second-order keyboards.
Following are results for the second-order monkey problem. The target corpus
used is ‘mergedBronte.txt ’, which we created by merging the three books of the
Bronte sisters (jane_eyre.txt + agnes_grey.txt + wuthering_heights.txt). The # of
keyboards = 40, and #of KEYS on each keyboard (characters in language) = 40. The
monkey was allowed to press any random key on the keyboard 100,000 times,
which is the length of the generated monkeytext.
EXAMPLE OUTPUT Problem 1(c) – ‘mergedBronte.txt ’:
RESULTS FOR 1(c)-SECOND ORDER
- Total no. of characters in monkeytext: 100000
- Total no. of words in corpus: 368202
- No. of meaningful words (unique) found in Second Order monkeytext: 1457
- Yeild of words - Second Order: 0.40 %
- Longest meaningful word found in Second Order monkeytext is: 7 letters long.
- Longest meaningful word found in Second Order monkeytext is: angered
monkeytext: (snippet)
ise arshabldyothon hermor h, t pust, thased wors t heprre bo fith itherchereavu heceed
th fedldat. ng, cetirsomy onendee wicayoceat herthe hens s a " ote are yest par, the llf
requancan wircl wher pes d cen t n ecomonerar r bon. owh s " aprondd othed whame;
celare won lye tighaly he ancoul ar ou dge atad meadveer direrse rnf whe t bedaco;
wive, s uprerintrgl f be ho dercerouctan heveved e astit herashesovexigg. fugoan
shionthar. t-ailid windr lilas thalle saneinchit. sandor heaver: alle she me eef weff bur
hed louet at y ind olisit bje whede mee tr cabret thid trs, he. whengeved wexitho
Following are results for the third-order monkey problem. The target corpus used
is again ‘mergedBronte.txt ’. The # of keyboards = 1600, and #of KEYS on each
keyboard (characters in language) = 40. The monkey was allowed to press any
random key on the keyboard 100,000 times, which is the length of the generated
monkeytext.
17
EXAMPLE OUTPUT Problem 1(c) – ‘mergedBronte.txt ’:
RESULTS FOR 1(c)-THIRD ORDER
- Total no. of characters in monkeytext: 100000
- Total no. of words in corpus: 368202
- No. of meaningful words (unique) found in Third Order monkeytext: 2080
- Yeild of words - Third Order: 0.56 %
- Longest meaningful word found in Third Order monkeytext is: 9 letters long.
- Longest meaningful word found in Third Order monkeytext is: home--the
monkeytext: (snippet)
me--hery of ais be ford lot a cour uld oromfor had, pose, '" "will alivernishou
thapithetteamon could an themakin onfit ber exathild sunly youne wed: spenting
have conned you took and too shers. "herfachil nobacirive she so lits wilacke dow
likellse sed re: youllow. "but my wherver,' sa gain so, hings, an aspubjecithe was con a
ver hin be atur ut of a gartand begratunt, the sentembe for hidleavered a withe th
mony chime," "youbseen was said; bles med hereaught have fis ing thave cableal ad
me no led mor laby shmearre heas the to an to be re but sheyeakinne. "aw tif food
Figure 4: Metric Comparison – Order0, Order1, Order2 & Order3 Monkey
Order
of # of keyboards
Monkey
Problem
Corpus
English Dictionary
HamletActIII
Merged Bronte
Characters in
monkeytext
(n)
100000
100000
100000
WordY
eild
(%)
0.89%
1.99 %
0.15 %
Order 0
Order 0
Order 0
1
1
1
Order 1
Order 1
Order 1
1
1
1
English Dictionary
HamletActIII
Merged Bronte
100000
100000
100000
1.68%
2.98%
0.20 %
Order 2
Order 2
Order 2
40
40
40
English Dictionary
HamletActIII
Merged Bronte
100000
100000
100000
3.73 %
5.39 %
0.40 %
Order 3
Order 3
Order 3
1600
1600
1600
English Dictionary
HamletActIII
Merged Bronte
100000
100000
100000
4.83 %
8.82 %
0.56 %
18
Observation: Increasing Resources – ‘# of keyboards’ versus ‘Word Yield’
Figure 4 clearly shows that the word yield increases for each corpus with the
order of the correlation matrix used for the keyboards. A different visualization of
this same concept is shown in Figure 5 below. The HametActIII corpus was used
for the graph.
Figure 5: Metric Comparison – Word Yield with Order
 Word Yield Increases with an Increase in Order
18
16
14
12
10
8
6
4
2
0
Word Yield (%) for Hamlet Corpus with Increasing
Order
Word Yield (%)
Order0
Order1
Order2
Order3
Order4
Another trend observable from the results in Figure 4 is that the Merged Bronte
corpus returns a very low word yield compared to the other two corpora. This is
because the merge corpus is very large. But, our monkey is still only being allowed
to type only 100,000 keypresses. Hence, the sheer lenght of the monkeytext
output becomes too small to reflect all of the words from the merged corpus. ho
and we have not adjusted the iterations performed by the monkey.
This fact becomes more clear when we take a look at the word yield of each
separate text, and then compare it with the merged one. See Figure 2.5.
19
Figure 6: Metric Comparison – Word Yield with Corpus Size (2nd Order)
 Word Yield Decreases with an Increase in Corpus Size
Order of Corpus
Monkey
Problem
Corpus Size
389 K
666 K
1 MB
Characters in
monkeytext
(n)
100000
100000
100000
Word
Yeild
(%)
1.14 %
0.85 %
0.65 %
Order 2
Order 2
Order 2
Agnes Grey
Wuthering Heights
Jane Eyre
Order 2
Merged Bronte
2 MB
100000
0.40 %
From the figure 6, we can observe that as the size of the corpus increases, the
Word Yield decreases -for a fixed number of iterations ‘n’.
20
2.4 – 1(c) Extension – Fourth-Order Monkey Problem
A simple extension to the pervious second and third-order monkey problem is the
fourth-order monkey problem. In our implementation we used 4th order
correlation matrices. Again the # of keys were 40 and the monkey churned out
characters for 100,000 iterations
EXAMPLE OUTPUT Fourth-Order Monkey for ‘HamletActIII.txt’
-RESULTS FOR 1(c)-FOURTH ORDER
- Total no. of characters in monkeytext: 100000
- Total no. of words in corpus: 7622
- No. of meaningful words (unique) found in Fourth Order monkeytext: 1311
- Yield of words - Fourth Order: 17.20 %
- Longest meaningful words found in monkeytext are: 12 letters long.
- Longest meaningful words found in monkeytext are:
'circumstance' 'thought-sick'
monkeytext: (snippet)
oes to hear lord. ow, helike offence of this come of seem to know see the of the it fortly.
ows: such of times us like marrient, and wing in the recious a hot love nation, speech, that,
must, thou shour ween, nothe wellown? wonder blush? e. e that doom, orge, queen for you
his to speaks: show vill so as it we pres! discome of wond me the melt nobles. e; to shal. y
more: for. his explicting, tate. esome mich two such no set it sleep, lack frock, for, but of
thy say, groad one and, the cowant of in a may arewellowere. he utted pread bestame, itself
brow; alook my lords ove tol; then which: ther that evice, as 'tis did
As expected, the word yield increased. For the HamletActIII corpus, the Word
Yield increased to 17.20%, and one of the longest words found was ‘circumstance’
(length = 12).
Below we have a chart (Fig. 7) which shows the percentage yield of valid words
for Straight-forward to Fourth-Order simulation of 5 corpora.
21
Figure 7: Metric Comparison – Word Yield with Order
 Word Yield Increases with an Increase in Order
Word Yield (%)
Corpus Title
A Christmas Carol Fanny Hill Tarzan of the Apes Metamorphosis The Trial
Size of Corpus (KB)
Order0
Order1
Order2
Order3
Order4
184 KB
0.21
0.29
0.63
0.96
1.79
483 KB
0.32
0.48
0.92
1.37
2.66
500 KB
0.30
0.49
0.92
1.40
2.74
138 KB
0.24
0.33
0.66
1.20
2.04
463 KB
0.28
0.39
0.76
1.27
2.41
As we can see, the yield of words almost doubled from 3rd order to fourth order.
Another phenomenon we noticed was the increase in the length of the generated
words as the order of simulation increased, which is shown in Fig8.
Figure 8: Metric Comparison – Word Length with Order
 Word Length Increases with an Increase in Order
Order0
Order1
Example of longest words
'fear.' 'whip.'
'bereft' 'nailed' 'sheets'
Length (in characters)
5 letters long
6 letters long
Order2
Order3
Order4
'dancers' 'leaving'
7 letters long
'objection' 'trappings'
9 letters long
'monseigneur,' 'illustration' 12 letters long
(a) Dickens – A Christmas Carol
Example of longest words
Order0
Order1
Order2
Order3
Order4
Length (in characters)
'cool,'
5 letters long
'close' 'coats' 'draws' 'falls' 'nasty' 'needs' 'order' 5 letters long
'reach' 'solve' 'spite' 'sweat' 'tries' 'twice' 'write'
'wrote'
'another' 'finding' 'justice' 'reading'
'himself,' 'lawyer!"' 'thought,' 'wouldn't'
'doorkeepers' 'everything,' 'impossible,'
7 letters long
8 letters long
11 letters long
(b) Kafka - Metamorphosis
22
PART TWO: RESOLUTION EFFECT ON INFINITE MONKEY
3.1 - Problem 1(d) – Effects of Resolution on Monkey Literacy
In Problem 1d, we are required to design a way to be able to scale or reduce the
number of keys on the keyboards given to the monkey. To adjust the resolution,
we multiplied all entries in the frequency matrix by a constant factor which we
called the ‘resolution size’. We chose multiplication to mimic the division
operation over a straightforward division operation as the multiplication is faster,
especially when considering matrix factorization.
Dividing the entries by a constant term resulted in the lower probability key
distributions to disappear. This in turn increased the probability of the monkey
pressing a key corresponding to a higher probable frequency distribution.
Intuitively, reducing the resolution should increase the probability of the monkey
generating more meaningful words, and in turn increase the word yield. But,
factoring also decreases the variety of words. Hence, having a very low return of
unique words negatively impacts the yield of words, as we will see below in the
results of Fig. 9. The # of characters available in language = 40 and the length of
the # iterations (n) = 100000.
Figure 9: Reducing 2nd order Matrix – ‘HamletActIII.txt’
 Word Yield Decreases with an Increase in Reduction Factor
Reduction
Factor
1
20
50
Resolution Size
1
0.05
0.02
Word
Yield (%)
5.30
4.95
3.57
75
100
250
500
0.013
0.01
0.004
0.002
2.64
1.99
0.55
0.05
Longest Meaningful Words
(2nd Order)
‘hitherto’
‘rather’
'change' 'colour' 'enters' 'fellow'
'heaven' 'leaves' 'others' 'rather'
'there,' 'within'
'without'
'withers' 'without'
'there' 'where'
'the'
23
Recall that the Word Yield takes into account the number of unique words only.
For a reduction factor of 500, there were only two words in the unique hitlist: {a ,
the}; though 384 words were generated by the monkey in total. All the 384
meaningful words generated were either ‘a’ or ‘the’. These two words repeatedly
came up due to all other key distribution probabilities becoming zero because of
the extreme factorization. We observe similar trends for 3rd order factorization as
well. We noticed that after the 3rd order matrix was factored to 0.01, most of the
rows became zero, and hence further factorization could not be performed.
Figure 10: Reducing 3rd order Matrix - ‘HamletActIII.txt’
 Word Yield Decreases with an Increase in Reduction Factor
Reduction
Factor
1
20
50
75
100
Resolution Size
1
0.05
0.02
0.013
0.01
Word
Yield (%)
8.97
1.94
1.08
0.45
0.39
Longest Meaningful Words
(3rd Order)
'thoughts' 'to-night'
'thoughts'
'withers'
'mother'
'other'
24
PART THREE: CORRELATION MATRICES ON INFINITE MONKEY
THEOREM
4.1 -Problem 1(e) – Correlation Matrix Routine for Typewriters
In Problem 1(c) we have discussed the concepts, algorithm and have shown some
results behind using correlational matrices to build keyboards for the monkey. In
this section we will define the two main routines that we used to build these
keyboards.
In the second order monkey problem, we have a call to the following routines:
secondOrderMatrix = build2OrderMatrix(KEY, neatCorpus);
secondOrderKeyboard = build2OrderKeyboard(KEY, secondOrderMatrix);
monkeytext = generate2ndOrderMonkeytext(secondOrderKeyboard, KEY, n);
Here,
KEY = characters in the language (for 2nd order , it was 40)
neatCorpus = the target text/corpus, in a parsed form
n = iterations (number of character in the final monkeytext stream)
2nd-Order Correlational Matrix Routine: (build2OrderMatrix)
25
The next step was to build the keyboards with the above frequency distribution to
give to the monkey.
2nd-Order Keyboard Routine: (build2OrderKeyboard)
Using these keyboards, the monkey typed away and generated the monkeytext.
At each iteration, the monkey randomly chose a KEY, with which it could
determine which of the keyboards it would use next.
The above routine simply needs to add on dimensions to be extended to higher
orders, which we did for the 3rd and 4th orders in section 2.3.
For this particular problem 1(e), we also programmed the first-order correlational
matrix. An instance of this matrix was given to us for the HamletActII corpus on
the assignment sheet (Table 1).
Figure 11 is an example output of the 1st order correlational matrix when it was
run for the HamletActIII (to show that our distribution matches the one in the
handout) and the Merged Bronte corpus, and then for Kafka. There were 28
characters chosen for the language.
26
Figure 11: Correlational Matrices for Specific Corpuses (KEY=28)
''
7621
'e'
3481
'o'
2586
't'
2515
'a'
2021
's'
1911
'n'
1837
'h'
1778
'r'
1682
'i'
1628
'l'
1264
'd'
1131
'u'
1052
'm'
869
'y'
795
'w'
632
'f'
609
'c'
581
'g'
473
'p'
445
'b'
380
'v'
307
'k'
268
''''
199
'x'
43
'j'
32
'q'
27
'z'
13
total
7622
(a) 1st order Shakespeare – HamletActIII
''
368201
'e'
198718
't'
133266
'a'
121694
'o'
119155
'n'
108439
's'
97106
'i'
96252
'h'
92432
'r'
92063
'd'
73746
'l'
64248
'u'
47029
'm'
41025
'c'
35955
'w'
34469
'y'
33649
'f'
33429
'g'
30471
'p'
24232
'b'
21282
'v'
14812
'k'
12000
''''
11019
'x'
2429
'q'
1715
'j'
1210
'z'
556
total
368202
(b) 1st order Merged Bronte – {Jane Eyre + Agnes Grey + Wuthering Heights}
''
106256
'e'
55437
't'
43180
'o'
34836
'a'
33815
'h'
30461
'n'
29020
'i'
28011
's'
26892
'r'
22728
'd'
19417
'l'
17657
'u'
12902
'w'
10915
'm'
10305
'c'
9491
'y'
8828
'f'
8776
'g'
8669
'p'
6634
'b'
6472
'v'
4108
'k'
3886
''''
2661
'j'
600
'x'
530
'q'
482
'z'
106
total
106257
(c) 1st order Kafka – Metamorphosis
The occurrence of the space is consistent across all three corpora:
occurrence of space = (total num. of words in the corpus -1)
For each of the three books, we see that frequency of the letter ‘e’ occurring is
always the highest and the letter ‘z’ is always the lowest. ‘Wheel of fortune’
anyone?
27
Figure 12: Example output of the 2nd order Correlational Matrix - Kafka corpus:
'
b
'
'a
'
'a' 0
'
c
'
'
d
'
124 203
896 7 2 2
'b' 596 26 0
'
f
'
'e
'
'
g
'
'h
'
205 460 17
3
2566 0
0
'c' 1190 0
169 0
1649 0
0
'd' 365 5
2
0
'
j
'
'i
'
214
4 5
'
k
'
'
n
'
'
o
'
273
639
659 9 554 5 4
149 21 0
702 17
'
q
'
550 0
0
716 0
201
0 3 0
152
147 7 4
0
'
r
'
'
s
'
't
' 'u' 'v'
9
253 396
1
5 2 4647279 7
'w'
5
8
5
'x'
250 46 30
0
0
0
0
0
0
''''
3
0
2
164
3
0
1
49
11
25
8
0
52
###
##
5
0
0
1030 0 61 1323 387 0
####
504
'h' 5587 12 1 13 #
6 0 0
9 0
113 179
'i' 467 226 9 2 839 703 797 0
0 0
0
64 878 0 0
188
0 21 13 3 9 0 0
104
791 100
298 8 1998 9 2 120 2
288 243 33
'j' 8
0
0
0
0
0
0
0
0
'k' 13
0
0
0
0
497 0
150
2 0
0
'l' 1207 24 8
2 1515 29 2
159
0 2958 480 3
'm' 1374 114 6
0
0
638 0
'f' 614 0
5
0
'g' 444 0
0
0
561 572 0
29
0
2794 24 0
0
449
411
831 3 2220 149 4 16
240
231 420 503 96 6 127 5
748 0
139 5
0
5
263 27 820 180 0
1
102 388
0 9 406813 505 0
0
87
20
0
0
1
6
9
1
1
0
0
0
50 0
346 5 1 0
245
149
127 8 125 6 6 125 0
4
90 0
140 328 303 139 70
109 0
0
6
0
0
0
1491 2
0
67
'q' 0
0
0
0
0
0
0
0
'v' 153 0
0
0
3327 0
0
'w' 3016 0
0
12 1196 3
'x' 92
0
72 0
'y' 13
33 1
'z' 1
0
0
93 0
24 163 14 977 514 0
0
0
0
0
0
0
8
0
5
22
77
24
6
0
7
0
4
0
53
5
3
7
38
3
1
6
2
8
3
5
0
304 0
0
363 0
309 0
0
627 2
3
715 450 0
766 202 206 317 0
0
0
0
0
0
0
3
0
438 0
0
0
0
199
364 261 317 395 3
147
353 296 238 108 0
490
0 598 21 92 7
166
119
0 4 217 1 17
658
3
573
9
608
0 0
482 0
0
0
117
92 0 412 7 995 287 136 114 0
119
484 18 2 9 3446629 0
119 0
0
0
0
0
799
0
205 4978
54
0
19
0
0
0
2
101 0
127 0
128
358 5 1
0
0
7
33 0
0
0
0
0
1
473 0
0
3
188 0
0
2
9
0
38
0
0
9
0
4
0
0
9661
861 560 778 580 0
146 152
4 5 23370 0
188 0
273
0
489 #####
0
1
6
3
236 918
0
0
0
3
0
0
0
11
0
0
0
0
76 79 0
0
0
0
0
260
0
8
1070
577 0
0
0
0
0
107
1
181
2
201
2
0
0
0
64
0
0
3
###
##
307
5
313
8
207
2
27
0
0
460 2
470 0
3 751 2976157 115 22 17
381
574
2 844 18806 488 1834 7
487 0
141
1770 5 0
16
0
237
129 6 16 8
406 191
578 578 2104 1 0 575 0
0
130
'r' 908 54 134 596 5814 31 163 82 0
154
's' 1490 10 132 1 3057 29 8 1246 3
#### 232
't' 1501 6
156 1 2916 9 2 #
4
'u' 209 174 376 319 260 89 705 0
0
128 0
0
7
'o' 98
0
118 68
0
739 17 299 455 2
0
203 329 4
1
'n' 372 4
'p' 700 4
0
''
12
67
33
4
7
247 15
798 331 0
1
3
'z'
77
1
3
58 233 1
934 23
'y'
175
487
699 254
26 3 1209 7 183 611 41 5 8 13191 1059 302 470
141
0 199 0
0 8 0 0 789 7 292 276 0
0
0
464
'e' 2540 40 943 9 1918 562 534 149 498 4
704 562 0
'
p
'
0
2
0
3
3
5
165 1763 2
1338 364 0
133
185 2
2 8
'
m
'
'l
'
0
0
0
113 1
0
0
0
0
0
0
11
1
0 182 0
153
3 5 0
2
204 224 0
0
60
0
0
0
66
5187
0
1
0
0
0
0
0
8
9
1
23
587 0 193 2
0
#### 117
#
6 486 7534 0
5
0
0
36
41
0
0
0
0
116
'''' 1
1
23 85 0
0 2 2
0 0 0 132 185 0 1 0 0 190 9
#### 456 381 314
335 148
524
285
251 547 231
174 715
'' #
0
0 4 2037 9 7 9548 3 539 470 0 3081 8 0 0 354 6 8
0
1498 1
(a) 2nd order Kafka – Metamorphosis
28
PART FOUR: DIGRAPH PATHS
5.1 Problem 1(f) – Computing Most Probable Digraph Paths
In our discussion of problem 1(e), we were able to ‘visualize’ how correlation
matrices represent a frequency distribution of characters or letters which are
drawn from a particular corpus. As the matrices grow in order, the probability of
representation of words of a longer length tends to grow as well.
The digraph path algorithm we have implemented follows the one in the handout,
except to keep track of the most probable occurrences along the path, we ‘zero’
out characters that are already observed –and hence prevent them from
recurring along the path. We stop when we reach a keyboard which has all keys
already pressed before; i.e. frequency of all keys on that keyboard has already
been changed to zero. To compute the digraph path we are using 2nd order
correlation matrices, which hold second-order probabilities of occurrence of
characters from a title/corpus.
Results:
 No. of characters in language = 28 (smaller case English alphabet (26) +
space + apostrophe)
Most Probable digraph path starting with the letter ‘t’, from pair-correlation (i.e.
second-order matrices):
Figure 13: Digraph path of Irving versus Poe – Order2
Author
Irving
Poe
Title
Legend of Sleepy Hollow
Gold Bug
Digraph path (most probable)
the andisofrylupkbj
the andisouryplf’bj
(from handout)
29
The digraph path computed for Irving is very similar to that of Poe, as both use
standard English. We observe more digraph paths from more authors and titles
below. Fig 14 shows simulation results for the digraph path output of 30 titles,
from 16 different authors. The list has been sorted alphabetically by path. We
have used extra corpora from Jane Austin and Bronte, C.
Figure 14: Digraph path of other authors – Order2
Author
Title
Twain
Wells
Irving
Wells
Cleland
Burroughs
Machiavelli
Burroughs
Burroughs
Burroughs
Twain
Austin
Haggard
Austin
Bronte , E
Doyle
Bronte, C
Bronte , A
Bronte, C
Doyle
Carroll
Kipling
Kafka
Kafka
Twain
Doyle
Doyle
Dickens
Dickens
Carroll
Adventures of Huckleberry Finn
The Time Machine
Legend of Sleepy Hollow
War of the Worlds
Fanny Hill
Warlord of Mars
The Prince
Tarzan of the Apes
The People that Time Forgot
The Land that Time Forgot
A Connecticut Yankee in King Arthur's Court
Pride and Prejudice
King Solomon’s Mines
Sense and Sensibility
Wuthering Heights
The Lost World
Villette
Agnes Grey
Jane Eyre
Tales of Terror and Mystery
Through the Looking Glass
The Jungle Book
Metamorphosis
The Trial
The Adventures of Tom Sawyer
The Adventures of Sherlock Holmes
The Hound of the Baskervilles
A Tale of Two Cities
A Christmas Carol
Alice’s Adventures in Wonderland
Digraph path
(most probable)
t andoulerishyb'mpwf
the andisofrycklugmpwbj
the andisofrylupkbj
the andisofrylupmbj
the andisofrympluckwg'
the andisorulympwf'
the andisoryblfuckw'
the andisorzlypugmbj
the andisoulyprmbj
the andisourmylf'ckw
the andisouryblfg'v
the andisoury'ckflwbj
the andisoury'cklfpmbj
the andisoury'clf
the andisoury'lf
the andisourylf'v
the andisourymplf
the andisoury'mplf
the andisourymplf'ckwbj
the andisourymplf'ckwbj
the andoulicrspy'mbj
the andoulispry'mbj
the andoulysimprkfcq
the andoulysimprkfcq
the andourisplybj
the andourisply'ckfgmbj
the andourisplymbj
the andourisplyv
the andourisplyv
the andouryplickswf
30
The beginning of almost all the paths seems identical, with the exception of
Twain’s ‘Huckleberry Finn’. Twains writing is known to be biased towards a
southern dialect instead of standard English. The detection of this feature by this
method advocates that the digraph paths method could be useful for language
identification.
In the sorted list, if we look at the author column, we can see that different texts
of the same author show up very close to each other in most of the instances.
Also, the paths tend to show more variation and difference at the tails. This is
reflected due to unique writing style of each of the authors. These are indications
that this method could also perhaps be used for author identification and
similarity measure.
31
PART FIVE: AUTHOR ATTRIBUTION
What is Authorship Attribution? [10]
- It is the way of determining who wrote a text when it is unclear who wrote
it.
- It is useful when two or more people claim to have written something or
when no one is willing (or able) to say that (s)he wrote the piece
- In a typical scenario, a set of documents with known authorship are used
for training; the problem is then to identify which of these authors wrote
unattributed documents.
Authorship Attribution can be used in a broad range of applications:
- To analyze anonymous or disputed documents/books, such as the plays of
Shakespeare (shakespeareauthorship.com)
- Plagiarism detection - it can be used to establish whether claimed
authorship is valid.
- Criminal Investigation - Ted Kaczynski was targeted as a primary suspect in
the Unabomber case, because authorship attribution methods determined
that he could have written the Unabomber’s manifesto
- Forensic investigations - Verifying the authorship of e-mails and newsgroup
messages, or identifying the source of a piece of intelligence.[10]
The main idea behind statistically or computationally-supported authorship
attribution is that by measuring some textual features we can distinguish
between texts written by different authors. In the typical authorship attribution
problem, a text of unknown authorship is assigned to one candidate author, given
a set of candidate authors for whom text samples of undisputed authorship are
available. [9]
32
In this section we investigate the 2 different methods we implemented for testing
Author Attribution:
1. Using an average English Matrix
2. Using N-grams and Cosine similarity measurement
6.1 Problem 1(g) – Average English Matrix
To implement this problem we used the following similarity metric taken from the
handout [7]:
∑[ (
)
(
)] [ (
)
(
)]
In the above equation, M(I,J) and N(I,J )are second-order correlation matrices.
E(I,J) is the “standard English” matrix, which can be computed by averaging the
correlation matrices of all the corpora to be investigated to get an estimation on
their average frequency distributions. This final matrix is then compared to the M
and N matrices, which are correlation matrices of different texts by authors.
For our experiment we selected five authors and for each author we selected two
of his/her written works. However, we would like to mention that our algorithm is
generic and can be extended to any number of authors. Table 3 illustrates the
authors and their respective books used for this experiment. We would like to
mention here that the texts used were normalized to length within the algorithm,
to maintain that , M(I,J) , N(I,J) and E(I,J) share the same total number of
characters.
33
Table 3: Five Selective Authors and their Work (for Authorship Attribution)
Author
Burroughs
Burroughs
Austin
Austin
Kafka
Kafka
Doyle
Doyle
Dickens
Dickens
Title
Tarzan of the Apes
The People that Time Forgot
Pride and Prejudice
Sense and Sensibility
Metamorphosis
The Trial
The Adventures of Sherlock Holmes
The Hound of the Baskervilles
A Tale of Two Cities
A Christmas Carol
The idea of this experiment is to generate an (n x n) matrix where n is the number
of authors we are considering for attribution. The ‘n’ rows of the matrix will
consist of 1 work from each author. The ‘n’ columns of the matrix will consist of 1
work (which is another distinct set of work), from those same authors; thus
effectively comparing nx2 different texts belonging to 5 individual authors.
We will know that we have been successfully able to perform author attribution
through our algorithm when 2 different books written by the same author will
have a high correlational value, S. We note that the value of S varies from -1 to 1.
If we represent our matrix in a sorted manner based on authors, then the values
appearing at the diagonal of the matrix should be the highest, since ideally when
M(i,j) and N(i,j) are similar or equal, S takes on the largest positive value. Figure
15 shows the results that were obtained from our simulations.
Figure 15: Authorship Attribution using “standard English matrix” algorithm
using a normalization size of 200KB
Tarzan of the Apes
Pride and Prejudice
Metamorphosis
The Adventures of
Sherlock Holmes
A Tale of Two Cities
The
People Sense
and The Trial The Hound of A Christmas
that
Time Sensibility
the Baskervilles Carol
Forgot
1.0000
-0.1305
0.1894
-0.3754
-0.0852
-0.1779
0.7684
-0.6564
0.0181
0.1402
0.2165
-0.6979
0.9616
-0.1396
-0.3189
-0.3593
-0.1206
0.0111
0.1578
-0.1499
-0.3175
0.3071
-0.0132
-0.0434
0.3629
34
From Fig. 15 we can justify that we have obtained results that have been able to
perform Authorship Attribution within a fairly close expectancy. Books written by
the same author received the highest correlation score, S – which has been
reflected along the mail diagonal. For Burroughs’ 2 texts, ‘Tarzan of the Apes’ and
‘The People that Time Forgot’, the authorship was accurately a 1.
The reason that we might not have obtained a better outcome
is that the corpora we have used (and hence our results) may be suffering from
loss of information, and as a consequence, ‘feature’ loss.
As we have seen before in Fig.6, the length of a document has an affect on
simulations and can bias results. The standard sum equation also inherently
requires normalization (i.e. – matrices must be of equal length) as we have
studied in [7]. Hence we normalized each of the texts before generating the
correlation matrices. Our method of normalization was to apply a length variable
(in our code normSize) to each text, which equalized the total length of characters
in the corpus to a certain size. We do understand that this process is a ‘lossy’ one
and may have failed to capture some important features for attribution – as we
know that distribution of the little words often gives away the author. An
alternative method could be to normalize each document using the following
equation [[11], pg. 111]:
⃗(
)
⃗⃗ (
) ⃗⃗ (
)
where, ⃗⃗( ) is the document vector for document d1 and ⃗⃗( )
Euclidean length of d1 .
is the
An interesting observation and one which matches the concept of ‘loss of
features’ which we just mentioned above is that as we continued to increase the
cutoff size of the books (i.e. took larger sizes of the corpus) the more accurate
results we got. In our case we tested for 110KB, 150KB and 200KB. We have
already shown the results for 200KB in Fig. 15. The following figures, Fig. 16 and
Fig. 17 show the results for 150KB and 110KB cutoff respectively for the same
corpora.
35
Figure 16: Authorship Attribution using “standard English matrix” algorithm
using a normalization size of 150KB
Tarzan of the Apes
Pride and Prejudice
Metamorphosis
The Adventures of
Sherlock Holmes
A Tale of Two Cities
The
People Sense
and The Trial The Hound of A Christmas
that
Time Sensibility
the Baskervilles Carol
Forgot
0.7652
-0.1468
0.2921
-0.3627
-0.0281
-0.1807
0.6689
-0.6337
0.0911
0.0814
0.3075
-0.6693
1.0000
-0.2710
-0.2107
-0.3391
-0.0536
0.0817
0.1075
-0.2760
-0.2365
0.3257
-0.0030
-0.0375
0.2632
Figure 17: Authorship Attribution using “standard English matrix” algorithm
using a normalization size of 110KB
Tarzan of the Apes
Pride and Prejudice
Metamorphosis
The Adventures of
Sherlock Holmes
A Tale of Two Cities
The
People Sense
and The Trial The Hound of A Christmas
that
Time Sensibility
the Baskervilles Carol
Forgot
0.6276
-0.2308
0.4128
-0.3689
-0.0699
-0.2650
0.5537
-0.5578
0.1448
0.0330
0.4509
-0.5996
1.0000
-0.3548
-0.1468
-0.3589
-0.0840
0.1252
0.0464
-0.3601
-0.1682
0.3873
0.0015
-0.0200
0.2206
Discuss your solution and provide reasons why it is likely or not likely to solve
the problem definitively.
Using the ‘lossy’ normalization method - it may not be likely to solve the problem
definitively due to the variable length of texts being compared. Not all books will
be of the same length, but the standard sum function requires all 3 matrices
[M(I,J) , N(I,J) and E(I,J)] be of the same length. Hence, when the lengths are
chopped off, features that may be important to the correlation measurement
may be lost, and we will not get exactly accurate results due to loss of features.
36
6.2 Problem 1(g) – N-grams and Cosine Similarity
For our second experiment, we use the process of N-gram analysis to find the
frequency distribution of the top ‘L’ most recurring n-grams in a corpus for each
author’s book. For the n-gram analysis we use the methods outlined in the slides
in [4], specifically slides numbered 12 and 13.
Once we had obtained the frequency distribution of the top L-most unique grams
list for each text, we used the following well known Cosine Similarity Measure
function:
( )
∑
‖ ‖‖ ‖
√∑
( )
√∑
( )
to obtain the similarity value between each title pair. In the above equation, A
and B are the term frequency vectors of the two titles being measured.
For this experiment, we measure the same sets of corpora we used in the
previous standard English matrix method. We will get a similar n by n matrix with
values representing the amount of similarity between titles. Cosine similarity uses
normalized vector space measurements, the max and min values can range from 0 to 1, 0 means complete dissimilarity and 1 represents the titles are absolutely
similar [5]. We will know Authorship Attribution has been successful if we find
maximum values for the entries for 2 books written by the same author.
We did the n-grams experiments on 2 sets of books:
37
SET-1:
Table 4: Five Selective Authors and their Work (for Authorship Attribution)
Author
Burroughs
Burroughs
Austin
Austin
Kafka
Kafka
Doyle
Doyle
Dickens
Dickens
Title
Tarzan of the Apes
The People that Time Forgot
Pride and Prejudice
Sense and Sensibility
Metamorphosis
The Trial
The Adventures of Sherlock Holmes
The Hound of the Baskervilles
A Tale of Two Cities
A Christmas Carol
This is the same set we used for authorship attribution in the standard English
matrix algorithm. We can see from the results below for the set of books, n-grams
and cosine similarity does a better job at finding similarity between authors and
can with a high accuracy solve the attribution problem.
Figure 18: Authorship Attribution using “N-gram and Cosine Similarity”
algorithm using a 3-gram (N=3, L=3000)
Tarzan of the Apes
Pride and Prejudice
Metamorphosis
The Adventures of
Sherlock Holmes
A Tale of Two Cities
The
People Sense
and The
The Hound of the A Christmas
that
Time Sensibility
Trial
Baskervilles
Carol
Forgot
1
0.9336 0.9625
0.9573
0.9731
0.9336
1 0.9491
0.9603
0.9628
0.9625
0.9491
1
0.9653
0.9678
0.9573
0.9731
0.9603
0.9628
0.9653
0.9678
1
0.9801
0.9801
1
Table 4 is a representation of how the colors in the matrix (Fig18) can be read. To
read the matrix, we basically choose the highest and lowest similarity columnwise. For example, we go to column 1, which is ‘the People that Time Forgot’ and
we mark the highest similarity in red (‘Tarzan of the Apes’) and the lowest
similarity in green (‘Pride and Prejudice’).
38
Table 4: How to read matrix in Fig. 18
The People that Time Forgot
Sense and Sensibility
The Trial
The Hound of the Baskervilles
A Christmas Carol
Most Similarity With
Least Similarity With
Tarzan of the Apes
Pride and Prejudice
Metamorphosis
The Adventures of Sherlock
Holmes
A Tale of Two Cities
Pride and Prejudice
Tarzan of the Apes
Pride and Prejudice
Tarzan of the Apes
Pride and Prejudice
From table 4, we can see that ‘Pride and Prejudice’ and ‘Tarzan of the Apes’ stand
out as distinct works - they have a higher dissimilarity with all other books in the
set. The reason may be ‘Pride and Prejudice’ is a romance novel and also written
by the only female author in this set. ‘Tarzan of the Apes’ is perhaps more related
to children’s fiction and settings are in a jungle.
From these observations, we can optimistically presume that this author
attributor method can perhaps also:
- distinguish the genre of the book
- distinguish the sex of the author
Figure 19: Authorship Attribution using “N-gram and Cosine Similarity”
algorithm using a 5-gram (N=5, L=3000)
Tarzan of the Apes
Pride and Prejudice
Metamorphosis
The Adventures of
Sherlock Holmes
A Tale of Two Cities
The
People Sense
and The
The Hound of the A Christmas
that
Time Sensibility
Trial
Baskervilles
Carol
Forgot
1
0.8060 0.8777
0.8962
0.9183
0.8060
1 0.8367
0.8582
0.8632
0.8777
0.8367
1
0.9026
0.9006
0.8962
0.9183
0.8582
0.8632
0.9026
0.9006
1
0.9350
0.9350
1
In Fig. 19, we see that as the ‘gram size’ is increased, the distinguishing texts show
up with higher dissimilarity weight, which helps to identify them more easily.
Hence, we can draw the conclusion that:
- a higher n-gram size picks up more distinguishing features
39
SET-2:
Table 5: Five Selective Authors and their Work (for Authorship Attribution)
Author
Poe
Poe
Austin
Austin
Kafka
Kafka
Doyle
Doyle
Bernard Shaw
Bernard Shaw
Title
The Unparalleled Adventures of One Hans Pfall
The Gold Bug
Pride and Prejudice
Sense and Sensibility
Metamorphosis
The Trial
The Adventures of Sherlock Holmes
The Hound of the Baskervilles
Candida (German)
Helden (German)
In this second set, we have replaced the works of Burroughs and Dickens, by the
works of Edgar Allan Poe and George Bernard Shaw, respectively. As Poe’s work is
mostly mystery and horror, we wan to see if this could be detected by our
analysis. The works of Bernard Shaw are in a different language (German), and we
expect the measurement for Shaw with all other authors to be the least – and
have highest similarity with only his own other text.
Below are results for authors of set 2.
Figure 20: Authorship Attribution using “N-gram and Cosine Similarity”
algorithm using a 3-gram (N =3, L=3000)
Hans Pfall
Pride and Prejudice
Metamorphosis
The
Adventures
Sherlock Holmes
Candida
The Gold Sense
and The
The Hound of the Helden
Bug
Sensibility
Trial
Baskervilles
1
0.9231 0.9153
0.9367 0.3201
0.9231
1 0.9491
0.9603 0.3653
0.9153
0.9491
1
0.9653 0.3128
of
0.9367
0.3201
0.9603
0.3653
0.9653
0.3128
1
0.3435
0.3435
1
40
Figure 21: Authorship Attribution using “N-gram and Cosine Similarity”
algorithm using a 5-gram (N =5, L=3000)
Hans Pfall
Pride and Prejudice
Metamorphosis
The
Adventures
Sherlock Holmes
Candida
The Gold Sense
and The
The Hound of the Helden
Bug
Sensibility
Trial
Baskervilles
1
0.7634 0.8016
0.8564 0.0413
0.7634
1 0.8367
0.8582 0.0413
0.8016
0.8367
1
0.9026 0.0381
of
0.8564
0.0413
0.8582
0.0413
0.9026
0.0381
1
0.0425
0.0425
1
As we can see, the results were as expected. From these results we might also
add that this method could be used for:
- language attribution
Observations from Algorithm Implementation:
When calculating the probability distribution of grams for each book – we used
two methods:
- The first was to create the entire gram list, find the unique grams in the list, and
run a for loop for i = 1 to ‘num. of unique grams’ and try to find the frequency of
each unique gram within the original gram list data structure to get the count.
This took a tremendous amount of time.
Used: count = sum(ismember(gramsList, uniqueGramsList{i}));
- So instead, we tried a different approach by searching for the unique grams in
the corpus, which was a string. This was much, much faster computationally.
Used: count = length(strfind(book, uniqueGramsList{i}));
The first method took longer as it proceeded to access each cell of the lists to see
if it was a member or not. Matching a string (second method) was faster.
We also noticed that as the size of the n-grams was increased, the algorithm took
longer to run.
41
PART SIX: GENRE CLASSIFICATION
Genre classification is the task of placing a document/text into one or more
labeled categories; such as – mystery, romance, action, adventure, horror, etc.
The most common application would be in library science.
Can you develop a metric for what you have done so far to classify the genre of
the stories? Implement your techniques to demonstrate classification.
From the previous 2 experiments done in part five, we have seen that our n-gram
algorithm can do a more accurate job at authorship attribution than the standard
English matrix algorithm. Hence, for genre classification for problem 1(h) we will
use the same technique. Our metric for evaluation, again, will be the Cosine
similarity measure which has already been introduced in the previous section. We
quickly restate it here:
where · indicates the dot product and ||x|| indicates the length of the vector x.
Here, x and y are the books we are computing the genre classification for. The
values of this metric can be between 0 and 1, where 0 represents high
dissimilarity and 1 represents high similarity.
In this section, we will use cosine similarity to measure how similar one book of a
particular genre is to another book of the same or different genre. We will be
using the books in Table 6. We will know that our classification has worked if two
different books belonging to the same genre show up with the highest similarity
measure.
42
Table 6: Selective Genre and Authors (for Genre Classification)
Genre
Title
Author
Science Fiction
Science Fiction
Gothic Fiction
Gothic Fiction
The Land that Time Forgot
The Time Machine
Legend of Sleepy Hollow
The House of the Seven
Gables
The Murders in the Rue
Morgue
The
Hound
of
the
Baskervilles
Pride and Prejudice
Wuthering Heights
Tarzan of the Apes
The Jungle Book
Edgar Rice Burroughs
H.G.Wells
Washington Irving
Nathaniel Hawthorne
Detective
Detective
Romance
Romance
Adventure
Adventure
Edgar Allan Poe
Sir Arthur Conan Doyle
Jane Austin
Emily Bronte
Edgar Rice Burroughs
Rudyard Kipling
For our simulations we will be using an n-gram size of 6 and a gram cutoff size of
3,000 grams. However, in this case we will be placing all titles in both the rows
and the columns. Hence we will have an n-by-n matrix where n = total number of
books. The results are shown in Fig22.
Figure 22: Genre Classification using “N-gram and Cosine Similarity” algorithm
using a 6-gram (N =6, L=3000)
1
0.8471
0.7763
0.7831
0.8471
1
0.7659
0.7823
0.7763
0.7659
1
0.8051
The
The
Pride
Murders Hound
Wutheri
The
and
Tarzan of
in the of the
ng
Jungle
Prejudice
the Apes
Rue
Baskervil
Heights
Book
(adventu
Morgue les
(romance
(adventu
(romance
re)
(detectiv (detectiv
)
re)
)
e)
e)
0.7831 0.7762 0.8051 0.6932 0.7196 0.8423 0.7619
0.7823 0.7564 0.7767 0.6598 0.7108 0.7875 0.7645
0.8051 0.7633 0.7294 0.6481 0.6778 0.7886 0.7435
1 0.7925 0.7809 0.7426 0.7256 0.7858 0.7377
0.7762
0.7564
0.7633
0.7925
1
0.7689
0.6580
0.6179
0.7659
0.6645
0.8051
0.6932
0.7196
0.8423
0.7619
0.7767
0.6598
0.7108
0.7875
0.7645
0.7294
0.6481
0.6778
0.7886
0.7435
0.7809
0.7426
0.7256
0.7858
0.7377
0.7689
0.6580
0.6179
0.7659
0.6645
1
0.7315
0.7201
0.7708
0.7430
0.7315
1
0.7545
0.6809
0.6564
0.7201
0.7545
1
0.6894
0.7206
0.7708
0.6809
0.6894
1
0.7727
0.7430
0.6564
0.7206
0.7727
1
The Land
that
Time
Forgot
(scifi)
The Land that Time Forgot (scifi)
The Time Machine (scifi)
Legend of Sleepy Hollow (gothfi)
The House of the Seven Gables (gothfi)
The Murders in the Rue Morgue
(detective)
The Hound of the Baskervilles
(detective)
Pride and Prejudice (romance)
Wuthering Heights (romance)
Tarzan of the Apes (adventure)
The Jungle Book (adventure)
Legend
The
of
Time
Sleepy
Machine
Hollow
(scifi)
(gothfi)
The
House of
the
Seven
Gables
(gothfi)
43
As expected, the diagonal is all 1’s, because we have placed the same books in
both the rows and columns. The diagonal represents the book-by-book evaluation
for the exact same books – and hence they must be ‘exactly’ similar. In green, we
have marked the next book with which ‘this’ book is most similar to (refer to
Table 4: how to read matrix). We can see that the classification has failed in 3
cases (marked in bold green). Let’s look at them one-by-one:
1. The “The Murders in the Rue Morgue” (detective) by Edgar Allan Poe
showed the highest similarity with Gothic Fiction “The House of the Seven
Gables” by Nathaniel Hawthorne. A plausible reason for this may be due to
a wrong labeling of the ‘detective’ story, which in reality, with more
thought might be a ‘gothic fiction’. However, for certain Edgar Allan Poe is
known mostly for his works in gothic fiction such as ‘The Tell-Tale Heart”
and “The Gold Bug”. It is possible that Poes’ writing style and features
extracted from his writing matched more closely to the gothic fiction genre.
So, in this case we can say that the authors’ genre has been correctly
classified, instead of the particular book itself.
2. The “The Hound of the Baskervilles” (detective) by Sir Arthur Conan Doyle
showed the highest similarity with Science Fiction “The Land that Time
Forgot” by Edgar Rice Burroughs. Again, as in the case of Poe, Sir Doyle has
been know to write in other genre besides ‘detective’ – such as ‘The Lost
World’ (scifi) and ‘Tales of Terror and Mystery’ (horror). Hence, the features
of his individual writing style have placed this particular work (detective) of
his into a different genre (scifi), but in which he also writes.
3. The “Tarzan of the Apes” (adventure) by Sir Arthur Conan Doyle showed
the highest similarity with Science Fiction “The Land that Time Forgot” by
Edgar Rice Burroughs. The Cosine similarity measure we have used has
been found to very accurate for authorship attribution (Fig20, 21). This
similarity value is a case of authorship attribution trumping genre
classification.
44
Can the classification scheme you have designed help with author attribution?
Yes. As we have seen in part five, we have used this same n-gram algorithm with
cosine similarity for authorship attribution. It returned accurate attributions.
Can you say something about correlations among books written by the same
author?
Our selection of authors for genre classification (Table6) included authors of
books writing in different genre. Our results (Fig22) showed that this n-gram and
cosine similarity classification metric sometimes gives higher values to:
- two books of two different genre by two authors who write in more than
one genre (cross-genre authorship)
- the same authors’ work, though the two books are not necessarily in the
same genre (authorship attribution)
So with this method, it is difficult to determine genre classification if the above
two cases just mentioned exist, which often does in reality. For future work, it
would be interesting to search for a metric that can refine this genre classification
to rule out authorship attribution and cross-genre authorship.
Is there any relationship to the styles of the three Bronte sisters’ works?
We can use n-gram with cosine similarity measure to verify this. For this
experiment we chose five books in total: 3 of the Bronte sisters’, 1 from Austen
and 1 from Heyer. All five books measured were taken from the ‘romance’ genre.
Fig23 shows there is a high similarity between the 3 Bronte sisters work.
For each book we have marked in green the other top 2 books with which it is
most similar. For each of the Bronte sisters, the highest similarity turned out to be
with the work of the two other sisters.
45
Figure 23: Classification using “N-gram and Cosine Similarity” algorithm using a
6-gram (N =6, L=3000) – Bronte Sisters Work and the Romance Genre
Jane Eyre Wuthering
Agnes Grey Pride
and The Black Moth
(C., Bronte) Heights
(E., (A., Bronte) Prejudice (Jane (Georgette
Bronte)
Austin)
Heyer)
Jane Eyre (C.,
Bronte)
Wuthering
Heights
(E.,
Bronte)
Agnes Grey (A.,
Bronte)
Pride
and
Prejudice (Jane
Austin)
The Black Moth
(Georgette
Heyer)
1
0.8445
0.8892
0.8017
0.8159
0.8445
1
0.8337
0.7545
0.7800
0.8892
0.8337
1
0.8215
0.7762
0.8017
0.7545
0.8215
1
0.7427
0.8159
0.7800
0.7762
0.7427
1
46
PART SEVEN: AUTHOR PROFILE SIMILARITY
Ideally, an author profile would be information about the author, such as – date
of birth and/or death, where he/she was born, a background study of the author
and his/her work (i.e. what kind of environment they grew up in, what influenced
their work), and perhaps personal life.
In this problem 1(i), we will use special features of a particular authors ‘work’ to
perform the profile similarity analysis.
In order to do this, we created a collection of work of each author provided on
the data sheet merging multiple books of that author into one file. We used the
‘mergeAuthor’ utility code to do this. We also added Jane Austin to the list. We
implemented N-gram analysis on each of these merged files to perform feature
extraction. The distributions resulted in the generation of profiles for each author.
Cosine similarity measure was then applied on these profiles to detect similarity.
As our metric to measure similarity we used cosine similarity. In this case the
arguments to the cosine metric:
are,
cos(x,y) = cos(AuthorProfile1, AuthorProfile2)
To perform author profile similarity, we choose an n-gram size of 6 with a gram
cutoff size of 3,000 grams. Fig 24 shows results from our simulation.
47
Figure 24: Author Profile Similarity using “N-gram and Cosine Similarity”
algorithm with 6-gram (N =6, L=3000) – All authors on data sheet + Jane Austin
Austin Dickens
Austin
Dickens
E. Bronte
A. Bronte
C.Bronte
Edgar
Haggard
Cleland
Carroll
Irving
Doyle
Twain
Nicolo
Wells
Kafka
Kipling
Avg.Total
1
0.7847
0.7713
0.8479
0.8237
0.7327
0.7322
0.7799
0.6018
0.6604
0.7841
0.7304
0.7126
0.6842
0.7230
0.6691
0.752
0.7847
1
0.8008
0.8200
0.8563
0.8534
0.8673
0.8089
0.6607
0.8068
0.8841
0.8538
0.7654
0.8474
0.7975
0.8319
0.827
E.
Bronte
0.7713
0.8008
1
0.8337
0.8445
0.7305
0.7592
0.7510
0.6309
0.6778
0.7633
0.7977
0.6745
0.7198
0.6964
0.7206
0.761
A.
Bronte
0.8479
0.8200
0.8337
1
0.8892
0.7825
0.7906
0.8183
0.6548
0.7119
0.8131
0.8310
0.7302
0.7596
0.7398
0.7357
0.797
C.
Bronte
0.8237
0.8563
0.8445
0.8892
1
0.8217
0.8302
0.8157
0.6539
0.7424
0.8641
0.8360
0.7251
0.8110
0.7672
0.7632
0.815
Edgar Haggard Cleland Carroll Irving Doyle Twain Nicolo Wells Kafka Kipling
0.7327
0.8534
0.7305
0.7825
0.8217
1
0.8709
0.7954
0.6078
0.8198
0.8947
0.8015
0.7640
0.8923
0.7684
0.7944
0.808
0.7322
0.8673
0.7592
0.7906
0.8302
0.8709
1
0.7983
0.6313
0.8063
0.8854
0.8423
0.7653
0.8723
0.7452
0.8205
0.814
0.7799
0.8089
0.7510
0.8183
0.8157
0.7954
0.7983
1
0.5578
0.7579
0.8181
0.7567
0.7298
0.7685
0.6954
0.6914
0.771
0.6018
0.6607
0.6309
0.6548
0.6539
0.6078
0.6313
0.5578
1
0.5401
0.6291
0.6839
0.5230
0.6252
0.6482
0.6421
0.643
0.6604 0.7841
0.8068 0.8841
0.6778 0.7633
0.7119 0.8131
0.7424 0.8641
0.8198 0.8947
0.8063 0.8854
0.7579 0.8181
0.5401 0.6291
1 0.7942
0.7942
1
0.7514 0.8209
0.7148 0.7641
0.8197 0.8563
0.6699 0.8080
0.7435 0.7911
0.751 0.823
0.7304
0.8538
0.7977
0.8310
0.8360
0.8015
0.8423
0.7567
0.6839
0.7514
0.8209
1
0.7211
0.8287
0.7649
0.8292
0.803
0.7126 0.6842
0.7654 0.8474
0.6745 0.7198
0.7302 0.7596
0.7251 0.8110
0.7640 0.8923
0.7653 0.8723
0.7298 0.7685
0.5230 0.6252
0.7148 0.8197
0.7641 0.8563
0.7211 0.8287
1 0.7225
0.7225
1
0.6929 0.7337
0.6976 0.8011
0.731 0.796
0.7230
0.7975
0.6964
0.7398
0.7672
0.7684
0.7452
0.6954
0.6482
0.6699
0.8080
0.7649
0.6929
0.7337
1
0.7337
0.749
0.6691
0.8319
0.7206
0.7357
0.7632
0.7944
0.8205
0.6914
0.6421
0.7435
0.7911
0.8292
0.6976
0.8011
0.7337
1
0.767
Fig 24 shows the profile similarity of 16 authors. To read the matrix, let’s say
every column (this could easily be row-wise as well) represents an author. A
similarity value has been measured from this author’s profile (along the column)
to another author’s profile (along the rows). We have marked the highest
similarities in each column in bold red. For example – the closest profile match
to Jane Austin is A. Bronte.
We included the measurement of an average total for each column. This total has
been calculated as:
∑
(
)
where, ‘j’ is the author for the column we are considering and ‘i’ is each of the
consine similarity values in that column for all other authors. ‘n’ is the total
number of authors we are doing the measurements for. In this example, n = 16.
This average gives us a statistic of which authors profile is most similar to all other
authors and vice versa.
48
Some observations from Fig24:
1. The profile of Doyle and Dickens, respectively, are similar to all other authors’
profiles on the list. Carroll’s features/profile are the least similar overall with
other authors.
2. The Bronte sisters’ profiles are most similar to each other including Austen.
3. Irving, whose profile was constructed based on ‘The legend of sleepy hollow’
which falls into the ‘horror’ genre should have matched more closely with
Doyle who is the only other author on our list who has a horror piece.
However, we can see that Irving’s profile matches more closely with Edgar R.
Burroughs and H.G. Wells, both of whom are Scifi writers on our list.
A further investigation (using Wikipedia) gave us the following information:
Table 6: Author Information for Irving, Doyle, Wells and Edgar Burroughs
Author
Irving
Birth - Death
Early childhood
Genre
1783 - 1859
Multi-genre
Doyle
1859 -1930
Manhattan,
New York City
Scotland
Wells
1866 - 1946
England
Burroughs
1875 - 1950
Chicago
Multi-genre
(Horror: ‘Tales of Terror and Mystery’)
Multi-genre
(Horror: ‘great Tales of Horror and the
Supernatural’)
Multi-genre
(Horror: ‘Forgotten Tales of Love and
Murder’)
One reason behind the similarity might be that the profile investigated for
Irving was small (based on one book only). Hence, all features of his profile
might not have been accurately captured. Moreover, Doyle, Wells and
Burroughs are very similar in era and genre. Hence, the similarity in their
profiles and an incomplete profile of Irving might have caused this bias.
49
4. As we can also see from Fig24, Edgar Burroughs profile is most similar to
Doyle. Wells is also most similar to Burroughs. The information in Table 6 also
shows these authors wrote in the same era. This can lead us to state that out
profile similarity measure is pretty accurate and has been able to reflect true
profiling.
50
PART EIGHT: CONCLUSIONS
9.1 Summary
From the various experiments performed in this assignment, we can summarize
some of the following observations:
1.
2.
3.
4.
5.
Word Yield increases with an increase in Order
Word Length increases with an increase in Order
Word Yield decreases with an increase in Corpus Size
Word Yield decreases with an increase in Reduction Factor (resolution size)
Increasing ‘n’ (# of iterations to increase churn out the monkeytext) increases
probability of more meaningful words being generated, which in turn
increase the word yield.
6. Another noticeable aspect was that as ‘n’ (# of iterations to increase churn
out the monkeytext) was increased, the runtime required to generate the
outputs also increased. This indicates that as n -> ∞ the required processing
time increases as well.
7. Document length has a huge effect on the frequency distribution of words,
and consequently the outcome of comparing the aptitude of the monkey at
different orders.
8. How we define a word (with or without spaces) will ace an effect on the
Word Yield if the set of characters allowed in the language contains the
‘space’ character as well.
9. Digraph paths can be useful for author and / or language identification
10. Increasing normalization cutoff helps to improve author attribution when
using the standard English matrix method.
11. Increasing ‘N’ (gram size) and ‘L’ (top most occurring n-grams) helps to
improve author attribution when using the N-gram with Cosine similarity
method. This method perhaps could also be used for language attribution.
51
9.2 Future Work
These are a few of the works that can be carried out later for better performance
of this assignment.
1. Implementation of higher order monkey simulations
2. Implementation of parallel monkeys with parallel keyboards
3. Finding a genre classification metric that can rule out authorship attribution
and cross-genre authorship.
52
REFERNCES:
1. http://en.wikipedia.org/wiki/Infinite_monkey_theorem
2. http://whatis.techtarget.com/definition/Infinite-Monkey-Theorem
3. V. Keselj, F. Peng, N. Cercone, C. Thomas., “N-Gram Based Author Profiles
for Authorship Attribution”, Pacific Association for Computational
Linguistics., 2003.
4. N-gram presentation:
https://wiki.eecs.yorku.ca/course_archive/201314/W/6339/_media/alzheimer0404n.ppt
5. Cosine Similarity:
https://wiki.eecs.yorku.ca/course_archive/201314/W/6339/_media/cse6339-presentation_ir_vsm.pdf
6. Big Assignment Handouts:
https://wiki.eecs.yorku.ca/course_archive/2013-14/W/6339/
7. Main handout:
https://wiki.eecs.yorku.ca/course_archive/201314/W/6339/_media/assignments:doc.pdf
8. Extra corpora: http://www.gutenberg.ca/
9. Paper: “A Survey of Modern Authorship Attribution Methods” - Efstathios
Stamatatos
http://www.icsd.aegean.gr/lecturers/stamatatos/papers/survey.pdf
10.www.cs.bilkent.edu.tr/~canf/CS533/CS533Spr06stuPresent/Authorship%20
Attribution.ppt
11. Book: Introduction to Information Retrieval – C.D. Manning, Prabhakar
Raghavan, H. Schutze
53
PART NINE: API REFERENCE LIST
The program codes for this assignment have been included in a single Matlab file ‘BigAssignemnt.m’, which can be opened and executed in Matlab. Each problem
has been implemented as a different ‘section’ of the file. Each section can be run
individually by uncommenting the section brackets ‘%{‘ and ‘}%’ and then run. The
outputs of each algorithm are displayed in the Matlab command window.
Variables and data structures can be monitored in the ‘workspace’ available in
Matlab.
10.1 Functions and Data Structures
Following is a list of reusable functions and data structures in our BigAssignment
package.
Table 7: Description of Functions and Data Structures in BigAssignemnt.m
Returns
Method
Description
Problem 1(a) – Straight-Forward Monkey
String[] monkeytext
generateMonkeytext(KEY, n)
String[] corpus
parseCorpus(‘PATH’)
String[] hitList
findMeaningfulWords(monkeytext, corpus)
Given a set of
characters in the
language KEY and the
number of iterations,
this function
generates monkeytext
Given a path to the
target file
(corpus/text), parses
the file as an array of
words
Checks monkeytext
string for the
presence of any of the
words in the corpus
and returns the words
54
Double [ ] wordYield
getWordYield(corpus, hitList)
String[]Int [ ]
maxWords,
maxWordlen
findLongestWord(hitList)
that match
Calculates the yield of
words using the word
yield equation
Finds the longest
word from hitList and
returns a list of the
longest words and
length
Problem 1(b) - FIRST-ORDER MONKEY ~ HAMLET (actIII)
String [ ] KEY
keyDistTable1()
Creates an internal
distribution of
characters according
to table 1 given in the
handout
Problem 1(c) - SECOND-ORDER MONKEY
String [ ] neatCorpus
getNeatCorpus(corpus)
Int [ ] [ ]
secondOrderMatrix
build2OrderMatrix(KEY, neatCorpus)
secondOrderKeyboard build2OrderKeyboard(KEY,
[]
secondOrderMatrix)
String[] monkeytext
generate2ndOrderMonkeytext(secondOr
derKeyboard, KEY, n)
Takes as input a parsed corpus,
which is an array of words, and
arranges it is as a string of words as
output
Generates a 2nd order Correlation
Matrix based on neatCorpus and
key distribution
Generates the keyboards which the
monkey will use to generate
random text. This can be reused for
higher order simulations if
keyboards and is defined with
dimension n + 1. Also keyboards of
nth order are used to compute the
keyboards of (n+1)th Order
Takes as input the second order
keyboard built from the 2nd order
correlational matrix and generates
random key presses to output the
monkeytext which will be ‘n’
characters long
Problem 1(c) - THIRD-ORDER MONKEY
Int [ ] [ ] [ ]
thirdOrderMatrix
Build3OrderMatrix(KEY, neatCorpus)
thirdOrderKeyboard
[][]
Build3OrderKeyboard(KEY,
thirdOrderMatrix)
Generates a 3rd order Correlation
Matrix based on neatCorpus and
key distribution
Generates the keyboards which the
monkey will use to generate
55
random text. This can be reused for
higher order simulations if
keyboards and is defined with
dimension n + 1. Also keyboards of
nth order are used to compute the
keyboards of (n+1)th Order
String[] monkeytext
generate2rdOrderMonkeytext(secondOr
derKeyboard, KEY, n)
Takes as input the third order
keyboard built from the 3rd order
correlational matrix and generates
random key presses to output the
monkeytext which will be ‘n’
characters long
Problem 1(c) - FOURTH-ORDER MONKEY (EXTENTION)
Int [ ] [ ] [ ]
fourthOrderMatrix
Build4OrderMatrix(KEY, neatCorpus)
fourthOrderKeyboard
[][][]
Build4OrderKeyboard(KEY,
fourthOrderMatrix)
String[] monkeytext
Generate4thOrderMonkeytext(secondOr
derKeyboard, KEY, n)
Generates a4th order Correlation
Matrix based on neatCorpus and
key distribution
Generates the keyboards which the
monkey will use to generate
random text. This can be reused for
higher order simulations if
keyboards and is defined with
dimension n + 1. Also keyboards of
nth order are used to compute the
keyboards of (n+1)th Order
Takes as input the fourth order
keyboard built from the 4th order
correlational matrix and generates
random key presses to output the
monkeytext which will be ‘n’
characters long
Problem 1(d) - FOURTH-ORDER MONKEY (EXTENTION)
resolutionSize
(constant)
Resolution size was assigned
constant values for each run; e.g. 1,
0.05, 0.02 etc. to factorize the
second and third-order matrices
Problem 1(e) – CORRELATION MATRICES – FIRST ORDER
Int [ ] [ ] [ ]
firstOrderMatrix
Build1OrderMatrix(KEY, neatCorpus)
sortedKeyDistribList
[], sortedKeyList[],
firstOrderKeyboard[]
Build1OrderKeyboard(KEY,
firstOrderMatrix)
Generates a 1st order Correlation
Matrix based on neatCorpus and
key distribution
Sorts the key distribution list and
key list in descending order of
frequency to get the characters
56
that appear the most first in order
and also builds the 1st order
keyboard
Problem 1(f) – FIND DIGRAPH PATHS
String [ ] digraphPath
buildDigraphPath(KEY, startLetter,
neatCorpus)
Takes a start letter (i.e. ‘t’) and the
characters in the language KEY and
builds a digraph path for given
corpus by finding the most
frequently occurring character in
the n-order correlation matrices.
Returns a string of characters as the
digraph path
Problem 1(g) – AUTHOR ATTRIBUTION - method: Standard English Matrix
books = {‘PATH1', ‘PATH2', ‘PATH3' …
‘PATHN'}
Declaration of cell array of books
which to do attribution for. ‘PATHN’
points to each book in the file
system
Int [] []
buildStandEnglishMatrix(books, KEY,
Given an array of text file names
correlationMatrix, Int normSize)
(titles), and the KEY, this method
Int [][]
computes a standard English Matrix
standardEnglishMatrix
(i.e. E(i,j)) by averaging individual
correlation matrices together. The
normSize cuts-off the size of the
matrices to be used in the standard
matrix equation. This function
includes a call to
build2OrderMatrix(KEY,
normCorpus) to build the second
order correlation matrices
Double [] []
findStandSumValue(correlationMatrix,
Using the correlation matrix for
finalAttributionMatrix standardEnglishMatrix, books)
each author (M(i,j), N(i,j)) and the
standEnglishMatrix E(i,j) returns the
final author attribution matrix of
size length(books)/2 *
length(books)/2 - using the
Standard English matrix ‘S’
equation from handout
Problem 1(g) – AUTHOR ATTRIBUTION - method: N-Gram and Cosine
Similarity
Char [] gramsList
tokenizeNgrams(ngramSize, neatCorpus)
Take the constant ngramSize (n
=3,5,4 etc.) and tokenizes the
corpus into n-length grams. Returns
all the grams as an array list.
57
Char []
sortedGramProbabilit
yDistrib, Char []
sortedUniqueGrams
Double [] []
finalCosSimMatrix
Double [] cosineValue
Double [] []
finalCosSimMatrix
getGramDistribution(neatCorpus,
gramsList,topGrams)
Calculates the count of each gram
(i.e.frequency). Extracts the list of
unique gram. Next, finds the
probability distribution for top
ranked (topGrams) unique grams
for neatCorpus. Returns the sorted
probability distribution of grams
and list of those grams.
bookwiseGramDistribList(i,:)
Creates a master list for all ‘i’ titles
for the books = {‘PATH1', ‘PATH2',
‘PATH3' … ‘PATHN'} and stores
them in a [i *
sortedGramProbabilityDistrib(i)]
structure
bookwiseGramsList
Creates a master list for all ‘i’ titles
for the books = {‘PATH1', ‘PATH2',
‘PATH3' … ‘PATHN'} and stores
them in a [i * sortedUniqueGrams
(i)] structure
findCosineSimilarity(bookwiseGramsList, Creates a gram profile for each
bookwiseGramDistribList, books,
title. Goes into each ‘books’
topGrams)
bookwiseGramsList and creates a
gramwise distribution list of gram
porbabilites for this gram for all
authors. It then extracts the
topGrams (i.e topGrams = 3000)
and stores it in a
gramwiseDistribList. This function
also includes a function call to
getCosineSimilarity(book2,book2)
which calculates the final cosine
matrix which is of size
length(books)/2 * length(books)/2
getCosineSimilarity(gramwiseDistribList(: Implements the cosine similarity
,i),gramwiseDistribList(:,j))
function. Nominator is the dot
product of arg1 & arg2. The
denominator is squareroot of their
lengths. Returns cosineValue
which is between 0 and 1.
Problem 1(h) & 1(i) – GENRE CLASSIFICATION and AUTHOR PROFILE
SIMILARITY - method: N-Gram and Cosine Similarity
findCosineSimilarityGenre(bookwiseGra
msList, bookwiseGramDistribList, books,
topGrams)
Preforms the same operation as
findCosineSimilarity(arg1,arg2,arg3,
arg4), except calculates the final
cosine matrix as size of
length(books) * length(books)
58
10.2 Web Implementation
We have made the documentation and code for this assignment available on the
webpage: yasmeen.ezpzit.com. For ease of use, the codes to each problem have
been ‘individually’ uploaded on the site, along with ‘BigAssignemnt.m’.
59