Using A Specialized Grammar to Generate Probable

Florida State University Libraries
Electronic Theses, Treatises and Dissertations
The Graduate School
2008
Using a Specialized Grammar to Generate
Probable Passwords
William J. Glodek
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]
FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
USING A SPECIALIZED GRAMMAR TO GENERATE PROBABLE
PASSWORDS
By
WILLIAM J. GLODEK
A Thesis submitted to the
Department of Computer Science
in partial fulfillment of the
requirements for the degree of
Master’s of Science
Degree Awarded:
Spring Semester, 2008
The members of the Committee approve the Thesis of William J. Glodek defended on April 9,
2008.
Sudhir Aggarwal
Professor Co-Directing Thesis
Breno de Medeiros
Professor Co-Directing Thesis
Zhenhai Duan
Committee Member
Approved:
Dr. David Whalley, Chair
Department of Computer Science
Joseph Travis, Dean, College of Arts and Sciences
The Office of Graduate Studies has verified and approved the above named committee members.
ii
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2. CURRENT STATE OF PASSWORD BREAKING . . . . . . . . . . . . . . . . . . . .
2.1 Well-Established Password Breaking Strategies & Techniques . . . . . . . . . . .
2.2 Improving Password Breaking By Using Markov Models . . . . . . . . . . . . . .
3
3
4
3. SPECIALIZED GRAMMAR AND PASSWORD GENERATION .
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Understanding User Generated Passwords . . . . . . . . . . .
3.3 Developing a Specialized Context-Sensitive Grammar . . . .
3.4 Implementing a Specialized Grammar to Generate Passwords
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
11
17
20
4. CREATING A CUSTOM INPUT DICTIONARY . . . . . . . . . . . . . . . . . . . . 23
4.1 Extracting Song Lyrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Song Lyric Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5. EXPERIMENTS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Specialized Grammar vs John the Ripper . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Specialized Grammar and Lyrics Dictionary . . . . . . . . . . . . . . . . . . . . . 29
6. FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
A. COMPUTE GRAMMAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
B. PASSWORD GENERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
iii
LIST OF TABLES
2.1
Normal Rainbow Table vs Markovian/FA Rainbow Table . . . . . . . . . . . . . . . . . .
9
3.1
Password Length Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.2
Password Basic Characteristic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.3
Simplified Password Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.4
Complete Structure Statistics of Simple Structure LD
3.5
Digit Grouping Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.6
1-Digit Grouping Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.7
2-Digit Grouping Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.8
Symbol Grouping Statistics
3.9
1-Symbol Grouping Statistics
4.1
Frequency Analysis of Unique Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.2
Frequency Analysis of Restricted Words . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
5.1
Performance Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iv
LIST OF FIGURES
5.1
SG vs John the Ripper (Training Set Dictionary) . . . . . . . . . . . . . . . . . . . . 27
5.2
SG vs John the Ripper (English Dictionary) . . . . . . . . . . . . . . . . . . . . . . . 28
5.3
SG vs John the Ripper (Scientific Dictionary) . . . . . . . . . . . . . . . . . . . . . . 28
5.4
Comparison of the Training Set Dictionary and Lyric Input Dictionaries . . . . . . . 29
v
ABSTRACT
The most common method of preventing unauthorized access to digital information is through
the use of a password-based authentication system. The strength of a password-based authentication system relies on a humans ability to generate a password that is memorable but not easily
guessed. Brute force techniques can be used to break passwords through exhaustive search, but this
may take an infeasible amount of time. Dictionary attack techniques attempt to break passwords
by applying common password construction patterns to standard dictionaries. These common
strategies are often successful in breaking weak passwords, but as computer users become more
educated in secure computing practices, these strategies may become less successful.
We have developed a novel password breaking strategy that uses known passwords to develop a
specialized grammar, which can be used to generate probable passwords. The password generation
process uses a probabilistic approach to develop grammars that measure the likely hood of a
password structure. Passwords are generated based on the probabilistic password structures. In
this thesis, we describe the development and implementation of the specialized grammar and the
generation of passwords. We also show that our probable password generation strategy is more
effective than current password breaking utilities and provides a foundation for future research.
vi
CHAPTER 1
INTRODUCTION
The use of secure cryptographic protocols to protect information has increased in recent years. The
United States government utilizes cryptography to protect sensitive information and communications. Major corporations and businesses use cryptography to secure their customer’s information,
which can include customers credit card information and social security numbers. Additionally, the
availability of commercial and open source encryption tools has provided the average computer user
the ability to protect their personal files. Secure encryption is necessary in shielding information
from unauthorized users.
The use of encryption is not only limited to protecting authorized information, but it can also be
used to ”conceal illegitimate materials from law enforcement” [1]. It is the duty of law enforcement
to protect the public from various threats; however, encryption may suppress information law
enforcement could use to protect the public. For example, if a person is suspected of committing
identity theft but all the evidence of the crime is contained within an encrypted file, law enforcement
officials may not have enough evidence to convict and/or prosecute the suspected person. It is law
enforcements need to decipher these types of files in order to protect the public.
The most common method of protecting information is through the use of a symmetric key
cryptosystem. In a symmetric key cryptosystem, both the ”encryption and decryption processes
are controlled by a cryptographic key” [2]. The most common cryptographic key is a password.
The strength of a password lies in its entropy, which is a measure of the passwords uncertainty.
A password that consists of random letters, digits and symbols will have a higher entropy than a
password that is just a simple English word. A weak link in most passwords is that they must be
human-memorable; therefore most passwords have relatively low entropy.
In a forensic setting, obtaining passwords can be difficult. If an individual utilizes encryption to
protect files, a law enforcement official can safely assume the individual is technologically literate
and understands the importance of using a password that can not be easily guessed. A password
1
can be broken through brute force techniques, which enumerate all possible passwords. However, it
may take an unreasonable amount of time to find the correct password. The focus of this thesis is
to explore a more effective password search strategy, which uses a specialized grammar to generate
probable passwords that are based on a specific input dictionary or corpus.
The next section will explore the current state of the art in breaking passwords. The following
section will describe a specialized context-sensitive grammar and how it can be used for more
efficient password search strategies.
The preliminary research associated with selecting input
dictionaries and corpuses will be explained in Section 4. Section 5 will detail the performance
of our probable password generator in various experiments. Future research is discussed in the final
section.
2
CHAPTER 2
CURRENT STATE OF PASSWORD BREAKING
2.1
Well-Established Password Breaking Strategies & Techniques
In modern operating sytems, passwords are not stored in plaintext but as hashed values based
upon the plaintext password and a cryptographic hash function. Breaking a password consists
of matching a hashed value with a corresponding plaintext password. The MD5 and SHA hash
functions are the most common cryptographic hash functions. The act of breaking a password can
be thought as finding a plaintext password that corresponds to a known hashed value. With respect
to discovering passwords, the terms breaking and cracking are considered synonymous, however,
for consistency the process of discovering a password will be referred to as breaking a password. It
is important to note that they password key space is vital to the strength of the password-based
authentication system. If a key space is small, all possible keys can be tried in a timely manner with
modern computing technology. However, if a suffciently large key space is used, the only feasible
manner in which to recover data is to know the password key. The research described in Chapter
2 and in this thesis focus on password-based authentication systems that use large key spaces.
The process of breaking passwords has been studied for many years. The most elementary
technique of password breaking is the guess-and-check method, where passwords are selected
without any particular strategy. A natural extension of this method is the dictionary attack. The
dictionary attack strategy is based on the assumption that passwords are based on single words
that can be found in a dictionary. The dictionary attack strategy can also be strengthened as
variations to passwords, such as appended numbers or symbols, can be easily predicted. The brute
force attack strategy attempts to generate all possible password combinations.
The dictionary attack strategy attempts only the easily guessable passwords and can execute
in a relatively short amount of time, but requires a large amount of memory to hold the dictionary.
The brute force strategy generates all possible passwords and garuantees the password will be
3
found, but execution time may be unreasonable. Popular implementations of these attack types
are John the Ripper and Cain & Abel [3, 4]. An improvement to the standard dictionary and
brute force attack strategies was developed by Philippe Oechslin in the paper ”Making a Faster
Cryptanalytic Time-Memory Trade-Off”, which introduced rainbow tables [5]. Rainbow tables are
comprised of chains that represent a sequence of plaintext passwords and their corresponding hash
values. The benefit of using rainbow tables is that they cover a large percentage of the possible
password space while maintaining an effecient look up speed.
Each chain in a rainbow table is constructed by starting with an arbitrary password, which is
used as input to a hash function. A reduction function is then used to map the hash value to a
new password. This process is repeated n-times to create a chain of length n. The only values
that are stored in the rainbow table is the initial password and the last computed hash. The size
of the rainbow table is dependent upon the chain length. A chain of small chain length results in
a relatively large rainbow table, while a large chain length results in a relatively smaller rainbow
table. The in-depth analysis of rainbow tables is outside the scope of this thesis; however, an
article ”Password Cracking: Rainbow Tables Explained,” written by Philippe Oechslin, explains
the motivation and techniques used in efficiently creating and using rainbow tables [6].
The current state of password breaking is built upon these basic strategies and techniques. The
following section details research that incorporates the dictionary attack strategy, the time-memory
tradeoff and Markov Models to break passwords.
2.2
Improving Password Breaking By Using Markov Models
A weakness associated with password-based authentication schemes is that the password must
be human-memorable. Researchers from the University of Texas at Austin, Arvind Narayanan
and Vital Shmatikov, contend that as long as humans are required to memorize passwords, the
passwords are vulnerable to ”smart dictionary” attacks [7].
2.2.1
Motivation
Humans are often considered the weak link in security systems. A user who is uneducated on
standard security practices may leave the most advanced security system vulnerable to a determined
attacker. Password-based authentication systems often allow users to create passwords without
any restrictions on password length or make-up, which often result in passwords that are easilyguessable. To strengthen password-based authentication systems, composition rules are sometimes
4
enforced, which ”require passwords to be drawn from certain regular languages, and require
passwords to include digits and non-alphanumeric characters” [7]. The purpose of composition
rules is to increase the complexity of user created passwords, which will theoretically increase the
difficulty of an attacker guessing the password. Although composition rules increase the possible
password search space, it does not ensure a higher entropy due to limitations introduced by humans.
Narayanan and Shmatikov have concluded that even with the adoption of composition rules, humanmemorable password will still be vulnerable to intelligent dictionary attacks [7].
The measure how of easy a password can be guessed is referred to as its Kolomogorov complexity.
The Kolomogrov complexity of a string ”is the description length of the smallest Turing machine
that outputs that string” [8]. A string that has a low Kolmogorov complexity, or K-complexity, can
be represented as Turing machine with relatively few states. One example of a long string that has
a low K-complexity is asasasasasasasasasasasasasasasasas. That particular string is simply the
word as repeated 16 times, which a Turing machine could easily produce [7]. However, calculating
the K-complexity ”of a string is uncomputable” as the ”Turing machine model for representing
strings is too powerful to be exploited effectively” [8]. The researchers proposed that a major
contributor to the memorability of a password is directly related to the ”phonetic similarity with
the words in the user’s native language” [7]. Therefore, Markov Models could be used to capture
this phonetic information based on the native language of the user. Additionally, simple finite
automaton can be used to filter out passwords, which are generated from Markov models, that do
not match well-known password creation rules.
2.2.2
Markov Modeling of Passwords
Markov modeling refers to the process of defining ”a probability distribution over sequences of
symbols” [7]. Markov modeling is commonplace in natural language processing, as Markov models
easily apply data to well known language patterns. The simplest statistical information that can be
utilized by Markov modeling is the underlying letter frequency distribution of a particular language.
For example, in the English language the letter E occurs 12.7% of the time, while the letter Q only
occurs 0.1% of the time [9]. Therefore, when creating words based on the English language, the
letter E should appear more often than the letter Q.
Narayanan and Shmatikov utilized stastical information of the English language with two
Markov model approaches: zero-order and first-order. The zero-order Markov will independently
consider each letter when generating passwords. The primary motivation for utilizing the zero5
order Markov model is to mimic a ”commonly used password generation [strategy],” where the
password is an acronym that is obtained by taking the first letter of each word in a sentence” [7].
The assumption made by the researchers that the underlying letter frequencies associated with
sentences in the English language could be captured in the zero-order Markov model. However,
not all passwords will be generated. Only passwords that are above a specified threshold hold will
be generated. The mathetical equation that represents a generated password is as follows:
P (α) = Πxǫa ν(x)
(2.1)
P (α)= the product of each character’s probability in a generated password
α= the generated password
a= set of all characters in given alphabet
x= individual character in alphabet
ν(x)= probability of the letter x. [7]
A threshold, θ, will define which generated passwords that will be accepted and placed into
zero-order Markov model dictionary. The mathematical equation that represents the zero-order
dictionary is:
Dυ,θ = {α : P (α) ≥ θ}.
(2.2)
The first-order Markov model will generate the next letter in the password based upon the
previously generated character. The motivation for the first-order Markov model is that users
will generate passwords that are not present in the English dictionary, but are phonetically
pronounceable. For example, the word ”phonerneting” is not a dictionary word, but it can be
spoken and will likely be remembered by a user. Since the probability of the current character is
based upon the previously generated character, the mathematical equation representing the firstorder Markov model differs from equation 2.1:
P (x1 x2 · · · xn ) = v(x1 )
(
Y
n − 1)ν(xi+1 |xi )
(2.3)
i=1
x1 x2 · · · xn = password of length n
ν(xi+1 |xi ) = probability of xi+1 occuring after xi
Similarly to the zero-order model, a threshold, θ, will be selected in the generation the first-order
dictionary. The mathematical representation of the first-order dictionary is as follows:
Dν,θ = {x1 x2 · · · xn : P (x1 x2 · · · xn ) ≥ θ}
6
(2.4)
By introducing a threshold, the password search space can be drastically reduced by only
including the most plausible passwords. A plausible password is one that is considered to be
pronouncable. The example given in the research refers to passwords with a length of 8 characters.
Suppose a zero-order Markov model dictionary was created using equation 2.2 with θ = 0. Now
suppose a θ was chosen that only accepted 14% of all generated passwords, which resulted in
another zero-order dictionary being generated. When comparing the two dictionaries, the second
dictionary will contain 90% of all the passwords generated when θ = 0. As a result, 14% of the
generated passwords contains 90% of the plausible password space. By intelligently choosing a θ,
the researchers could maximize coverage over the plausible password space while minimizing time
for computation [7].
2.2.3
Filtering with Finite Automaton
In the previous section, only lowercase letters were used to generate the plausible passwords. In
reality, passwords are generated with a mix of uppercase and lowercase letters, numbers and
symbols. To better represent more realistic passwords, the researchers increased the complete
alphabet to contain:
• 26 lowercase characters (a-z)
• 26 uppercase characters (A-Z)
• 10 numerals (0-9)
• 5 special characters (space, hyphen, underscore, period and comma)
The complete alphabet, a, contains 67 unique characters and numbers, which will be used in
the creation of the zero- and first-order Markov model dictionaries [7]. With an alphabet of 67
characters, as opposed to 26, the size of the generated dictionaries will increase exponentially if
intelligent filtering is not used.
However, the location of the uppercase, numerals and special characters in user created
passwords can be anticipated. For example, ”the first character of an alphabetic sequence is far more
likely to be capitalized than the others,” and ”in an alphanumeric password, all numerals are likely
to be at the end” [7]. Many password breaking tools, such as [3] and [4], use common password
patterns to generate passwords as transformations on standard dictionaries. By understanding these
7
common password patterns, the researchers can then decrease size of their generated dictionaries
by only accepting passwords that match the common password patterns.
The common password patterns can be expressed as regular expressions, which then can be
”modeled by finite automata” [7]. Finite automaton are used to accept strings that are in a regular
language, which are based on a regular expression. The researchers compiled a list of regular
expressions that are based upon the transformation rules found in the popular password breaking
tools. These regular expressions, which can be found in appendix A of [7], were used to create
finite automaton. The finite automaton can then be used to filter out the generated passwords
that do not match the most common password patterns. Therefore the dictionaries generated by
the researchers can be expressed as ”the set of [passwords]matching the Markovian filter and also
accepted by at least one of the finite automata corresponding to the regular expressions” [7].
Mathematically, the zero-order dictionary is:
Dυ,θ,<Mi > = {α : P (α) ≥ θ and∃i : Mi acceptsα}.
(2.5)
Mi = one of the finite automaton
The first-order dictionary is generated similarly to equation 2.5. Overall, the generated dictionaries
will consist of only the most probable Markovian generated passwords that fit the common password
patterns.
2.2.4
Indexing Generated Passwords
Another major aspect of their research was implementing an efficient indexing algorithm for use
in a time-memory trade-off attack [7, 5]. The indexing algorithm will enable the researchers to
efficiently generate rainbow tables that utilize their Markovian/FA filtered dictionaries, by quickly
generating the ith password in a given dictionary. The efficient generation of the ith password
is critical to their research as the researcher’s primary goal was to demonstrate that Oechslin’s
time-memory trade-off could be further improved. However, at this time, this aspect of their
research does not directly apply to the research surrounding password breaking and our specialized
context-sensitive grammar.
2.2.5
Experiment, Results and Conclusions
To test the effectiveness of their Markovian/FA generated dictionaries, the researchers obtained a
list of 142 ”real user passwords” from Passware1 . The researchers then generated dictionaries based
1
This person is the operator of http://LostPassword.com
8
on the regular expressions found in appendix A of [7], which were then incorporated into a rainbow
table. The researchers also generated another rainbow table that only consisted of alphanumeric
(only lowercase letters) passwords of 6 characters or less using Rainbow Crack [10]. Both rainbow
tables, the hybrid rainbow table generated by the researchers and the Rainbow Crack generated
rainbow table, were used to see how many of the 142 supplied passwords could be recovered.
Refer to Table 2.1 [7], that illustrates the results of the experiment. The Category column
describes the password groups, while the Count column refers to the number of passwords in each
column. The Rainbow column illustrates the number of passwords recovered using the standard
rainbow table and the Hybrid column describes the passwords recovered using the Markovian/FA
filtered rainbow table. The Markovian/FA filtered rainbow table performed very well at breaking
Table 2.1: Normal Rainbow Table vs Markovian/FA Rainbow Table
passwords with length of at most five. However, as the length of the passwords increased, the
performance of the Markovian/FA filtered rainbow table decreased. The Markovian/FA filtered
rainbow table was not able to break any of the passwords of a length of greater eight or eight
character passwords that began with a number or a symbol. Overall, the Markovian/FA filtered
rainbow table performed well and it validated their hypothesis that password breaking can be
improved by using Markov models based on the English language and finite automaton [7].
9
CHAPTER 3
SPECIALIZED GRAMMAR AND PASSWORD GENERATION
It is a fair assumption, like that of Narayanan and Shmatikov from the University of Texas at
Austin, that human generated passwords are susceptable to an intelligent attack. In the previous
chapter, Markov models and finite automaton were used to intelligently construct more probable
passwords. The Markov model and finite automaton password construction strategy was shown
to be effective. However, a more effective strategy in generating highly probable passwords may
include the use of a specialized grammar. The following sections will describe research into the
development of a specialized grammar and its application in password generation.
3.1
Motivation
Context-free grammars have long been used in the studying of natural languages. Noam Chomsky introduced context-free grammars in his 1956 paper ”Three Models for the Description of
Language”, where he explores the general structure of the English lanuage [11]. The goal of his
research was to develop a method to generate arbitrary sentences that model the English language.
A general context-free grammar formally defined as G = (V, Σ, R, S), where:
V - a finite set of non-terminal characters or values
Σ - a finite set of terminal characters or values
R - a finite set of production rules used to generate the grammar
S - the start variable.
Sentences are generated by applying the various production rules, which results in a string of
terminal values that model the language described in the grammar. All sentences of a language
are not equally likely, therefore probabilities can be associated with each production rule to form
a probabilistic context-free grammar. Each sentenced generated with a probabilistic context-free
grammar has an associated numerical production probability, which is a measure of how likely that
10
particular sentence is to be found in the language described by the grammar.
The research surrounding probabilistic context-free grammars has largely been centered around
the processing of natural languages. However, general probabilistic context-free grammars have
not had many applications to real world problems. We would like to use probabilistic context-free
grammars as motivation to understand user generated passwords and use our learned knowledge
to generate probable passwords.
3.2
Understanding User Generated Passwords
In Chomsky’s research, he identified two primary challenges when attempting to generally understand a language. The first challenge is discovering the revealing characteristics of the language
that are used to construct sentences. The second challenge is then formulating a general theory
about the language based on the revealing characteristics discovered [11]. As Chomsky’s focus was
on understanding the general structure of the English language, our focus is in understanding the
general structure of user created passwords.
In order to begin to understand how users create passwords, it is necessary to understand the
revealing characteristics of user created passwords. Although there are visual-based authentication
systems, such as [12] or [13], the most popular method of authentication is based on users
supplying a password via a keyboard. The use of a keyboard limits the user to create passwords
with symbols that consist of only letters, digits and special characters. In [11], Chomsky breaks
down sentences into their general phrases, such as noun phrases, verb phrases, etc, to generate
grammatically correct sentences. We breakdown passwords into three general categories: letters1 ,
digits2 and special characters3 By analyzing real world passwords, it enables us to determine how
users incorporate the general categories into passwords.
R
phishing attack was launched to obtain the email addresses and
In October of 2006, a MySpace R
users [14]. The phishing attack received major publicity, and subsequently,
passwords of MySpace the list of the obtained passwords was released temporarily for public downloading. A member of
Florida State University’s Electronic Crime Investigative Technologies Laboratory research team
obtained the password list and provided it to us for analysis [15]. The password corpus did not
R
usernames and all users were notified to change their passwords.
contain any reference to MySpace We feel the passwords are representative of real world user generated passwords. Table 3.1 outlines
1
[A-Z][a-z]
[0-9]
3
! ” # $ % & % ’ ( ) * + , - . / : ; <= >? @ [ \] ˆ {` |}
2
11
the password lengths of the obtained passwords. The passwords that contain between six and
Table 3.1: Password Length Statistics
Password Length
less than 5
5
6
7
8
9
10
greater than 10
Totals
Number of Passwords
494
718
9946
14827
15845
11713
9605
3894
67042
Percentage
0.7369
1.071
14.8355
22.116
23.6344
17.4711
14.3268
5.8083
100
ten characters collected make up 92.38% of all collected passwords. Table 3.2 outlines the basic
characteristics of our password corpus. Over 80% of all passwords are alphanumeric, which consist
Table 3.2: Password Basic Characteristic Statistics
Password Characteristic
Letters Only
Digits Only
Special Charactes Only
Alphanumeric
Letters and Special Characters
Digits and Special Characters
Letters, Digits and Special Characters
Totals
Number of Passwords
5151
817
9
54212
4950
8
1895
67042
Percentage
7.68324
1.21864
0.0134244
80.8627
7.38343
0.0119328
2.82659
100
of only letters and digits. From the two tables above, it is clear to see that most passwords are of
lengths between six and ten symbols that include both letters and digits. At this point, the basic
structure of the passwords is known; however, it is necessary to further analyze the passwords in
order to create a specialized grammar.
3.2.1
Detailed Structure of Passwords
One revealing characteristic of passwords is specific locations of the letters, digits and special
characters within the passwords. By better understanding where these symbols occur in real world
passwords, we can create more probable passwords by mimicing the format of the known passwords.
12
To collect statistical information on this characteristic, we transformed each password into a
simplified version of the password using only the symbols L, D and S for the letters, digits and special
characters, respectively. For example, the password ”password10!! ” would be transformed into
”LDS”, as ”password10!! ” consists of letters followed by digits, followed by special characters. The
transformation process was performed on all the passwords and statistical information collected.
The statistical information for the fifteen most frequently occuring password structures can be found
in Table 3.3. As expected from Table 3.2, the most frequent password structure is alphanumeric.
More specifically, the alphanumeric password structure is letters followed by digits. There are 181
unique simplified password structures contained within the compiled password list. The statistical
information regarding simplified password structures gives us a more detailed look into how users
incorporate the three symbol categories into passwords.
Another revealing characteristic is how many symbols are used in generating passwords of
a specific structure.
For example, consider the simple password structure LD. This structure
appeared 45120 times in the password collection, with 107 unique complete structures. A complete
structure is the representation of a password in terms of the l, d and s symbols without any
simplification. The numerical subscript represents the length of the symbols. For example, the
password ”password10!! ”’s complete structure would be ”l8 d2 s2 ”. Of the 107 unique complete
Table 3.3: Simplified Password Statistics
Simplified Password Structure
LD
L
DL
LS
LDL
LSL
D
LSD
DLD
LDLD
LDS
LDLDL
LSLD
SL
SLS
Number of Passwords
45120
5151
4499
3311
2863
998
817
566
526
492
396
275
185
158
153
13
Percentage
67.3011
7.68324
6.71072
4.9387
4.27046
1.48862
1.21864
0.844247
0.784583
0.733868
0.590675
0.410191
0.275946
0.235673
0.228215
structures, the complete structure ”l6 d1 ” appeared 4933 times, while the complete structure ”l1 d4 ”
only occured once. When generating passwords with the simple structure LD, it would be to our
advantage to generate passwords of the complete structure ”l6 d1 ” before the complete structure
”l1 d4 ”. Table 3.4 lists the ten most frequently occuring complete structures of the simple structure
LD. From Table 3.4 it is clear to see that the most popular complete structure password consists of
six letters followed by one digit. The same complete structure statistical information was recorded
for each of the simplified structures.
Digits in Passwords
Alphanumeric passwords are the most popular and an important characteristic of alphanumeric
passwords is the selection of digits.
Digits in passwords often times represent something of
significance, such as a birthday or a date. Popular password breaking tools simply take an input
dictionary and append one or two digits, at random, to the end of each dictionary word in hopes
of breaking the password. We analyze the digits in the passwords to improve our performance in
selecting digits in the password. The digits are not analyzed individually, but as a continuous group
of digits. For example, the password ”10password1 ” has two digit groups: ”10” and ”1”. The digits
are analyzed as groups to hopefully encompass the significance of each user’s digit selections when
constructing a password. The digits are broken into groups dependent upon the length of the digit
groups.
In the password collection, there are 16 unique digit collections that occur a total of 59354 times.
One digit groups are the most frequent as they occur 42.42% of the time. Table 3.5 describes the
Table 3.4: Complete Structure Statistics of Simple Structure LD
Complete Password Structure
l6 d1
l6 d2
l7 d1
l8 d1
l5 d2
l9 d1
l7 d2
l5 d1
l4 d2
l8 d2
Number of Passwords
4933
4402
4145
3540
3037
2743
2720
2547
2356
2271
14
Percentage
0.109331
0.0975621
0.0918661
0.0784574
0.0673094
0.0607934
0.0602837
0.0564495
0.0522163
0.0503324
Table 3.5: Digit Grouping Statistics
Length of Digit Grouping
1
2
3
4
5
6
7
8
9
>9
Number of Occurrences
25183
18084
6107
5182
1227
2088
771
436
207
69
Percentage of total
42.4285
30.468
10.2891
8.73067
2.06726
3.51788
1.29899
0.734576
0.348755
0.116252
remaining digit collections statistics. Ninety percent of all digits are included in digit groupings
of lengths between one and four. Each digit grouping was analyzed to determine which digits are
most frequent in their particular grouping. For example, the most frequent digit from the one digit
grouping is ”1”, which occurs 50.78% of the time. Table 3.6 outlines the most frequent one digit
Table 3.6: 1-Digit Grouping Statistics
1-Digit
1
2
3
4
7
5
0
6
8
9
Number of Occurrences
12788
2789
2096
1708
1245
1039
1009
899
898
712
Percentage of total
50.7803
11.0749
8.32308
6.78235
4.94381
4.1258
4.00667
3.56987
3.5659
2.8273
numbers, while Table 3.7 outlines the ten most frequent two digit numbers. Similar statistics were
recorded for each of the 16 unique digit groupings. An interesting point is that in the three digit
grouping, the three digit number ”123” occurs 21% of the time. By understanding which digits
are more frequent in each digit grouping, we can generate more probable passwords by generating
passwords with the more frequent digits.
15
Table 3.7: 2-Digit Grouping Statistics
2-Digit
12
13
11
69
06
22
21
23
14
10
Number of Occurrences
1084
771
747
734
595
567
538
533
481
467
Percentage of total
5.99425
4.26344
4.13072
4.05884
3.2902
3.13537
2.97501
2.94736
2.65981
2.58239
Special Characters in Passwords
Special characters are also important in password generation. The same process that was detailed
above for digits was repeated for special characters.
There are 213 unique special character
sequences that occur a total of 7260 times. Table 3.8 details the frequencies of each special
character grouping.
It is clear that one special character occurs more frequently than any other
Table 3.8: Symbol Grouping Statistics
Length of Special Character Groupings
1
2
3
4
5
6
9
8
7
10
Number of Occurrences
6566
512
125
29
11
8
4
3
1
1
Percentage of total
90.4408
7.05234
1.72176
0.399449
0.151515
0.110193
0.0550964
0.0413223
0.0137741
0.0137741
special character grouping. Table 3.9 details the most frequent special characters occuring in the
one special character grouping. The same statistics were collected for the other special character
groupings and recorded.
16
Table 3.9: 1-Symbol Grouping Statistics
1-Symbol
!
.
*
#
@
$
?
&
3.3
Number of Occurrences
2047
1377
510
492
473
255
234
151
131
121
Percentage of total
31.1758
20.9717
7.76729
7.49315
7.20378
3.88364
3.56381
2.29973
1.99513
1.84283
Developing a Specialized Context-Sensitive Grammar
In 3.2.1, the password collection was thoroughly analyzed. Each password was transformed into
its corresponding simple/complete structures and their frequencies recorded. Frequencies of the
digit and special character groupings were also collected and recorded. At this point, we have a
thorough understanding of the basic syntax used by users to generate passwords. It is our goal
to apply the learned syntax and frequencies to form a specialized grammar that will allow us to
generate the most probable passwords.
As context-free grammars are used to process complex natural langauges, such as English,
we have seen passwords are simplier, therefore they can be modeled by a specialized grammar.
The specialized grammar is described similarly as the general probilistic context-free grammar but
with some key differences. The finite set of non-terminal characters will consist of the simple and
complete structures and their corresponding probabilities defined in 3.2.1. The terminal characters
will be broken into three distinct groups: words, digits and special characters. The make-up of the
three groups will be explained below. The start variable will remain the same and have a probability
of 1.0. The production rules of the specialized grammar vary greatly from the production rules
of a general probabilistic context-free grammar, as the location of the words, digits and special
characters effect which production rules can be selected.
The specialized grammar will generate passwords based on password structures created using
production rules. From the start variable, only non-terminal values can be produced. More
specificially, only the simple structures, which a sample of them can be found in Table 3.3, can be
17
produced from the start variable. For example, if the simple structure LD was produced from the
start variable, the resulting generated password would be alphanumeric, with letters followed by
digits. Once a simple structure has been selected, the only values that can be produced are the
complete structures, which are also non-terminal values. The production rules that map a simple
structure to a complete structure are dependent upon the chosen simple structure. Only complete
structures that simplify to the previously produced simple structure may be selected. For instance,
if the simple structure LD was selected, a valid production rule may be the complete structure
l4 d3 , while the complete structure d2 l4 is not. Also, every production rule that is selected will be
multiplied together and recorded.
After a complete structure is selected, only terminal values can be selected. The choice of
terminal characters fall into three distinct groups: words, digits and special characters. The
members of the digit and special character groups will be associated with a probability. The
length of each group is dependent upon the length of the letter grouping, digit grouping and
special character grouping in the complete structure, non-terminal value. For example, consider
the complete structure l4 d2 s1 . The resulting generated password would consist of a 4-letter word
followed by a 2-digit number and one special character. We then substitute all digits and special
characters into the complete structure, which results in a password structure that has specific
digits and special characters, but still general word groups. At this point, there are specific
complete structures with specific production probabilities. The complete structures with the highest
production probability represents the most probable password structure. We will then substitute
words into the specific complete structures to generate passwords.
The members of the digit and special character terminal groups are obtained by extracting the
digit and special character groupings when analyzing the simple/complete password structures. The
word terminal group is derived from an input dictionary, which is independently generated from
sources other than the passwords. The primary purpose of having a customizable input dictionary
is to allow the generation of passwords within a specific domain. For example, if a password is
suspected to be constructed using scientific terms, it would be advantageous to generate passwords
using an input dictionary of scientific terms. The ability to use a customizable input dictionary is
available in all modern password breaking tools.
Consider the following specialized grammar:
V = simple structure: {(LDS,1.0)} complete structures: {(l3 d2 s1 ,0.75), (l3 d1 s1 ,0.25)}
Σ = Non-terminals groups
18
words: {cat, dog, at}
digits: {(99,0.9),(00,0.1),(1,1.0)}
symbols: {(!,1.0)}
R = List of production rules
Λ –>LDS 1.0
LDS –>l3 d2 s1 0.75
LDS –>l3 d1 s1 0.25
Λ = (Λ, 1.0)
We want to generate all possible passwords within this specialized grammar. We start with the
start symbol, which has a probability of 1.0. The only available production rule from the start
symbol is the simple non-terminal structure LDS, which occurs 100% of the time in this specialized
grammar. From the simple structure non-terminal value there are two possible production rules.
We choose the first complete structure non-terminal value l3 d2 s1 which occurs 75% of the time
within this specialized grammar. The complete structure value l3 d2 s1 contains a 3-letter word, a
2-digit number and 1 special character. We then apply the digit and special character groups to
generate the specific complete structures. The following list outlines the creation of the password
structures:
Λ –>LDS –>l3 d2 s1 –>l3 99s1 –>l3 99!
1.0 ∗ 1.0 ∗ 0.75 ∗ 0.9 ∗ 1.0 = 0.675
Λ –>LDS –>l3 d2 s1 –>l3 00s1 –>l3 00!
1.0 ∗ 1.0 ∗ 0.75 ∗ 0.1 ∗ 1.0 = 0.075
Λ –>LDS –>l3 d1 s1 –>l3 1s1 –>l3 1!
1.0 ∗ 1.0 ∗ 0.25 ∗ 1.0 ∗ 1.0 ∗ 1.0 = 0.25
Three password structures are generated: (l3 99!, 0.675), (l3 00!,0.075), (l3 1!,0.25).
The most
probable password structure is l3 99!, while the least probable password structure is l3 00!. The
word groups are then substituted into the password structures, which results in the following
passwords being generated: cat99!, dog99!, cat1!, dog1!, cat00! and dog00!. Notice, the word ”at”
was not used because there were no word groups of length 2 found in the password structures.
The specialized grammar in this example is relatively small, but illustrates the password structure
generation process and the actual password generation. Section 3.4 will explain in-depth the
password generation process.
19
3.4
Implementing a Specialized Grammar to Generate Passwords
There are two phases involved in generating passwords using a specialized grammar. The first
phase involves training the specialized grammar from a known corpus of passwords. The second
phase is the actual password generation which was outlined in section 3.3. Before implementation
is described, it is necessary to understand the directory structure associated with the password
generation. The grammar/ directory holds the pertinent information of the specialized grammar.
The words/ directory houses all the words included within an user supplied input dictionary. The
numbers/ directory maintains all the terminal digit groups, while the symbols/ directory maintains
all the terminal special character groups.
3.4.1
Training a Specialized Grammar
A training period is needed to initialize the specialized grammar’s simple/complete non-terminal
values and the digit/special character terminal groups. After the training period, the grammar,
numbers and symbols directories will be populated with the necessary information to begin
generating probable passwords. The training program is called computer grammar.py and the
actual implementation can be found in Appendix
A. The program is written in the Python
programming language as it can very easily handle files and string manipulation.
The training program accepts one command line argument, which is the file name of the
known password corpus. Before the analysis of the known passwords begins, four hash tables
are initialized to track the frequencies of the simple structures, complete structures, digit groups
and special character groups. The known password file is then opened and each password is
analyzed individually. Each password is transformed into it’s complete structure. During the
transformation process, each group of contiguous digits or special characters is extracted and placed
in the appropriate frequency hash table. Once the complete structure of the password is formed,
it is placed into the frequency hash table of complete structures. The complete structure is then
simplified and the resulting structure is inserted into the frequency hash table of simple structures.
The transformation and extraction process was done on each password in the known password
corpus. After the processing is completed, the four hash tables our filled with the frequencies of
each unique structure or grouping found in the known passwords.
All of the hash tables are then sorted and each members probability are calculated. The simple
and complete structure hash tables are stored within the grammar directory in the files simple.txt
and complete.txt, respectively. The digit group hash table is stored with numbers directory, not
20
as one large file, but as separate files of numbers of the same length. For example, the number
”99” has a length of two, therefore it will be stored in the file numbers/2.txt, which stores all
two digit numbers found in the known password corpus. The symbol group hash table is stored
in the same manner as the digit group hash table within the symbols directory. The specialized
context-sensitive grammar is now populated and password generation can begin.
3.4.2
Probable Password Generation Implementation
The password generator, gen passwords.c, is implemented in C and can be found in Appendix B.
Before the password generation can begin, a user must supply an input dictionary to be processed.
An input dictionary is a file that contains a collection of words, with one word per line. Each word
in the input dictionary can also have an associated weight value, which is a measure of how likely
that particular word should appear. The input dictionary is processed much like the digit and
special character groups described in 3.4.1. All the words of the same length are placed into a file
within the words directory. For example, if the word ”apple” is present in the input dictionary, it
would be placed in the words/5.txt file along with all the other 5-letter words.
Once the words directory has been populated, the probable password generation can be started.
One simple structure is processed at a time and all complete structures, that when simplified equal
the simple structure, are also processed before another simple structure can be selected. When a
complete structure is chosen, each distinct word, digit and special character grouping is processed
from left to right. The length of each grouping determines the file to be opened and read from
to generate each portion of the password. For example, if the complete structure being processed
has the form l5 d3 s1 , the word grouping is processed first. Since the word grouping has 5 letters,
the file words/5.txt will be opened and all 5-letter words will be placed into passwords that match
the complete structure. Next, the digit grouping will be processed, which will append all 3-digit
numbers to the end of all the 5-letter words. Finally, all special characters in the symbols/1.txt will
be appended to all previously generated passwords.
During the password construction process, a probable password percentage is recorded for each
individual password, which was described in 3.3. A threshold value can be established through
a command line option, which forces all generated passwords to be above a specific probability.
As each grouping is processed, if the password percentage falls below the threshold the password
is discarded. The password generator also allows for each complete structures to be concurrently
processed, which can be selected through a command line option.
21
All generated passwords and their corresponding probabilities are stored in the tmp/ directory.
The passwords are stored in files that match their complete structures.
For example, the
generated password password1 matches the complete structure l8 d1 , and would be placed in the
tmp/l8 d1 .txt.2 file with all other passwords that match the complete structure. The last number
of the filename represents the number of distinct groups in the complete structure. A limitation
of 32-bit systems is a restriction on the maximum file size, which can only be 2GB. To overcome
this limitation, complete structures that produced more than 2GB of passwords had their results
broken into multiple files, each with an additional number appended to the filename. If the complete
structure used above generated more than 2GB passwords, the generated passwords would be in
the files tmp/l8 d1 .txt.2.0, tmp/l8 d1 .txt.2.1, etc.
22
CHAPTER 4
CREATING A CUSTOM INPUT DICTIONARY
We have a thorough understanding of how numbers and special characters are used in generating
passwords. The difficult aspect of breaking passwords pertains to intelligently selecting the words
that form the root of the password. Many password based authentication schemes have a password
aging requirement, which states that a password must ”be changed after some period of time
has passed or after some event has occurred”[16]. One problem associated with password aging
requirements is that users do not change the root of the password. Users may only change the
prefix or a suffix, while leaving the root word in the password the same. For example, if a user
currently has the password apple01, it would not be unreasonable to assume that a new password
could be apple02. It is our assumption that if we can intelligently select the root word within a
password, our chances of breaking passwords will increase drastically.
The focus of this research has been on understanding the user generated passwords from
our password corpus. MySpace
R
is a popular social networking site that allows users to sign
up and share photos, journals and interests through out the world [17]1 MySpace
R
also enables
musical groups and artists to create accounts to share their music and connect with fans. It is
our assumption that MySpace
R
users identify with their favorite musical group/artist and will
incorporate the musical group’s/artist’s lyrics into their passwords.
4.1
Extracting Song Lyrics
The are a large number of musical genres within MySpace
R
, so it was important to retrieve lyrics
from a source that also represented a wide variety of popular musical genres. A popular user
submitted music lyric website, A-Z Lyrics Universe 2 , was selected to extract song lyrics. Browsing
through A-Z Lyrics Universe, we found that it contained a large selection of popular music lyrics in
1
2
As of March 7, 2008, it ranks third behind Google and Yahoo in internet site traffic[18].
http://www.azlyrics.com/
23
an easily parseable format. The lyrics were extracted using standard Unix command line utilities
and lynx, ”a general purpose distributed information browser for the World Wide Web”[19].
4.2
Song Lyric Analysis
A total of 55452 songs were analyzed and their lyrics extracted. All of the lyrics were converted to
lowercase and all periods, exclamation points and question marks were removed. These punctuation
marks were removed as they were typically found at the end of lyric sentences and were not an
integral part of the lyrics themselves. There were 15,636,602 total words found in the lyrics, but
only 180,083 of them were unique. Table 4.1 describes the word lengths found the lyrics. However,
Table 4.1: Frequency Analysis of Unique Words
Word Length
less than 5
5-7
8-10
11-16
greater than 16
Totals
Number of Occurrences
16097
72382
54272
21399
15933
180083
Percentage
0.0893866
0.401937
0.301372
0.118829
0.0884759
100
when constructing a dictionary it is infeasible to include all available words. Therefore, we have
decided to only include words that occur atleast 150 times within the lyrics. By enforcing this
minimal frequency, we restrict the number of unique words from 180083 down to 4835 words. This
may seem like a drastic reduction, but the these 4835 words appear 14,510,365 times, which is
92.79% of the total number of words found in the lyrics. Table 4.2 displays the frequency analysis
Table 4.2: Frequency Analysis of Restricted Words
Word Length
less than 5
5-7
8-10
greater than 11
Totals
Number of Occurrences
10704302
3421115
367024
17924
14510365
Percentage
0.7377
0.23577
0.0252939
0.00123525
100
of the restricted lyric dictionary. A majority of the words have between one and four characters.
It is our assumption that these 4835 unique words provide ample coverage of the lyric collections
24
word space.
It is interesting to note that a majority of the words in the restricted lyrics dictionary contain
less than five letters. A possible explanation for this may be based on user submitted lyrics that
contain slang terms. For example, the word ”ima” appears 4063 times, which is a common slang
abbreviation for ”i’m gonna.” It is the inclusion of these types of slang terms that could improve our
password breaking performance, as it captures the common langauge used by many young people
on the internet.
25
CHAPTER 5
EXPERIMENTS AND RESULTS
To test the effectiveness of our specialized grammar, we randomly separated the password corpus
into two sets. The first set of passwords was labeled as the training set, which contained 33561
passwords and was used to generate a specialized grammar. The second set of passwords was
labeled as the test set, which contained 33481 passwords and all generated passwords would be
tested against. For all experiments, the specialized grammar derived from the training set was used
to generate passwords.
5.1
Specialized Grammar vs John the Ripper
We compared our probable password generation against a well-known password breaking utility,
John the Ripper1 [3]. As our password generation program accepts an input dictionary, John
the Ripper does as well. John the Ripper has a wordlist mode that accepts an input dictionary
and generates passwords based on the dictionary and a set of transformation rules. Since we are
attempting to generate a specific domain of passwords, it is important use an input dictionary that
is likely to be used in that same domain. To meet this requirement, we simply extracted all the
unique words
2
found in the training set of passwords, which resulted in an input dictionary with
4143 words. All words within the dictionary will be considered equally likely, therefore they all
have a weighted value of 1.0. This input dictionary was used by both our password generator and
John the Ripper.
The following command was used to generate a list of all attempted passwords by John the
Ripper in wordlist mode and applying the standard transformation rules:
./john --wordlist=myspace.dict --rules --stdout > jtr_candidate.pwds
1
2
version 1.7.2
Words are considered any group of contiguous letters.
26
John the Ripper generated 189091 candidate passwords. We then used our probable password
generator to all the possible passwords baed on the input dictionary. However, we only selected
the 189091 most probable passwords. These two sets of candidate passwords were then compared
against the password test set.
Figure 5.1 displays the result of the comparison. John the Ripper was able to generate 5093
of the 33481 test passwords or 15.21%. Our probable password generation was able to generate
21.75%, or 7286, of the passwords in the test set. Figure 5.1 was generated by taking the number of
Figure 5.1: SG vs John the Ripper (Training Set Dictionary)
passwords found at 10000 password attempt intervals. From the figure, it is clear that our probable
password generator out performs John the Ripper from the beginning and showed increases in
passwords found toward the end of the experiment.
Although our generated passwords out performed John the Ripper, John the Ripper was not
necessarily designed to use an input dictionary in the same manner as our password generator. To
see if John the Rippers would perform better using a different word list, we ran John the Ripper
using the actual training password set as an input wordlist. John the Ripper generated 239284
candidate passwords, but only managed to generate 1611, or 4.8%, of the passwords in the test set.
This test shows that John the Ripper is more effective when using an input wordlist, therefore our
password generator performs better overall when compared to John the Ripper.
We also compared John the Ripper and our password generator when using an English
dictionary. John the Ripper generated 197662 candidate passwords, and was only able to generate
659 passwords that are in the test password set. We then used our password generator to generate
27
Figure 5.2: SG vs John the Ripper (English Dictionary)
the 197662 most probable passwords. We were able to generate 830 of the passwords found in
the test set. Figure 5.1 summarizes the results of this experiment. In Figure 5.1, John the Ripper
performed better than our password generated with the first 100000 candidate passwords. However,
John the Ripper’s performance diminished quickly after that point, while our password generator
continued to generate passwords in the test set.
In the previous experiments, we chose input dictionaries that are relative to the password
Figure 5.3: SG vs John the Ripper (Scientific Dictionary)
test set. We wanted to compare our performance against John the Ripper when using an input
dictionary from a different domain. As MySpace
R
is as a site directed towards younger computer
users, a dictionary of scientific terms would provide a significantly different set of words MySpace
R
users may use to create passwords. Figure 5.1 illustrates the results of the experiment. John the
Ripper generated 250335 candidate passwords and was only able to generate 18 of the passwords
28
found in the test set.
We selected the 250335 most probable passwords from the passwords
generated using our specialized grammar. We slightly outperformed John the Ripper by generated
19 passwords that were present in the test set. Although the total performance of our password
generator was slighty better than John the Ripper’s, we were able to find 19 passwords in the first
60000 passwords, while John the Ripper needed 240000 candidate passwords to find 18 passwords.
Table 5.1 summarizes the performance improvements associated with using our specialized
grammar when compared against John the Ripper.
We receive the largest improvement in
Table 5.1: Performance Improvements
Input Dictionary
R
MySpace English
Scientific
performance when using the MySpace
R
Performance Improvement (%)
43.1
25.9
5.5
input dictionary, and we receive the smallest performance
improvement when using the scientific input dictionary. From these results, it is clear to see that
our performance improves more as we use an input dictionary that is specific to the domain of the
passwords. Therefore, if we can determine the domain of a password, the chances of breaking a
password improves greatly when using our specialized grammar.
5.2
Specialized Grammar and Lyrics Dictionary
In the previous section, we have established that our password generator performs better than
Figure 5.4: Comparison of the Training Set Dictionary and Lyric Input Dictionaries
29
R
users incorporate song
John the Ripper. We also made the hypothesis in Section 4 that MySpace lyrics in their MySpace
R
passwords. Therefore an input dictionary consisting of words found in
song lyrics would allow us to generate more passwords in the test set than any previously used
dictionary.
We then used our password generator to generate passwords based on the lyrics input dictionary.
The 500,000 most probable passwords were chosen and tested against the test set. The value 500,000
was chosen as it is sufficiently large to determine any trends. The results of the experiment are shown
in Figure 5.2. We were able to generate 2905 of the passwords found in the test set, which is much
less than the 9507 passwords that were generated using the input dictionary consisting of words
found in the password training set. Although the lyrics input dictionary performed poorly when
compared to the training set input dictionary, it performed substantially better than the English
input dictionary. In the previous section, when generating the 197662 most probable passwords
based on the English input dictionary, we were able to generate only 830 passwords. However,
in Figure 5.2 we were able to generate 2019 passwords after trying the 190000 most probable
passwords, which is substantially more than the 830 when using the English input dictionary. We
can conclude that our lyrics input dictionary contains more root words than the English input
dictionary. Therefore the lyrics dictionary is more specific to the MySpace
R
domain than the
English dictionary. Although our lyrics dictionary did not perform as well as anticipated, it did
demonstrate that more probable passwords can be generated by using a domain specific input
dictionary.
30
CHAPTER 6
FUTURE RESEARCH
Through our comparisions with John the Ripper we have established that our specialized grammar
can effectively generate more probable passwords than current password breaking technology.
Instead of applying common transformation and mutation patterns to dictionary words, we learned
from known passwords and use that knowledge effectively to generate passwords. Although our
password generation results when using the lyrics dictionary were not as we anticipated, we still
feel that more research into custom input dictionaries will result in the generation of more probable
passwords. In this research, we focused primarily on the structure of the passwords and not on the
words used in the passwords. If we can better understand, from a users perspective, the reasoning
behind the root word selection, we may be better able to generate probable passwords.
Since our research only focused on the structure of passwords, there are no language dependecies
associated with this password generation strategy. Future research may include how effective this
strategy may be when applied to foreign languages. From a law enforcement or military perspective,
having a password breaking tool that can intelligently generate probable passwords in a foreign
language could be invaluable for national security.
Our probable password selection strategy was simple and straighforward. We simply selected the
most probable passwords, regardless of the password structure. For example, if we wanted to select
10,000 passwords to test and the first complete structure had generated over 10,000 passwords,
only the passwords matching the first complete structure will be selected. We chose this strategy
because it was the easiest to understand. Additional research can explore other password selection
strategies that use the statistical information of the simple and complete structures to select a more
diverse password test set.
The performance of our password generation program was not a primary concern through
out this research. Additional research can explore the parallelization of our password generation
process. Also, our current implementation does not allow for the generation of the ith password.
31
Additional research into creating an indexing algorithm would allow the specific generation of the
i th password, which will allow us to generate rainbow tables based on our password generator.
As this research focuses on the generation of probable passwords, another future research project
could be the development of a proactive password checker that is based on learned password
structures. A proactive password checker can educate users in strong password generation and
improve the security of a password-based authentication system.
Overall, we have established that generating passwords based on a specialized grammar is an
effective strategy. Although these results are preliminary, there is enough evidence to warrant
further research.
32
APPENDIX A
COMPUTE GRAMMAR
This code is used to compute the grammmar associated with known passwords and is described in
3.4.1.
#!/usr/bin/python
import string, sys, os, re
def populate symbols(total,symbol):
symbol = [(v,k) for k,v in symbol.items()]
symbol.sort()
symbol.reverse()
for count,word in symbol:
file = "symbols/"+str(len(word))+".txt"
out = open(file,’a’)
line = "%s %f\n" % (word, float(count)/total)
out.write(line)
out.close()
def populate numbers(total,numbers):
numbers = [(v,k) for k,v in numbers.items()]
numbers.sort()
numbers.reverse()
10
20
for count,word in numbers:
file = "numbers/"+str(len(word))+".txt"
out = open(file,’a’)
line = "%s %f\n" % (word, float(count)/total)
out.write(line)
out.close()
def populate complete grammar(total,complete):
complete = [(v,k) for k,v in complete.items()]
complete.sort()
complete.reverse()
30
out = open("grammar/complete.txt",’w’)
for count,word in complete:
if len(word)<17:
line = "%s %f\n" % (word, float(count)/total)
33
out.write(line)
out.close()
40
def populate simple grammar(total,simple):
simple = [(v,k) for k,v in simple.items()]
simple.sort()
simple.reverse()
out = open("grammar/simple.txt",’w’)
for count,word in simple:
if len(word)<17:
line = "%s %f\n" % (word, float(count)/total)
out.write(line)
out.close()
def simplify grammar(n):
ret = ’’
char = ’’
for x in range(0,len(n)):
if not (n[x].isspace()):
if n[x] != char:
ret+=n[x]
char = n[x]
return ret
50
60
# this function will accept filename n
# then read in the file and
# 1. generate the complete grammar of the file
# 2. simple grammar of the file
# 3. stats associated with the numbers
# 4. stats associated with the symbols
def generate complete grammar(n):
70
symbol = {}
symbol total = 0
number = {}
number total = 0
complete={}
complete total = 0
80
simple = {}
simple total = 0
in fd = open(n,’r’)
for password in in fd:
password = string.rstrip(password)
90
complete temp = ’’
simple temp = ’’
number temp = ’’
symbol temp = ’’
34
last ch = ’’
for x in range(0,len(password)):
if password[x].isalpha():
complete temp+="L"
100
if last ch == ’D’:
if number temp != ’’:
try:
number[number temp]+=1
except:
number[number temp]=1
number total += 1
number temp = ’’
elif last ch == ’S’:
if symbol temp != ’’:
try:
symbol[symbol temp]+=1
except:
symbol[symbol temp]=1
symbol total += 1
symbol temp = ’’
last ch = ’L’
elif password[x].isdigit():
complete temp+="D"
110
120
if (last ch == ’’) or (last ch == ’D’):
number temp += password[x]
elif last ch == ’S’:
if symbol temp != ’’:
try:
symbol[symbol temp]+=1
except:
symbol[symbol temp]=1
symbol total += 1
symbol temp = ’’
130
last ch = ’D’
else:
complete temp+="S"
if (last ch == ’’) or (last ch == ’S’):
symbol temp += password[x]
elif last ch == ’D’:
if number temp != ’’:
try:
number[number temp]+=1
except:
number[number temp]=1
number total += 1
number temp = ’’
140
last ch = ’S’
150
if number temp != ’’:
35
try:
number[number temp]+=1
except:
number[number temp]=1
number total+=1
if symbol temp != ’’:
try:
symbol[symbol temp]+=1
except:
symbol[symbol temp]=1
symbol total+=1
160
try:
complete[complete temp]+=1
except:
complete[complete temp]=1
complete total+=1
170
simple temp = simplify grammar(complete temp)
try:
simple[simple temp]+=1
except:
simple[simple temp]=1
simple total+=1
in fd.close()
180
# finished computing stats
populate simple grammar(simple total,simple)
populate complete grammar(complete total,complete)
populate numbers(number total,number)
populate symbols(symbol total,symbol)
if
name ==’__main__’:
generate complete grammar(sys.argv[1])
190
36
APPENDIX B
PASSWORD GENERATION
Detailed in 3.4.2, this is the code to generate passwords based on a specialized grammar.
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define
#define
#define
#define
#define
#define
<stdio.h>
<stdlib.h>
<string.h>
<fcntl.h>
<sys/types.h>
<sys/time.h>
<sys/stat.h>
<sys/wait.h>
<unistd.h>
<math.h>
<popt.h>
10
SIMPLE "grammar/simple.txt"
COMPLETE "grammar/complete.txt"
MAXLEN 60
MAXFILE 32
FORMAT1 "%s"
FORMAT2 "%s %f"
20
int
int
int
int
int
int
int
compare(char *complex, char *simple);
generate passwords(char *simple, float prob);
process complete grammars(char *simple);
get number of chunks(char *word);
breakdown(char *structure, float prob, int cycle, char *pathbase, int final value);
populate words(char *file);
scrub(void);
int line cntr=0; /* counts the current complete grammar being generated */
int complete cntr=0; /* total number of grammars being generated */
static
static
static
static
float threshold=−1.0; /* threshold obtained from command line arguments */
int verbose=0; /* verbosity obtained from command line arguments */
int np=0; /* number of processes */
int format=0;
int main(int argc, char *argv[ ]) {
FILE *s file,*c file; /* pointer to grammar/simple.txt fd */
37
30
char *dictionary;
int rc;
char buf[MAXLEN],buf2[MAXLEN];
float prob,dummy; /* holds the probability of a particular value */
struct timeval tv; /* for timing purposes */
time t start time, end time;
int total time;
static int populate;
static int clean;
40
char ch; /* used to parse command line options */
poptContext opt con; /* context for parsing command-line options */
50
static struct poptOption options table[ ] = {
{ "clean",
’c’, POPT ARG NONE,
&clean,
’c’, "clean the words/ and tmp/ directories",NULL},
{ "populate", ’p’, POPT ARG NONE, &populate,
’p’, "only populate the words from the input dictionary", NULL},
&np,
{ "procs",
’n’, POPT ARG INT,
’n’, "number of processes to create", "val>1"},
{ "threshold", ’t’, POPT ARG FLOAT, &threshold,
’t’, "set minimum threshold", "val"},
{ "verbose", ’v’, POPT ARG NONE, &verbose,
’v’, "enable verbose logging", NULL},
{ "format",
0, POPT ARG NONE, &format,
0, "input file in the form: <word> <probability>", NULL},
POPT AUTOHELP
{ NULL, 0, 0, NULL, 0 } /* end-of-list terminator */
};
opt con = poptGetContext(NULL, argc, (const char **)argv, options table, 0);
poptSetOtherOptionHelp(opt con, "[OPTIONS]* <input dictionary>");
60
70
/* process the command line options */
/*ch = poptGetNextOpt(opt con);*/
/* only use the below if we need to do some option processing */
while((ch = poptGetNextOpt(opt con))>=0) {
switch(ch) {
case ’?’: poptPrintHelp(opt con,stderr,0); break;
}
}
80
if (ch < −1) { /* if there was an error with the command
line arguments that was missed, it would be caught here */
poptPrintHelp(opt con,stderr,0);
return 0;
}
if ((np)&&(np<2)) { /* error checking with the number of processes */
poptPrintHelp(opt con,stderr,0);
return 0;
}
dictionary = (char *)poptGetArg(opt con); /* get the input file */
if ((dictionary == NULL)&&(!clean)) {
poptPrintHelp(opt con,stderr,0);
38
90
return 0;
}
/* begin timing */
gettimeofday(&tv,NULL);
start time = tv.tv sec;
100
if (clean) {
system("/bin/rm -f ./words/*");
system("/bin/rm -f ./tmp/*");
printf("Cleaned word/ and tmp/ directories.\n");
if (dictionary == NULL) {
printf("No input dicitionary selected - exiting.\n");
return 0;
}
}
printf("Populating words from input dictionary. . .\n");
if ((rc=populate words(dictionary))<0) {
exit(−1);
}
printf("Input dictionary contained %d words.\n",rc);
if (populate) {
gettimeofday(&tv,NULL);
end time = tv.tv sec;
total time = end time−start time;
printf("Total population time: %1ds\n",total time);
return 0;
}
if (access(SIMPLE,R OK)<0) {
printf("Error: cannot access %s\n",SIMPLE);
printf("Need to generate grammars\n");
exit(−1);
}
110
120
130
if (access(COMPLETE,R OK)<0) {
printf("Error: cannot access %s\n",COMPLETE);
printf("Need to generate grammars\n");
exit(−1);
}
s file = fopen(SIMPLE,"r");
while(fscanf(s file,"%s %f",buf,&dummy)>0) {
c file = fopen(COMPLETE,"r");
while(fscanf(c file,"%s %f",buf2,&dummy)>0) {
if (compare(buf2,buf))
complete cntr++;
}
fclose(c file);
}
fclose(s file);
printf("Complete grammars to be analyzed: %d\n",complete cntr);
printf("Generating passwords (threshold=%.20f). . .\n",threshold);
s file = fopen(SIMPLE,"r");
39
140
150
/* To just examine one simple structure change buf = “LDS”
or whatever simple structure you want and take out
the for loop
*/
while(fscanf(s file,"%s %f",buf,&prob)>0) {
rc = generate passwords(buf,prob);
}
160
fclose(s file);
gettimeofday(&tv,NULL);
end time = tv.tv sec;
/* need to clean up tmp/ directory */
printf("********************************\n");
printf("Password generation complete.\n\nCleaning up. . .\n");
scrub();
printf("\nAll generated passwords are located in the tmp/ directory.\n");
total time = end time−start time;
printf("Total generation time: %1d seconds\n",total time);
170
return 0;
}
int scrub() {
FILE *in fd;
char path[MAXFILE];
char structure[MAXLEN];
float temp;
int num chunks;
int i;
180
in fd = fopen(SIMPLE,"r");
while(fscanf(in fd,"%s %f",structure,&temp)>0) {
sprintf(path,"tmp/%s.txt",structure);
if (unlink(path)<0) {
/*printf(“unlink: could not remove %s\n”,path);*/
}
}
fclose(in fd);
in fd = fopen(COMPLETE,"r");
while(fscanf(in fd,"%s %f",structure,&temp)>0) {
i=1;
num chunks = get number of chunks(structure);
while(i<num chunks) {
sprintf(path,"tmp/%s.txt.%d",structure,i);
if (unlink(path)<0) {
/*printf(“unlink: could not remove %s\n”,path);*/
}
i++;
}
}
fclose(in fd);
190
200
210
40
return 0;
}
/* Take a complex complete structure and break it down
* into its simplified form
*/
int compare(char *complex, char *simple) {
char buf[MAXLEN];
char last=0;
int pos=0,x;
for(x=0;x<strlen(complex);x++) {
if (last != complex[x]) {
buf[pos]=complex[x];
last = complex[x];
pos++;
}
}
buf[pos]=’\0’;
if (strcmp(buf,simple)==0)
return 1;
else
return 0;
220
230
}
/* generate all the passwords based on the simple structure
* output passwords to tmp/ files
*/
int generate passwords(char *simple, float prob) {
FILE *c file; /* fd to grammar/complete.txt */
FILE *tmp file;
float p,in p;
char buf[MAXLEN];
char f[MAXFILE];
240
sprintf(f,"tmp/%s.txt",simple);
printf("Generating %s file to hold intermediate results. . .\n",f);
tmp file = fopen(f,"w");
c file = fopen(COMPLETE,"r");
while(fscanf(c file,"%s %f",buf,&in p)>0) {
if (compare(buf,simple)) {
/* the complete grammar matches the simplified grammar */
p = in p * prob;
fprintf(tmp file,"%s %.20f\n",buf,p);
}
}
fclose(c file);
fclose(tmp file);
process complete grammars(simple);
return 0;
}
/* examine the tmp/ file to start generating the words */
41
250
260
int process complete grammars(char *simple) {
FILE *in;
char file[MAXFILE];
char structure[MAXLEN];
float prob;
int np cntr=0; /* counts the current number of processes running */
int pid,rc,stat;
270
sprintf(file,"tmp/%s.txt",simple);
in = fopen(file,"r");
while(fscanf(in,"%s %f",structure,&prob)>0) {
/*if(verbose)*/
printf("%d: Processing complete structure %s --> %s\n",line cntr++,structure,simple);
if (np) { /* concurrent processing */
if (np cntr >= np) {
rc = waitpid(−1,&stat,0);
if (verbose)
printf("%d: child process reaped.\n",rc);
np cntr−−;
}
if (np cntr < np) {
if ((pid = fork())<0) {
perror("fork");
exit(−1);
}
if (pid == 0) {
if (verbose)
printf("%d: child process analyzing %s.\n",getpid(),structure);
breakdown(structure,prob,0,structure,0);
exit(0);
} else {
np cntr++;
}
}
} else { /* single process only */
breakdown(structure,prob,0,structure,0);
}
}
while(np cntr > 0) {
rc = waitpid(−1,&stat,0);
if (verbose)
printf("%d: child process reaped.\n",rc);
np cntr−−;
}
280
290
300
310
return 0;
}
int get number of chunks(char *word) {
int count=1;
char c;
int i=0;
if (word == NULL)
return 0;
c=word[0];
320
42
while(i<strlen(word)) {
if (c!=word[i]) {
count++;
c=word[i];
}
i++;
}
return count;
330
}
int get chunk size(char *word) {
int count=0;
char c;
if (word == NULL)
return 0;
340
c=word[0];
while(c==word[count])
count++;
return count;
}
int breakdown(char *structure, float prob, int cycle, char *pathbase, int final value) {
int chunk=0;
char symbol;
char in[MAXFILE];
char out[MAXFILE];
char data[MAXFILE];
char word[MAXLEN];
FILE *in fd, *out fd, *data fd;
float p;
350
sprintf(in,"tmp/%s.txt.%d",pathbase,cycle);
sprintf(out,"tmp/%s.txt.%d",pathbase,cycle+1);
360
chunk = get chunk size(structure);
symbol = structure[0];
if (cycle == 0) { /* have no previous input to be created */
switch(symbol) {
case ’L’: sprintf(data,"words/%d.txt",chunk); break;
case ’D’: sprintf(data,"numbers/%d.txt",chunk); break;
case ’S’: sprintf(data,"symbols/%d.txt",chunk); break;
}
if (access(data,R OK)<0) {
printf("error: %s not available.\n",data);
return 0;
}
out fd = fopen(out,"w");
data fd = fopen(data,"r");
while(fscanf(data fd,"%s %f",word,&p)>0) {
fprintf(out fd,"%s %.20f\n",word,p*prob);
}
fclose(out fd);
fclose(data fd);
43
370
380
} else { /* need to incorporate previous results */
char buf[MAXLEN];
float p2;
switch(symbol) {
case ’L’: sprintf(data,"words/%d.txt",chunk); break;
case ’D’: sprintf(data,"numbers/%d.txt",chunk); break;
case ’S’: sprintf(data,"symbols/%d.txt",chunk); break;
}
if (access(data,R OK)<0) {
printf("error: %s not available.\n",data);
return 0;
}
in fd = fopen(in,"r");
out fd = fopen(out,"w");
while(fscanf(in fd,"%s %f",buf,&p2)>0) {
struct stat s;
data fd = fopen(data,"r");
if (stat(out,&s)<0) {
fprintf(stderr,"stat");
}
if (s.st size >2100000000) { /* output file is close to 2gig limit */
fclose(out fd);
printf("\tFile %s is close to the 2GB limit\n",out);
sprintf(out,"tmp/%s.txt.%d.%d",pathbase,cycle+1,final value++);
printf("\tContinue and redirect output to %s\n",out);
out fd = fopen(out,"w");
}
while(fscanf(data fd,"%s %f",word,&p)>0) {
if ((p2*p)>=threshold)
fprintf(out fd,"%s%s %.20f\n",buf,word,p2*p);
}
390
400
410
fclose(data fd);
}
fclose(in fd); fclose(out fd);
}
structure += chunk;
if (strlen(structure)>0)
breakdown(structure,prob,++cycle,pathbase,final value);
else
return 0;
return 0;
420
}
/* take input file and populate the words directory
* pathname -> input dictionary
* returns -> number of words analyzed
*/
int populate words(char *pathname) {
FILE *in,*out;
char buf[MAXLEN];
char file out[MAXFILE];
int len,words=0,rc;
float f;
430
44
if (access(pathname,R OK)<0) { /* unable to access the input dictionary */
printf("Error: cannot access %s\n",pathname);
return −1;
}
440
/* can read input dictionary */
in = fopen(pathname,"r");
while(fscanf(in,"%s",buf)>0) {
/* breakdown each word by length */
len = strlen(buf);
450
if (format)
rc=fscanf(in,"%f",&f);
else
f=1.0;
sprintf(file out,"words/%d.txt",len);
out = fopen(file out,"a");
fprintf(out,"%s %.20f\n",buf,f);
fclose(out);
460
words++;
}
fclose(in);
return words;
}
470
45
REFERENCES
[1] Albert J. Greenfield, Robert; Marcella, editor. Cyber Forensics: A Field Manual for Collecting,
Examining, and Preserving Evidence of Computer Crimes. Auerbach Publications, Boca
Raton, FL, 2002. 1
[2] Wenbo Mao. Modern Cryptography: Theory and Practice. Hewlett-Packard Company, Upper
Saddle River, NJ., 2004. 1
[3] Openwall Project. John the Ripper password cracker, 2008. http://www.openwall.com/john/.
2.1, 2.2.3, 5.1
[4] Oxid.it. Cain and Abel, 2008. http://www.oxid.it/cain.html. 2.1, 2.2.3
[5] Philippe Oechslin. Making a Faster Cryptanalytic Time-Memory Trade-Off. In The 23rd
Annual International Cryptology Conference, CRYPTO ’03, volume 2729 of Lecture Notes in
Computer Science, pages 617–630, 2003. 2.1, 2.2.4
[6] P Oechslin. Password cracking: Rainbow tables explained. The International Information
Systems Security Certification Consortium, Inc., 2005. 2.1
[7] Arvind Narayanan and Vitaly Shmatikov. Fast dictionary attacks on passwords using timespace tradeoff. In CCS ’05: Proceedings of the 12th ACM conference on Computer and
communications security, pages 364–372, New York, NY, USA, 2005. ACM. 2.2, 2.2.1, 2.2.2,
2.2.2, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 2.2.5
[8] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, April Rasala,
Amit Sahai, and abhi shelat. Approximating the smallest grammar: Kolmogorov complexity
in natural models. In STOC ’02: Proceedings of the thiry-fourth annual ACM symposium on
Theory of computing, pages 792–801, New York, NY, USA, 2002. ACM. 2.2.1
[9] Simon Singh. Letter frequencies, 2008. http://www.simonsingh.net/. 2.2.2
[10] Zhu Shuanglei. Project Rainbow Crack, 2007. http://www.antsight.com/zsl/rainbowcrack/.
2.2.5
[11] N. Chomsky. Three models for the description of language. Information Theory, IEEE
Transactions on, 2(3):113–124, Sep 1956. 3.1, 3.2
[12] Yusuk Lim, Changsheng Xu, and David Dagan Feng. Web based image authentication
using invisible fragile watermark. In VIP ’01: Proceedings of the Pan-Sydney area workshop
on Visual information processing, pages 31–34, Darlinghurst, Australia, Australia, 2001.
Australian Computer Society, Inc. 3.2
46
[13] Joseph Goldberg, Jennifer Hagman, and Vibha Sazawal. Doodling our way to better
authentication. In CHI ’02: CHI ’02 extended abstracts on Human factors in computing
systems, pages 868–869, New York, NY, USA, 2002. ACM. 3.2
[14] Robert
McMillan.
Phishing
attack
targets
myspace
users,
2006.
http://www.infoworld.com/infoworld/article/06/10/27/HNphishingmyspace 1.html. 3.2
[15] Florida State University.
http://ecit.fsu.edu/. 3.2
Electronic Crime Investigative Technologies Laboratory, 2008.
[16] Matt Bishop. Computer Security: Art and Science. Addison-Wesley, 2003. 4
[17] Thomas Anderson and Christopher DeWolfe. MySpace, 2008. http://www.myspace.com. 4
[18] Alexa: The web information company, 2008. http://www.alexa.com. 1
[19] Thomas Dickey. Lynx, 2006. http://lynx.isc.org. 4.1
[20] Zhiyi Chi. Statistical properties of probabilistic context-free grammars.
Linguistics, 25(1):131–160, 1999.
47
Computational
BIOGRAPHICAL SKETCH
William J. Glodek
William J. Glodek completed his Bachelors degree in Computer and Information Sciences from the
University of Delaware in the spring of 2006. Under the advisement of Prof. Sudhir Aggarwal
and Prof. Breno de Medeiros, he obtained his Masters degree in Information Security from the
Department of Computer Science at Florida State University in spring 2008. William’s research
interests include general computer and network security.
48