Florida State University Libraries Electronic Theses, Treatises and Dissertations The Graduate School 2008 Using a Specialized Grammar to Generate Probable Passwords William J. Glodek Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES USING A SPECIALIZED GRAMMAR TO GENERATE PROBABLE PASSWORDS By WILLIAM J. GLODEK A Thesis submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master’s of Science Degree Awarded: Spring Semester, 2008 The members of the Committee approve the Thesis of William J. Glodek defended on April 9, 2008. Sudhir Aggarwal Professor Co-Directing Thesis Breno de Medeiros Professor Co-Directing Thesis Zhenhai Duan Committee Member Approved: Dr. David Whalley, Chair Department of Computer Science Joseph Travis, Dean, College of Arts and Sciences The Office of Graduate Studies has verified and approved the above named committee members. ii TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. CURRENT STATE OF PASSWORD BREAKING . . . . . . . . . . . . . . . . . . . . 2.1 Well-Established Password Breaking Strategies & Techniques . . . . . . . . . . . 2.2 Improving Password Breaking By Using Markov Models . . . . . . . . . . . . . . 3 3 4 3. SPECIALIZED GRAMMAR AND PASSWORD GENERATION . 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Understanding User Generated Passwords . . . . . . . . . . . 3.3 Developing a Specialized Context-Sensitive Grammar . . . . 3.4 Implementing a Specialized Grammar to Generate Passwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 17 20 4. CREATING A CUSTOM INPUT DICTIONARY . . . . . . . . . . . . . . . . . . . . 23 4.1 Extracting Song Lyrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Song Lyric Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5. EXPERIMENTS AND RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1 Specialized Grammar vs John the Ripper . . . . . . . . . . . . . . . . . . . . . . 26 5.2 Specialized Grammar and Lyrics Dictionary . . . . . . . . . . . . . . . . . . . . . 29 6. FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 A. COMPUTE GRAMMAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 B. PASSWORD GENERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 iii LIST OF TABLES 2.1 Normal Rainbow Table vs Markovian/FA Rainbow Table . . . . . . . . . . . . . . . . . . 9 3.1 Password Length Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Password Basic Characteristic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Simplified Password Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Complete Structure Statistics of Simple Structure LD 3.5 Digit Grouping Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.6 1-Digit Grouping Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.7 2-Digit Grouping Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.8 Symbol Grouping Statistics 3.9 1-Symbol Grouping Statistics 4.1 Frequency Analysis of Unique Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Frequency Analysis of Restricted Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1 Performance Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 iv LIST OF FIGURES 5.1 SG vs John the Ripper (Training Set Dictionary) . . . . . . . . . . . . . . . . . . . . 27 5.2 SG vs John the Ripper (English Dictionary) . . . . . . . . . . . . . . . . . . . . . . . 28 5.3 SG vs John the Ripper (Scientific Dictionary) . . . . . . . . . . . . . . . . . . . . . . 28 5.4 Comparison of the Training Set Dictionary and Lyric Input Dictionaries . . . . . . . 29 v ABSTRACT The most common method of preventing unauthorized access to digital information is through the use of a password-based authentication system. The strength of a password-based authentication system relies on a humans ability to generate a password that is memorable but not easily guessed. Brute force techniques can be used to break passwords through exhaustive search, but this may take an infeasible amount of time. Dictionary attack techniques attempt to break passwords by applying common password construction patterns to standard dictionaries. These common strategies are often successful in breaking weak passwords, but as computer users become more educated in secure computing practices, these strategies may become less successful. We have developed a novel password breaking strategy that uses known passwords to develop a specialized grammar, which can be used to generate probable passwords. The password generation process uses a probabilistic approach to develop grammars that measure the likely hood of a password structure. Passwords are generated based on the probabilistic password structures. In this thesis, we describe the development and implementation of the specialized grammar and the generation of passwords. We also show that our probable password generation strategy is more effective than current password breaking utilities and provides a foundation for future research. vi CHAPTER 1 INTRODUCTION The use of secure cryptographic protocols to protect information has increased in recent years. The United States government utilizes cryptography to protect sensitive information and communications. Major corporations and businesses use cryptography to secure their customer’s information, which can include customers credit card information and social security numbers. Additionally, the availability of commercial and open source encryption tools has provided the average computer user the ability to protect their personal files. Secure encryption is necessary in shielding information from unauthorized users. The use of encryption is not only limited to protecting authorized information, but it can also be used to ”conceal illegitimate materials from law enforcement” [1]. It is the duty of law enforcement to protect the public from various threats; however, encryption may suppress information law enforcement could use to protect the public. For example, if a person is suspected of committing identity theft but all the evidence of the crime is contained within an encrypted file, law enforcement officials may not have enough evidence to convict and/or prosecute the suspected person. It is law enforcements need to decipher these types of files in order to protect the public. The most common method of protecting information is through the use of a symmetric key cryptosystem. In a symmetric key cryptosystem, both the ”encryption and decryption processes are controlled by a cryptographic key” [2]. The most common cryptographic key is a password. The strength of a password lies in its entropy, which is a measure of the passwords uncertainty. A password that consists of random letters, digits and symbols will have a higher entropy than a password that is just a simple English word. A weak link in most passwords is that they must be human-memorable; therefore most passwords have relatively low entropy. In a forensic setting, obtaining passwords can be difficult. If an individual utilizes encryption to protect files, a law enforcement official can safely assume the individual is technologically literate and understands the importance of using a password that can not be easily guessed. A password 1 can be broken through brute force techniques, which enumerate all possible passwords. However, it may take an unreasonable amount of time to find the correct password. The focus of this thesis is to explore a more effective password search strategy, which uses a specialized grammar to generate probable passwords that are based on a specific input dictionary or corpus. The next section will explore the current state of the art in breaking passwords. The following section will describe a specialized context-sensitive grammar and how it can be used for more efficient password search strategies. The preliminary research associated with selecting input dictionaries and corpuses will be explained in Section 4. Section 5 will detail the performance of our probable password generator in various experiments. Future research is discussed in the final section. 2 CHAPTER 2 CURRENT STATE OF PASSWORD BREAKING 2.1 Well-Established Password Breaking Strategies & Techniques In modern operating sytems, passwords are not stored in plaintext but as hashed values based upon the plaintext password and a cryptographic hash function. Breaking a password consists of matching a hashed value with a corresponding plaintext password. The MD5 and SHA hash functions are the most common cryptographic hash functions. The act of breaking a password can be thought as finding a plaintext password that corresponds to a known hashed value. With respect to discovering passwords, the terms breaking and cracking are considered synonymous, however, for consistency the process of discovering a password will be referred to as breaking a password. It is important to note that they password key space is vital to the strength of the password-based authentication system. If a key space is small, all possible keys can be tried in a timely manner with modern computing technology. However, if a suffciently large key space is used, the only feasible manner in which to recover data is to know the password key. The research described in Chapter 2 and in this thesis focus on password-based authentication systems that use large key spaces. The process of breaking passwords has been studied for many years. The most elementary technique of password breaking is the guess-and-check method, where passwords are selected without any particular strategy. A natural extension of this method is the dictionary attack. The dictionary attack strategy is based on the assumption that passwords are based on single words that can be found in a dictionary. The dictionary attack strategy can also be strengthened as variations to passwords, such as appended numbers or symbols, can be easily predicted. The brute force attack strategy attempts to generate all possible password combinations. The dictionary attack strategy attempts only the easily guessable passwords and can execute in a relatively short amount of time, but requires a large amount of memory to hold the dictionary. The brute force strategy generates all possible passwords and garuantees the password will be 3 found, but execution time may be unreasonable. Popular implementations of these attack types are John the Ripper and Cain & Abel [3, 4]. An improvement to the standard dictionary and brute force attack strategies was developed by Philippe Oechslin in the paper ”Making a Faster Cryptanalytic Time-Memory Trade-Off”, which introduced rainbow tables [5]. Rainbow tables are comprised of chains that represent a sequence of plaintext passwords and their corresponding hash values. The benefit of using rainbow tables is that they cover a large percentage of the possible password space while maintaining an effecient look up speed. Each chain in a rainbow table is constructed by starting with an arbitrary password, which is used as input to a hash function. A reduction function is then used to map the hash value to a new password. This process is repeated n-times to create a chain of length n. The only values that are stored in the rainbow table is the initial password and the last computed hash. The size of the rainbow table is dependent upon the chain length. A chain of small chain length results in a relatively large rainbow table, while a large chain length results in a relatively smaller rainbow table. The in-depth analysis of rainbow tables is outside the scope of this thesis; however, an article ”Password Cracking: Rainbow Tables Explained,” written by Philippe Oechslin, explains the motivation and techniques used in efficiently creating and using rainbow tables [6]. The current state of password breaking is built upon these basic strategies and techniques. The following section details research that incorporates the dictionary attack strategy, the time-memory tradeoff and Markov Models to break passwords. 2.2 Improving Password Breaking By Using Markov Models A weakness associated with password-based authentication schemes is that the password must be human-memorable. Researchers from the University of Texas at Austin, Arvind Narayanan and Vital Shmatikov, contend that as long as humans are required to memorize passwords, the passwords are vulnerable to ”smart dictionary” attacks [7]. 2.2.1 Motivation Humans are often considered the weak link in security systems. A user who is uneducated on standard security practices may leave the most advanced security system vulnerable to a determined attacker. Password-based authentication systems often allow users to create passwords without any restrictions on password length or make-up, which often result in passwords that are easilyguessable. To strengthen password-based authentication systems, composition rules are sometimes 4 enforced, which ”require passwords to be drawn from certain regular languages, and require passwords to include digits and non-alphanumeric characters” [7]. The purpose of composition rules is to increase the complexity of user created passwords, which will theoretically increase the difficulty of an attacker guessing the password. Although composition rules increase the possible password search space, it does not ensure a higher entropy due to limitations introduced by humans. Narayanan and Shmatikov have concluded that even with the adoption of composition rules, humanmemorable password will still be vulnerable to intelligent dictionary attacks [7]. The measure how of easy a password can be guessed is referred to as its Kolomogorov complexity. The Kolomogrov complexity of a string ”is the description length of the smallest Turing machine that outputs that string” [8]. A string that has a low Kolmogorov complexity, or K-complexity, can be represented as Turing machine with relatively few states. One example of a long string that has a low K-complexity is asasasasasasasasasasasasasasasasas. That particular string is simply the word as repeated 16 times, which a Turing machine could easily produce [7]. However, calculating the K-complexity ”of a string is uncomputable” as the ”Turing machine model for representing strings is too powerful to be exploited effectively” [8]. The researchers proposed that a major contributor to the memorability of a password is directly related to the ”phonetic similarity with the words in the user’s native language” [7]. Therefore, Markov Models could be used to capture this phonetic information based on the native language of the user. Additionally, simple finite automaton can be used to filter out passwords, which are generated from Markov models, that do not match well-known password creation rules. 2.2.2 Markov Modeling of Passwords Markov modeling refers to the process of defining ”a probability distribution over sequences of symbols” [7]. Markov modeling is commonplace in natural language processing, as Markov models easily apply data to well known language patterns. The simplest statistical information that can be utilized by Markov modeling is the underlying letter frequency distribution of a particular language. For example, in the English language the letter E occurs 12.7% of the time, while the letter Q only occurs 0.1% of the time [9]. Therefore, when creating words based on the English language, the letter E should appear more often than the letter Q. Narayanan and Shmatikov utilized stastical information of the English language with two Markov model approaches: zero-order and first-order. The zero-order Markov will independently consider each letter when generating passwords. The primary motivation for utilizing the zero5 order Markov model is to mimic a ”commonly used password generation [strategy],” where the password is an acronym that is obtained by taking the first letter of each word in a sentence” [7]. The assumption made by the researchers that the underlying letter frequencies associated with sentences in the English language could be captured in the zero-order Markov model. However, not all passwords will be generated. Only passwords that are above a specified threshold hold will be generated. The mathetical equation that represents a generated password is as follows: P (α) = Πxǫa ν(x) (2.1) P (α)= the product of each character’s probability in a generated password α= the generated password a= set of all characters in given alphabet x= individual character in alphabet ν(x)= probability of the letter x. [7] A threshold, θ, will define which generated passwords that will be accepted and placed into zero-order Markov model dictionary. The mathematical equation that represents the zero-order dictionary is: Dυ,θ = {α : P (α) ≥ θ}. (2.2) The first-order Markov model will generate the next letter in the password based upon the previously generated character. The motivation for the first-order Markov model is that users will generate passwords that are not present in the English dictionary, but are phonetically pronounceable. For example, the word ”phonerneting” is not a dictionary word, but it can be spoken and will likely be remembered by a user. Since the probability of the current character is based upon the previously generated character, the mathematical equation representing the firstorder Markov model differs from equation 2.1: P (x1 x2 · · · xn ) = v(x1 ) ( Y n − 1)ν(xi+1 |xi ) (2.3) i=1 x1 x2 · · · xn = password of length n ν(xi+1 |xi ) = probability of xi+1 occuring after xi Similarly to the zero-order model, a threshold, θ, will be selected in the generation the first-order dictionary. The mathematical representation of the first-order dictionary is as follows: Dν,θ = {x1 x2 · · · xn : P (x1 x2 · · · xn ) ≥ θ} 6 (2.4) By introducing a threshold, the password search space can be drastically reduced by only including the most plausible passwords. A plausible password is one that is considered to be pronouncable. The example given in the research refers to passwords with a length of 8 characters. Suppose a zero-order Markov model dictionary was created using equation 2.2 with θ = 0. Now suppose a θ was chosen that only accepted 14% of all generated passwords, which resulted in another zero-order dictionary being generated. When comparing the two dictionaries, the second dictionary will contain 90% of all the passwords generated when θ = 0. As a result, 14% of the generated passwords contains 90% of the plausible password space. By intelligently choosing a θ, the researchers could maximize coverage over the plausible password space while minimizing time for computation [7]. 2.2.3 Filtering with Finite Automaton In the previous section, only lowercase letters were used to generate the plausible passwords. In reality, passwords are generated with a mix of uppercase and lowercase letters, numbers and symbols. To better represent more realistic passwords, the researchers increased the complete alphabet to contain: • 26 lowercase characters (a-z) • 26 uppercase characters (A-Z) • 10 numerals (0-9) • 5 special characters (space, hyphen, underscore, period and comma) The complete alphabet, a, contains 67 unique characters and numbers, which will be used in the creation of the zero- and first-order Markov model dictionaries [7]. With an alphabet of 67 characters, as opposed to 26, the size of the generated dictionaries will increase exponentially if intelligent filtering is not used. However, the location of the uppercase, numerals and special characters in user created passwords can be anticipated. For example, ”the first character of an alphabetic sequence is far more likely to be capitalized than the others,” and ”in an alphanumeric password, all numerals are likely to be at the end” [7]. Many password breaking tools, such as [3] and [4], use common password patterns to generate passwords as transformations on standard dictionaries. By understanding these 7 common password patterns, the researchers can then decrease size of their generated dictionaries by only accepting passwords that match the common password patterns. The common password patterns can be expressed as regular expressions, which then can be ”modeled by finite automata” [7]. Finite automaton are used to accept strings that are in a regular language, which are based on a regular expression. The researchers compiled a list of regular expressions that are based upon the transformation rules found in the popular password breaking tools. These regular expressions, which can be found in appendix A of [7], were used to create finite automaton. The finite automaton can then be used to filter out the generated passwords that do not match the most common password patterns. Therefore the dictionaries generated by the researchers can be expressed as ”the set of [passwords]matching the Markovian filter and also accepted by at least one of the finite automata corresponding to the regular expressions” [7]. Mathematically, the zero-order dictionary is: Dυ,θ,<Mi > = {α : P (α) ≥ θ and∃i : Mi acceptsα}. (2.5) Mi = one of the finite automaton The first-order dictionary is generated similarly to equation 2.5. Overall, the generated dictionaries will consist of only the most probable Markovian generated passwords that fit the common password patterns. 2.2.4 Indexing Generated Passwords Another major aspect of their research was implementing an efficient indexing algorithm for use in a time-memory trade-off attack [7, 5]. The indexing algorithm will enable the researchers to efficiently generate rainbow tables that utilize their Markovian/FA filtered dictionaries, by quickly generating the ith password in a given dictionary. The efficient generation of the ith password is critical to their research as the researcher’s primary goal was to demonstrate that Oechslin’s time-memory trade-off could be further improved. However, at this time, this aspect of their research does not directly apply to the research surrounding password breaking and our specialized context-sensitive grammar. 2.2.5 Experiment, Results and Conclusions To test the effectiveness of their Markovian/FA generated dictionaries, the researchers obtained a list of 142 ”real user passwords” from Passware1 . The researchers then generated dictionaries based 1 This person is the operator of http://LostPassword.com 8 on the regular expressions found in appendix A of [7], which were then incorporated into a rainbow table. The researchers also generated another rainbow table that only consisted of alphanumeric (only lowercase letters) passwords of 6 characters or less using Rainbow Crack [10]. Both rainbow tables, the hybrid rainbow table generated by the researchers and the Rainbow Crack generated rainbow table, were used to see how many of the 142 supplied passwords could be recovered. Refer to Table 2.1 [7], that illustrates the results of the experiment. The Category column describes the password groups, while the Count column refers to the number of passwords in each column. The Rainbow column illustrates the number of passwords recovered using the standard rainbow table and the Hybrid column describes the passwords recovered using the Markovian/FA filtered rainbow table. The Markovian/FA filtered rainbow table performed very well at breaking Table 2.1: Normal Rainbow Table vs Markovian/FA Rainbow Table passwords with length of at most five. However, as the length of the passwords increased, the performance of the Markovian/FA filtered rainbow table decreased. The Markovian/FA filtered rainbow table was not able to break any of the passwords of a length of greater eight or eight character passwords that began with a number or a symbol. Overall, the Markovian/FA filtered rainbow table performed well and it validated their hypothesis that password breaking can be improved by using Markov models based on the English language and finite automaton [7]. 9 CHAPTER 3 SPECIALIZED GRAMMAR AND PASSWORD GENERATION It is a fair assumption, like that of Narayanan and Shmatikov from the University of Texas at Austin, that human generated passwords are susceptable to an intelligent attack. In the previous chapter, Markov models and finite automaton were used to intelligently construct more probable passwords. The Markov model and finite automaton password construction strategy was shown to be effective. However, a more effective strategy in generating highly probable passwords may include the use of a specialized grammar. The following sections will describe research into the development of a specialized grammar and its application in password generation. 3.1 Motivation Context-free grammars have long been used in the studying of natural languages. Noam Chomsky introduced context-free grammars in his 1956 paper ”Three Models for the Description of Language”, where he explores the general structure of the English lanuage [11]. The goal of his research was to develop a method to generate arbitrary sentences that model the English language. A general context-free grammar formally defined as G = (V, Σ, R, S), where: V - a finite set of non-terminal characters or values Σ - a finite set of terminal characters or values R - a finite set of production rules used to generate the grammar S - the start variable. Sentences are generated by applying the various production rules, which results in a string of terminal values that model the language described in the grammar. All sentences of a language are not equally likely, therefore probabilities can be associated with each production rule to form a probabilistic context-free grammar. Each sentenced generated with a probabilistic context-free grammar has an associated numerical production probability, which is a measure of how likely that 10 particular sentence is to be found in the language described by the grammar. The research surrounding probabilistic context-free grammars has largely been centered around the processing of natural languages. However, general probabilistic context-free grammars have not had many applications to real world problems. We would like to use probabilistic context-free grammars as motivation to understand user generated passwords and use our learned knowledge to generate probable passwords. 3.2 Understanding User Generated Passwords In Chomsky’s research, he identified two primary challenges when attempting to generally understand a language. The first challenge is discovering the revealing characteristics of the language that are used to construct sentences. The second challenge is then formulating a general theory about the language based on the revealing characteristics discovered [11]. As Chomsky’s focus was on understanding the general structure of the English language, our focus is in understanding the general structure of user created passwords. In order to begin to understand how users create passwords, it is necessary to understand the revealing characteristics of user created passwords. Although there are visual-based authentication systems, such as [12] or [13], the most popular method of authentication is based on users supplying a password via a keyboard. The use of a keyboard limits the user to create passwords with symbols that consist of only letters, digits and special characters. In [11], Chomsky breaks down sentences into their general phrases, such as noun phrases, verb phrases, etc, to generate grammatically correct sentences. We breakdown passwords into three general categories: letters1 , digits2 and special characters3 By analyzing real world passwords, it enables us to determine how users incorporate the general categories into passwords. R phishing attack was launched to obtain the email addresses and In October of 2006, a MySpace R users [14]. The phishing attack received major publicity, and subsequently, passwords of MySpace the list of the obtained passwords was released temporarily for public downloading. A member of Florida State University’s Electronic Crime Investigative Technologies Laboratory research team obtained the password list and provided it to us for analysis [15]. The password corpus did not R usernames and all users were notified to change their passwords. contain any reference to MySpace We feel the passwords are representative of real world user generated passwords. Table 3.1 outlines 1 [A-Z][a-z] [0-9] 3 ! ” # $ % & % ’ ( ) * + , - . / : ; <= >? @ [ \] ˆ {` |} 2 11 the password lengths of the obtained passwords. The passwords that contain between six and Table 3.1: Password Length Statistics Password Length less than 5 5 6 7 8 9 10 greater than 10 Totals Number of Passwords 494 718 9946 14827 15845 11713 9605 3894 67042 Percentage 0.7369 1.071 14.8355 22.116 23.6344 17.4711 14.3268 5.8083 100 ten characters collected make up 92.38% of all collected passwords. Table 3.2 outlines the basic characteristics of our password corpus. Over 80% of all passwords are alphanumeric, which consist Table 3.2: Password Basic Characteristic Statistics Password Characteristic Letters Only Digits Only Special Charactes Only Alphanumeric Letters and Special Characters Digits and Special Characters Letters, Digits and Special Characters Totals Number of Passwords 5151 817 9 54212 4950 8 1895 67042 Percentage 7.68324 1.21864 0.0134244 80.8627 7.38343 0.0119328 2.82659 100 of only letters and digits. From the two tables above, it is clear to see that most passwords are of lengths between six and ten symbols that include both letters and digits. At this point, the basic structure of the passwords is known; however, it is necessary to further analyze the passwords in order to create a specialized grammar. 3.2.1 Detailed Structure of Passwords One revealing characteristic of passwords is specific locations of the letters, digits and special characters within the passwords. By better understanding where these symbols occur in real world passwords, we can create more probable passwords by mimicing the format of the known passwords. 12 To collect statistical information on this characteristic, we transformed each password into a simplified version of the password using only the symbols L, D and S for the letters, digits and special characters, respectively. For example, the password ”password10!! ” would be transformed into ”LDS”, as ”password10!! ” consists of letters followed by digits, followed by special characters. The transformation process was performed on all the passwords and statistical information collected. The statistical information for the fifteen most frequently occuring password structures can be found in Table 3.3. As expected from Table 3.2, the most frequent password structure is alphanumeric. More specifically, the alphanumeric password structure is letters followed by digits. There are 181 unique simplified password structures contained within the compiled password list. The statistical information regarding simplified password structures gives us a more detailed look into how users incorporate the three symbol categories into passwords. Another revealing characteristic is how many symbols are used in generating passwords of a specific structure. For example, consider the simple password structure LD. This structure appeared 45120 times in the password collection, with 107 unique complete structures. A complete structure is the representation of a password in terms of the l, d and s symbols without any simplification. The numerical subscript represents the length of the symbols. For example, the password ”password10!! ”’s complete structure would be ”l8 d2 s2 ”. Of the 107 unique complete Table 3.3: Simplified Password Statistics Simplified Password Structure LD L DL LS LDL LSL D LSD DLD LDLD LDS LDLDL LSLD SL SLS Number of Passwords 45120 5151 4499 3311 2863 998 817 566 526 492 396 275 185 158 153 13 Percentage 67.3011 7.68324 6.71072 4.9387 4.27046 1.48862 1.21864 0.844247 0.784583 0.733868 0.590675 0.410191 0.275946 0.235673 0.228215 structures, the complete structure ”l6 d1 ” appeared 4933 times, while the complete structure ”l1 d4 ” only occured once. When generating passwords with the simple structure LD, it would be to our advantage to generate passwords of the complete structure ”l6 d1 ” before the complete structure ”l1 d4 ”. Table 3.4 lists the ten most frequently occuring complete structures of the simple structure LD. From Table 3.4 it is clear to see that the most popular complete structure password consists of six letters followed by one digit. The same complete structure statistical information was recorded for each of the simplified structures. Digits in Passwords Alphanumeric passwords are the most popular and an important characteristic of alphanumeric passwords is the selection of digits. Digits in passwords often times represent something of significance, such as a birthday or a date. Popular password breaking tools simply take an input dictionary and append one or two digits, at random, to the end of each dictionary word in hopes of breaking the password. We analyze the digits in the passwords to improve our performance in selecting digits in the password. The digits are not analyzed individually, but as a continuous group of digits. For example, the password ”10password1 ” has two digit groups: ”10” and ”1”. The digits are analyzed as groups to hopefully encompass the significance of each user’s digit selections when constructing a password. The digits are broken into groups dependent upon the length of the digit groups. In the password collection, there are 16 unique digit collections that occur a total of 59354 times. One digit groups are the most frequent as they occur 42.42% of the time. Table 3.5 describes the Table 3.4: Complete Structure Statistics of Simple Structure LD Complete Password Structure l6 d1 l6 d2 l7 d1 l8 d1 l5 d2 l9 d1 l7 d2 l5 d1 l4 d2 l8 d2 Number of Passwords 4933 4402 4145 3540 3037 2743 2720 2547 2356 2271 14 Percentage 0.109331 0.0975621 0.0918661 0.0784574 0.0673094 0.0607934 0.0602837 0.0564495 0.0522163 0.0503324 Table 3.5: Digit Grouping Statistics Length of Digit Grouping 1 2 3 4 5 6 7 8 9 >9 Number of Occurrences 25183 18084 6107 5182 1227 2088 771 436 207 69 Percentage of total 42.4285 30.468 10.2891 8.73067 2.06726 3.51788 1.29899 0.734576 0.348755 0.116252 remaining digit collections statistics. Ninety percent of all digits are included in digit groupings of lengths between one and four. Each digit grouping was analyzed to determine which digits are most frequent in their particular grouping. For example, the most frequent digit from the one digit grouping is ”1”, which occurs 50.78% of the time. Table 3.6 outlines the most frequent one digit Table 3.6: 1-Digit Grouping Statistics 1-Digit 1 2 3 4 7 5 0 6 8 9 Number of Occurrences 12788 2789 2096 1708 1245 1039 1009 899 898 712 Percentage of total 50.7803 11.0749 8.32308 6.78235 4.94381 4.1258 4.00667 3.56987 3.5659 2.8273 numbers, while Table 3.7 outlines the ten most frequent two digit numbers. Similar statistics were recorded for each of the 16 unique digit groupings. An interesting point is that in the three digit grouping, the three digit number ”123” occurs 21% of the time. By understanding which digits are more frequent in each digit grouping, we can generate more probable passwords by generating passwords with the more frequent digits. 15 Table 3.7: 2-Digit Grouping Statistics 2-Digit 12 13 11 69 06 22 21 23 14 10 Number of Occurrences 1084 771 747 734 595 567 538 533 481 467 Percentage of total 5.99425 4.26344 4.13072 4.05884 3.2902 3.13537 2.97501 2.94736 2.65981 2.58239 Special Characters in Passwords Special characters are also important in password generation. The same process that was detailed above for digits was repeated for special characters. There are 213 unique special character sequences that occur a total of 7260 times. Table 3.8 details the frequencies of each special character grouping. It is clear that one special character occurs more frequently than any other Table 3.8: Symbol Grouping Statistics Length of Special Character Groupings 1 2 3 4 5 6 9 8 7 10 Number of Occurrences 6566 512 125 29 11 8 4 3 1 1 Percentage of total 90.4408 7.05234 1.72176 0.399449 0.151515 0.110193 0.0550964 0.0413223 0.0137741 0.0137741 special character grouping. Table 3.9 details the most frequent special characters occuring in the one special character grouping. The same statistics were collected for the other special character groupings and recorded. 16 Table 3.9: 1-Symbol Grouping Statistics 1-Symbol ! . * # @ $ ? & 3.3 Number of Occurrences 2047 1377 510 492 473 255 234 151 131 121 Percentage of total 31.1758 20.9717 7.76729 7.49315 7.20378 3.88364 3.56381 2.29973 1.99513 1.84283 Developing a Specialized Context-Sensitive Grammar In 3.2.1, the password collection was thoroughly analyzed. Each password was transformed into its corresponding simple/complete structures and their frequencies recorded. Frequencies of the digit and special character groupings were also collected and recorded. At this point, we have a thorough understanding of the basic syntax used by users to generate passwords. It is our goal to apply the learned syntax and frequencies to form a specialized grammar that will allow us to generate the most probable passwords. As context-free grammars are used to process complex natural langauges, such as English, we have seen passwords are simplier, therefore they can be modeled by a specialized grammar. The specialized grammar is described similarly as the general probilistic context-free grammar but with some key differences. The finite set of non-terminal characters will consist of the simple and complete structures and their corresponding probabilities defined in 3.2.1. The terminal characters will be broken into three distinct groups: words, digits and special characters. The make-up of the three groups will be explained below. The start variable will remain the same and have a probability of 1.0. The production rules of the specialized grammar vary greatly from the production rules of a general probabilistic context-free grammar, as the location of the words, digits and special characters effect which production rules can be selected. The specialized grammar will generate passwords based on password structures created using production rules. From the start variable, only non-terminal values can be produced. More specificially, only the simple structures, which a sample of them can be found in Table 3.3, can be 17 produced from the start variable. For example, if the simple structure LD was produced from the start variable, the resulting generated password would be alphanumeric, with letters followed by digits. Once a simple structure has been selected, the only values that can be produced are the complete structures, which are also non-terminal values. The production rules that map a simple structure to a complete structure are dependent upon the chosen simple structure. Only complete structures that simplify to the previously produced simple structure may be selected. For instance, if the simple structure LD was selected, a valid production rule may be the complete structure l4 d3 , while the complete structure d2 l4 is not. Also, every production rule that is selected will be multiplied together and recorded. After a complete structure is selected, only terminal values can be selected. The choice of terminal characters fall into three distinct groups: words, digits and special characters. The members of the digit and special character groups will be associated with a probability. The length of each group is dependent upon the length of the letter grouping, digit grouping and special character grouping in the complete structure, non-terminal value. For example, consider the complete structure l4 d2 s1 . The resulting generated password would consist of a 4-letter word followed by a 2-digit number and one special character. We then substitute all digits and special characters into the complete structure, which results in a password structure that has specific digits and special characters, but still general word groups. At this point, there are specific complete structures with specific production probabilities. The complete structures with the highest production probability represents the most probable password structure. We will then substitute words into the specific complete structures to generate passwords. The members of the digit and special character terminal groups are obtained by extracting the digit and special character groupings when analyzing the simple/complete password structures. The word terminal group is derived from an input dictionary, which is independently generated from sources other than the passwords. The primary purpose of having a customizable input dictionary is to allow the generation of passwords within a specific domain. For example, if a password is suspected to be constructed using scientific terms, it would be advantageous to generate passwords using an input dictionary of scientific terms. The ability to use a customizable input dictionary is available in all modern password breaking tools. Consider the following specialized grammar: V = simple structure: {(LDS,1.0)} complete structures: {(l3 d2 s1 ,0.75), (l3 d1 s1 ,0.25)} Σ = Non-terminals groups 18 words: {cat, dog, at} digits: {(99,0.9),(00,0.1),(1,1.0)} symbols: {(!,1.0)} R = List of production rules Λ –>LDS 1.0 LDS –>l3 d2 s1 0.75 LDS –>l3 d1 s1 0.25 Λ = (Λ, 1.0) We want to generate all possible passwords within this specialized grammar. We start with the start symbol, which has a probability of 1.0. The only available production rule from the start symbol is the simple non-terminal structure LDS, which occurs 100% of the time in this specialized grammar. From the simple structure non-terminal value there are two possible production rules. We choose the first complete structure non-terminal value l3 d2 s1 which occurs 75% of the time within this specialized grammar. The complete structure value l3 d2 s1 contains a 3-letter word, a 2-digit number and 1 special character. We then apply the digit and special character groups to generate the specific complete structures. The following list outlines the creation of the password structures: Λ –>LDS –>l3 d2 s1 –>l3 99s1 –>l3 99! 1.0 ∗ 1.0 ∗ 0.75 ∗ 0.9 ∗ 1.0 = 0.675 Λ –>LDS –>l3 d2 s1 –>l3 00s1 –>l3 00! 1.0 ∗ 1.0 ∗ 0.75 ∗ 0.1 ∗ 1.0 = 0.075 Λ –>LDS –>l3 d1 s1 –>l3 1s1 –>l3 1! 1.0 ∗ 1.0 ∗ 0.25 ∗ 1.0 ∗ 1.0 ∗ 1.0 = 0.25 Three password structures are generated: (l3 99!, 0.675), (l3 00!,0.075), (l3 1!,0.25). The most probable password structure is l3 99!, while the least probable password structure is l3 00!. The word groups are then substituted into the password structures, which results in the following passwords being generated: cat99!, dog99!, cat1!, dog1!, cat00! and dog00!. Notice, the word ”at” was not used because there were no word groups of length 2 found in the password structures. The specialized grammar in this example is relatively small, but illustrates the password structure generation process and the actual password generation. Section 3.4 will explain in-depth the password generation process. 19 3.4 Implementing a Specialized Grammar to Generate Passwords There are two phases involved in generating passwords using a specialized grammar. The first phase involves training the specialized grammar from a known corpus of passwords. The second phase is the actual password generation which was outlined in section 3.3. Before implementation is described, it is necessary to understand the directory structure associated with the password generation. The grammar/ directory holds the pertinent information of the specialized grammar. The words/ directory houses all the words included within an user supplied input dictionary. The numbers/ directory maintains all the terminal digit groups, while the symbols/ directory maintains all the terminal special character groups. 3.4.1 Training a Specialized Grammar A training period is needed to initialize the specialized grammar’s simple/complete non-terminal values and the digit/special character terminal groups. After the training period, the grammar, numbers and symbols directories will be populated with the necessary information to begin generating probable passwords. The training program is called computer grammar.py and the actual implementation can be found in Appendix A. The program is written in the Python programming language as it can very easily handle files and string manipulation. The training program accepts one command line argument, which is the file name of the known password corpus. Before the analysis of the known passwords begins, four hash tables are initialized to track the frequencies of the simple structures, complete structures, digit groups and special character groups. The known password file is then opened and each password is analyzed individually. Each password is transformed into it’s complete structure. During the transformation process, each group of contiguous digits or special characters is extracted and placed in the appropriate frequency hash table. Once the complete structure of the password is formed, it is placed into the frequency hash table of complete structures. The complete structure is then simplified and the resulting structure is inserted into the frequency hash table of simple structures. The transformation and extraction process was done on each password in the known password corpus. After the processing is completed, the four hash tables our filled with the frequencies of each unique structure or grouping found in the known passwords. All of the hash tables are then sorted and each members probability are calculated. The simple and complete structure hash tables are stored within the grammar directory in the files simple.txt and complete.txt, respectively. The digit group hash table is stored with numbers directory, not 20 as one large file, but as separate files of numbers of the same length. For example, the number ”99” has a length of two, therefore it will be stored in the file numbers/2.txt, which stores all two digit numbers found in the known password corpus. The symbol group hash table is stored in the same manner as the digit group hash table within the symbols directory. The specialized context-sensitive grammar is now populated and password generation can begin. 3.4.2 Probable Password Generation Implementation The password generator, gen passwords.c, is implemented in C and can be found in Appendix B. Before the password generation can begin, a user must supply an input dictionary to be processed. An input dictionary is a file that contains a collection of words, with one word per line. Each word in the input dictionary can also have an associated weight value, which is a measure of how likely that particular word should appear. The input dictionary is processed much like the digit and special character groups described in 3.4.1. All the words of the same length are placed into a file within the words directory. For example, if the word ”apple” is present in the input dictionary, it would be placed in the words/5.txt file along with all the other 5-letter words. Once the words directory has been populated, the probable password generation can be started. One simple structure is processed at a time and all complete structures, that when simplified equal the simple structure, are also processed before another simple structure can be selected. When a complete structure is chosen, each distinct word, digit and special character grouping is processed from left to right. The length of each grouping determines the file to be opened and read from to generate each portion of the password. For example, if the complete structure being processed has the form l5 d3 s1 , the word grouping is processed first. Since the word grouping has 5 letters, the file words/5.txt will be opened and all 5-letter words will be placed into passwords that match the complete structure. Next, the digit grouping will be processed, which will append all 3-digit numbers to the end of all the 5-letter words. Finally, all special characters in the symbols/1.txt will be appended to all previously generated passwords. During the password construction process, a probable password percentage is recorded for each individual password, which was described in 3.3. A threshold value can be established through a command line option, which forces all generated passwords to be above a specific probability. As each grouping is processed, if the password percentage falls below the threshold the password is discarded. The password generator also allows for each complete structures to be concurrently processed, which can be selected through a command line option. 21 All generated passwords and their corresponding probabilities are stored in the tmp/ directory. The passwords are stored in files that match their complete structures. For example, the generated password password1 matches the complete structure l8 d1 , and would be placed in the tmp/l8 d1 .txt.2 file with all other passwords that match the complete structure. The last number of the filename represents the number of distinct groups in the complete structure. A limitation of 32-bit systems is a restriction on the maximum file size, which can only be 2GB. To overcome this limitation, complete structures that produced more than 2GB of passwords had their results broken into multiple files, each with an additional number appended to the filename. If the complete structure used above generated more than 2GB passwords, the generated passwords would be in the files tmp/l8 d1 .txt.2.0, tmp/l8 d1 .txt.2.1, etc. 22 CHAPTER 4 CREATING A CUSTOM INPUT DICTIONARY We have a thorough understanding of how numbers and special characters are used in generating passwords. The difficult aspect of breaking passwords pertains to intelligently selecting the words that form the root of the password. Many password based authentication schemes have a password aging requirement, which states that a password must ”be changed after some period of time has passed or after some event has occurred”[16]. One problem associated with password aging requirements is that users do not change the root of the password. Users may only change the prefix or a suffix, while leaving the root word in the password the same. For example, if a user currently has the password apple01, it would not be unreasonable to assume that a new password could be apple02. It is our assumption that if we can intelligently select the root word within a password, our chances of breaking passwords will increase drastically. The focus of this research has been on understanding the user generated passwords from our password corpus. MySpace R is a popular social networking site that allows users to sign up and share photos, journals and interests through out the world [17]1 MySpace R also enables musical groups and artists to create accounts to share their music and connect with fans. It is our assumption that MySpace R users identify with their favorite musical group/artist and will incorporate the musical group’s/artist’s lyrics into their passwords. 4.1 Extracting Song Lyrics The are a large number of musical genres within MySpace R , so it was important to retrieve lyrics from a source that also represented a wide variety of popular musical genres. A popular user submitted music lyric website, A-Z Lyrics Universe 2 , was selected to extract song lyrics. Browsing through A-Z Lyrics Universe, we found that it contained a large selection of popular music lyrics in 1 2 As of March 7, 2008, it ranks third behind Google and Yahoo in internet site traffic[18]. http://www.azlyrics.com/ 23 an easily parseable format. The lyrics were extracted using standard Unix command line utilities and lynx, ”a general purpose distributed information browser for the World Wide Web”[19]. 4.2 Song Lyric Analysis A total of 55452 songs were analyzed and their lyrics extracted. All of the lyrics were converted to lowercase and all periods, exclamation points and question marks were removed. These punctuation marks were removed as they were typically found at the end of lyric sentences and were not an integral part of the lyrics themselves. There were 15,636,602 total words found in the lyrics, but only 180,083 of them were unique. Table 4.1 describes the word lengths found the lyrics. However, Table 4.1: Frequency Analysis of Unique Words Word Length less than 5 5-7 8-10 11-16 greater than 16 Totals Number of Occurrences 16097 72382 54272 21399 15933 180083 Percentage 0.0893866 0.401937 0.301372 0.118829 0.0884759 100 when constructing a dictionary it is infeasible to include all available words. Therefore, we have decided to only include words that occur atleast 150 times within the lyrics. By enforcing this minimal frequency, we restrict the number of unique words from 180083 down to 4835 words. This may seem like a drastic reduction, but the these 4835 words appear 14,510,365 times, which is 92.79% of the total number of words found in the lyrics. Table 4.2 displays the frequency analysis Table 4.2: Frequency Analysis of Restricted Words Word Length less than 5 5-7 8-10 greater than 11 Totals Number of Occurrences 10704302 3421115 367024 17924 14510365 Percentage 0.7377 0.23577 0.0252939 0.00123525 100 of the restricted lyric dictionary. A majority of the words have between one and four characters. It is our assumption that these 4835 unique words provide ample coverage of the lyric collections 24 word space. It is interesting to note that a majority of the words in the restricted lyrics dictionary contain less than five letters. A possible explanation for this may be based on user submitted lyrics that contain slang terms. For example, the word ”ima” appears 4063 times, which is a common slang abbreviation for ”i’m gonna.” It is the inclusion of these types of slang terms that could improve our password breaking performance, as it captures the common langauge used by many young people on the internet. 25 CHAPTER 5 EXPERIMENTS AND RESULTS To test the effectiveness of our specialized grammar, we randomly separated the password corpus into two sets. The first set of passwords was labeled as the training set, which contained 33561 passwords and was used to generate a specialized grammar. The second set of passwords was labeled as the test set, which contained 33481 passwords and all generated passwords would be tested against. For all experiments, the specialized grammar derived from the training set was used to generate passwords. 5.1 Specialized Grammar vs John the Ripper We compared our probable password generation against a well-known password breaking utility, John the Ripper1 [3]. As our password generation program accepts an input dictionary, John the Ripper does as well. John the Ripper has a wordlist mode that accepts an input dictionary and generates passwords based on the dictionary and a set of transformation rules. Since we are attempting to generate a specific domain of passwords, it is important use an input dictionary that is likely to be used in that same domain. To meet this requirement, we simply extracted all the unique words 2 found in the training set of passwords, which resulted in an input dictionary with 4143 words. All words within the dictionary will be considered equally likely, therefore they all have a weighted value of 1.0. This input dictionary was used by both our password generator and John the Ripper. The following command was used to generate a list of all attempted passwords by John the Ripper in wordlist mode and applying the standard transformation rules: ./john --wordlist=myspace.dict --rules --stdout > jtr_candidate.pwds 1 2 version 1.7.2 Words are considered any group of contiguous letters. 26 John the Ripper generated 189091 candidate passwords. We then used our probable password generator to all the possible passwords baed on the input dictionary. However, we only selected the 189091 most probable passwords. These two sets of candidate passwords were then compared against the password test set. Figure 5.1 displays the result of the comparison. John the Ripper was able to generate 5093 of the 33481 test passwords or 15.21%. Our probable password generation was able to generate 21.75%, or 7286, of the passwords in the test set. Figure 5.1 was generated by taking the number of Figure 5.1: SG vs John the Ripper (Training Set Dictionary) passwords found at 10000 password attempt intervals. From the figure, it is clear that our probable password generator out performs John the Ripper from the beginning and showed increases in passwords found toward the end of the experiment. Although our generated passwords out performed John the Ripper, John the Ripper was not necessarily designed to use an input dictionary in the same manner as our password generator. To see if John the Rippers would perform better using a different word list, we ran John the Ripper using the actual training password set as an input wordlist. John the Ripper generated 239284 candidate passwords, but only managed to generate 1611, or 4.8%, of the passwords in the test set. This test shows that John the Ripper is more effective when using an input wordlist, therefore our password generator performs better overall when compared to John the Ripper. We also compared John the Ripper and our password generator when using an English dictionary. John the Ripper generated 197662 candidate passwords, and was only able to generate 659 passwords that are in the test password set. We then used our password generator to generate 27 Figure 5.2: SG vs John the Ripper (English Dictionary) the 197662 most probable passwords. We were able to generate 830 of the passwords found in the test set. Figure 5.1 summarizes the results of this experiment. In Figure 5.1, John the Ripper performed better than our password generated with the first 100000 candidate passwords. However, John the Ripper’s performance diminished quickly after that point, while our password generator continued to generate passwords in the test set. In the previous experiments, we chose input dictionaries that are relative to the password Figure 5.3: SG vs John the Ripper (Scientific Dictionary) test set. We wanted to compare our performance against John the Ripper when using an input dictionary from a different domain. As MySpace R is as a site directed towards younger computer users, a dictionary of scientific terms would provide a significantly different set of words MySpace R users may use to create passwords. Figure 5.1 illustrates the results of the experiment. John the Ripper generated 250335 candidate passwords and was only able to generate 18 of the passwords 28 found in the test set. We selected the 250335 most probable passwords from the passwords generated using our specialized grammar. We slightly outperformed John the Ripper by generated 19 passwords that were present in the test set. Although the total performance of our password generator was slighty better than John the Ripper’s, we were able to find 19 passwords in the first 60000 passwords, while John the Ripper needed 240000 candidate passwords to find 18 passwords. Table 5.1 summarizes the performance improvements associated with using our specialized grammar when compared against John the Ripper. We receive the largest improvement in Table 5.1: Performance Improvements Input Dictionary R MySpace English Scientific performance when using the MySpace R Performance Improvement (%) 43.1 25.9 5.5 input dictionary, and we receive the smallest performance improvement when using the scientific input dictionary. From these results, it is clear to see that our performance improves more as we use an input dictionary that is specific to the domain of the passwords. Therefore, if we can determine the domain of a password, the chances of breaking a password improves greatly when using our specialized grammar. 5.2 Specialized Grammar and Lyrics Dictionary In the previous section, we have established that our password generator performs better than Figure 5.4: Comparison of the Training Set Dictionary and Lyric Input Dictionaries 29 R users incorporate song John the Ripper. We also made the hypothesis in Section 4 that MySpace lyrics in their MySpace R passwords. Therefore an input dictionary consisting of words found in song lyrics would allow us to generate more passwords in the test set than any previously used dictionary. We then used our password generator to generate passwords based on the lyrics input dictionary. The 500,000 most probable passwords were chosen and tested against the test set. The value 500,000 was chosen as it is sufficiently large to determine any trends. The results of the experiment are shown in Figure 5.2. We were able to generate 2905 of the passwords found in the test set, which is much less than the 9507 passwords that were generated using the input dictionary consisting of words found in the password training set. Although the lyrics input dictionary performed poorly when compared to the training set input dictionary, it performed substantially better than the English input dictionary. In the previous section, when generating the 197662 most probable passwords based on the English input dictionary, we were able to generate only 830 passwords. However, in Figure 5.2 we were able to generate 2019 passwords after trying the 190000 most probable passwords, which is substantially more than the 830 when using the English input dictionary. We can conclude that our lyrics input dictionary contains more root words than the English input dictionary. Therefore the lyrics dictionary is more specific to the MySpace R domain than the English dictionary. Although our lyrics dictionary did not perform as well as anticipated, it did demonstrate that more probable passwords can be generated by using a domain specific input dictionary. 30 CHAPTER 6 FUTURE RESEARCH Through our comparisions with John the Ripper we have established that our specialized grammar can effectively generate more probable passwords than current password breaking technology. Instead of applying common transformation and mutation patterns to dictionary words, we learned from known passwords and use that knowledge effectively to generate passwords. Although our password generation results when using the lyrics dictionary were not as we anticipated, we still feel that more research into custom input dictionaries will result in the generation of more probable passwords. In this research, we focused primarily on the structure of the passwords and not on the words used in the passwords. If we can better understand, from a users perspective, the reasoning behind the root word selection, we may be better able to generate probable passwords. Since our research only focused on the structure of passwords, there are no language dependecies associated with this password generation strategy. Future research may include how effective this strategy may be when applied to foreign languages. From a law enforcement or military perspective, having a password breaking tool that can intelligently generate probable passwords in a foreign language could be invaluable for national security. Our probable password selection strategy was simple and straighforward. We simply selected the most probable passwords, regardless of the password structure. For example, if we wanted to select 10,000 passwords to test and the first complete structure had generated over 10,000 passwords, only the passwords matching the first complete structure will be selected. We chose this strategy because it was the easiest to understand. Additional research can explore other password selection strategies that use the statistical information of the simple and complete structures to select a more diverse password test set. The performance of our password generation program was not a primary concern through out this research. Additional research can explore the parallelization of our password generation process. Also, our current implementation does not allow for the generation of the ith password. 31 Additional research into creating an indexing algorithm would allow the specific generation of the i th password, which will allow us to generate rainbow tables based on our password generator. As this research focuses on the generation of probable passwords, another future research project could be the development of a proactive password checker that is based on learned password structures. A proactive password checker can educate users in strong password generation and improve the security of a password-based authentication system. Overall, we have established that generating passwords based on a specialized grammar is an effective strategy. Although these results are preliminary, there is enough evidence to warrant further research. 32 APPENDIX A COMPUTE GRAMMAR This code is used to compute the grammmar associated with known passwords and is described in 3.4.1. #!/usr/bin/python import string, sys, os, re def populate symbols(total,symbol): symbol = [(v,k) for k,v in symbol.items()] symbol.sort() symbol.reverse() for count,word in symbol: file = "symbols/"+str(len(word))+".txt" out = open(file,’a’) line = "%s %f\n" % (word, float(count)/total) out.write(line) out.close() def populate numbers(total,numbers): numbers = [(v,k) for k,v in numbers.items()] numbers.sort() numbers.reverse() 10 20 for count,word in numbers: file = "numbers/"+str(len(word))+".txt" out = open(file,’a’) line = "%s %f\n" % (word, float(count)/total) out.write(line) out.close() def populate complete grammar(total,complete): complete = [(v,k) for k,v in complete.items()] complete.sort() complete.reverse() 30 out = open("grammar/complete.txt",’w’) for count,word in complete: if len(word)<17: line = "%s %f\n" % (word, float(count)/total) 33 out.write(line) out.close() 40 def populate simple grammar(total,simple): simple = [(v,k) for k,v in simple.items()] simple.sort() simple.reverse() out = open("grammar/simple.txt",’w’) for count,word in simple: if len(word)<17: line = "%s %f\n" % (word, float(count)/total) out.write(line) out.close() def simplify grammar(n): ret = ’’ char = ’’ for x in range(0,len(n)): if not (n[x].isspace()): if n[x] != char: ret+=n[x] char = n[x] return ret 50 60 # this function will accept filename n # then read in the file and # 1. generate the complete grammar of the file # 2. simple grammar of the file # 3. stats associated with the numbers # 4. stats associated with the symbols def generate complete grammar(n): 70 symbol = {} symbol total = 0 number = {} number total = 0 complete={} complete total = 0 80 simple = {} simple total = 0 in fd = open(n,’r’) for password in in fd: password = string.rstrip(password) 90 complete temp = ’’ simple temp = ’’ number temp = ’’ symbol temp = ’’ 34 last ch = ’’ for x in range(0,len(password)): if password[x].isalpha(): complete temp+="L" 100 if last ch == ’D’: if number temp != ’’: try: number[number temp]+=1 except: number[number temp]=1 number total += 1 number temp = ’’ elif last ch == ’S’: if symbol temp != ’’: try: symbol[symbol temp]+=1 except: symbol[symbol temp]=1 symbol total += 1 symbol temp = ’’ last ch = ’L’ elif password[x].isdigit(): complete temp+="D" 110 120 if (last ch == ’’) or (last ch == ’D’): number temp += password[x] elif last ch == ’S’: if symbol temp != ’’: try: symbol[symbol temp]+=1 except: symbol[symbol temp]=1 symbol total += 1 symbol temp = ’’ 130 last ch = ’D’ else: complete temp+="S" if (last ch == ’’) or (last ch == ’S’): symbol temp += password[x] elif last ch == ’D’: if number temp != ’’: try: number[number temp]+=1 except: number[number temp]=1 number total += 1 number temp = ’’ 140 last ch = ’S’ 150 if number temp != ’’: 35 try: number[number temp]+=1 except: number[number temp]=1 number total+=1 if symbol temp != ’’: try: symbol[symbol temp]+=1 except: symbol[symbol temp]=1 symbol total+=1 160 try: complete[complete temp]+=1 except: complete[complete temp]=1 complete total+=1 170 simple temp = simplify grammar(complete temp) try: simple[simple temp]+=1 except: simple[simple temp]=1 simple total+=1 in fd.close() 180 # finished computing stats populate simple grammar(simple total,simple) populate complete grammar(complete total,complete) populate numbers(number total,number) populate symbols(symbol total,symbol) if name ==’__main__’: generate complete grammar(sys.argv[1]) 190 36 APPENDIX B PASSWORD GENERATION Detailed in 3.4.2, this is the code to generate passwords based on a specialized grammar. #include #include #include #include #include #include #include #include #include #include #include #define #define #define #define #define #define <stdio.h> <stdlib.h> <string.h> <fcntl.h> <sys/types.h> <sys/time.h> <sys/stat.h> <sys/wait.h> <unistd.h> <math.h> <popt.h> 10 SIMPLE "grammar/simple.txt" COMPLETE "grammar/complete.txt" MAXLEN 60 MAXFILE 32 FORMAT1 "%s" FORMAT2 "%s %f" 20 int int int int int int int compare(char *complex, char *simple); generate passwords(char *simple, float prob); process complete grammars(char *simple); get number of chunks(char *word); breakdown(char *structure, float prob, int cycle, char *pathbase, int final value); populate words(char *file); scrub(void); int line cntr=0; /* counts the current complete grammar being generated */ int complete cntr=0; /* total number of grammars being generated */ static static static static float threshold=−1.0; /* threshold obtained from command line arguments */ int verbose=0; /* verbosity obtained from command line arguments */ int np=0; /* number of processes */ int format=0; int main(int argc, char *argv[ ]) { FILE *s file,*c file; /* pointer to grammar/simple.txt fd */ 37 30 char *dictionary; int rc; char buf[MAXLEN],buf2[MAXLEN]; float prob,dummy; /* holds the probability of a particular value */ struct timeval tv; /* for timing purposes */ time t start time, end time; int total time; static int populate; static int clean; 40 char ch; /* used to parse command line options */ poptContext opt con; /* context for parsing command-line options */ 50 static struct poptOption options table[ ] = { { "clean", ’c’, POPT ARG NONE, &clean, ’c’, "clean the words/ and tmp/ directories",NULL}, { "populate", ’p’, POPT ARG NONE, &populate, ’p’, "only populate the words from the input dictionary", NULL}, &np, { "procs", ’n’, POPT ARG INT, ’n’, "number of processes to create", "val>1"}, { "threshold", ’t’, POPT ARG FLOAT, &threshold, ’t’, "set minimum threshold", "val"}, { "verbose", ’v’, POPT ARG NONE, &verbose, ’v’, "enable verbose logging", NULL}, { "format", 0, POPT ARG NONE, &format, 0, "input file in the form: <word> <probability>", NULL}, POPT AUTOHELP { NULL, 0, 0, NULL, 0 } /* end-of-list terminator */ }; opt con = poptGetContext(NULL, argc, (const char **)argv, options table, 0); poptSetOtherOptionHelp(opt con, "[OPTIONS]* <input dictionary>"); 60 70 /* process the command line options */ /*ch = poptGetNextOpt(opt con);*/ /* only use the below if we need to do some option processing */ while((ch = poptGetNextOpt(opt con))>=0) { switch(ch) { case ’?’: poptPrintHelp(opt con,stderr,0); break; } } 80 if (ch < −1) { /* if there was an error with the command line arguments that was missed, it would be caught here */ poptPrintHelp(opt con,stderr,0); return 0; } if ((np)&&(np<2)) { /* error checking with the number of processes */ poptPrintHelp(opt con,stderr,0); return 0; } dictionary = (char *)poptGetArg(opt con); /* get the input file */ if ((dictionary == NULL)&&(!clean)) { poptPrintHelp(opt con,stderr,0); 38 90 return 0; } /* begin timing */ gettimeofday(&tv,NULL); start time = tv.tv sec; 100 if (clean) { system("/bin/rm -f ./words/*"); system("/bin/rm -f ./tmp/*"); printf("Cleaned word/ and tmp/ directories.\n"); if (dictionary == NULL) { printf("No input dicitionary selected - exiting.\n"); return 0; } } printf("Populating words from input dictionary. . .\n"); if ((rc=populate words(dictionary))<0) { exit(−1); } printf("Input dictionary contained %d words.\n",rc); if (populate) { gettimeofday(&tv,NULL); end time = tv.tv sec; total time = end time−start time; printf("Total population time: %1ds\n",total time); return 0; } if (access(SIMPLE,R OK)<0) { printf("Error: cannot access %s\n",SIMPLE); printf("Need to generate grammars\n"); exit(−1); } 110 120 130 if (access(COMPLETE,R OK)<0) { printf("Error: cannot access %s\n",COMPLETE); printf("Need to generate grammars\n"); exit(−1); } s file = fopen(SIMPLE,"r"); while(fscanf(s file,"%s %f",buf,&dummy)>0) { c file = fopen(COMPLETE,"r"); while(fscanf(c file,"%s %f",buf2,&dummy)>0) { if (compare(buf2,buf)) complete cntr++; } fclose(c file); } fclose(s file); printf("Complete grammars to be analyzed: %d\n",complete cntr); printf("Generating passwords (threshold=%.20f). . .\n",threshold); s file = fopen(SIMPLE,"r"); 39 140 150 /* To just examine one simple structure change buf = “LDS” or whatever simple structure you want and take out the for loop */ while(fscanf(s file,"%s %f",buf,&prob)>0) { rc = generate passwords(buf,prob); } 160 fclose(s file); gettimeofday(&tv,NULL); end time = tv.tv sec; /* need to clean up tmp/ directory */ printf("********************************\n"); printf("Password generation complete.\n\nCleaning up. . .\n"); scrub(); printf("\nAll generated passwords are located in the tmp/ directory.\n"); total time = end time−start time; printf("Total generation time: %1d seconds\n",total time); 170 return 0; } int scrub() { FILE *in fd; char path[MAXFILE]; char structure[MAXLEN]; float temp; int num chunks; int i; 180 in fd = fopen(SIMPLE,"r"); while(fscanf(in fd,"%s %f",structure,&temp)>0) { sprintf(path,"tmp/%s.txt",structure); if (unlink(path)<0) { /*printf(“unlink: could not remove %s\n”,path);*/ } } fclose(in fd); in fd = fopen(COMPLETE,"r"); while(fscanf(in fd,"%s %f",structure,&temp)>0) { i=1; num chunks = get number of chunks(structure); while(i<num chunks) { sprintf(path,"tmp/%s.txt.%d",structure,i); if (unlink(path)<0) { /*printf(“unlink: could not remove %s\n”,path);*/ } i++; } } fclose(in fd); 190 200 210 40 return 0; } /* Take a complex complete structure and break it down * into its simplified form */ int compare(char *complex, char *simple) { char buf[MAXLEN]; char last=0; int pos=0,x; for(x=0;x<strlen(complex);x++) { if (last != complex[x]) { buf[pos]=complex[x]; last = complex[x]; pos++; } } buf[pos]=’\0’; if (strcmp(buf,simple)==0) return 1; else return 0; 220 230 } /* generate all the passwords based on the simple structure * output passwords to tmp/ files */ int generate passwords(char *simple, float prob) { FILE *c file; /* fd to grammar/complete.txt */ FILE *tmp file; float p,in p; char buf[MAXLEN]; char f[MAXFILE]; 240 sprintf(f,"tmp/%s.txt",simple); printf("Generating %s file to hold intermediate results. . .\n",f); tmp file = fopen(f,"w"); c file = fopen(COMPLETE,"r"); while(fscanf(c file,"%s %f",buf,&in p)>0) { if (compare(buf,simple)) { /* the complete grammar matches the simplified grammar */ p = in p * prob; fprintf(tmp file,"%s %.20f\n",buf,p); } } fclose(c file); fclose(tmp file); process complete grammars(simple); return 0; } /* examine the tmp/ file to start generating the words */ 41 250 260 int process complete grammars(char *simple) { FILE *in; char file[MAXFILE]; char structure[MAXLEN]; float prob; int np cntr=0; /* counts the current number of processes running */ int pid,rc,stat; 270 sprintf(file,"tmp/%s.txt",simple); in = fopen(file,"r"); while(fscanf(in,"%s %f",structure,&prob)>0) { /*if(verbose)*/ printf("%d: Processing complete structure %s --> %s\n",line cntr++,structure,simple); if (np) { /* concurrent processing */ if (np cntr >= np) { rc = waitpid(−1,&stat,0); if (verbose) printf("%d: child process reaped.\n",rc); np cntr−−; } if (np cntr < np) { if ((pid = fork())<0) { perror("fork"); exit(−1); } if (pid == 0) { if (verbose) printf("%d: child process analyzing %s.\n",getpid(),structure); breakdown(structure,prob,0,structure,0); exit(0); } else { np cntr++; } } } else { /* single process only */ breakdown(structure,prob,0,structure,0); } } while(np cntr > 0) { rc = waitpid(−1,&stat,0); if (verbose) printf("%d: child process reaped.\n",rc); np cntr−−; } 280 290 300 310 return 0; } int get number of chunks(char *word) { int count=1; char c; int i=0; if (word == NULL) return 0; c=word[0]; 320 42 while(i<strlen(word)) { if (c!=word[i]) { count++; c=word[i]; } i++; } return count; 330 } int get chunk size(char *word) { int count=0; char c; if (word == NULL) return 0; 340 c=word[0]; while(c==word[count]) count++; return count; } int breakdown(char *structure, float prob, int cycle, char *pathbase, int final value) { int chunk=0; char symbol; char in[MAXFILE]; char out[MAXFILE]; char data[MAXFILE]; char word[MAXLEN]; FILE *in fd, *out fd, *data fd; float p; 350 sprintf(in,"tmp/%s.txt.%d",pathbase,cycle); sprintf(out,"tmp/%s.txt.%d",pathbase,cycle+1); 360 chunk = get chunk size(structure); symbol = structure[0]; if (cycle == 0) { /* have no previous input to be created */ switch(symbol) { case ’L’: sprintf(data,"words/%d.txt",chunk); break; case ’D’: sprintf(data,"numbers/%d.txt",chunk); break; case ’S’: sprintf(data,"symbols/%d.txt",chunk); break; } if (access(data,R OK)<0) { printf("error: %s not available.\n",data); return 0; } out fd = fopen(out,"w"); data fd = fopen(data,"r"); while(fscanf(data fd,"%s %f",word,&p)>0) { fprintf(out fd,"%s %.20f\n",word,p*prob); } fclose(out fd); fclose(data fd); 43 370 380 } else { /* need to incorporate previous results */ char buf[MAXLEN]; float p2; switch(symbol) { case ’L’: sprintf(data,"words/%d.txt",chunk); break; case ’D’: sprintf(data,"numbers/%d.txt",chunk); break; case ’S’: sprintf(data,"symbols/%d.txt",chunk); break; } if (access(data,R OK)<0) { printf("error: %s not available.\n",data); return 0; } in fd = fopen(in,"r"); out fd = fopen(out,"w"); while(fscanf(in fd,"%s %f",buf,&p2)>0) { struct stat s; data fd = fopen(data,"r"); if (stat(out,&s)<0) { fprintf(stderr,"stat"); } if (s.st size >2100000000) { /* output file is close to 2gig limit */ fclose(out fd); printf("\tFile %s is close to the 2GB limit\n",out); sprintf(out,"tmp/%s.txt.%d.%d",pathbase,cycle+1,final value++); printf("\tContinue and redirect output to %s\n",out); out fd = fopen(out,"w"); } while(fscanf(data fd,"%s %f",word,&p)>0) { if ((p2*p)>=threshold) fprintf(out fd,"%s%s %.20f\n",buf,word,p2*p); } 390 400 410 fclose(data fd); } fclose(in fd); fclose(out fd); } structure += chunk; if (strlen(structure)>0) breakdown(structure,prob,++cycle,pathbase,final value); else return 0; return 0; 420 } /* take input file and populate the words directory * pathname -> input dictionary * returns -> number of words analyzed */ int populate words(char *pathname) { FILE *in,*out; char buf[MAXLEN]; char file out[MAXFILE]; int len,words=0,rc; float f; 430 44 if (access(pathname,R OK)<0) { /* unable to access the input dictionary */ printf("Error: cannot access %s\n",pathname); return −1; } 440 /* can read input dictionary */ in = fopen(pathname,"r"); while(fscanf(in,"%s",buf)>0) { /* breakdown each word by length */ len = strlen(buf); 450 if (format) rc=fscanf(in,"%f",&f); else f=1.0; sprintf(file out,"words/%d.txt",len); out = fopen(file out,"a"); fprintf(out,"%s %.20f\n",buf,f); fclose(out); 460 words++; } fclose(in); return words; } 470 45 REFERENCES [1] Albert J. Greenfield, Robert; Marcella, editor. Cyber Forensics: A Field Manual for Collecting, Examining, and Preserving Evidence of Computer Crimes. Auerbach Publications, Boca Raton, FL, 2002. 1 [2] Wenbo Mao. Modern Cryptography: Theory and Practice. Hewlett-Packard Company, Upper Saddle River, NJ., 2004. 1 [3] Openwall Project. John the Ripper password cracker, 2008. http://www.openwall.com/john/. 2.1, 2.2.3, 5.1 [4] Oxid.it. Cain and Abel, 2008. http://www.oxid.it/cain.html. 2.1, 2.2.3 [5] Philippe Oechslin. Making a Faster Cryptanalytic Time-Memory Trade-Off. In The 23rd Annual International Cryptology Conference, CRYPTO ’03, volume 2729 of Lecture Notes in Computer Science, pages 617–630, 2003. 2.1, 2.2.4 [6] P Oechslin. Password cracking: Rainbow tables explained. The International Information Systems Security Certification Consortium, Inc., 2005. 2.1 [7] Arvind Narayanan and Vitaly Shmatikov. Fast dictionary attacks on passwords using timespace tradeoff. In CCS ’05: Proceedings of the 12th ACM conference on Computer and communications security, pages 364–372, New York, NY, USA, 2005. ACM. 2.2, 2.2.1, 2.2.2, 2.2.2, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 2.2.5 [8] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, April Rasala, Amit Sahai, and abhi shelat. Approximating the smallest grammar: Kolmogorov complexity in natural models. In STOC ’02: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 792–801, New York, NY, USA, 2002. ACM. 2.2.1 [9] Simon Singh. Letter frequencies, 2008. http://www.simonsingh.net/. 2.2.2 [10] Zhu Shuanglei. Project Rainbow Crack, 2007. http://www.antsight.com/zsl/rainbowcrack/. 2.2.5 [11] N. Chomsky. Three models for the description of language. Information Theory, IEEE Transactions on, 2(3):113–124, Sep 1956. 3.1, 3.2 [12] Yusuk Lim, Changsheng Xu, and David Dagan Feng. Web based image authentication using invisible fragile watermark. In VIP ’01: Proceedings of the Pan-Sydney area workshop on Visual information processing, pages 31–34, Darlinghurst, Australia, Australia, 2001. Australian Computer Society, Inc. 3.2 46 [13] Joseph Goldberg, Jennifer Hagman, and Vibha Sazawal. Doodling our way to better authentication. In CHI ’02: CHI ’02 extended abstracts on Human factors in computing systems, pages 868–869, New York, NY, USA, 2002. ACM. 3.2 [14] Robert McMillan. Phishing attack targets myspace users, 2006. http://www.infoworld.com/infoworld/article/06/10/27/HNphishingmyspace 1.html. 3.2 [15] Florida State University. http://ecit.fsu.edu/. 3.2 Electronic Crime Investigative Technologies Laboratory, 2008. [16] Matt Bishop. Computer Security: Art and Science. Addison-Wesley, 2003. 4 [17] Thomas Anderson and Christopher DeWolfe. MySpace, 2008. http://www.myspace.com. 4 [18] Alexa: The web information company, 2008. http://www.alexa.com. 1 [19] Thomas Dickey. Lynx, 2006. http://lynx.isc.org. 4.1 [20] Zhiyi Chi. Statistical properties of probabilistic context-free grammars. Linguistics, 25(1):131–160, 1999. 47 Computational BIOGRAPHICAL SKETCH William J. Glodek William J. Glodek completed his Bachelors degree in Computer and Information Sciences from the University of Delaware in the spring of 2006. Under the advisement of Prof. Sudhir Aggarwal and Prof. Breno de Medeiros, he obtained his Masters degree in Information Security from the Department of Computer Science at Florida State University in spring 2008. William’s research interests include general computer and network security. 48
© Copyright 2026 Paperzz