Overview COMP2240 / AI20 This lecture covers the following topics: Artificial Intelligence • Motivations for computerised natural language processing (NLP) • Why NLP is so difficult. • Some approaches and techniques. Lecture NLP-1 • A quick preview of the Python Natural Language ToolKit (NLTK). Introduction to Natural Language Processing AI — Introduction to Natural Language Processing NLP-1-1 Useful Resources AI — Introduction to Natural Language Processing NLP-1-2 The Ultimate Goal % • Applied Natural Language Processing course by Marti Hearst at UC Berkeley. These online lecture notes are well worth browsing to give a complementary perspective on NLP. Dave Bowman: “Open the pod bay doors, HAL” % • Speech and Language Processing (Jurafsky and Martin). % Chapter 1 is freely available online and gives a good overview of the research field of NLP. % • NLTK — Python Natural Language Toolkit package The website has a huge amount of information, covering many topics and giving examples and exercises relating to the NLTK software. AI — Introduction to Natural Language Processing NLP-1-3 HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.” % From 2001: A Space Odyssey (1968) AI — Introduction to Natural Language Processing NLP-1-4 Real-World Applications of NLP Topics Covered • • • • • • • • • • Tokenisation — how to break text into words and word-like units AI Spelling Suggestions/Corrections Grammar Checking Synonym Generation Information Extraction Text Categorisation Automated Customer Service Speech Recognition (limited) Machine Translation In the (near?) future: Question Answering, Improving Web Search, Automated Metadata Assignment, Online Dialogs, .... — Introduction to Natural Language Processing • Stemming and Morphemes — how complex words are made up by combining simpler lexical units • Formal Grammars — how to generate and parse sentences using formal specifications of syntax • Semantics — how can we determine the meanings of words and sentences? • Statistical Corpus-Based Approaches — how to use statistical properties of large amounts of text to support a variety of NLP applications. NLP-1-5 AI — Introduction to Natural Language Processing NLP-1-6 Why NLP is Difficult Language Subtleties • Many words can have a variety of different meanings and functions depending on their context. Which of these sound odd? A big black dog A big black scary dog • The relation of the superficial structure of natural language text (and speech) to its meaning is extremely complex and subtle. A big scary dog A scary big dog • Interpreting natural language often requires a large amount of background knowledge. A black big dog • Computers are not social beings in the way that humans are. AI — Introduction to Natural Language Processing NLP-1-7 Consider the different meanings of ‘with’: He wrecked the car with his mates He wrecked the car with his custom plates He wrecked the car with his bulldozer He wrecked the car with his stupidity • • • • • She upset the boy with her joke She upset the boy with his sister She upset the mug with her hand She upset the mug with her coffee She upset the mug with her name AI — Introduction to Natural Language Processing NLP-1-8 Ambiguity Context and Disambiguation • • • • AI Consider the following newspaper headlines: • • • • • • • • — Introduction to Natural Language Processing NLP-1-9 AI Iraqi Head Seeks Arms Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Kids Make Nutritious Snacks British Left Waffles on Falkland Islands Red Tape Holds Up New Bridges Bush Wins on Budget, but More Lies Ahead Hospitals are Sued by 7 Foot Doctors — Introduction to Natural Language Processing NLP-1-10 Two Approaches to NLP Sense vs. Nonsense “Colourless green ideas sleep furiously” % is a sentence composed by Noam Chomsky in 1957 as an example of a sentence with correct grammar but nonsensical semantics. The sentence therefore has no understandable meaning. • Formal Grammars and Logic-Based Semantics This approach looks for precise structures and regularities in language and meaning. It tries to see beyond the messiness of actual linguistic data to construct idealised theories. One way of explaining such nonsense is to say that although the combination of grammatical categories of the words is coherent, the combination of semantic categories is inconsistent. • Statistical Analysis based on large language Corpora This approach directly treats the messy data of actual language use. Typically develops task-specific algorithms rather than general theories. AI AI — Introduction to Natural Language Processing NLP-1-11 — Introduction to Natural Language Processing NLP-1-12 Formal Grammar Examples Formal Grammars A grammar is a specification of possible ‘correct’ forms of linguistic expression (sentences). A huge amount of research has focused on the specification of formal grammars many different paradigms have been proposed. A formal grammar is a precise symbolic representation of a grammar, usually specified by a system of rules. One simple approach is that of Context-Free Grammar . Here is a simple CFG rule set: Grammars can be used in two opposite directions: % • S −→ SS • S −→ (S) • S −→ () • Generating grammatical sentences. • Parsing sentences in order to discover their grammatical structure. By applying these rules starting with S we can generate arbitrary correctly nested strings of brackets: Parsing linguistic expressions is usually much more difficult than generating them. Why? S → SS → SSS → (S)SS → ((S))SS → ((SS))S(S) → ((()S))S(S) → ((()()))S(S) → ((()()))()(S) → ((()()))()(()) AI AI — Introduction to Natural Language Processing NLP-1-13 The formulation of grammars often makes heavy use of the concept of the grammatical category of a word or phrase: AI sentence noun phrase verb phrase prepositional phrase noun verb article preposition adjective adverb adverbial phrase The cat sat on the mat The cat sat on the mat on the mat cat sat the on black lazily after the exams — Introduction to Natural Language Processing NLP-1-14 Example of a Parse Tree Grammatical Categories S NP VP PP N V AT P Adj Adv AdvP — Introduction to Natural Language Processing In parsing we want to derive the sequence of grammar rules that produced a given sentence. S=sentence, NP=noun phrase, VP=verb phrase, PP=prepositional phrase, N=noun, V=verb, AT=article, P=preposition NLP-1-15 Deep vs Surface Structure AI — Introduction to Natural Language Processing NLP-1-16 Statistical Techniques Statistical techniques make use of large corpora (i.e. organised collections) of natural text in order to analyse language. Try to work out parse trees for the following sentences: “Time flies like an arrow.” Certain kinds of correlation have been found to be useful for useful tasks: • • “Fruit flies like a banana.” • Word frequency distribution — how often do words occur in the corpus and in a given piece of text. Word collocation — how frequently is a given pair of words found together. Context similarity — how often are each of a given pair of words found within the same textual context. Can you think of applications that could make use of these statistics? AI — Introduction to Natural Language Processing NLP-1-17 AI — Introduction to Natural Language Processing NLP-1-18 The Python NLTK Package Further Reading The NLTK package contains a large set of NLP programs as well as NL text data supporting learning materials. % The following code use examples are given in Chapter 1 of the % NLTK book Natural Language Processing with Python. import nltk from nltk.book import * text1.concordance("monstrous") text1.similar("monstrous") text2.generate(300) • Lecture 1 of the Applied Natural Language Processing course % by Marti Hearst • Chapter 1 of Speech and Language Processing by Jurafsky % and Martin % • The Wikipedia page on Context-Free Grammar. % • Chapter 1 of the NLTK book Natural Language Processing % with Python. All the examples should run directly on the SoC Linux machines. You can also install NLTK on your own machine. AI — Introduction to Natural Language Processing NLP-1-19 This AI — Introduction to Natural Language Processing — Introduction to Natural Language Processing NLP-1-20 slide NLP-1-21 AI — Introduction to Natural Language Processing NLP-1-22 blank is AI AI — Introduction to Natural Language Processing NLP-1-23 AI — Introduction to Natural Language Processing NLP-1-24 What is Tokenisation? COMP2240 / AI20 • Tokenisation is breaking down a stream of characters into individual items of interest. Artificial Intelligence • Typically tokens are words, but can include punctuation as separate tokens. In some applications line breaks might be tokens. • Breaking down the input into sentences is called ‘sentence tokenisation’ or ‘sentence segmentation’. Lecture NLP-2 • Both words and sentences are far more dicult to recognize than you might imagine. We concentrate on English other languages can introduce very dierent challenges. Tokenisation and Regular Expressions AI — Tokenisation and Regular Expressions NLP-2-1 What tokenising problems can you spot? "It was widely assumed that the slowdown in the U.S. economy that began in mid-2007 would reduce demand for electronics goods and, by extension, semiconductors in 2008," said Richard Gordon, research vice president at Gartner. "However, while we are still forecasting low single-digit growth for the semiconductor market in 2008, this has more to do with supply-side factors than weakness in demand." http://www.gartner.com/it/page.jsp?id=684112 — Tokenisation and Regular Expressions — Tokenisation and Regular Expressions NLP-2-2 More Tokenising Tribulations Worldwide semiconductor revenue is forecast to reach $286.5 billion in 2008, a 4.6 percent increase from 2007 revenue of $273.9 billion, according to the latest projections by Gartner, Inc. Gartner has slightly increased its growth rate from February when it forecast a 3.4 percent increase for 2008. AI AI Consider the problems involved in tokenising the following text examples: The New York-New Jersey railroad The Leeds-based architecture group plans a Radically New York skyline -- 100-storey towers replacing the redundant minster. He’d’ve never said "I ain’t saying ’yessir’ to nobody". We all know that 25,000 pounds is a large sum and that 1 + 1 is a small sum. Call me on (+44) (0) 113 3431076 :-) NLP-2-3 AI — Tokenisation and Regular Expressions NLP-2-4 Regular Expressions Splitting Strings in Python Python’s split method provides a simple and convenient way to chop strings into pieces: >>> s = "this is a sentence." >>> s.split(" ") [’this’, ’is’, ’a’, ’sentence.’] >>> s.split(’e’) [’this is a s’, ’nt’, ’nc’, ’.’] • A regular expression is a string that is used to represent a string pattern. • Each regular expression matches a specific set of strings. • Regular expressions are often used to find particular kinds of substring within a longer string. Note that the method split() with no arguments, splits on whitespace rather than individual spaces. • A more or less standard syntax for regular expressions is used in many different programming and shell languages. Contrast: "Sentence with "Sentence • Regular expression matching provides a powerful tool for a wide variety of low-level NLP operations. AI — Tokenisation and Regular Expressions with with extra extra spaces".split(’ ’) spaces".split() NLP-2-5 AI — Tokenisation and Regular Expressions NLP-2-6 RegExp Special Symbols RegExp Operators Make notes of the following special symbols: . * + ? [abc] [^abc] [A-Z0-9] ed|ing|s These mach a symbol type rather than a specific symbol: \d \s \w \t \D \S \W \n These match a position rather than a symbol: AI ^ $ \A \Z (For more advanced matching, grouping brackets and group backreferences are often used — e.g. (...).*\1. You won’t be examined on these.) \b — Tokenisation and Regular Expressions NLP-2-7 Regular Expressions in Python AI — Tokenisation and Regular Expressions NLP-2-8 What do these match? Since both Python strings and RegExps use the ‘\’ symbol as a special (escape) character, it is confusing to use ordinary strings to represent RegExps. r"\w+" r"\bz\w*" Instead use raw strings or the form r"some chars" (or r’some chars’). These ignore string escapes so any ‘\’ will be interpreted as part of the RegExp. r"\w+-\s?\w+" r"\w+\’\w+" import re ## import regular expression library string = "The numbers 6.3 and 6 are greater than 5.998." print re.findall( r"\w+", string) print re.findall( r"\d*\.\d*", string) r"(\$|#)?\d+(\.)?\d+%?" Check out the online documentation of Python Regular % expressions. AI — Tokenisation and Regular Expressions NLP-2-9 Finding Words Matching a RegExp AI — Tokenisation and Regular Expressions NLP-2-10 Splitting wrt a RegExp % Example of corpus word analysis using NLTK import nltk, re from nltk.corpus import brown // ’brown’ is a well-known // text corpus // Task: Get all words in the corpus containing a hyphen. // Method: Use ’list comprehension’ operator The Python re package also allows you to split a string with respect to a regular expression that matches the break points: re.split(r’\s+’, s) re.split(r’\W+’, s) This can be used to implement various kinds of tokenising. // element enumeration filter print [ w for w in brown.words() if re.search(’-’, w)] AI — Tokenisation and Regular Expressions NLP-2-11 AI — Tokenisation and Regular Expressions NLP-2-12 Overview COMP2240 / AI20 This lecture covers the following topics: Artificial Intelligence • Formal specification of grammars. • Types of grammar (the Chomsky hierarchy). • Various small example grammars. • Relation between (regular) grammars and regular expressions. Lecture NLP-3 • A very brief look at parsing. Grammars and Parsing AI — Grammars and Parsing NLP-3-1 General Principles of Formal Grammar AI — Grammars and Parsing NLP-3-2 Production Rules and Languages Terminal symbols are the actual symbols of the language that the grammar specifies. These could be individual characters, words or morphemes (ie meaningful parts of words). Non-Terminal symbols correspond to classes of expressions of particular type. They are often associated with grammatical categories of words or phrases. A start symbol S is always included in the Non-Terminals. A Production Rule specifies a re-write of one sequence of symbols to a new sequence of symbols. In general both the input and output sequence can contain a mixture of terminals and nonterminals. A sequence of terminals is generated by a given grammar if it is the result of applying a series of production rules of that grammar, starting from the symbol S. A grammar specifies a language consisting of all those strings (i.e. sequences of terminals) generated by its rules. AI — Grammars and Parsing NLP-3-3 Formal Grammar Definition The symbol ‘|’ is used to represent alternative symbols. This notation provides a convenient abbreviation for multiple similar rules. Σ is a finite set of terminal symbols N is a finite set of non-terminal symbols S ∈ N is the start symbol R is a set of production rules E.g. where α is a seqence of symbols containing at least one nonterminal symbol; and β is any sequence of symbols. NLP-3-5 N −→ dog | hog | frog Is equivalent to the three rules: N −→ dog N −→ hog N −→ frog α −→ β — Grammars and Parsing NLP-3-4 The symbol ‘ε’ is often used to denote the empty symbol sequence. Each rule in R is of the form: AI — Grammars and Parsing Notational Conventions A formal grammar is a structure hΣ, N, S, Ri where: • • • • AI AI — Grammars and Parsing NLP-3-6 Chomsky Hierarchy of Formal Grammars Grammar for Arithmetic Expressions A grammar for arithmetic expressions can be defined as follows Garith = h{0, . . . , 9, −, +, ∗, /}, {S, N }, S, Ri where the rule set R is: • • • • • • • AI S −→ N S −→ −(S) S −→ (S + S) S −→ (S ∗ S) S −→ (S/S) N −→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 N −→ N N NLP-3-7 Regular Grammars and Regular Expressions Conversely, the strings matching some regular expression can be specified by a regular grammar. AI a*bc+ — Grammars and Parsing NLP-3-9 • Type-2 — Context Free. The class of formal grammars whose rules are of the form: N −→ γ • Type-3 — Regular. The class of formal grammars whose rules have the forms a and b, or the forms a and b0: a) N −→ t b) N −→ tM b0) N −→ M t AI — Grammars and Parsing NLP-3-8 Context-free grammars are very widely used in computer science. They are often used to specify the legal syntax of a programming language. They provide a good compromise between flexibility of language specification and ease of parsing. AI — Grammars and Parsing NLP-3-10 % Transformational Grammar NLTK includes tools for specifying grammars: CFGs are generally thought to be too limited to capture the full range structures found in a natural language. from nltk import parse_cfg from nltk.parse import RecursiveDescentParser One idea (due to Chomsky) is that natural language sentences are generated by an initial CFG-like process followed by one or more structural transformations. grammar = parse_cfg(""" S -> NP VP PP -> P NP NP -> ’the’ N | N PP | ’the’ N PP VP -> V NP | V PP | V NP PP N -> ’cat’ N -> ’dog’ N -> ’rug’ V -> ’chased’ V -> ’sat’ P -> ’in’ P -> ’on’ """) — Grammars and Parsing Type-1 — Context Sensitive. The class of formal grammars whose rules are of the form: αN β −→ αγβ Their localised specification of structure means that they are not good for describing languages in which there are constraints between separated parts of a string. (as is often the case in NL). Python CFGs using NLTK AI • Context-Free Grammars The set of strings generated by any regular grammar can also be specifed by a regular expression. ⇐⇒ Type-0. The class of all formal grammars. α, β and γ are sequences of terminls and/or non-terminals. γ must be non-empty. M and N are non-terminals. t is terminal. — Grammars and Parsing For example: • S −→ aS • S −→ bC • C −→ c • C −→ cC • For example, the transformation from an active to a passive sentence form can be specified by the following rule: N P1 V N P2 =⇒ N P2 V +passive by N P1 Mary saw the thief. =⇒ The thief was seen by Mary. The resulting Transformational Grammars are context sensitive (Type 1). NLP-3-11 AI — Grammars and Parsing NLP-3-12 Parsing Parsing as Search To parse a word means to assign it to a particular grammatical category. Parsing a sentence means assigning each word to a grammatical category and also determining the structure by which words (perhaps also morphemes) are combined. In terms of a formal grammar, parsing a string is equivalent to finding the sequence of production rules that generate that string. Parsing Type-0 grammars is extremely difficult, but parsing gets successively easier in the more restricted types. (Type-1 hard, Type-2 computationally tractable, Type-3 very efficient.) AI — Grammars and Parsing NLP-3-13 Python Parsing rd = RecursiveDescentParser(grammar) ## Create parser sentences = [’the cat chased the dog’, ’the cat chased the dog on the rug’ ] ## Run parser on list of sentences for s in sentences: print ’Sentence:’, s sentence_wordlist = s.split() for parse in rd.nbest_parse(sentence_wordlist): print ’Parse tree:’, parse, "\n" — Grammars and Parsing NLP-3-15 For parsing it is usually best to implement the search in the direction starting from the final sentence and applying production rules in the reverse direction to try to reach the start symbol, S. Why not search forward from S? AI — Grammars and Parsing NLP-3-14 Prolog has very good support for specifying grammar rules which can work as both generators and parsers: sentence(s(NP,VP)) --> noun_phrase(NP), " ", verb_phrase(VP), ".". noun_phrase(np(AT,CN)) --> article(AT), " ", count_noun(CN). noun_phrase(np(M)) --> mass_noun(M). verb_phrase(vp(IVP)) --> intransitive_verb(IVP). verb_phrase(vp(IVP,NP)) --> transitive_verb(IVP), " ", noun_phrase(NP). noun_phrase(np(john)) --> article( at( a ) ) --> article( at( the ) ) --> count_noun( cn(brick) ) --> mass_noun( m(water) ) --> intransitive_verb( iv(ran) ) --> transitive_verb( tv(hit) ) --> AI — Grammars and Parsing "John". "a". "the". "brick". "water". "ran". "bit". NLP-3-16 *Bonus Blank Page* Grammar Tagged Corpus import nltk from nltk.corpus import brown One can regard a formal grammar as defining transitions between nodes of a search space. Grammars in Prolog ## Assumes CFG created as in Python code given above. AI Parsing is a type of search problem. ## import brown corpus ## Print out some tagget words for i in range(0,100): print brown.tagged_words()[i] AI — Grammars and Parsing NLP-3-17 AI — Grammars and Parsing NLP-3-18 Overview COMP2240 / AI20 In this lecture I give a very brief overview of the basic ideas of morphology — the analysis of word structure. Artificial Intelligence You should learn how to break a word down into its constituent morphemes and how to identify its root and the different types of affix that are incorporated. The following resources may be useful: Lecture NLP-4 Word Structure (Morphology) AI — Word Structure (Morphology) NLP-4-1 Branches of Linguistics • • • • • • AI NLP-4-3 Morphology Morphemes are the smallest meaningful unit in the grammar of a language. From a morphological point of view, a word is formed from the combination of a root morpheme with a number of affix morphemes. — Word Structure (Morphology) • Wikipedia entry on Morphology VV AI % — Word Structure (Morphology) NLP-4-2 This is a preliminary but rather subtle question in the study of words. How many words are in the following sentence? “A rose is a rose is a rose.” ...................................... • ................................. • ................................ • ................................. AI — Word Structure (Morphology) NLP-4-4 Inflexional Affixes Morphology is the study of the way words are built up from smaller meaning units. AI Glossary of linguistic terms by Eugene Loos What is a Word? Phonetics and Phonology: The study of linguistic sounds or speech. Syntax: the structural relationships between words. Semantics: meanings of words, phrases, sentences. Discourse: extended series of utterances involving more than one agent. Pragmatics: how language is used in context and to accomplish goals. * Morphology*: The study of the meaningful components of words. — Word Structure (Morphology) % • NLP-4-5 Inflexional affixes are morphemes that are added to a word root to express general logical and grammatical aspects of meaning, that can be applied to any word of an appropriate grammatical category. e.g.: Number (singular, plural), Case (subject, object etc), Gender (masculine, feminine) Verbs have: Tense (past, present, future etc) Person (1st,2nd,3rd), Number (singular, plural) , Gender (masculine, feminine) AI — Word Structure (Morphology) NLP-4-6 Affix Idiosyncracy Derivational Affixes Derivational Affixes combine with a word root to produce a new word with a different meaning: • • unhappy extraordinary The affect of a derivational affix on different root words is often similar but not always. Derivational affixes often transform the root to a different grammatical category: • • • Words often include more than one derivational affix: joyful ness — Word Structure (Morphology) NLP-4-7 We can now define ‘root’ and ‘affix’ more precisely, as well as the related concept of ‘stem’: The Root is the portion of the word that: is common to a set of derived or inflected forms, if any, when all affixes are removed is not further analyzable into meaningful elements carries the principle portion of meaning of the word • • As well as the difference in the way extra letters are added, notice that ‘stapler’ is a thing for stapling not (normally) a person who is stapling. And a sticker is a thing that can be stuck rather than a tool for sticking. AI — Word Structure (Morphology) NLP-4-8 Examples Word Structure Terminology • For example: ‘swim(m)er ’ and ‘run(n)er ’ are analogous; but contrast ‘swimmer ’ with ‘sticker ’ and ‘stapl(e)er ’ happyness joyful stapler AI The possibility of adding a derivational affix to a word does not follow general rules — it is idiosyncratic. For example we have ‘joyful’ and ‘painful’ but not ‘despairful’. An affix a bound morpheme that is incorporated before, after, or within a root or stem. ‘handbags’ hand+bag+s (3 morphemes, 2 syllables) The root is bag and the stem is handbag. hand is playing the role of a derivational affix. s is an inflexional affix (indicating a plural meaning). ‘unladylike un+lady+like (3 morphemes, 4 syllables) Root is lady, meaning (well behaved) female adult human un is a derivational affix meaning ‘not’. like is a derivational affix meaning ‘having the characteristics of’. Since there are no inflexional afixes, the stem is the whole word. The stem of a word comprises its root together with any derivational (but not inflectional) affixes with which it is combined. AI — Word Structure (Morphology) NLP-4-9 Exercises AI — Word Structure (Morphology) NLP-4-10 Not a Slide • Analyse the words of the following sentence into their component morphemes: The over-excited sheepdog trotted playfully towards her. • What is the maximum and minimum number of syllables that the word ‘extraordinary’ can have? (when spoken) • The affix morpheme ‘ing’ can occur either as an inflectional or derivational affix. Give an example of each. (Actually, as well as relatively clear cases, there are some words where it is controversial which type of affix ‘ing’ is.) AI — Word Structure (Morphology) NLP-4-11 AI — Word Structure (Morphology) NLP-4-12 Overview COMP2240 / AI20 Topics covered: Artificial Intelligence • The statistical approach to NLP. • Word frequency distributions. • Zipf’s Law • Word collocation Lecture NLP-5 • N-Grams • Bag of Words Statistical Corpus-Based Approaches • Vector Space Representation AI — Statistical Corpus-Based Approaches NLP-5-1 AI — Statistical Corpus-Based Approaches NLP-5-2 Useful Operations for Analysing a Text or Corpus The Statistical Approach to NLP Whereas classical linguistics is primarily concerned with the detailed structural and semantic properties of language, statistical approaches consider linguistic properties that are averaged over large quantities of linguistic data — i.e. corpora. It has found that many language related computational tasks can be implemented by statistical algorithms that ignore nearly all of the detailed structure of language. It is controversial whether a statistical approach can lead to fullyfledged natural language processing/understanding. How to find the frequency of a particular token in a text object: text1.count("thrice") How to get a sorted list of all words in a corpus: sorted(set(brown.words())) How to create a frequency distribution object: fdist = nltk.probability.FreqDist(samples) Where samples is a list (or other iterator) of items (such as the word tokens of a corpus). AI — Statistical Corpus-Based Approaches NLP-5-3 Text Token Frequency Analysis 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) — Statistical Corpus-Based Approaches NLP-5-4 Zipf’s ‘Law’ Texts vary in the frequency distribution of their tokens: Genesis # tokens 44764 , 3681 and 2428 the 2411 of 1358 . 1315 And 1250 his 651 he 648 to 611 ; 605 unto 590 in 588 AI For a word type in large corpus the Moby Dick # tokens 260819 , 18713 the 13721 . 6862 of 6536 and 6024 a 4569 to 4542 ; 4072 in 3916 that 2982 ’ 2684 2552 • frequency of the word is the number of times it occurs, • rank of the word is its position in a list words ordered from highest to lowest frequency. Zipf’s law says that for each word frequency × rank = constant. Token frequency distribution provides a signature of a text style. The constant is fixed throughout a given corpus, but depends on the corpus. AI AI — Statistical Corpus-Based Approaches NLP-5-5 — Statistical Corpus-Based Approaches NLP-5-6 Zipf Graph for Brown Corpus Evidence word the and a he but be there one about more never Oh two freq 3332 2972 1775 877 410 294 222 172 158 138 124 116 104 rank 1 2 3 10 20 30 40 50 60 70 80 90 100 product 3332 5944 5235 8770 8400 8820 8880 8600 9480 9660 9920 10440 10400 word turned you’ll name comes group lead friends begin family brushed sins Could Appluasive freq 51 30 21 16 13 11 10 9 8 4 2 2 1 rank 200 300 400 500 600 700 800 900 1000 2000 3000 4000 8000 product 10200 9000 8400 8000 7800 7700 8000 8100 8000 8000 6000 8000 8000 Zipf figures from Tom Sawyer (Manning and Schütze) AI — Statistical Corpus-Based Approaches Logarithmic plot of rank against frequency for Brown corpus also showing Zipf’s formula f r ≈ 105. NLP-5-7 Zipf Question Examples AI Frequency of bi-grams (2 word sequences) in the Brown corpus: Brown Corpus, Top Bi-Gram Frequency (Total = 864685) 1. How often does the word ranked 9000 occur? 2. How often does the highest ranked word occur? 3. If a word occurs 4500 times, what would its rank be? 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) • Is ‘Zipf’s law’ really a law? — Statistical Corpus-Based Approaches NLP-5-8 Word Collocation • Suppose the words in a corpus follow Zipf’s law exactly, and that the 20th most frequent word occurs 450 times. AI — Statistical Corpus-Based Approaches NLP-5-9 AI of the in the to the on the and the for the to be at the with the of a that the from the 9625 5546 3426 2297 2136 1759 1697 1506 1472 1461 1368 1351 13) 14) 15) 16) 17) 18) 19) 20) 21) 22) 23) 24) in a by the as a it is with a is a of his is the was a had been it was for a — Statistical Corpus-Based Approaches 1316 1310 896 881 881 864 806 782 776 760 743 734 NLP-5-10 Next Word Probability N-Grams An N-Gram is a sequence of N words. Useful information about the common patterns and regularities of natural language can be derived by restricting attention to the occurences of N-grams, for some relatively small N. It is typical to consider Tri-Grams — i.e. 3-Grams. These support a much richer analysis than bi-grams (2-grams), and are more manageable to work with than 4+-grams. Analysis of N-grams can be used to compute the probability that a given word follows from a given N-1 word sequence. Suppose a corpus contains j tri-grams of the form A B ... and k tri-grams of the form A B C Then (all other things being equal) the probability that C follows A B is k/j AI — Statistical Corpus-Based Approaches NLP-5-11 AI — Statistical Corpus-Based Approaches NLP-5-12 Bag of Words with Naive Bayes Document Classifier Bag of Words % In the Bag of Words (VV ) model, a text is considered as an un-orderd collection of word. We ingnore syntax and consider only the number of occurences of each word. Although, treating a document as a bag of words throws away most of its meaning, the model can still be very effective in applications such as document classification. A very common use is spam filtering. Suppose, we have a large corpus C of emails, which is divided into two sets: Spam and Non-Spam. We can treat the entire corpus C as a huge sack of words, and we can also treat the Spam and Non-Spam sets as smaller sacks. Moreover any email can be be considered as a small handbag of words. If I take words randomly from the corpus sack and put them into the email handbag then there is a certain probability of the handback getting filled with a certain collection of words. Similarly, if I take words only from the spam sack, then I get a different probability of filling the email handback in a particular way. AI — Statistical Corpus-Based Approaches NLP-5-13 Using Bayes Law (Very Naively) Let E be the fact that you recieve a particular email. Let S be the fact that this email is Spam. Let P (A) be the probability of fact A and P (A|B) be the probability of A given that B. We want to find P (S|E) From Bayes Law we have: P (E ∧ S) = P (S) ∗ P (E|S) = P (E) ∗ P (S|E) So P (S|E) = P (E|S) ∗ P (S)/P (E) AI — Statistical Corpus-Based Approaches NLP-5-15 AI — Statistical Corpus-Based Approaches NLP-5-14
© Copyright 2024 Paperzz