Artificial Intelligence

Overview
COMP2240 / AI20
This lecture covers the following topics:
Artificial
Intelligence
• Motivations for computerised natural language processing
(NLP)
• Why NLP is so difficult.
• Some approaches and techniques.
Lecture NLP-1
• A quick preview of the Python Natural Language ToolKit (NLTK).
Introduction to Natural Language
Processing
AI
— Introduction to Natural Language Processing
NLP-1-1
Useful Resources
AI
— Introduction to Natural Language Processing
NLP-1-2
The Ultimate Goal
%
• Applied Natural Language Processing course by Marti
Hearst at UC Berkeley.
These online lecture notes are well worth browsing to give a
complementary perspective on NLP.
Dave Bowman: “Open the pod bay doors, HAL”
%
• Speech and Language Processing (Jurafsky and Martin).
%
Chapter 1 is freely available online and gives a good
overview of the research field of NLP.
%
• NLTK — Python Natural Language Toolkit package
The website has a huge amount of information, covering many
topics and giving examples and exercises relating to the NLTK
software.
AI
— Introduction to Natural Language Processing
NLP-1-3
HAL 9000: “I’m sorry Dave. I’m afraid I can’t do that.”
%
From 2001: A Space Odyssey (1968) AI
— Introduction to Natural Language Processing
NLP-1-4
Real-World Applications of NLP
Topics Covered
•
•
•
•
•
•
•
•
•
• Tokenisation — how to break text into words and word-like units
AI
Spelling Suggestions/Corrections
Grammar Checking
Synonym Generation
Information Extraction
Text Categorisation
Automated Customer Service
Speech Recognition (limited)
Machine Translation
In the (near?) future:
Question Answering,
Improving Web Search,
Automated Metadata Assignment, Online Dialogs, ....
— Introduction to Natural Language Processing
• Stemming and Morphemes — how complex words are made
up by combining simpler lexical units
• Formal Grammars — how to generate and parse sentences
using formal specifications of syntax
• Semantics — how can we determine the meanings of words
and sentences?
• Statistical Corpus-Based Approaches — how to use statistical
properties of large amounts of text to support a variety of NLP
applications.
NLP-1-5
AI
— Introduction to Natural Language Processing
NLP-1-6
Why NLP is Difficult
Language Subtleties
• Many words can have a variety of different meanings and
functions depending on their context.
Which of these sound odd?
A big black dog
A big black scary dog
• The relation of the superficial structure of natural language text
(and speech) to its meaning is extremely complex and subtle.
A big scary dog
A scary big dog
• Interpreting natural language often requires a large amount of
background knowledge.
A black big dog
• Computers are not social beings in the way that humans are.
AI
— Introduction to Natural Language Processing
NLP-1-7
Consider the different meanings of ‘with’:
He wrecked the car with his mates
He wrecked the car with his custom plates
He wrecked the car with his bulldozer
He wrecked the car with his stupidity
•
•
•
•
•
She upset the boy with her joke
She upset the boy with his sister
She upset the mug with her hand
She upset the mug with her coffee
She upset the mug with her name
AI
— Introduction to Natural Language Processing
NLP-1-8
Ambiguity
Context and Disambiguation
•
•
•
•
AI
Consider the following newspaper headlines:
•
•
•
•
•
•
•
•
— Introduction to Natural Language Processing
NLP-1-9
AI
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
British Left Waffles on Falkland Islands
Red Tape Holds Up New Bridges
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
— Introduction to Natural Language Processing
NLP-1-10
Two Approaches to NLP
Sense vs. Nonsense
“Colourless green ideas sleep furiously”
%
is a sentence composed by Noam Chomsky in 1957 as an
example of a sentence with correct grammar but nonsensical
semantics. The sentence therefore has no understandable
meaning.
• Formal Grammars and Logic-Based Semantics
This approach looks for precise structures and regularities in
language and meaning. It tries to see beyond the messiness of
actual linguistic data to construct idealised theories.
One way of explaining such nonsense is to say that although the
combination of grammatical categories of the words is coherent,
the combination of semantic categories is inconsistent.
• Statistical Analysis based on large language Corpora
This approach directly treats the messy data of actual language
use. Typically develops task-specific algorithms rather than
general theories.
AI
AI
— Introduction to Natural Language Processing
NLP-1-11
— Introduction to Natural Language Processing
NLP-1-12
Formal Grammar Examples
Formal Grammars
A grammar is a specification of possible ‘correct’ forms of linguistic
expression (sentences).
A huge amount of research has focused on the specification of
formal grammars many different paradigms have been proposed.
A formal grammar is a precise symbolic representation of a
grammar, usually specified by a system of rules.
One simple approach is that of Context-Free Grammar . Here
is a simple CFG rule set:
Grammars can be used in two opposite directions:
%
• S −→ SS
• S −→ (S)
• S −→ ()
• Generating grammatical sentences.
• Parsing sentences in order to discover their grammatical
structure.
By applying these rules starting with S we can generate arbitrary
correctly nested strings of brackets:
Parsing linguistic expressions is usually much more difficult than
generating them. Why?
S → SS → SSS → (S)SS → ((S))SS → ((SS))S(S) →
((()S))S(S) → ((()()))S(S) → ((()()))()(S) → ((()()))()(())
AI
AI
— Introduction to Natural Language Processing
NLP-1-13
The formulation of grammars often makes heavy use of the
concept of the grammatical category of a word or phrase:
AI
sentence
noun phrase
verb phrase
prepositional phrase
noun
verb
article
preposition
adjective
adverb
adverbial phrase
The cat sat on the mat
The cat
sat on the mat
on the mat
cat
sat
the
on
black
lazily
after the exams
— Introduction to Natural Language Processing
NLP-1-14
Example of a Parse Tree
Grammatical Categories
S
NP
VP
PP
N
V
AT
P
Adj
Adv
AdvP
— Introduction to Natural Language Processing
In parsing we want to derive the sequence of grammar rules that
produced a given sentence.
S=sentence, NP=noun phrase, VP=verb phrase, PP=prepositional
phrase, N=noun, V=verb, AT=article, P=preposition
NLP-1-15
Deep vs Surface Structure
AI
— Introduction to Natural Language Processing
NLP-1-16
Statistical Techniques
Statistical techniques make use of large corpora (i.e. organised
collections) of natural text in order to analyse language.
Try to work out parse trees for the following sentences:
“Time flies like an arrow.”
Certain kinds of correlation have been found to be useful for useful
tasks:
•
•
“Fruit flies like a banana.”
•
Word frequency distribution — how often do words occur in
the corpus and in a given piece of text.
Word collocation — how frequently is a given pair of words
found together.
Context similarity — how often are each of a given pair of
words found within the same textual context.
Can you think of applications that could make use of these
statistics?
AI
— Introduction to Natural Language Processing
NLP-1-17
AI
— Introduction to Natural Language Processing
NLP-1-18
The Python NLTK Package
Further Reading
The NLTK package contains a large set of NLP programs as well
as NL text data supporting learning materials.
%
The following code use examples are given in Chapter 1 of the
%
NLTK book Natural Language Processing with Python.
import nltk
from nltk.book import *
text1.concordance("monstrous")
text1.similar("monstrous")
text2.generate(300)
• Lecture 1 of the Applied Natural Language Processing course
%
by Marti Hearst • Chapter 1 of Speech and Language Processing by Jurafsky
%
and Martin
%
• The Wikipedia page on Context-Free Grammar.
%
• Chapter 1 of the NLTK book Natural Language Processing
%
with Python.
All the examples should run directly on the SoC Linux machines.
You can also install NLTK on your own machine.
AI
— Introduction to Natural Language Processing
NLP-1-19
This
AI
— Introduction to Natural Language Processing
— Introduction to Natural Language Processing
NLP-1-20
slide
NLP-1-21
AI
— Introduction to Natural Language Processing
NLP-1-22
blank
is
AI
AI
— Introduction to Natural Language Processing
NLP-1-23
AI
— Introduction to Natural Language Processing
NLP-1-24
What is Tokenisation?
COMP2240 / AI20
• Tokenisation is breaking down a stream of characters into
individual items of interest.
Artificial
Intelligence
• Typically tokens are words, but can include punctuation as
separate tokens. In some applications line breaks might be
tokens.
• Breaking down the input into sentences is called ‘sentence
tokenisation’ or ‘sentence segmentation’.
Lecture NLP-2
• Both words and sentences are far more dicult to recognize
than you might imagine. We concentrate on English other
languages can introduce very dierent challenges.
Tokenisation and Regular Expressions
AI
— Tokenisation and Regular Expressions
NLP-2-1
What tokenising problems
can you spot?
"It was widely assumed that the slowdown in the U.S.
economy that began in mid-2007 would reduce demand for
electronics goods and, by extension, semiconductors in
2008," said Richard Gordon, research vice president at
Gartner. "However, while we are still forecasting low
single-digit growth for the semiconductor market in 2008,
this has more to do with supply-side factors than weakness
in demand."
http://www.gartner.com/it/page.jsp?id=684112
— Tokenisation and Regular Expressions
— Tokenisation and Regular Expressions
NLP-2-2
More Tokenising Tribulations
Worldwide semiconductor revenue is forecast to reach $286.5
billion in 2008, a 4.6 percent increase from 2007 revenue of
$273.9 billion, according to the latest projections by
Gartner, Inc. Gartner has slightly increased its growth rate
from February when it forecast a 3.4 percent increase for
2008.
AI
AI
Consider the problems involved in tokenising the following text
examples:
The New York-New Jersey railroad
The Leeds-based architecture group plans a Radically
New York skyline -- 100-storey towers replacing the
redundant minster.
He’d’ve never said "I ain’t saying ’yessir’ to nobody".
We all know that 25,000 pounds is a large sum and that
1 + 1 is a small sum.
Call me on (+44) (0) 113 3431076 :-)
NLP-2-3
AI
— Tokenisation and Regular Expressions
NLP-2-4
Regular Expressions
Splitting Strings in Python
Python’s split method provides a simple and convenient way to
chop strings into pieces:
>>> s = "this is a sentence."
>>> s.split(" ")
[’this’, ’is’, ’a’, ’sentence.’]
>>> s.split(’e’)
[’this is a s’, ’nt’, ’nc’, ’.’]
• A regular expression is a string that is used to represent a string
pattern.
• Each regular expression matches a specific set of strings.
• Regular expressions are often used to find particular kinds of
substring within a longer string.
Note that the method split() with no arguments, splits on
whitespace rather than individual spaces.
• A more or less standard syntax for regular expressions is used
in many different programming and shell languages.
Contrast: "Sentence
with "Sentence
• Regular expression matching provides a powerful tool for a wide
variety of low-level NLP operations.
AI
— Tokenisation and Regular Expressions
with
with
extra
extra
spaces".split(’ ’)
spaces".split()
NLP-2-5
AI
— Tokenisation and Regular Expressions
NLP-2-6
RegExp Special Symbols
RegExp Operators
Make notes of the following special symbols:
.
*
+
?
[abc]
[^abc]
[A-Z0-9]
ed|ing|s
These mach a symbol type rather than a specific symbol:
\d
\s
\w
\t
\D
\S
\W
\n
These match a position rather than a symbol:
AI
^
$
\A
\Z
(For more advanced matching, grouping brackets and group backreferences are often used — e.g. (...).*\1. You won’t be
examined on these.)
\b
— Tokenisation and Regular Expressions
NLP-2-7
Regular Expressions in Python
AI
— Tokenisation and Regular Expressions
NLP-2-8
What do these match?
Since both Python strings and RegExps use the ‘\’ symbol as a
special (escape) character, it is confusing to use ordinary strings
to represent RegExps.
r"\w+"
r"\bz\w*"
Instead use raw strings or the form r"some chars" (or r’some
chars’). These ignore string escapes so any ‘\’ will be interpreted
as part of the RegExp.
r"\w+-\s?\w+"
r"\w+\’\w+"
import re
## import regular expression library
string = "The numbers 6.3 and 6 are greater than 5.998."
print re.findall( r"\w+",
string)
print re.findall( r"\d*\.\d*",
string)
r"(\$|#)?\d+(\.)?\d+%?"
Check out the online documentation of Python Regular
%
expressions.
AI
— Tokenisation and Regular Expressions
NLP-2-9
Finding Words Matching a RegExp
AI
— Tokenisation and Regular Expressions
NLP-2-10
Splitting wrt a RegExp
%
Example of corpus word analysis using NLTK
import nltk, re
from nltk.corpus import brown // ’brown’ is a well-known
// text corpus
// Task: Get all words in the corpus containing a hyphen.
// Method: Use ’list comprehension’ operator
The Python re package also allows you to split a string with
respect to a regular expression that matches the break points:
re.split(r’\s+’, s)
re.split(r’\W+’, s)
This can be used to implement various kinds of tokenising.
//
element
enumeration
filter
print [ w
for w in brown.words() if re.search(’-’, w)]
AI
— Tokenisation and Regular Expressions
NLP-2-11
AI
— Tokenisation and Regular Expressions
NLP-2-12
Overview
COMP2240 / AI20
This lecture covers the following topics:
Artificial
Intelligence
• Formal specification of grammars.
• Types of grammar (the Chomsky hierarchy).
• Various small example grammars.
• Relation between (regular) grammars and regular expressions.
Lecture NLP-3
• A very brief look at parsing.
Grammars and Parsing
AI
— Grammars and Parsing
NLP-3-1
General Principles
of Formal Grammar
AI
— Grammars and Parsing
NLP-3-2
Production Rules and Languages
Terminal symbols are the actual symbols of the language that the
grammar specifies. These could be individual characters, words
or morphemes (ie meaningful parts of words).
Non-Terminal symbols correspond to classes of expressions of
particular type. They are often associated with grammatical
categories of words or phrases. A start symbol S is always
included in the Non-Terminals.
A Production Rule specifies a re-write of one sequence of
symbols to a new sequence of symbols. In general both the input
and output sequence can contain a mixture of terminals and nonterminals.
A sequence of terminals is generated by a given grammar if it is
the result of applying a series of production rules of that grammar,
starting from the symbol S.
A grammar specifies a language consisting of all those strings (i.e.
sequences of terminals) generated by its rules.
AI
— Grammars and Parsing
NLP-3-3
Formal Grammar Definition
The symbol ‘|’ is used to represent alternative symbols. This
notation provides a convenient abbreviation for multiple similar
rules.
Σ is a finite set of terminal symbols
N is a finite set of non-terminal symbols
S ∈ N is the start symbol
R is a set of production rules
E.g.
where α is a seqence of symbols containing at least one nonterminal symbol; and β is any sequence of symbols.
NLP-3-5
N −→ dog | hog | frog
Is equivalent to the three rules:
N −→ dog
N −→ hog
N −→ frog
α −→ β
— Grammars and Parsing
NLP-3-4
The symbol ‘ε’ is often used to denote the empty symbol
sequence.
Each rule in R is of the form:
AI
— Grammars and Parsing
Notational Conventions
A formal grammar is a structure hΣ, N, S, Ri
where:
•
•
•
•
AI
AI
— Grammars and Parsing
NLP-3-6
Chomsky Hierarchy
of Formal Grammars
Grammar for Arithmetic Expressions
A grammar for arithmetic expressions can be defined as follows
Garith = h{0, . . . , 9, −, +, ∗, /}, {S, N }, S, Ri
where the rule set R is:
•
•
•
•
•
•
•
AI
S −→ N
S −→ −(S)
S −→ (S + S)
S −→ (S ∗ S)
S −→ (S/S)
N −→ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
N −→ N N
NLP-3-7
Regular Grammars
and Regular Expressions
Conversely, the strings matching some regular expression can be
specified by a regular grammar.
AI
a*bc+
— Grammars and Parsing
NLP-3-9
•
Type-2 — Context Free. The class of formal grammars whose
rules are of the form: N −→ γ
•
Type-3 — Regular. The class of formal grammars whose
rules have the forms a and b, or the forms a and b0:
a) N −→ t
b) N −→ tM
b0) N −→ M t
AI
— Grammars and Parsing
NLP-3-8
Context-free grammars are very widely used in computer science.
They are often used to specify the legal syntax of a programming
language.
They provide a good compromise between flexibility of language
specification and ease of parsing.
AI
— Grammars and Parsing
NLP-3-10
%
Transformational Grammar NLTK includes tools for specifying grammars:
CFGs are generally thought to be too limited to capture the full
range structures found in a natural language.
from nltk import parse_cfg
from nltk.parse import RecursiveDescentParser
One idea (due to Chomsky) is that natural language sentences
are generated by an initial CFG-like process followed by one or
more structural transformations.
grammar = parse_cfg("""
S -> NP VP
PP -> P NP
NP -> ’the’ N | N PP | ’the’ N PP
VP -> V NP | V PP | V NP PP
N -> ’cat’
N -> ’dog’
N -> ’rug’
V -> ’chased’
V -> ’sat’
P -> ’in’
P -> ’on’
""")
— Grammars and Parsing
Type-1 — Context Sensitive. The class of formal grammars
whose rules are of the form: αN β −→ αγβ
Their localised specification of structure means that they are not
good for describing languages in which there are constraints
between separated parts of a string. (as is often the case in NL).
Python CFGs using NLTK
AI
•
Context-Free Grammars
The set of strings generated by any regular grammar can also be
specifed by a regular expression.
⇐⇒
Type-0. The class of all formal grammars.
α, β and γ are sequences of terminls and/or non-terminals.
γ must be non-empty. M and N are non-terminals. t is terminal.
— Grammars and Parsing
For example:
• S −→ aS
• S −→ bC
• C −→ c
• C −→ cC
•
For example, the transformation from an active to a passive
sentence form can be specified by the following rule:
N P1
V
N P2
=⇒
N P2
V +passive by N P1
Mary saw the thief. =⇒ The thief was seen by Mary.
The resulting Transformational Grammars are context sensitive
(Type 1).
NLP-3-11
AI
— Grammars and Parsing
NLP-3-12
Parsing
Parsing as Search
To parse a word means to assign it to a particular grammatical
category.
Parsing a sentence means assigning each word to a grammatical
category and also determining the structure by which words
(perhaps also morphemes) are combined.
In terms of a formal grammar, parsing a string is equivalent to
finding the sequence of production rules that generate that string.
Parsing Type-0 grammars is extremely difficult, but parsing gets
successively easier in the more restricted types. (Type-1 hard,
Type-2 computationally tractable, Type-3 very efficient.)
AI
— Grammars and Parsing
NLP-3-13
Python Parsing
rd = RecursiveDescentParser(grammar)
## Create parser
sentences = [’the cat chased the dog’,
’the cat chased the dog on the rug’
]
## Run parser on list of sentences
for s in sentences:
print ’Sentence:’, s
sentence_wordlist = s.split()
for parse in rd.nbest_parse(sentence_wordlist):
print ’Parse tree:’, parse, "\n"
— Grammars and Parsing
NLP-3-15
For parsing it is usually best to implement the search in the
direction starting from the final sentence and applying production
rules in the reverse direction to try to reach the start symbol, S.
Why not search forward from S?
AI
— Grammars and Parsing
NLP-3-14
Prolog has very good support for specifying grammar rules which
can work as both generators and parsers:
sentence(s(NP,VP))
--> noun_phrase(NP), " ",
verb_phrase(VP), ".".
noun_phrase(np(AT,CN)) --> article(AT), " ", count_noun(CN).
noun_phrase(np(M))
--> mass_noun(M).
verb_phrase(vp(IVP))
--> intransitive_verb(IVP).
verb_phrase(vp(IVP,NP)) --> transitive_verb(IVP), " ",
noun_phrase(NP).
noun_phrase(np(john))
-->
article( at( a ) )
-->
article( at( the ) )
-->
count_noun( cn(brick) )
-->
mass_noun( m(water) )
-->
intransitive_verb( iv(ran) ) -->
transitive_verb( tv(hit) )
-->
AI
— Grammars and Parsing
"John".
"a".
"the".
"brick".
"water".
"ran".
"bit".
NLP-3-16
*Bonus Blank Page*
Grammar Tagged Corpus
import nltk
from nltk.corpus import brown
One can regard a formal grammar as defining transitions between
nodes of a search space.
Grammars in Prolog
## Assumes CFG created as in Python code given above.
AI
Parsing is a type of search problem.
## import brown corpus
## Print out some tagget words
for i in range(0,100):
print brown.tagged_words()[i]
AI
— Grammars and Parsing
NLP-3-17
AI
— Grammars and Parsing
NLP-3-18
Overview
COMP2240 / AI20
In this lecture I give a very brief overview of the basic ideas of
morphology — the analysis of word structure.
Artificial
Intelligence
You should learn how to break a word down into its constituent
morphemes and how to identify its root and the different types of
affix that are incorporated.
The following resources may be useful:
Lecture NLP-4
Word Structure (Morphology)
AI
— Word Structure (Morphology)
NLP-4-1
Branches of Linguistics
•
•
•
•
•
•
AI
NLP-4-3
Morphology
Morphemes are the smallest meaningful unit in the grammar of a
language.
From a morphological point of view, a word is formed from
the combination of a root morpheme with a number of affix
morphemes.
— Word Structure (Morphology)
•
Wikipedia entry on Morphology VV
AI
%
— Word Structure (Morphology)
NLP-4-2
This is a preliminary but rather subtle question in the study of
words.
How many words are in the following sentence?
“A rose is a rose is a rose.”
......................................
• .................................
• ................................
• .................................
AI
— Word Structure (Morphology)
NLP-4-4
Inflexional Affixes
Morphology is the study of the way words are built up from smaller
meaning units.
AI
Glossary of linguistic terms by Eugene Loos What is a Word?
Phonetics and Phonology:
The study of linguistic sounds or speech.
Syntax:
the structural relationships between words.
Semantics:
meanings of words, phrases, sentences.
Discourse:
extended series of utterances involving more than one agent.
Pragmatics:
how language is used in context and to accomplish goals.
* Morphology*:
The study of the meaningful components of words.
— Word Structure (Morphology)
%
•
NLP-4-5
Inflexional affixes are morphemes that are added to a word root
to express general logical and grammatical aspects of meaning,
that can be applied to any word of an appropriate grammatical
category.
e.g.: Number (singular, plural), Case (subject, object etc), Gender
(masculine, feminine)
Verbs have:
Tense (past, present, future etc) Person
(1st,2nd,3rd), Number (singular, plural) , Gender (masculine,
feminine)
AI
— Word Structure (Morphology)
NLP-4-6
Affix Idiosyncracy
Derivational Affixes
Derivational Affixes combine with a word root to produce a new
word with a different meaning:
•
•
unhappy
extraordinary
The affect of a derivational affix on different root words is often
similar but not always.
Derivational affixes often transform the root to a different
grammatical category:
•
•
•
Words often include more than one derivational affix: joyful ness
— Word Structure (Morphology)
NLP-4-7
We can now define ‘root’ and ‘affix’ more precisely, as well as the
related concept of ‘stem’:
The Root is the portion of the word that:
is common to a set of derived or inflected forms, if any, when
all affixes are removed
is not further analyzable into meaningful elements
carries the principle portion of meaning of the word
•
•
As well as the difference in the way extra letters are added, notice
that ‘stapler’ is a thing for stapling not (normally) a person who is
stapling. And a sticker is a thing that can be stuck rather than a
tool for sticking.
AI
— Word Structure (Morphology)
NLP-4-8
Examples
Word Structure Terminology
•
For example: ‘swim(m)er ’ and ‘run(n)er ’ are analogous;
but contrast ‘swimmer ’ with ‘sticker ’ and ‘stapl(e)er ’
happyness
joyful
stapler
AI
The possibility of adding a derivational affix to a word does not
follow general rules — it is idiosyncratic.
For example we have ‘joyful’ and ‘painful’ but not ‘despairful’.
An affix a bound morpheme that is incorporated before, after, or
within a root or stem.
‘handbags’
hand+bag+s
(3 morphemes, 2 syllables)
The root is bag and the stem is handbag.
hand is playing the role of a derivational affix.
s is an inflexional affix (indicating a plural meaning).
‘unladylike
un+lady+like
(3 morphemes, 4 syllables)
Root is lady, meaning (well behaved) female adult human
un is a derivational affix meaning ‘not’.
like is a derivational affix meaning ‘having the characteristics of’.
Since there are no inflexional afixes, the stem is the whole word.
The stem of a word comprises its root together with any
derivational (but not inflectional) affixes with which it is combined.
AI
— Word Structure (Morphology)
NLP-4-9
Exercises
AI
— Word Structure (Morphology)
NLP-4-10
Not a Slide
• Analyse the words of the following sentence into their
component morphemes:
The over-excited sheepdog trotted playfully towards her.
• What is the maximum and minimum number of syllables that
the word ‘extraordinary’ can have? (when spoken)
• The affix morpheme ‘ing’ can occur either as an inflectional or
derivational affix. Give an example of each.
(Actually, as well as relatively clear cases, there are some words
where it is controversial which type of affix ‘ing’ is.)
AI
— Word Structure (Morphology)
NLP-4-11
AI
— Word Structure (Morphology)
NLP-4-12
Overview
COMP2240 / AI20
Topics covered:
Artificial
Intelligence
• The statistical approach to NLP.
• Word frequency distributions.
• Zipf’s Law
• Word collocation
Lecture NLP-5
• N-Grams
• Bag of Words
Statistical Corpus-Based Approaches
• Vector Space Representation
AI
— Statistical Corpus-Based Approaches
NLP-5-1
AI
— Statistical Corpus-Based Approaches
NLP-5-2
Useful Operations for Analysing
a Text or Corpus
The Statistical Approach to NLP
Whereas classical linguistics is primarily concerned with the
detailed structural and semantic properties of language, statistical
approaches consider linguistic properties that are averaged over
large quantities of linguistic data — i.e. corpora.
It has found that many language related computational tasks can
be implemented by statistical algorithms that ignore nearly all of
the detailed structure of language.
It is controversial whether a statistical approach can lead to fullyfledged natural language processing/understanding.
How to find the frequency of a particular token in a text object:
text1.count("thrice")
How to get a sorted list of all words in a corpus:
sorted(set(brown.words()))
How to create a frequency distribution object:
fdist = nltk.probability.FreqDist(samples)
Where samples is a list (or other iterator) of items (such as the
word tokens of a corpus).
AI
— Statistical Corpus-Based Approaches
NLP-5-3
Text Token Frequency Analysis
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
— Statistical Corpus-Based Approaches
NLP-5-4
Zipf’s ‘Law’
Texts vary in the frequency distribution of their tokens:
Genesis
# tokens 44764
,
3681
and
2428
the
2411
of
1358
.
1315
And
1250
his
651
he
648
to
611
;
605
unto
590
in
588
AI
For a word type in large corpus the
Moby Dick
# tokens 260819
,
18713
the
13721
.
6862
of
6536
and
6024
a
4569
to
4542
;
4072
in
3916
that
2982
’
2684
2552
• frequency of the word is the number of times it occurs,
• rank of the word is its position in a list words ordered from
highest to lowest frequency.
Zipf’s law says that for each word
frequency × rank = constant.
Token frequency distribution provides a signature of a text style.
The constant is fixed throughout a given corpus, but depends on
the corpus.
AI
AI
— Statistical Corpus-Based Approaches
NLP-5-5
— Statistical Corpus-Based Approaches
NLP-5-6
Zipf Graph for Brown Corpus
Evidence
word
the
and
a
he
but
be
there
one
about
more
never
Oh
two
freq
3332
2972
1775
877
410
294
222
172
158
138
124
116
104
rank
1
2
3
10
20
30
40
50
60
70
80
90
100
product
3332
5944
5235
8770
8400
8820
8880
8600
9480
9660
9920
10440
10400
word
turned
you’ll
name
comes
group
lead
friends
begin
family
brushed
sins
Could
Appluasive
freq
51
30
21
16
13
11
10
9
8
4
2
2
1
rank
200
300
400
500
600
700
800
900
1000
2000
3000
4000
8000
product
10200
9000
8400
8000
7800
7700
8000
8100
8000
8000
6000
8000
8000
Zipf figures from Tom Sawyer (Manning and Schütze)
AI
— Statistical Corpus-Based Approaches
Logarithmic plot of rank against frequency for Brown corpus also
showing Zipf’s formula f r ≈ 105.
NLP-5-7
Zipf Question Examples
AI
Frequency of bi-grams (2 word sequences) in the Brown corpus:
Brown Corpus, Top Bi-Gram Frequency (Total = 864685)
1. How often does the word ranked 9000 occur?
2. How often does the highest ranked word occur?
3. If a word occurs 4500 times, what would its rank be?
1)
2)
3)
4)
5)
6)
7)
8)
9)
10)
11)
12)
• Is ‘Zipf’s law’ really a law?
— Statistical Corpus-Based Approaches
NLP-5-8
Word Collocation
• Suppose the words in a corpus follow Zipf’s law exactly, and
that the 20th most frequent word occurs 450 times.
AI
— Statistical Corpus-Based Approaches
NLP-5-9
AI
of the
in the
to the
on the
and the
for the
to be
at the
with the
of a
that the
from the
9625
5546
3426
2297
2136
1759
1697
1506
1472
1461
1368
1351
13)
14)
15)
16)
17)
18)
19)
20)
21)
22)
23)
24)
in a
by the
as a
it is
with a
is a
of his
is the
was a
had been
it was
for a
— Statistical Corpus-Based Approaches
1316
1310
896
881
881
864
806
782
776
760
743
734
NLP-5-10
Next Word Probability
N-Grams
An N-Gram is a sequence of N words.
Useful information about the common patterns and regularities
of natural language can be derived by restricting attention to the
occurences of N-grams, for some relatively small N.
It is typical to consider Tri-Grams — i.e. 3-Grams.
These support a much richer analysis than bi-grams (2-grams),
and are more manageable to work with than 4+-grams.
Analysis of N-grams can be used to compute the probability that
a given word follows from a given N-1 word sequence.
Suppose a corpus contains j tri-grams of the form
A B ...
and k tri-grams of the form
A B C
Then (all other things being equal) the probability that C follows
A B is
k/j
AI
— Statistical Corpus-Based Approaches
NLP-5-11
AI
— Statistical Corpus-Based Approaches
NLP-5-12
Bag of Words with Naive Bayes
Document Classifier
Bag of Words
%
In the Bag of Words (VV ) model, a text is considered as
an un-orderd collection of word.
We ingnore syntax and consider only the number of occurences
of each word.
Although, treating a document as a bag of words throws away
most of its meaning, the model can still be very effective in
applications such as document classification.
A very common use is spam filtering.
Suppose, we have a large corpus C of emails, which is divided
into two sets: Spam and Non-Spam.
We can treat the entire corpus C as a huge sack of words, and we
can also treat the Spam and Non-Spam sets as smaller sacks.
Moreover any email can be be considered as a small handbag of
words.
If I take words randomly from the corpus sack and put them
into the email handbag then there is a certain probability of the
handback getting filled with a certain collection of words.
Similarly, if I take words only from the spam sack, then I get a
different probability of filling the email handback in a particular
way.
AI
— Statistical Corpus-Based Approaches
NLP-5-13
Using Bayes Law (Very Naively)
Let E be the fact that you recieve a particular email.
Let S be the fact that this email is Spam.
Let P (A) be the probability of fact A and P (A|B) be the probability
of A given that B.
We want to find P (S|E)
From Bayes Law we have:
P (E ∧ S) = P (S) ∗ P (E|S) = P (E) ∗ P (S|E)
So P (S|E) = P (E|S) ∗ P (S)/P (E)
AI
— Statistical Corpus-Based Approaches
NLP-5-15
AI
— Statistical Corpus-Based Approaches
NLP-5-14