NLP/CL: Review - School of Computing

School of Computing
something
FACULTY OF ENGINEERING
OTHER
Objectives of the module
Eric Atwell, Language Research Group
On completion of this module, students should be able to:
- understand theory and terminology of empirical modelling of
natural language;
- understand and use algorithms, resources and techniques
for implementing and evaluating NLP systems;
- be familiar with some of the main language engineering
application areas;
- appreciate why unrestricted natural language processing is
still a major research task.
(with thanks to other contributors)
In a nutshell:
NLP/CL: Review
Why NLP is difficult: language is a complex system
How to solve it? Corpus-based machine-learning approaches
Motivation: applications of “The Language Machine”
The main sub-areas of linguistics
◮ Phonetics and Phonology: The study of linguistic sounds or speech.
◮ Morphology: The study of the meaningful components of words.
◮ Syntax: The study of the structural relationships between words.
◮ Semantics: The study of meanings of words, phrases, sentences.
Python, NLTK, WEKA
Python: A good programming language for NLP
• Interpreted
• Object-oriented
• Easy to interface to other things (text files, web, DBMS)
• Good stuff from: java, lisp, tcl, perl
◮ Discourse: The study of linguistic units larger than a single utterance.
• Easy to learn
◮ Pragmatics: The study of how language is used to accomplish goals.
• FUN!
Python NLTK: Natural Language Tool Kit with demos and tutorials
WEKA: Machine Learning toolkit: Classifiers, eg J48 Decision Trees
Why is NLP difficult?
Ambiguity:
Grammar (PoS) and Meaning
Iraqi Head Seeks Arms
Computers are not brains
• There is evidence that much of language understanding is built into
the human brain
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
Computers do not socialize
British Left Waffles on Falkland Islands
• Much of language is about communicating with people
Red Tape Holds Up New Bridges
Key problems:
Bush Wins on Budget, but More Lies Ahead
• Representation of meaning and hidden structure
Hospitals are Sued by 7 Foot Doctors
• Language presupposes knowledge about the world
(Headlines leave out punctuation and function-words)
• Language is ambiguous: a message can have many interpretations
• Language presupposes communication between people
Lynne Truss, 2003. Eats shoots and leaves:
The Zero Tolerance Approach to Punctuation
1
The Role of Memorization
But there is too much to memorize!
Children learn words quickly
establish
• Around age two they learn about 1 word every 2 hours.
establishment
• (Or 9 words/day)
• Often only need one exposure to associate meaning with word
• Can make mistakes, e.g., overgeneralization
“I goed to the store.”
the church of England as the official state church.
disestablishment
antidisestablishment
• Exactly how they do this is still under study
antidisestablishmentarian
Adult vocabulary
antidisestablishmentarianism
• Typical adult: about 60,000 words
• Literate adults: about twice that.
is a political philosophy that is opposed to the separation of church and
state.
MAYBE we don’t remember every word separately;
MAYBE we remember MORPHEMES and how to combine them
Rationalism v Empiricism
Empiricism v Rationalism
Rationalism: the doctrine that knowledge is acquired by
reason without regard to experience (Collins English Dictionary)
Empiricism: the doctrine that all knowledge derives from
experience (Collins English Dictionary)
The field was stuck for quite some time: rationalist
Noam Chomsky, 1957 Syntactic Structures
-Argued that we should build models through introspection:
linguistic models for a specific example did not generalise.
A new approach started around 1990: Corpus Linguistics
-A language model is a set of rules thought up by an expert
• Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk
space, or GHz
Like “Expert Systems”…
Main idea: machine learning from CORPUS data
Chomsky thought data was full of errors, better to rely on
linguists’ intuitions…
How to do corpus linguistics:
• Get large text collection (a corpus; plural: several corpora)
• Compute statistical models over the words/PoS/parses/… in the corpus
Surprisingly effective
Example Problem
Grammar checking example:
Which word to use?
<principal> <principle>
Using Very Large Corpora
Keep track of which words are the neighbors of each spelling in
well-edited text, e.g.:
• Principal: “high school”
• Principle: “rule”
Empirical solution: look at which words surround each use:
• I am in my third year as the principal of Anamosa High School.
At grammar-check time, choose the spelling best predicted by the
probability of co-occurring with surrounding words.
• School-principal transfers caused some upset.
No need to “understand the meaning” !?
• This is a simple formulation of the quantum mechanical uncertainty
principle.
Surprising results:
• Log-linear improvement even to a billion words!
• Power without principle is barren, but principle without power is
futile. (Tony Blair)
• Getting more data is better than fine-tuning algorithms!
2
The Effects of LARGE
Datasets
Corpus, word tokens and types
From Banko & Brill, 2001. Scaling to Very Very Large
Corpora for Natural Language Disambiguation, Proc ACL
Corpus: text selected by language, genre, domain, …
Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …
Corpus Annotation: text headers, PoS, parses, …
Corpus size is no. of words – depends on tokenisation
We can count word tokens, word types, type-token distribution
Lexeme/lemma is “root form”, v inflections (be v am/is/was…)
Tokenization and Morphology
Corpus word-counts and n-grams
Tokenization - by whitespace, regular expressions
FreqDist counts of tokens and their distribution can be useful
Problems: It’s data-base New York …
Eg find main characters in Gutenberg texts
Jabberwocky shows we can break words into morphemes
Eg compare word-lengths in different languages
Morpheme types: root/stem, affix, clitic
Human can predict the next word …
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
N-gram models are based on counts in a large corpus
Auto-generate a story ... (but gets stuck in local maximum)
Morphological analysers: Porter stemmer, Morphy, PC-Kimmo
Morphology by lookup: CatVar, CELEX, OALD++
Word-counts follow Zipf’s Law
Kilgarriff’s Sketch Engine
Zipf’s law applies to a word type-token frequency distribution:
frequency is proportional to the inverse of the rank in a ranked
list
Sketch Engine shows a Word Sketch or list of collocates:
words co-occurring with the target word more frequently than
predicted by independent probabilities
f*r = k where f is frequency, r is rank, k is a constant
A lexicographer can colour-code groups of related collocates
indicating different senses or meanings of the target word
ie a few very common words, a small to medium number of
middle-frequency words, and a long tail of low frequency (1)
Chomsky argued against corpus evidence as it is finite and
limited compared to introspection; Zipf’s law shows that many
words/structures only appear 1 or 0 times in a given corpus,
??supporting the argument that corpus evidence is limited
compared to introspection
With a large corpus the lexicographer should find all current
senses, better than relying on intuition/introspection
Large user-base of experience, used in development of
several published dictionaries for English
For minority languages with few existing corpus resources,
Sketch Engine is combined with Web-Bootcat to enable
lexicographers to collect their own Web-as-Corpus
3
Parts of Speech
Parts of Speech: groups words into grammatical categories
Training and Testing of
Machine Learning Algorithms
… and separates different functions of a word
Algorithms that “learn” from data see a set of examples
and try to generalize from them.
In English, many words are ambiguous: 2 or more PoS-tags
Training set:
Very simple tagger: everything is NN
Better Pos-Taggers: unigram, bigram, trigram, Brill, …
• Examples trained on
Test set:
• Also called held-out data and unseen data
• Use this for testing your algorithm
• Must be separate from the training set
• Otherwise, you cheated!
“Gold Standard”
• A test set that a community has agreed on and uses as a common
benchmark – use for final evaluation
Grammar and Parsing
Problems with context-free
grammars and parsers
Context-Free Grammars and Constituency
Parse-trees show syntactic structure of sentences
Some common CFG phenomena for English
Key constituents: S, NP, VP, PP
• Sentence-level constructions
You can draw a parse-tree and corresponding CFG
• NP, PP, VP
• Coordination
Problems with Context-Free Grammar:
• Subcategorization
• Coordination: X X and X is a Meta-Rule, not strict CFG rule
Top-down and Bottom-up Parsing
• Agreement: needs duplicate CFG rules for singular/plural etc
• Subcategorization: needs separate CFG non-terminals for trans/intrans/…
• Movement: object/subject of verb may be “moved” in questions
• Dependency parsing captures deeper semantics but is harder
• Parsing: top-down v bottom-up v combined
• Ambiguity causes backtracking, so CHART PARSER stores partial parses
Parsing sentences left-to-right
The horse raced past the barn
[S [NP [A the] [N horse] NP]
[VP [V raced] [PP [I past] [NP [A the] [N barn] NP] PP]
VP] S]
Chunking or Shallow Parsing
Break text up into non-overlapping contiguous subsets of tokens.
Shallow parsing or Chunking is useful for:
Entity recognition
• people, locations, organizations
Studying linguistic patterns
The horse raced past the barn fell
• gave NP
[S [NP [NP [A the] [N horse] NP]
• gave up NP in NP
[VP [V raced] [PP [I past] [NP [A the] [N barn] NP]
PP] VP] NP]
[VP [V fell] VP] S]
• gave NP NP
• gave NP to NP
Prosodic phrase breaks – pauses in speech
Can ignore complex structure when not relevant
Chunking can be done via regular expressions over PoS-tags
4
Information Extraction
Semantics:
Word Sense Disambiguation
Partial parsing gives us NP chunks…
e.g. mouse (animal /PC-interface)
IE: Named Entity Recognition
• It’s a hard task (Very)
• Humans very good at it
People, places, companies, dates etc.
• Computers not
In a cohesive text, some NPs refer to the same thing/person
• Active field of research for over 50 years
Needed: an algorithm for NP coreference resolution; eg:
• Mistakes in disambiguation have negative results
“Hudda”, “ten of hearts”, “Mrs Anthrax”, “she” all refer to the
same Named Entity
• Beginning to be of practical use
Machine learning
v cognitive modelling
• Desirable skill (Google, M$)
NLP applications
NLP has been successful using ML from data, without
“linguistic / cognitive models”
Machine Translation
Supervised ML: given labelled data (eg PoS-tagged text to
train PoS-tagger, to tag new text in the style of training text)
Information Retrieval (Google, etc)
Unsupervised ML: no labelled data (eg clustering words with
“similar contexts” gives PoS-tag categories)
Detecting Terrorist Activities
Unsupervised ML is harder, but increasingly successful!
Localization: adapting text (e.g. ads) to local language
Information Extraction
Understanding the Quran
…
For more, see The Language Machine
And Finally…
Any final questions?
Feedback please (eg email me)
Good luck in the exam!
Look at past exam papers
BUT note changes in topics covered
And if you do use NLP in your career, please let me know…
5