High contrast colours will help audiences to read text from a distance

School of Computing
something
School of Computing
FACULTY OF ENGINEERING
OTHER
FACULTY OF ENGINEERING
• Thanks to many others for much of
the material; particularly…
NLP: Introduction and overview
COMP3310 Natural Language Processing
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst,
and other contributors)
Today
Module Objectives
Why NLP is difficult: language is a complex system
How to solve it? Corpus-based machine-learning approaches
Motivation: applications of “The Language Machine”
Goals of this Module
Learn about the problems and possibilities of natural language
analysis:
• What are the major issues?
• What are the major solutions?
• How well do they work?
• How do they work?
• Katja Markert, Lecturer, School of
Computing, Leeds University
http://www.comp.leeds.ac.uk/markert
http://www.comp.leeds.ac.uk/lng
• Marti Hearst, Associate Professor,
School of Information, University of
California at Berkeley
http://www.ischool.berkeley.edu/people
/faculty/martihearst
http://courses.ischool.berkeley.edu/i256
/f06/sched.html
Objectives of COMP3310
On completion of this module, students should be able to:
- understand theory and terminology of empirical modelling of
natural language;
- understand and use algorithms, resources and techniques
for implementing and evaluating NLP systems;
- be familiar with some of the main language engineering
application areas;
- appreciate why unrestricted natural language processing is
still a major research task.
Why is NLP difficult?
Computers are not brains
• There is evidence that much of language understanding is built into
the human brain
Computers do not socialize
• Much of language is about communicating with people
At the end you should:
Key problems:
• Agree that language is subtle and interesting!
• Representation of meaning
• Feel some ownership over the algorithms
• Language presupposes knowledge about the world
• Be able to assess NLP problems
• Language is ambiguous: a message can have many interpretations
• Know which solutions to apply when, and how
• Be able to read research papers in the field
• Language presupposes communication between people
1
Hidden Structure
English plural pronunciation
2001: A Space Odyssey (1968)
• Toy
+ s  toyz
; add z
Dave Bowman: “Open the pod bay doors, HAL”
• Book
+ s  books
; add s
• Church + s  churchiz
• Box
+ s  boxiz
• Sheep + s  sheep
; add iz
; add iz
; add nothing
What about new words?
HAL 9000: “I‟m sorry Dave. I‟m afraid I can‟t do that.”
Language subtleties
Adjective order and placement
• A big black dog
• A big black scary dog
• Bach + „s  baXs
; why not baXiz?
World Knowledge is subtle
He arrived at the lecture.
He chuckled at the lecture.
• A big scary dog
• A scary big dog
A black big dog
He arrived drunk.
He chuckled drunk.
Antonyms
• Which sizes go together?
• Big and little
• Big and small
He chuckled his way through the lecture.
He arrived his way through the lecture.
• Large and small
Large and little
Words are ambiguous:
multiple functions and meanings
How can a machine
understand these differences?
• Get the cat with the gloves.
I know that.
I know that block.
I know that blocks the sun.
I know that block blocks the sun.
2
How can a machine
understand these differences?
How can a machine
understand these differences?
• Get the sock from the cat with the gloves.
• Decorate the cake with the frosting.
• Get the glove from the cat with the socks.
• Decorate the cake with the kids.
• Throw out the cake with the frosting.
• Throw out the cake with the kids.
News Headline Ambiguity
The Role of Memorization
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
British Left Waffles on Falkland Islands
Red Tape Holds Up New Bridges
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
(Headlines leave out punctuation and function-words)
Children learn words quickly
• Around age two they learn about 1 word every 2 hours.
• (Or 9 words/day)
• Often only need one exposure to associate meaning with word
• Can make mistakes, e.g., overgeneralization
“I goed to the store.”
• Exactly how they do this is still under study
Adult vocabulary
• Typical adult: about 60,000 words
• Literate adults: about twice that.
Lynne Truss, 2003. Eats shoots and leaves:
The Zero Tolerance Approach to Punctuation
The Role of Memorization
But there is too much to memorize!
Dogs can do word association too!
establish
• Rico, a border collie in Germany
establishment
• Knows the names of each of 100 toys
• Can retrieve items called out to him with over 90%
accuracy.
• Can also learn and remember the names of unfamiliar
toys after just one encounter, putting him on a par with a
three-year-old child.
the church of England as the official state church.
disestablishment
antidisestablishment
antidisestablishmentarian
antidisestablishmentarianism
is a political philosophy that is opposed to the separation of church and
state.
http://www.nature.com/news/2004/040607/pf/040607-8_pf.html
MAYBE we don‟t remember every word separately;
MAYBE we remember MORPHEMES and how to combine them
3
Rules and Memorization
Representation of Meaning
Current thinking in psycholinguistics is that we use a
combination of rules and memorization
I know that block blocks the sun.
• However, this is controversial
• How do we represent the meanings of “block”?
Mechanism:
• How do we represent “I know”?
• If there is an applicable rule, apply it
• How does that differ from “I know that…”?
• However, if there is a memorized version, that takes precedence.
(Important for irregular words.)
• Who/what is “I”?
• Artists paint “still lifes”
• Not “still lives”
• Past tense of
• think  thought
• How do we indicate that we are talking about earth‟s sun vs. some other
planet‟s sun?
• When did this take place? What if I move the block? What if I move my
viewpoint? How do we represent this?
• blink  blinked
This is a simplification…
How to tackle these problems?
The field was stuck for quite some time…
linguistic models for a specific example did not generalise
Example Problem
Grammar checking example:
Which word to use?
<principal> <principle>
A new approach started around 1990: Corpus Linguistics
Empirical solution: look at which words surround each use:
• Well, not really new, but in the 50‟s to 80‟s, they didn‟t have the text, disk
space, or GHz
• I am in my third year as the principal of Anamosa High School.
Main idea: combine memorizing and rules, learn from data
How to do it:
• Get large text collection (a corpus; plural: several corpora)
• Compute statistics over the words in the text collection (corpus)
Surprisingly effective
• School-principal transfers caused some upset.
• This is a simple formulation of the quantum mechanical uncertainty
principle.
• Power without principle is barren, but principle without power is
futile. (Tony Blair)
• Even better now with the Web: Web-as-Corpus research
Using Very Large Corpora
Keep track of which words are the neighbors of each spelling in
well-edited text, e.g.:
The Effects of LARGE
Datasets
From Banko & Brill, 2001. Scaling to Very Very Large
Corpora for Natural Language Disambiguation, Proc ACL
• Principal: “high school”
• Principle: “rule”
At grammar-check time, choose the spelling best predicted by the
probability of co-occurring with surrounding words.
No need to “understand the meaning” !?
Surprising results:
• Log-linear improvement even to a billion words!
• Getting more data is better than fine-tuning algorithms!
4
Motivation:
Real-World Applications of NLP
Machine Translation
Spelling Suggestions/Corrections
Grammar Checking
Synonym Generation
Information Extraction
Text Categorization
Automated Customer Service
Speech Recognition
Machine Translation
Question Answering
Chatbots
Improving Web Search Engine results
Automated Metadata Assignment
Online Dialogs
Information Retrieval, e.g. Google … and
scholar, books, products, AdWords, AdSense
Programming: Python and NLTK
Synonym Generation
Summary: Intro to NLP
Python: A suitable programming language
• Interpreted – easy to test ideas
• Object-oriented
• Easy to interface to other things (web, DBMS, TK)
Module Objectives: learn about NLP and how to apply it
Why NLP is difficult: language is a complex system
• Data-structures, OO concepts etc from: java, lisp, tcl, perl
How to solve it? Corpus-based machine-learning approaches
• Easy to learn, FUN! (?)
Motivation: applications of “The Language Machine”
• Python NLTK: Natural Language Tool Kit with demos and tutorials
Suggested private study this week:
• Load python and NLTK onto your own PCs: http://www.nltk.org/
• Read “The Language Machine”
http://www.comp.leeds.ac.uk/eric/atwell99bc.pdf
• Read NLTK “Getting Started” http://www.nltk.org/getting-started
5