Introduction to NLP
Thanks for Hongning Wang@UVa’s
slides on Text Ming Courses, Slides
are slightly modified by Lei Chen
What is NLP?
Arabic text
.كلب هو مطاردة صبي في الملعب
?
How can a computer make sense out of this string
Morphology
- What are the basic units of meaning (words)?
- What is the meaning of each word?
- How are words related with each other?
- What is the “combined meaning” of words?
Syntax
Semantics
Pragmatics - What is the “meta-meaning”? (speech act)
Discourse - Handling a large chunk of text
Inference - Making sense of everything
CS@UVa
CS6501: Text Mining
2
An example of NLP
A dog is chasing a boy on the playground.
Det Noun Aux
Noun Phrase
Verb
Complex Verb
Semantic analysis
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1).
+
Det Noun Prep Det
Noun Phrase
Noun Phrase
Prep Phrase
Lexical
analysis
(part-ofspeech
tagging)
Verb Phrase
Verb Phrase
Syntactic analysis
(Parsing)
Sentence
A person saying this may
be reminding another
person to get the dog
back…
Scared(x) if Chasing(_,x,_).
Scared(b1)
Inference
CS@UVa
Noun
CS6501: Text Mining
Pragmatic analysis
(speech act)
3
• Automatically answer our emails
• Translate languages accurately
• Help us manage, summarize, and
aggregate information
• Use speech as a UI (when needed)
• Talk to us / listen to us
If we can do this for all the sentences in
all languages, then …
BAD NEWS:
• Unfortunately, we cannot right now.
• General NLP = “Complete AI”
CS@UVa
CS6501: Text Mining
4
NLP is difficult!!!!!!!
• Natural language is designed to make human
communication efficient. Therefore,
– We omit a lot of “common sense” knowledge,
which we assume the hearer/reader possesses
– We keep a lot of ambiguities, which we assume
the hearer/reader knows how to resolve
• This makes EVERY step in NLP hard
–Ambiguity is a “killer”!
– Common sense reasoning is pre-required
CS@UVa
CS6501: Text Mining
5
An example of ambiguity
• Get the cat with the gloves.
CS@UVa
CS6501: Text Mining
6
Examples of challenges
• Word-level ambiguity
– “design” can be a noun or a verb (Ambiguous POS)
– “root” has multiple meanings (Ambiguous sense)
• Syntactic ambiguity
– “natural language processing” (Modification)
– “A man saw a boy with a telescope.” (PP Attachment)
• Anaphora resolution
– “John persuaded Bill to buy a TV for himself.” (himself =
John or Bill?)
• Presupposition
– “He has quit smoking.” implies that he smoked before.
CS@UVa
CS6501: Text Mining
7
Despite all the challenges, research in
NLP has also made a lot of progress…
CS@UVa
CS6501: Text Mining
8
A brief history of NLP
• Early enthusiasm (1950’s): Machine Translation
– Too ambitious
– Bar-Hillel report (1960) concluded that fully-automatic high-quality translation
could not be accomplished without knowledge (Dictionary + Encyclopedia)
• Less ambitious applications (late 1960’s & early 1970’s): Limited success,
failed to scale up
Deep understanding in
– Speech recognition
limited domain
– Dialogue (Eliza) Shallow understanding
– Inference and domain knowledge (SHRDLU=“block world”)
• Real world evaluation (late 1970’s – now)
– Story understanding (late 1970’s & early 1980’s) Knowledge representation
– Large scale evaluation of speech recognition, text retrieval, information
extraction (1980 – now)
Robust component techniques
– Statistical approaches enjoy more success (first in speech recognition &
retrieval, later others)
Statistical language models
• Current trend:
– Boundary between statistical and symbolic approaches is disappearing.
– We need to use all the available knowledge
Applications
CS@UVa– Application-driven NLP research
CS6501:(bioinformatics,
Text Mining
Web, Question answering…) 9
The state of the art
A dog is chasing a boy on the playground
Det Noun Aux
Noun Phrase
Verb
Complex Verb
Det Noun Prep Det
Noun Phrase
Noun
POS
Tagging:
97%
Noun Phrase
Prep Phrase
Verb Phrase
Semantics: some aspects
- Entity/relation extraction
- Word sense disambiguation
- Anaphora resolution
Parsing: partial >90%
Verb Phrase
Sentence
Inference: ???
CS@UVa
Speech act analysis: ???
CS6501: Text Mining
10
Machine translation
CS@UVa
CS6501: Text Mining
11
Dialog systems
Apple’s siri system
CS@UVa
Google search
CS6501: Text Mining
12
Information extraction
CS@UVa
Google Knowledge Graph
CS6501: Text Mining
Wiki Info Box
13
Information extraction
CMU Never-Ending Language Learning
YAGO Knowledge Base
CS@UVa
CS6501: Text Mining
14
Building a computer
that ‘understands’ text:
The NLP pipeline
CS@UVa
CS6501: Text Mining
15
Tokenization/Segmentation
• Split text into words and sentences
– Task: what is the most likely segmentation
/tokenization?
There was an earthquake near
D.C. I’ve even felt it in
Philadelphia, New York, etc.
There + was + an + earthquake
+ near + D.C.
CS@UVa
I + ve + even + felt + it + in +
Philadelphia, + New + York, + etc.
CS6501: Text Mining
16
Part-of-Speech tagging
• Marking up a word in a text (corpus) as
corresponding to a particular part of speech
– Task: what is the most likely tag sequence
A + dog + is + chasing + a + boy + on + the + playground
A + dog + is + chasing + a + boy + on + the + playground
Det
CS@UVa
Noun
Aux
Verb
Det
Noun Prep
CS6501: Text Mining
Det
Noun
17
Named entity recognition
• Determine text mapping to proper names
– Task: what is the most likely mapping
Its initial Board of Visitors included U.S.
Presidents Thomas Jefferson, James Madison,
and James Monroe.
Its initial Board of Visitors included U.S.
Presidents Thomas Jefferson, James Madison,
and James Monroe.
Organization, Location, Person
CS@UVa
CS6501: Text Mining
18
Syntactic parsing
• Grammatical analysis of a given sentence,
conforming to the rules of a formal grammar
– Task: what is the most likely grammatical structure
A + dog + is + chasing + a + boy + on + the + playground
Det
Noun
Noun Phrase
Aux
Verb
Det
Complex Verb
Noun Prep
Noun Phrase
Verb Phrase
Det
Noun
Noun Phrase
Prep Phrase
Verb Phrase
CS@UVa
Sentence
CS6501:
Text Mining
19
Relation extraction
• Identify the relationships among named
entities
– Shallow semantic analysis
Its initial Board of Visitors included U.S.
Presidents Thomas Jefferson, James Madison,
and James Monroe.
1. Thomas Jefferson Is_Member_Of Board of
Visitors
2. Thomas Jefferson Is_President_Of U.S.
CS@UVa
CS6501: Text Mining
20
Logic inference
• Convert chunks of text into more formal
representations
– Deep semantic analysis: e.g., first-order logic
structures
Its initial Board of Visitors included U.S.
Presidents Thomas Jefferson, James Madison,
and James Monroe.
∃𝑥 (Is_Person(𝑥) & Is_President_Of(𝑥,’U.S.’)
& Is_Member_Of(𝑥,’Board of Visitors’))
CS@UVa
CS6501: Text Mining
21
Towards understanding of text
• Who is Carl Lewis?
• Did Carl Lewis break any records?
CS@UVa
CS6501: Text Mining
22
Major NLP applications
• Speech recognition: e.g., auto telephone call routing
• Text mining
–
–
–
–
–
Text clustering
Text classification
Text summarization
Topic modeling
Question answering
Our focus
• Language tutoring
– Spelling/grammar correction
• Machine translation
– Cross-language retrieval
– Restricted natural language
•CS@UVa
Natural language userCS6501:
interface
Text Mining
23
NLP & text mining
• Better NLP => Better text mining
• Bad NLP => Bad text mining?
Robust, shallow NLP tends to be more useful than
deep, but fragile NLP.
Errors in NLP can hurt text mining performance…
CS@UVa
CS6501: Text Mining
24
How much NLP is really needed?
Tasks
Dependency on NLP
Scalability
Classification
Clustering
Summarization
Extraction
Topic modeling
Translation
Dialogue
Question
Answering
Inference
Speech Act
CS@UVa
CS6501: Text Mining
25
So, what NLP techniques are the most
useful for text mining?
• Statistical NLP in general.
• The need for high robustness and efficiency
implies the dominant use of simple models
CS@UVa
CS6501: Text Mining
26
What is POS tagging
Tag Set
NNP: proper noun
CD: numeral
JJ: adjective
POS Tagger
Tagged Text
Raw Text
Pierre Vinken , 61 years
old , will join the board as
a nonexecutive director
Nov. 29 .
CS@UVa
Pierre_NNP Vinken_NNP ,_,
61_CD years_NNS old_JJ ,_,
will_MD join_VB the_DT
board_NN as_IN a_DT
nonexecutive_JJ director_NN
Nov._NNP 29_CD ._.
CS 6501: Text Mining
27
Why POS tagging?
• POS tagging is a prerequisite for further NLP analysis
– Syntax parsing
• Basic unit for parsing
– Information extraction
• Indication of names, relations
– Machine translation
• The meaning of a particular word depends on its POS tag
– Sentiment analysis
• Adjectives are the major opinion holders
– Good v.s. Bad, Excellent v.s. Terrible
CS@UVa
CS 6501: Text Mining
28
Challenges in POS tagging
• Words often have more than one POS tag
– The back door (adjective)
– On my back (noun)
– Promised to back the bill (verb)
• Simple solution with dictionary look-up does
not work in practice
– One needs to determine the POS tag for a
particular instance of a word from its context
CS@UVa
CS 6501: Text Mining
29
Define a tagset
• We have to agree on a standard inventory of
word classes
– Taggers are trained on a labeled corpora
– The tagset needs to capture semantically or
syntactically important distinctions that can easily
be made by trained human annotators
CS@UVa
CS 6501: Text Mining
30
Word classes
• Open classes
– Nouns, verbs, adjectives, adverbs
• Closed classes
– Auxiliaries and modal verbs
– Prepositions, Conjunctions
– Pronouns, Determiners
– Particles, Numerals
CS@UVa
CS 6501: Text Mining
31
Public tagsets in NLP
• Brown corpus - Francis and Kucera 1961
– 500 samples, distributed across 15 genres in rough
proportion to the amount published in 1961 in each of
those genres
– 87 tags
• Penn Treebank - Marcus et al. 1993
– Hand-annotated corpus of Wall Street Journal, 1M
words
– 45 tags, a simplified version of Brown tag set
– Standard for English now
• Most statistical POS taggers are trained on this Tagset
CS@UVa
CS 6501: Text Mining
32
How much ambiguity is there?
• Statistics of word-tag pair in Brown Corpus
and Penn Treebank
11%
CS@UVa
CS 6501: Text Mining
18%
33
Is POS tagging a solved problem?
• Baseline
– Tag every word with its most frequent tag
– Tag unknown words as nouns
– Accuracy
• Word level: 90%
• Sentence level
– Average English sentence length 14.3 words
– 0.914.3 = 22%
Accuracy of State-of-the-art POS Tagger
• Word level: 97%
• Sentence level: 0.9714.3 = 65%
CS@UVa
CS 6501: Text Mining
34
Building a POS tagger
• Rule-based solution
1. Take a dictionary that lists all possible tags for
each word
2. Assign to every word all its possible tags
3. Apply rules that eliminate impossible/unlikely
tag sequences, leaving only one tag per word
Rules can be learned
via inductive learning.
CS@UVa
she PRP
promised VBN,VBD
to TO
back VB, JJ, RB, NN!!
the DT
bill NN, VB
CS 6501: Text Mining
R1: Pronoun should be
followed by a past tense verb
R2: Verb cannot follow
determiner
35
Building a POS tagger
• Statistical POS tagging
𝒕=
𝑡1
𝑡2
𝑡3
𝑡4
𝑡5
𝑡6
𝒘=
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑤6
– What is the most likely sequence of tags 𝒕 for the
given sequence of words 𝒘
𝒕∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑝(𝒕|𝒘)
CS@UVa
CS 6501: Text Mining
36
POS tagging with generative models
• Bayes Rule
𝒕∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑝 𝒕 𝒘
𝑝 𝒘 𝒕 𝑝(𝒕)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕
𝑝(𝒘)
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑝 𝒘 𝒕 𝑝(𝒕)
– Joint distribution of tags and words
– Generative model
• A stochastic process that first generates the tags, and
then generates the words based on these tags
CS@UVa
CS 6501: Text Mining
37
Hidden Markov models
• Two assumptions for POS tagging
1. Current tag only depends on previous 𝑘 tags
• 𝑝 𝒕 = 𝑖 𝑝(𝑡𝑖 |𝑡𝑖−1 , 𝑡𝑖−2 , … , 𝑡𝑖−𝑘 )
• When 𝑘=1, it is so-called first-order HMMs
2. Each word in the sequence depends only on its
corresponding tag
• 𝑝 𝒘𝒕 =
CS@UVa
𝑖 𝑝(𝑤𝑖 |𝑡𝑖 )
CS 6501: Text Mining
38
Graphical representation of HMMs
All the tags
in the tagset
𝑝(𝑡𝑖 |𝑡𝑖−1 )
Transition probability
𝑝(𝑤𝑖 |𝑡𝑖 )
All the
words in the
vocabulary
Emission probability
• Light circle: latent random variables
• Dark circle: observed random variables
• Arrow: probabilistic dependency
CS@UVa
CS 6501: Text Mining
39
Finding the most probable tag sequence
𝒕∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑝 𝒕 𝒘
= 𝑎𝑟𝑔𝑚𝑎𝑥𝒕
𝑝 𝑤𝑖 𝑡𝑖 𝑝(𝑡𝑖 |𝑡𝑖−1 )
𝑖
• Complexity analysis
– Each word can have up to 𝑇 tags
– For a sentence with 𝑁 words, there will be up to
𝑇 𝑁 possible tag sequences
– Key: explore the special structure in HMMs!
CS@UVa
CS 6501: Text Mining
40
𝒕𝟏 = 𝑡4 𝑡1 𝑡3 𝑡5 𝑡7
𝑤1
𝑤2
𝒕𝟐 = 𝑡4 𝑡1 𝑡3 𝑡5 𝑡2
𝑤3
𝑤4
𝑤5
𝑡1
𝑡2
𝑡3
𝑡4
𝑡5
𝑡6
𝑡7
CS@UVa
Word 𝑤1 takes tag 𝑡4 CS 6501: Text Mining
41
Trellis: a special structure for HMMs
𝒕𝟏 = 𝑡4 𝑡1 𝑡3 𝑡5 𝑡7
𝒕𝟐 = 𝑡4 𝑡1 𝑡3 𝑡5 𝑡2
Computation can be reused!
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑡1
𝑡2
𝑡3
𝑡4
𝑡5
𝑡6
𝑡7
CS@UVa
Word 𝑤1 takes tag 𝑡4 CS 6501: Text Mining
42
Viterbi algorithm
• Store the best tag sequence for 𝑤1 … 𝑤𝑖 that
ends in 𝑡 𝑗 in 𝑇[𝑗][𝑖]
– 𝑇[𝑗][𝑖] = max 𝑝(𝑤1 … 𝑤𝑖 , 𝑡1 … , 𝑡𝑖 = 𝑡 𝑗 )
• Recursively compute trellis[j][i] from the
entries in the previous column trellis[j][i-1]
– 𝑇 𝑗 𝑖 = 𝑃 𝑤𝑖 𝑡 𝑗 𝑀𝑎𝑥𝑘 𝑇 𝑘 𝑖 − 1 𝑃 𝑡 𝑗 𝑡𝑘
Generating the current
observation
The best i-1 tag
sequence
CS@UVa
CS 6501: Text Mining
Transition from the
previous best ending
tag
43
Viterbi algorithm
Dynamic programming: 𝑂(𝑇 2 𝑁)!
𝑇 𝑗 𝑖 = 𝑃 𝑤𝑖 𝑡 𝑗 𝑀𝑎𝑥𝑘 𝑇 𝑘 𝑖 − 1 𝑃 𝑡 𝑗 𝑡𝑘
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑡1
𝑡2
𝑡3
𝑡4
𝑡5
𝑡6
𝑡7
Order of computation
CS@UVa
CS 6501: Text Mining
44
Decode 𝑎𝑟𝑔𝑚𝑎𝑥𝒕 𝑝(𝒕|𝒘)
• Take the highest scoring entry in the last
Keep backpointers in each trellis to keep
column of the trellis
track of the most probable sequence
𝑇 𝑗 𝑖 = 𝑃 𝑤𝑖 𝑡 𝑗 𝑀𝑎𝑥𝑘 𝑇 𝑘 𝑖 − 1 𝑃 𝑡 𝑗 𝑡𝑘
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑡1
𝑡2
𝑡3
𝑡4
𝑡5
CS@UVa
𝑡6
𝑡7
CS 6501: Text Mining
45
Train an HMMs tagger
• Parameters in an HMMs tagger
– Transition probability: 𝑝 𝑡𝑖 𝑡𝑗 , 𝑇 × 𝑇
– Emission probability: 𝑝 𝑤 𝑡 , 𝑉 × 𝑇
– Initial state probability: 𝑝 𝑡 𝜋 , 𝑇 × 1
For the first tag in a sentence
CS@UVa
CS 6501: Text Mining
46
Train an HMMs tagger
• Maximum likelihood estimator
– Given a labeled corpus, e.g., Penn Treebank
– Count how often we have the pair of 𝑡𝑖 𝑡𝑗 and 𝑤𝑖 𝑡𝑗
• 𝑝 𝑡𝑗 𝑡𝑖 =
𝑐(𝑡𝑖 ,𝑡𝑗 )
• 𝑝 𝑤𝑖 𝑡𝑗 =
𝑐(𝑡𝑖 )
𝑐(𝑤𝑖 ,𝑡𝑗 )
𝑐(𝑡𝑗 )
Proper smoothing is necessary!
CS@UVa
CS 6501: Text Mining
47
Viterbi Algorithm Example
CS@UVa
CS 6501: Text Mining
48
Viterbi Algorithm Example (Cont.)
CS@UVa
CS 6501: Text Mining
49
CS@UVa
CS 6501: Text Mining
50
Public POS taggers
• Brill’s tagger
– http://www.cs.jhu.edu/~brill/
• TnT tagger
– http://www.coli.uni-saarland.de/~thorsten/tnt/
• Stanford tagger
– http://nlp.stanford.edu/software/tagger.shtml
• SVMTool
– http://www.lsi.upc.es/~nlp/SVMTool/
• GENIA tagger
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
• More complete list at
– http://www-nlp.stanford.edu/links/statnlp.html#Taggers
CS@UVa
CS 6501: Text Mining
51
What you should know
• Definition of POS tagging problem
– Property & challenges
• Public tag sets
• Generative model for POS tagging
– HMMs
CS@UVa
CS 6501: Text Mining
52
Lexical Semantics
1. Lexical semantics
– Meaning of words
– Relation between different meanings
2. WordNet
– An ontology structure of word senses
– Similarity between words
3. Distributional semantics
– Similarity between words
– Word sense disambiguation
CS@UVa
CS 6501: Text Mining
53
What is the meaning of a word?
• Most words have many different senses
– dog = animal or sausage?
– lie = to be in a horizontal position or a false statement
made with deliberate intent
• What are the relations of different words in terms
of meaning?
– Specific relations between senses
• Animal is more general than dog
– Semantic fields
• Money is related to bank
CS@UVa
“a set of words grouped, referring to a
specific subject … not necessarily
synonymous, but are all used to talk about
the same general phenomenon ” - wiki
CS 6501: Text Mining
54
Word senses
• What does ‘bank’ mean?
– A financial institution
• E.g., “US bank has raised interest rates.”
– A particular branch of a financial institution
• E.g., “The bank on Main Street closes at 5pm.”
– The sloping side of any hollow in the ground, espe
cially when bordering a river
• E.g., “In 1927, the bank of the Mississippi flooded.”
– A ‘repository’
• E.g., “I donate blood to a blood bank.”
CS@UVa
CS 6501: Text Mining
55
Lexicon entries
lemma
senses
CS@UVa
CS 6501: Text Mining
56
Some terminologies
• Word forms: runs, ran, running; good, better, best
– Any, possibly inflected, form of a word
• Lemma (citation/dictionary form): run; good
– A basic word form (e.g. infinitive or singular nominative
noun) that is used to represent all forms of the same word
• Lexeme: RUN(V), GOOD(A), BANK1(N), BANK2(N)
– An abstract representation of a word (and all its forms),
with a part-of-speech and a set of related word senses
– Often just written (or referred to) as the lemma, perhaps
in a different FONT
• Lexicon
– A (finite) list of lexemes
CS@UVa
CS 6501: Text Mining
57
Make sense of word senses
• Polysemy
– A lexeme is polysemous if it has different related
senses
bank = financial institution
CS@UVa
CS 6501: Text Mining
or
a building
58
Make sense of word senses
• Homonyms
– Two lexemes are homonyms if their senses are
unrelated, but they happen to have the same
spelling and pronunciation
bank = financial institution
CS@UVa
CS 6501: Text Mining
or
river bank
59
Relations between senses
• Symmetric relations
– Synonyms: couch/sofa
• Two lemmas with the same sense
– Antonyms: cold/hot, rise/fall, in/out
• Two lemmas with the opposite sense
• Hierarchical relations:
– Hypernyms and hyponyms: pet/dog
• The hyponym (dog) is more specific than the hypernym (pet)
– Holonyms and meronyms: car/wheel
• The meronym (wheel) is a part of the holonym (car)
CS@UVa
CS 6501: Text Mining
60
WordNet
George Miller, Cognitive
Science Laboratory of Princeton
University, 1985
• A very large lexical database of English:
– 117K nouns, 11K verbs, 22K adjectives, 4.5K adverbs
• Word senses grouped into synonym sets
(“synsets”) linked into a conceptual-semantic
hierarchy
– 82K noun synsets, 13K verb synsets, 18K adjectives
synsets, 3.6K adverb synsets
– Avg. # of senses: 1.23/noun, 2.16/verb, 1.41/adj,
1.24/adverb
• Conceptual-semantic relations
– hypernym/hyponym
CS@UVa
CS 6501: Text Mining
61
A WordNet example
• http://wordnet.princeton.edu/
CS@UVa
CS 6501: Text Mining
62
Hierarchical synset relations: nouns
• Hypernym/hyponym (between concepts)
– The more general ‘meal’ is a hypernym of the more specific
‘breakfast’
• Instance hypernym/hyponym (between concepts and
instances)
Jane Austen, 1775–1817, English novelist
– Austen is an instance hyponym of author
• Member holonym/meronym (groups and members)
– professor is a member meronym of (a university’s) faculty
• Part holonym/meronym (wholes and parts)
– wheel is a part meronym of (is a part of) car.
• Substance meronym/holonym (substances and
components)
– flour is a substance meronym of (is made of) bread
CS@UVa
CS 6501: Text Mining
63
WordNet hypernyms & hyponyms
CS@UVa
CS 6501: Text Mining
64
Hierarchical synset relations: verbs
the presence of a ‘manner’
relation between two lexemes
• Hypernym/troponym (between events)
– travel/fly, walk/stroll
– Flying is a troponym of traveling: it denotes a
specific manner of traveling
• Entailment (between events):
– snore/sleep
• Snoring entails (presupposes) sleeping
CS@UVa
CS 6501: Text Mining
65
WordNet similarity
• Path based similarity measure between words
– Shortest path between two concepts (Leacock &
Chodorow 1998)
• sim = 1/|shortest path|
– Path length to the root node from the least
common subsumer (LCS) of the two concepts (Wu
the most specific concept which
& Palmer 1994)
is an ancestor of both A and B.
• sim = 2*depth(LCS)/(depth(w1)+depth(w2))
• http://wn-similarity.sourceforge.net/
CS@UVa
CS 6501: Text Mining
66
WordNet::Similarity
CS@UVa
CS 6501: Text Mining
67
WordNet::Similarity
CS@UVa
CS 6501: Text Mining
68
Distributional hypothesis
• What is tezgüino?
– A bottle of tezgüino is on the table.
– Everybody likes tezgüino.
– Tezgüino makes you drunk.
– We make tezgüino out of corn.
• The contexts in which a word appears tell us a
lot about what it means
CS@UVa
CS 6501: Text Mining
69
Distributional semantics
• Use the contexts in which words appear to
measure their similarity
– Assumption: similar contexts => similar meanings
– Approach: represent each word 𝑤 as a vector of
its contexts 𝑐
• Vector space representation
• Each dimension corresponds to a particular context 𝑐𝑛
• Each element in the vector of 𝑤 captures the degree to
which the word 𝑤 is associated with the context 𝑐𝑛
– Similarity metric
• Cosine similarity
CS@UVa
CS 6501: Text Mining
70
How to define the contexts
within a sentence
• Nearby words
– 𝑤 appears near 𝑐 if 𝑐 occurs within ±𝑘 words of 𝑤
• It yields fairly broad thematic relations
– Decide on a fixed vocabulary of 𝑁 context words 𝑐1 . . 𝑐𝑁
• Prefer words occur frequently enough in the corpus but not too
frequent (i.e., avoid stopwords)
– Co-occurrence count of word 𝑤 and context 𝑐 as the
corresponding element in the vector
• Pointwise Mutual Information (PMI)
• Grammatical relations
– How often is 𝑤 used as the subject of the verb 𝑐?
– Fine-grained thematic relations
CS@UVa
CS 6501: Text Mining
71
Mutual information
• Relatedness between two random variables
– 𝐼 𝑋; 𝑌 =
CS@UVa
𝑦∈𝑌
𝑝(𝑥,𝑦)
𝑥∈𝑋 𝑝(𝑥, 𝑦) log(𝑝 𝑥 𝑝(𝑦))
CS 6501: Text Mining
72
Pointwise mutual information
within a sentence
• PMI between w and c using a fixed window of
± 𝑘 words
– 𝑃𝑀𝐼 𝑤; 𝑐 =
𝑝(𝑤,𝑐)
𝑝(𝑤, 𝑐) log(
)
𝑝 𝑤 𝑝(𝑐)
How often 𝑤 and 𝑐 cooccur inside a window
CS@UVa
How often 𝑤 occurs How often 𝑐 occurs
CS 6501: Text Mining
73
Word sense disambiguation
• What does this word mean?
– This plant needs to be watered each day.
• living plant
– This plant manufactures 1000 widgets each day.
• factory
• Word sense disambiguation (WSD)
– Identify the sense of content words (noun, verb,
adjective) in context (assuming a fixed inventory
of word senses)
CS@UVa
CS 6501: Text Mining
74
Dictionary-based methods
• A dictionary/thesaurus contains glosses and
examples of a word
bank1
Gloss: a financial institution that accepts deposits and channels
the money into lending activities
Examples: “he cashed the check at the bank”,
“that bank holds the mortgage on my home”
bank2
Gloss: sloping land (especially the slope beside a body of water)
Examples: “they pulled the canoe up on the bank”,
“he sat on the bank of the river and watched the current”
CS@UVa
CS 6501: Text Mining
75
Lesk algorithm
• Compare the context with the dictionary
definition of the sense
context words
– Construct the signature of a word in context by the
signatures of its senses in the dictionary
• Signature = set of context words (in examples/gloss or in
context)
– Assign the dictionary sense whose gloss and examples
are the most similar to the context in which the word
occurs
• Similarity = size of intersection of context signature and
sense signature
CS@UVa
CS 6501: Text Mining
76
Sense signatures
bank1
Gloss: a financial institution that accepts deposits and channels
the money into lending activities
Examples: “he cashed the check at the bank”,
“that bank holds the mortgage on my home”
Signature(bank1) = {financial, institution, accept, deposit,
channel, money, lend, activity, cash, check, hold, mortgage, home}
bank2
Gloss: sloping land (especially the slope beside a body of water)
Examples: “they pulled the canoe up on the bank”,
“he sat on the bank of the river and watched the current”
Signature(bank1) = {slope, land, body, water, pull, canoe, sit,
river, watch, current}
CS@UVa
CS 6501: Text Mining
77
Signature of target word
“The bank refused to give me a loan.”
• Simplified Lesk
– Words in context
– Signature(bank) = {refuse, give, loan}
• Original Lesk
– Augmented signature of the target word
– Signature(bank) = {refuse, reject, request,... , give,
gift, donate,... loan, money, borrow,...}
CS@UVa
CS 6501: Text Mining
78
What you should know
• Lexical semantics
– Relationship between words
– WordNet
• Distributional semantics
– Similarity between words
– Word sense disambiguation
CS@UVa
CS 6501: Text Mining
79
© Copyright 2025 Paperzz