Word Sense
Disambiguation
Gaël Dias - Adapted from Rada Mihalcea & Ted Pedersen at ACL 2005
Definitions
Word sense disambiguation is the problem of selecting a
sense for a word from a set of predefined possibilities.
Sense Inventory usually comes from a dictionary or thesaurus.
Knowledge intensive methods, supervised learning, and
(sometimes) bootstrapping approaches
Word sense discrimination is the problem of dividing the
usages of a word into different meanings, without regard to
any particular existing sense inventory.
Unsupervised techniques
Slide 2
Computers versus Humans
Polysemy – most words have many possible meanings.
A computer program has no basis for knowing which one is
appropriate, even if it is obvious to a human…
Ambiguity is rarely a problem for humans in their day to day
communication, except in extreme cases…
Slide 3
Ambiguity for Humans Newspaper Headlines!
DRUNK GETS NINE YEARS IN VIOLIN CASE
FARMER BILL DIES IN HOUSE
PROSTITUTES APPEAL TO POPE
STOLEN PAINTING FOUND BY TREE
RED TAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFF TREES
INCLUDE CHILDREN WHEN BAKING COOKIES
MINERS REFUSE TO WORK AFTER DEATH
Slide 4
Ambiguity for a Computer
The fisherman jumped off the bank and into the water.
The bank down the street was robbed!
Back in the day, we had an entire bank of computers devoted
to this problem.
The bank in that road is entirely too steep and is really
dangerous.
The plane took a bank to the left, and then headed off
towards the mountains.
Slide 5
Early Days of WSD
Noted as problem for Machine Translation (Weaver, 1949)
A word can often only be translated if you know the specific sense
intended (A bill in English could be a pico or a cuenta in Spanish)
Bar-Hillel (1960) posed the following:
Little John was looking for his toy box. Finally, he found it. The box
was in the pen. John was very happy.
Is “pen” a writing instrument or an enclosure where children play?
…declared it unsolvable, left the field of MT!
Slide 6
Since then…
1970s - 1980s
Rule based systems
Rely on hand crafted knowledge sources
1990s
Corpus based approaches
Dependence on sense tagged text
2000s
Hybrid Systems
Minimizing or eliminating use of sense tagged text
Taking advantage of the Web
Slide 7
Practical Applications
Machine Translation
Translate “bill” from English to Spanish
Is it a “pico” or a “cuenta”?
Is it a bird jaw or an invoice?
Information Retrieval
Find all Web Pages about “cricket”
The sport or the insect?
Question Answering
What is George Miller’s position on gun control?
The psychologist or US congressman?
Knowledge Acquisition
Add to KB: Herb Bergson is the mayor of Duluth.
Minnesota or Georgia?
Slide 8
Knowledge-based WSD
Task definition
Knowledge-based WSD = class of WSD methods relying
(mainly) on knowledge drawn from dictionaries and/or raw
text
Resources
Yes
Machine Readable Dictionaries
Raw corpora
No
Manually annotated corpora
Scope
All open-class words
Slide 9
Machine Readable Dictionaries
In recent years, most dictionaries made available in Machine
Readable format (MRD)
Oxford English Dictionary
Collins
Longman Dictionary of Ordinary Contemporary English (LDOCE)
Thesauruses – add synonymy information
Roget Thesaurus
Semantic networks – add more semantic relations
WordNet
EuroWordNet
Slide 10
MRD – A Resource for Knowledgebased WSD
For each word in the language vocabulary, an MRD
provides:
A list of meanings
Definitions (for all word meanings)
Typical usage examples (for most word meanings)
WordNet definitions/examples for the noun plant
1.
2.
3.
4.
buildings for carrying on industrial labor; "they built a large plant to
manufacture automobiles“
a living organism lacking the power of locomotion
something planted secretly for discovery by another; "the police used a plant to
trick the thieves"; "he claimed that the evidence against him was a plant"
an actor situated in the audience whose acting is rehearsed but seems
spontaneous to the audience
Slide 11
MRD – A Resource for Knowledgebased WSD
A thesaurus adds:
An explicit synonymy relation between word meanings
WordNet synsets for the noun “plant”
1. plant, works, industrial plant
2. plant, flora, plant life
A semantic network adds:
Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF),
antonymy, entailnment, etc.
WordNet related concepts for the meaning “plant life”
{plant, flora, plant life}
hypernym: {organism, being}
hypomym: {house plant}, {fungus}, …
meronym: {plant tissue}, {plant part}
holonym: {Plantae, kingdom Plantae, plant kingdom}
Slide 12
MRD – A Resource for Knowledgebased WSD
WordNet
A semantic network for the general English
Slide 13
Lesk Algorithm
(Michael Lesk 1986): Identify senses of words in context
using definition overlap
Algorithm:
Retrieve from MRD all sense definitions of the words to be
disambiguated
Determine the definition overlap for all possible sense
combinations
Choose senses that lead to highest overlap
Example: disambiguate PINE CONE
• PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
• CONE
1. solid body which narrows to a point
2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
Pine#1 ∩ Cone#1 = 0
Pine#2 ∩ Cone#1 = 0
Pine#1 ∩ Cone#2 = 1
Pine#2 ∩ Cone#2 = 0
Pine#1 ∩ Cone#3 = 2
Pine#2 ∩ Cone#3 = 0
Slide 14
Lesk Algorithm for More than Two
Words?
I saw a man who is 98 years old and can still walk and tell
jokes
nine open class words: see(26), man(11), year(4), old(8), can(5),
still(4), walk(10), tell(8), joke(3)
43,929,600 sense combinations! How to find the optimal
sense combination?
Simulated annealing (Cowie, Guthrie, Guthrie 1992)
Define a function E = combination of word senses in a given text.
Find the combination of senses that leads to highest definition
overlap (redundancy)
1. Start with E = the most frequent sense for each word
2. At each iteration, replace the sense of a random word in the
set with a different sense, and measure E
3. Stop iterating when there is no change in the configuration of
senses
Slide 15
Lesk Algorithm: A Simplified
Version
Original Lesk definition: measure overlap between sense definitions
for all words in context
Identify simultaneously the correct senses for all words in context
Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap
between sense definitions of a word and current context
Identify the correct sense for one word at a time
Search space significantly reduced
Slide 16
Lesk Algorithm: A Simplified
Version
• Algorithm for simplified Lesk:
1.Retrieve from MRD all sense definitions of the word to be
disambiguated
2.Determine the overlap between each sense definition and the
current context
3.Choose the sense that leads to highest overlap
Example: disambiguate PINE in
“Pine cones hanging in a tree”
Pine#1 ∩ Sentence = 1
Pine#2 ∩ Sentence = 0
• PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
Slide 17
Evaluations of Lesk Algorithm
Initial evaluation by M. Lesk
50-70% on short samples of text manually annotated set, with
respect to Oxford Advanced Learner’s Dictionary
Simulated annealing
47% on 50 manually annotated sentences
Evaluation on Senseval-2 all-words data, with back-off to
random sense (Mihalcea & Tarau 2004)
Original Lesk: 35%
Simplified Lesk: 47%
Evaluation on Senseval-2 all-words data, with back-off to most
frequent sense (Vasilescu, Langlais, Lapalme 2004)
Original Lesk: 42%
Simplified Lesk: 58%
Slide 18
Selectional Preferences
A way to constrain the possible meanings of words in a given
context
E.g. “Wash a dish” vs. “Cook a dish”
WASH-OBJECT vs. COOK-FOOD
Capture information about possible relations between semantic
classes
Common sense knowledge
Alternative terminology
Selectional Restrictions
Selectional Preferences
Selectional Constraints
Slide 19
Acquiring Selectional Preferences
From raw corpora
Frequency counts
Information theory measures
Class-to-class relations
Slide 20
Preliminaries: Learning Word-toWord Relations
An indication of the semantic fit between two words
1. Frequency counts
Pairs of words connected by a syntactic relations
Count (W 1 , W 2 , R)
2. Conditional probabilities
Condition on one of the words
P(W 1∣W 2 , R )=
Count ( W 1 ,W 2 , R)
Count (W 2 , R )
Slide 21
Learning Selectional Preferences
(1)
Word-to-class relations (Resnik 1993)
Quantify the contribution of a semantic class using all the
concepts subsumed by that class
P (C 2∣W 1 , R) log
P (C 2∣W 1 , R )
A (W 1 , C 2 , R )=
∑ P (C 2∣W 1 , R ) log
C2
P (C 2 )
P(C 2∣W 1 , R)
P( C 2 )
where
P(C 2∣W 1 , R)=
Count( W 1 ,C 2 , R)
Count( W 1 , R)
Count (W 1 , C 2 , R)=
∑
Count (W 1 , W 2 , R)
W 2 ∈C 2 Count (W 2 )
Slide 22
Learning Selectional Preferences
(2)
Determine the contribution of a word sense based on the
assumption of equal sense distributions:
e.g. “plant” has two senses 50% occurrences are sense 1,
50% are sense 2
Example: learning restrictions for the verb “to drink”
Find high-scoring verb-object pairs
Co-occ score
11.75
11.75
11.75
10.53
10.2
9.34
Verb
drink
drink
drink
drink
drink
drink
Object
tea
Pepsi
champagne
liquid
beer
wine
Find “prototypical” object classes (high association score)
A(v,c)
Object class
3.58 (beverage, [drink, …])
2.05 (alcoholic_beverage, [intoxicant, …])
Slide 23
Using Selectional Preferences for
WSD
Algorithm:
1. Learn a large set of selectional preferences for a given
syntactic relation R
2. Given a pair of words W1– W2 connected by a relation R
3. Find all selectional preferences W1– C (word-to-class) or C1–
C2 (class-to-class) that apply
4. Select the meanings of W1 and W2 based on the selected
semantic class
Example: disambiguate coffee in “drink coffee”
1. (beverage) a beverage consisting of an infusion of ground coffee beans
2. (tree) any of several small trees native to the tropical Old World
3. (color) a medium to dark brown color
Given the selectional preference “DRINK BEVERAGE” : coffee#1
Slide 24
Evaluation of Selectional
Preferences for WSD
Data set
mainly on verb-object, subject-verb relations extracted from
SemCor
Compare against random baseline
Results (Agirre and Martinez, 2000)
Average results on 8 nouns
Similar figures reported in (Resnik 1997)
Object
Precision
Random
19.2
Word-to-word
95.9
Word-to-class
66.9
Class-to-class
66.6
Recall
19.2
24.9
58.0
64.8
Subject
Precision Recall
19.2
19.2
74.2
18.0
56.2
46.8
54.0
53.7
Slide 25
Semantic Similarity
Words in a discourse must be related in meaning, for the
discourse to be coherent (Haliday and Hassan, 1976)
Use this property for WSD – Identify related meanings for words
that share a common context
Context span:
1. Local context: semantic similarity between pairs of words
2. Global context: lexical chains
Slide 26
Semantic Similarity in a Local
Context
Similarity determined between pairs of concepts, or between a
word and its surrounding context
Relies on similarity metrics on semantic networks
carnivore
fissiped mamal, fissiped
wolf
dingo
canine, canid
wild dog
dog
hyena dog
feline, felid
bear
hyena
hunting dog
dachshund
terrier
Slide 27
Semantic Similarity in a Local
Context
Similarity based on path length
Slide 28
Semantic Similarity in a Local
Context
Similarity based on information theory
Slide 29
Semantic Similarity Metrics for
WSD
Disambiguate target words based on similarity with one word
to the left and one word to the right
Example: disambiguate PLANT in “plant with flowers”
PLANT
1.
2.
plant, works, industrial plant
plant, flora, plant life
Similarity (plant#1, flower) = 0.2
Similarity (plant#2, flower) = 1.5
: plant#2
Evaluation:
1,723 ambiguous nouns from Senseval-2
Among 5 similarity metrics, (Jiang and Conrath 1997) provide the
best precision (39%)
Slide 30
Semantic Similarity in a Global
Context
Lexical chains (Hirst and St-Onge 1988)
“A lexical chain is a sequence of semantically related words,
which creates a context and contributes to the continuity of
meaning and the coherence of a discourse”
Algorithm for finding lexical chains:
Select the candidate words from the text. These are words for which
we can compute similarity measures, and therefore most of the
time they have the same part of speech.
For each such candidate word, and for each meaning for this word,
find a chain to receive the candidate word sense, based on a
semantic relatedness measure between the concepts that are
already in the chain, and the candidate word meaning.
If such a chain is found, insert the word in this chain; otherwise,
create a new chain.
Slide 31
Semantic Similarity of a Global
Context
A very long train traveling along the rails with a constant velocity v in a
certain direction …
train
#1: public transport
#1 change location
# 2: a bar of steel for trains
#2: order set of things
#3: piece of cloth
travel
#2: undergo transportation
rail
#1: a barrier
#3: a small bird
Slide 32
Lexical Chains for WSD
Identify lexical chains in a text
Usually target one part of speech at a time
Identify the meaning of words based on their membership to a
lexical chain
Evaluation:
(Galley and McKeown 2003) lexical chains on 74 SemCor texts
give 62.09%
(Mihalcea and Moldovan 2000) on five SemCor texts give 90% with
60% recall
lexical chains “anchored” on monosemous words
(Okumura and Honda 1994) lexical chains on five Japanese texts
give 63.4%
Slide 33
Heuristics: One Sense Per
Discourse
A word tends to preserve its meaning across all its occurrences in a
given discourse (Gale, Church, Yarowksy 1992)
What does this mean?
E.g. The ambiguous word PLANT occurs 10 times in a discourse
all instances of “plant” carry the same meaning
Evaluation:
8 words with two-way ambiguity, e.g. plant, crane, etc.
98% of the two-word occurrences in the same discourse carry the same
meaning
The grain of salt: Performance depends on granularity
(Krovetz 1998) experiments with words with more than two senses
Performance of “one sense per discourse” measured on SemCor is
approx. 70%
Slide 34
Heuristics: One Sense per
Collocation
A word tends to preserver its meaning when used in the same
collocation (Yarowsky 1993)
Strong for adjacent collocations
Weaker as the distance between words increases
The ambiguous word PLANT preserves its meaning in all its
occurrences within the collocation “industrial plant”, regardless
of the context where this collocation occurs
Evaluation:
97% precision on words with two-way ambiguity
Finer granularity:
(Martinez and Agirre 2000) tested the “one sense per collocation”
hypothesis on text annotated with WordNet senses
70% precision on SemCor words
Slide 35
© Copyright 2026 Paperzz