Word Sense Disambiguation

Word Meaning & Word
Sense Disambiguation
CMSC 723 / LING 723 / INST 725
MARINE CARPUAT
[email protected]
Today
• Representing word meaning
• Word sense disambiguation as supervised
classification
• Word sense disambiguation without
annotated examples
Drunk gets nine year in violin case.
http://www.ling.upenn.edu/ ˜beatrice/humor/headlines.html
How do we know that a word
(lemma) has distinct senses?
• Linguists often design
tests for this purpose
Which flight serves
breakfast?
• e.g., zeugma
combines distinct
senses in an
uncomfortable way
Which flights serve
Tuscon?
*Which flights serve
breakfast and Tuscon?
Where can we look up the
meaning of words?
• Dictionary?
Word Senses
• “Word sense” = distinct meaning of a word
• Same word, different senses
– Homonyms (homonymy): unrelated senses; identical
orthographic form is coincidental
– Polysemes (polysemy): related, but distinct senses
– Metonyms (metonymy): “stand in”, technically, a subcase of polysemy
• Different word, same sense
– Synonyms (synonymy)
• Homophones: same pronunciation,
different orthography, different meaning
– Examples: would/wood, to/too/two
• Homographs: distinct senses, same
orthographic form, different pronunciation
– Examples: bass (fish) vs. bass (instrument)
Relationship Between Senses
• IS-A relationships
– From specific to general (up): hypernym
(hypernymy)
– From general to specific (down): hyponym
(hyponymy)
• Part-Whole relationships
– wheel is a meronym of car (meronymy)
– car is a holonym of wheel (holonymy)
WordNet:
a lexical database for English
https://wordnet.princeton.edu/
• Includes most English nouns, verbs, adjectives,
adverbs
• Electronic format makes it amenable to
automatic manipulation: used in many NLP
applications
• “WordNets” generically refers to similar
resources in other languages
WordNet: History
• Research in artificial intelligence:
– How do humans store and access knowledge about
concept?
– Hypothesis: concepts are interconnected via
meaningful relations
– Useful for reasoning
• The WordNet project started in 1986
– Can most (all?) of the words in a language be
represented as a semantic network where words are
interlinked by meaning?
– If so, the result would be a large semantic network…
Synonymy in WordNet
• WordNet is organized in terms of “synsets”
– Unordered set of (roughly) synonymous
“words” (or multi-word phrases)
• Each synset expresses a distinct
meaning/concept
WordNet: Example
Noun
{pipe, tobacco pipe} (a tube with a small bowl at one end; used for
smoking tobacco)
{pipe, pipage, piping} (a long tube made of metal or plastic that is used
to carry water or oil or gas etc.)
{pipe, tube} (a hollow cylindrical shape)
{pipe} (a tubular wind instrument)
{organ pipe, pipe, pipework} (the flues and stops on a pipe organ)
Verb
{shriek, shrill, pipe up, pipe} (utter a shrill cry)
{pipe} (transport by pipeline) “pipe oil, water, and gas into the desert”
{pipe} (play on a pipe) “pipe a tune”
{pipe} (trim with piping) “pipe the skirt”
What do you think of the sense granularity?
The “Net” Part of WordNet
{c
o
n
v
e
y
a
n
c
e
;tra
n
s
p
o
rt}
h
y
p
e
ro
n
y
m
{v
e
h
ic
le
}
{
b
u
m
p
e
r}
h
y
p
e
ro
n
y
m
{
m
o
to
rv
e
h
ic
le
;a
u
to
m
o
tiv
ev
e
h
ic
le
}
m
e
ro
n
y
m
{
c
a
rd
o
o
r}
m
e
ro
n
y
m
h
y
p
e
ro
n
y
m
{c
a
r;a
u
to
;a
u
to
m
o
b
ile
;m
a
c
h
in
e
;m
o
to
rc
a
r}
m
e
ro
n
y
m
{
d
o
o
rlo
c
k
}
m
e
ro
n
y
m
{
c
a
rw
in
d
o
w
}
{
c
a
rm
irro
r}
h
y
p
e
ro
n
y
m h
y
p
e
ro
n
y
m
{c
ru
is
e
r;s
q
u
a
dc
a
r;p
a
tro
lc
a
r;p
o
lic
ec
a
r;p
ro
w
lc
a
r}
{
h
in
g
e
;fle
x
ib
lejo
in
t}
{c
a
b
;ta
x
i;h
a
c
k
;ta
x
ic
a
b
;}
{
a
rm
re
st}
WordNet 3.0: Size
Part of speech
Word form
Synsets
Noun
117,798
82,115
Verb
11,529
13,767
Adjective
21,479
18,156
Adverb
4,481
3,621
Total
155,287
117,659
http://wordnet.princeton.edu/
Word Sense
From WordNet:
Noun
{pipe, tobacco pipe} (a tube with a small bowl at one end; used for
smoking tobacco)
{pipe, pipage, piping} (a long tube made of metal or plastic that is used
to carry water or oil or gas etc.)
{pipe, tube} (a hollow cylindrical shape)
{pipe} (a tubular wind instrument)
{organ pipe, pipe, pipework} (the flues and stops on a pipe organ)
Verb
{shriek, shrill, pipe up, pipe} (utter a shrill cry)
{pipe} (transport by pipeline) “pipe oil, water, and gas into the desert”
{pipe} (play on a pipe) “pipe a tune”
{pipe} (trim with piping) “pipe the skirt”
WORD SENSE DISAMBIGUATION
Word Sense Disambiguation
• Task: automatically select the correct sense of a
word
– Input: a word in context
– Output: sense of the word
– Can be framed as lexical sample (focus on one word type
at a time) or all-words (disambiguate all content words in
a document)
• Motivated by many applications:
– Information retrieval
– Machine translation
–…
How big is the problem?
• Most words in English have only one sense
– 62% in Longman’s Dictionary of Contemporary English
– 79% in WordNet
• But the others tend to have several senses
– Average of 3.83 in LDOCE
– Average of 2.96 in WordNet
• Ambiguous words are more frequently used
– In the British National Corpus, 84% of instances have more than
one sense
• Some senses are more frequent than others
Ground Truth
• Which sense inventory do we use?
Existing Corpora
• Lexical sample
– line-hard-serve corpus (4k sense-tagged
examples)
– interest corpus (2,369 sense-tagged examples)
–…
• All-words
– SemCor (234k words, subset of Brown Corpus)
– Senseval/SemEval (2081 tagged content
words from 5k total words)
–…
Evaluation
• Intrinsic
– Measure accuracy of sense selection wrt
ground truth
• Extrinsic
– Integrate WSD as part of a bigger end-to-end
system, e.g., machine translation or
information retrieval
– Compare WSD
Baseline Performance
• Baseline: most frequent sense
– Equivalent to “take first sense” in WordNet
– Does surprisingly well!
62% accuracy in this case!
Upper Bound Performance
• Upper bound
– Fine-grained WordNet sense: 75-80% human
agreement
– Coarser-grained inventories: 90% human
agreement possible
WSD as Supervised Classification
Training
Testing
training data
?
label1
label2
label3
unlabeled
document
label4
Feature Functions
label1?
supervised machine
learning algorithm
label2?
Classifier
label3?
label4?
WSD Approaches
• Depending on use of manually created
knowledge sources
– Knowledge-lean
– Knowledge-rich
• Depending on use of labeled data
– Supervised
– Semi- or minimally supervised
– Unsupervised
Simplest WSD algorithm:
Lesk’s Algorithm
• Intuition: note word overlap between context
and dictionary entries
– Unsupervised, but knowledge rich
The bank can guarantee deposits will eventually cover future tuition
costs because it invests in adjustable-rate mortgage securities.
WordNet
Lesk’s Algorithm
• Simplest implementation:
– Count overlapping content words between glosses
and context
• Lots of variants:
–
–
–
–
–
Include the examples in dictionary definitions
Include hypernyms and hyponyms
Give more weight to larger overlaps (e.g., bigrams)
Give extra weight to infrequent words
…
WSD Accuracy
• Generally
– Supervised approaches yield ~70-80%
accuracy
• However
– depends on actual word, sense inventory,
amount of training data, number of features,
etc.
WORD SENSE DISAMBIGUATION:
MINIMIZING SUPERVISION
Minimally Supervised WSD
• Problem: annotations are expensive!
• Solution 1: “Bootstrapping” or co-training
(Yarowsky 1995)
–
–
–
–
–
Start with (small) seed, learn classifier
Use classifier to label rest of corpus
Retain “confident” labels, add to training set
Learn new classifier
Repeat…
Heuristics (derived from observation):
– One sense per discourse
– One sense per collocation
One Sense per Discourse
A word tends to preserve its meaning across all its
occurrences in a given discourse
• Gale et al. 1992
– 8 words with two-way ambiguity, e.g. plant, crane, etc.
– 98% of the two-word occurrences in the same discourse carry
the same meaning
• Krovetz 1998
– Heuristic true mostly for coarse-grained senses and for
homonymy rather than polysemy
– Performance of “one sense per discourse” measured on SemCor
is approximately 70%
One Sense per Collocation
A word tends to preserve its meaning when used in
the same collocation
– Strong for adjacent collocations
– Weaker as the distance between words increases
• Evaluation:
– 97% precision on words with two-way ambiguity
– Again, accuracy depends on granularity:
• 70% precision on SemCor words
Yarowsky’s Method: Stopping
• Stop when:
– Error on training data is less than a threshold
– No more training data is covered
Yarowsky’s Method: Discussion
• Advantages
– Accuracy is about as good as a supervised
algorithm
– Bootstrapping: far less manual effort
• Disadvantages
– Seeds may be tricky to construct
– Works only for coarse-grained sense
distinctions
– Snowballing error with co-training
nd
2
WSD with
language
as supervision
• Problems: annotations are expensive!
• What’s the “proper” sense inventory?
– How fine or coarse grained?
– Application specific?
• Observation: multiple senses translate to
different words in other languages
– Use the foreign language as the sense inventory
– Added bonus: annotations for free! (Using machine
translation data)
Today
• Representing word senses & word relations
– WordNet
• Word Sense Disambiguation
– Lesk Algorithm
– Supervised classification
– Minimizing supervision
• Co-training: combine 2 views of data
• Use parallel text
• Next: general techniques for learning without
annotated training examples
– If needed review probability, expectations