Word Sense Disambiguation - Computational Linguistics

Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Word Sense Disambiguation
Translation-based WSD
Unsupervised WSD
Modern WSD
L645 / B659
(Some material from Jurafsky & Martin (2009) + Manning & Schütze (2000))
Dept. of Linguistics, Indiana University
Fall 2015
1 / 30
Word Sense
Disambiguation
Context
Lexical Semantics
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
A (word) sense represents one meaning of a word
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
I
I
1
bank : financial institution
bank2 : sloped ground near water
Unsupervised WSD
Modern WSD
Various relations:
I
I
homonymy: 2 words/senses happen to sound the same
(e.g., bank1 & bank2 )
polysemy: 2 senses have some semantic relation
between them
I
bank1 & bank3 = repository for biological entities
2 / 30
Word Sense
Disambiguation
Context
WordNet
Supervised WSD
WSD evaluation
WordNet (http://wordnet.princeton.edu/) is a database of
lexical relations:
I Nouns (117,798); verbs (11,529); adjectives (21,479) &
adverbs (4,481)
I
https://wordnet.princeton.edu/wordnet/man/wnstats.
7WN.html
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
WordNet contains different senses of a word, defined by
synsets (synonym sets)
I
{chump1 , fool2 , gull1 , mark9 , patsy1 , fall
guy1 , sucker1 , soft touch1 , mug2 }
I
I
Words are substitutable in some contexts
gloss: a person who is gullible and easy to take
advantage of
See http://babelnet.org for other languages
3 / 30
Word Sense Disambiguation (WSD)
Word Sense
Disambiguation
Supervised WSD
Word Sense Disambiguation (WSD): determine the proper
sense of an ambiguous word in a given context
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
e.g., Given the word bank, is it:
I
I
the rising ground bordering a body of water?
an establishment for exchanging funds?
I
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
Or maybe a repository (e.g., blood bank)?
WSD comes in two variants:
I
Lexical sample task: small pre-selected set of target
words (along with sense inventory)
I
All-words task: entire texts
Our goal: get a flavor for insights & what techniques need to
accomplish
4 / 30
Supervised WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Supervised WSD: extract features which are helpful for
particular senses & train a classifier to assign correct sense
I
lexical sample task: labeled corpora for individual
words
I
all-word disambiguation task: use a semantic
concordance (e.g., SemCor)
Translation-based WSD
Unsupervised WSD
Modern WSD
5 / 30
WSD Evaluation
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
I
I
Extrinsic (in vivo) evaluation: evaluate WSD in the
context of another task, e.g., question answering
Intrinsic (in vitro) evaluation: evaluate WSD as a
stand-alone system
I
I
Exact-match sense accuracy
Precision/recall measures, if systems pass on some
labelings
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
Baselines:
I
Most frequent sense (MFS): for WordNet, take first
sense
I
Lesk algorithm (later)
Ceiling: inter-annotator agreement, generally 75-80%
6 / 30
Feature extraction
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
1. POS tag, lemmatize/stem, & perhaps parse the
sentence in question
2. Extract context features within a certain window of a
target word
I
Unsupervised WSD
Modern WSD
Feature vector: numeric or nominal values encoding
linguistic information
7 / 30
Feature extraction
Word Sense
Disambiguation
Collocational features
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Collocational features encode information about specific
positions to the left or right of a target word
Translation-based WSD
Unsupervised WSD
Modern WSD
I
capture local lexical & grammatical information
Consider: An electric guitar and bass player stand off to one
side, not really part of the scene ...
I
[wi −2 ,POSi −2 ,wi −1 ,POSi −1 ,wi +1 ,POSi +1 ,wi +2 ,POSi +2 ]
I
[guitar, NN, and, CC, player, NN, stand, VB]
8 / 30
Feature extraction
Word Sense
Disambiguation
Bag-of-words features
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Bag-of-words features encode unordered sets of
surrounding words, ignoring exact position
I
I
Captures more semantic properties & general topic of
discourse
Vocabulary for surrounding words usually pre-defined
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
e.g., 12 most frequent content words from bass sentences in
the WSJ:
I
[fishing, big, sound, player, fly, rod, pound, double, runs,
playing, guitar, band]
leading to this feature vector:
I
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0]
9 / 30
Bayesian WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
I
Look at a context of surrounding words, call it c, within a
window of a particular size
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
I
Select the best sense s from among the different
senses
s
(1)
Unsupervised WSD
Modern WSD
= argsk max P (sk |c )
P (c |sk )P (sk )
= argsk max
P (c )
= argsk max P (c |sk )P (sk )
Computationally simpler to calculate logarithms, giving:
(2) s = argsk max[log P (c |sk ) + log P (sk )]
10 / 30
Naive Bayes assumption
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
I
Treat the context (c) as a bag of words (vj )
I
Make the Naive Bayes assumption that every
surrounding word vj is independent of the other ones:
(3) P (c |sk ) =
(4) s = argsk max[
Q
v j ∈c
P
v j ∈c
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
P (vj |sk )
log P (vj |sk ) + log P (sk )]
We get maximum likelihood estimates from the corpus to
obtain P (sk ) and P (vj |sk )
11 / 30
Dictionary-based WSD
Word Sense
Disambiguation
Lesk algorithm
Supervised WSD
WSD evaluation
Feature extraction
Use general characterizations of the senses to aid in
disambiguation
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Intuition: words found in a particular sense definition can
provide contextual cues, e.g., for ash:
Sense
s1 : tree
s2 : burned stuff
Unsupervised WSD
Modern WSD
Definition
a tree of the olive family
the solid residue left
when combustible material is burned
If tree is in the context of ash, the sense is more likely s1
12 / 30
Lesk algorithm
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Look at words within the sense definition and the words
within the definitions of context words, too (unioning over
different senses)
1. Take all senses sk of a word w and gather the set of
words for each definition
I
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
Treat it as a bag of words
2. Gather all the words in the definitions of the surrounding
words, within some context window
3. Calculate the overlap
4. Choose the sense with the higher overlap
13 / 30
Example
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
(5) This cigar burns slowly and creates a stiff ash.
Translation-based WSD
Unsupervised WSD
(6) The ash is one of the last trees to come into leaf.
Modern WSD
So, sense s2 goes with the first sentence and s1 with the
second
I
Note that, depending on the dictionary, leaf might also
be a contextual cue for sense s1 of ash
14 / 30
Problems with dictionary-based WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
I
Not very accurate: 50%-70%
I
Highly dependent upon the choice of dictionary
I
Not always clear whether the dictionary definitions align
with what we think of as different senses
Unsupervised WSD
Modern WSD
15 / 30
Heuristic-based WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Can use a heuristic to automatically select seeds
I
One sense per discourse: the sense of a word is
highly consistent within a given document
I
One sense per collocation: collocations rarely have
multiple senses associated with them
Unsupervised WSD
Modern WSD
16 / 30
One sense per collocation
Rank senses based on what collocations the word appears
in,
I
e.g., show interest might be strongly correlated with the
’attention, concern’ usage of interest
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
I
I
The collocational feature could be a surrounding POS
tag, or a word in the object position
Unsupervised WSD
Modern WSD
For a given context, select which collocational feature
will be used to disambiguate, based on which feature is
strongest indicator
I
Avoid having to combine different pieces of information
this way
Rankings are based on the following, where f is a
collocational feature:
(7)
P (sk1 |f )
P (sk2 |f )
17 / 30
Calculating collocations
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
1. Initially, calculate the collocations for sk
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
2. Calculate the contexts in which an ambiguous word is
assigned to sk , based on those collocations
Unsupervised WSD
Modern WSD
3. Calculate the set of collocations that are most
characteristic of the contexts for sk , using the formula:
(8)
P (sk1 |f )
P (sk2 |f )
4. Repeat steps 2 & 3 until a threshold is reached.
18 / 30
Word similarity
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Idea: expect synonyms to behave similarly
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Define this in two ways:
I
Knowledge-based: thesaurus-based WSD
I
Knowledge-free: distributional methods
Modern WSD
Word similarity computations are useful for IR, QA,
summarization, language modeling, etc.
19 / 30
Thesaurus-based WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Use essentially the same set-up as dictionary-based WSD,
but now:
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
I
I
instead of requiring context words to have overlapping
dictionary definitions
we require surrounding context words to list the focus
word w (or the subject code of w) as one of their topics
Unsupervised WSD
Modern WSD
e.g., If an animal or insect appears in the context of bass, we
choose the fish sense instead of the musical one
Alternative: use path lengths in an ontology like WordNet to
calculate word similarity
20 / 30
Translation-based WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Idea: when disambiguating a word w, look for a combination
of w and some contextual word which translates to a
particular pair, indicating a particular sense
I
interest can be ‘legal share’ (Beteiligung in German) or
‘concern’ (Interesse)
I
In the phrase show concern, we are more likely to
translate to Interesse zeigen than Beteiligung zeigen
I
So, in this English context, the German context tells us
to go with the sense that corresponds to Interesse
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
21 / 30
Information-theoretic WSD
Word Sense
Disambiguation
Supervised WSD
Instead of using all contextual features—which we assume
are independent—an information-theoretic approach tries to
find one disambiguating feature
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
I
Take a set of possible indicators and determine which is
the best, i.e., which gives the highest mutual
information in the training data
Unsupervised WSD
Modern WSD
Possible indicators:
I
I
I
I
I
object of the verb
the verb tense
word to the left
word to the right
etc.
When sense tagging, find value of that indicator to tag
22 / 30
Partitioning
Word Sense
Disambiguation
Supervised WSD
More specifically, determine what the values (xi ) of the
indicator indicate, i.e. what sense (si ) they point to.
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
I
I
Assume two senses (P1 and P2 ), which can be
captured in subsets Q1 = {xi |xi indicates sense 1} and
Q2 = {xi |xi indicates sense 2}
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
We will have a set of indicator values Q; our goal is to
partition Q into these two sets
The partition we choose is the one which maximizes the
mutual information scores I(P1 , Q1 ) and I(P2 , Q2 )
I
The Flip-Flop algorithm is used when you have to
automatically determine your senses (e.g., if using
parallel text)
23 / 30
The Flip-Flop Algorithm (roughly)
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
1. Randomly partition P (possible senses/translations) into
P1 and P2
2. While improving mutual information scores,
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
2.1 Find the partition Q (possible indicators) into Q1 and Q2
which maximizes I(P ; Q )
I
Q might be the set of objects which appear after the
verb in question
2.2 Find the partition P into P1 and P2 which maximizes
I(P ; Q )
24 / 30
Disambiguation
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
After determining the best indicator and partitioning the
values, disambiguating is easy:
1. Determine the value xi of the indicator for the
ambiguous word.
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
2. If xi is in Q1 , assign it sense 1; otherwise, sense 2.
This method is also applicable for determining which
indicators are best for a set of translation words
25 / 30
Unsupervised WSD
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Perform sense discrimination, or clustering
I
In other words, group comparable senses
together—even if you cannot give a correct label
Unsupervised WSD
Modern WSD
We will look briefly at the EM (Expectation-Maximization)
algorithm for this task, based on a Bayesian model
26 / 30
EM algorithm: Bayesian review
Word Sense
Disambiguation
Supervised WSD
Bayesian WSD for supervised learning:
WSD evaluation
Feature extraction
Naive Bayes
I
I
Look at a context of surrounding words, call it c (vj =
word in context), within a window of a particular size
Select the best sense s from among the different
senses
s
(9)
= argsk
= argsk
= argsk
= argsk
= argsk
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
max P (sk |c )
max
P (c |sk )P (sk )
P (c )
max P (c |sk )P (sk )
max[log P (c |sk ) + log P (sk )]
P
max[
log P (vj |sk ) + log P (sk )]
vj ∈c
We need some other way to get estimates of P (sk ) and
P (c |sk )
27 / 30
Word Sense
Disambiguation
EM algorithm
Supervised WSD
WSD evaluation
Feature extraction
1. Intialize the parameters randomly, i.e., the probabilities
for all senses and contexts
I
And decide K , the number of senses you want →
determines how fine-grained your distinctions are
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
2. While still improving:
2.1 Expectation: re-estimate the probability of sk
generating the context c
(10) P̂ (ci |sk ) =
P (ci |sk )
K
P
P (ci |sk )
k =1
Recall that all contextual words vj (i.e., P (vj |sk )) will be
used to calculate the context
28 / 30
Word Sense
Disambiguation
EM algorithm (cont.)
Supervised WSD
WSD evaluation
2. Maximization: Use the expected probabilities to
re-estimate the parameters:
Feature extraction
Naive Bayes
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
P
(11) P (vj |sk ) =
P̂ (ci |sk )
{ci :vj ∈ci }
P
P
Translation-based WSD
Unsupervised WSD
P̂ (ci |sk )
k {ci :vj ∈ci }
Modern WSD
→ Of all the times that vj occurs in a context of any of
this word’s senses, how often does vj indicate sk ?
P
(12) P (sk ) =
P̂ (ci |sk )
i
PP
P̂ (ci |sk )
k
i
→ Of all the times that any sense generates ci , how
often does sk generate it?
29 / 30
Surveys on WSD Systems
Word Sense
Disambiguation
Supervised WSD
WSD evaluation
Surveys:
Feature extraction
Naive Bayes
I
Roberto Navigli (2009). Word Sense Disambiguation: a
Survey. ACM Computing Surveys, 41(2), pp. 1-69.
I
I
I
Covers: decision lists, decision trees, Naive Bayes,
neural networks, instance-based learning, SVMs,
ensemble methods, clustering, multilinguality,
Semeval/Senseval, etc.
http://wwwusers.di.uniroma1.it/∼navigli/pubs/
ACM Survey 2009 Navigli.pdf
Lesk algorithm
Heuristic-based WSD
Similarity-based WSD
Translation-based WSD
Unsupervised WSD
Modern WSD
Alok Ranjan Pal and Diganta Saha (2015). Word Sense
Disambiguation: A Survey. International Journal of
Control Theory and Computer Modeling (IJCTCM),
5(3).
I
http://arxiv.org/pdf/1508.01346.pdf
30 / 30