Word Sense Disambiguation The problem of WSD - clic

Word Sense Disambiguation
Carlo Strapparava
FBK-Irst Istituto per la ricerca scientifica e tecnologica
I-38050 Povo, Trento, ITALY
[email protected]
The problem of WSD
 What is the idea of word sense disambiguation?
 Many words have several meanings or senses
 For such words (given out of context) there is
ambiguity about how they should be interpreted
 WSD is the task of examining word tokens in
context and specifying exactly which sense of
each word is being used
1
Computers versus Humans
 Polysemy – most words have many possible
meanings.
 A computer program has no basis for knowing which
one is appropriate, even if it is obvious to a human…
 Ambiguity is rarely a problem for humans in their day
to day communication, except in extreme cases…
Ambiguity for a Computer
 The fisherman jumped off the bank and into the




water.
The bank down the street was robbed!
Back in the day, we had an entire bank of computers
devoted to this problem.
The bank in that road is entirely too steep and is
really dangerous.
The plane took a bank to the left, and then headed
off towards the mountains.
2
Other examples
Two examples:
1. In my office there are 2 tables and 4 chairs.
2. This year the chair of the ACL conference is prof. W. D.
For humans - it is not a problem
Ex. 1: chair = in the sense of furniture
Ex. 2: chair = in the sense of the role covered by a person
For machines - it is not trivial: one of the hardest problems
in NLP
Brief Historical Overview
 1970s - 1980s


Rule based systems
Rely on hand crafted knowledge sources
 1990s



Corpus based approaches
Dependence on sense tagged text
(Ide and Veronis, 1998) overview history from early days to
1998.
 2000s



Hybrid Systems
Minimizing or eliminating use of sense tagged text
Taking advantage of the Web
3
Interdisciplinary Connections
 Cognitive Science & Psychology

Quillian (1968), Collins and Loftus (1975) : spreading
activation
 Hirst (1987) developed marker passing model
 Linguistics

Fodor & Katz (1963) : selectional preferences
 Resnik (1993) pursued statistically
 Philosophy of Language

Wittgenstein (1958): meaning as use
“For a large class of cases - though not for all - in which we
employ the word "meaning" it can be defined thus: the
meaning of a word is its use in the language.”
Why do we need the senses ?
 Sense disambiguation is an intermediate task
(Wilks and Stevenson, 1996)
 It is necessary to deal with most natural
language tasks that involve language
understanding (e.g. message understanding,
man-machine communication, …)
4
Where WSD could be useful
 Machine translation
 Information retrieval and hypertext





navigation
Content and thematic analysis
Grammatical analysis
Speech processing
Text processing
…
WSD for machine translation
 Sense disambiguation is essential for the
proper translation of words
 For example the French word grille,
depending on the context, can be translated
as railings, gate, bar , grid, scale, schedule,
etc…
5
WSD for information retrieval
 WSD could be useful for information retrieval
and hypertext navigation
 When searching for specific keywords, it is
desirable to eliminate documents where words
are used in an inappropriate sense
 For example, when searching for judicial
references, court as associated with royalty
rather than with law.
 Vorhees (1999) Krovetz (1997, 2002)
=>more benefits in Cross Language IR
WSD for grammatical analysis
 Sense disambiguation can be useful in



Part-of-speech tagging
ex. L’étagère plie sous les livres [the shelf is
bending under (the weight of) the books]
livres (which can mean books or pounds) is
masculine in in the former sense, feminine in
the latter
Prepositional phrase attachment (Hindle and
Rooth, 1993)
In general it can restrict the space of parses
(Alshawi and Carter, 1994)
6
WSD for speech processing
 Sense disambiguation is required for correct
phonetization of words in speech synthesis
 Ex. The word conjure
He conjured up an image
or I conjure you to help me
(Yarowsky, 1996)
 And also for word segmentation and
homophone discrimination in speech
recognition
WSD for content representation
 WSD is useful to have a content-based
representation of documents
 Ex. user-model for multilingual news web sites



A content-based technique to represent documents
as a starting point to build a model of user’s
interest
A user-model built using a word-sense based
representation of the visited documents
A filtering procedure dynamically predicts new
document on the basis of the user’s interest model
7
A simple idea: selectional
restriction-based disambiguation
 Examples from Wall Street Journal
 “… washing dishes”
 “… Ms. Chen works well, stir-frying several simple dishes”
 Two senses:
 Physical Objects
 Meals or Recipes
 The selection can be based on restrictions imposed by
“wash” or “stir-fry” on their PATIENT role
 The object of stir-fry should be something edible
 The sense of “meal” conflicts with the restrictions
imposed by “wash”
Selectional restrictions (2)
 However there are some limitations



Sometimes the available selectional restrictions are
too general
-> “what kind of dishes do you recommend ?”
Violations of selectional restrictions, but perfectly
well-formed and interpretable sentences: ex.
metaphors, metonymy)
Requires a huge knowledge base
 Some attempts:


FrameNet (Fillmore @ Berkeley)
LCS (Lexical Conceptual Structure) - Jackendoff 83
8
WSD methodology
 A WSD task involves two steps
1. Sense repository -> the determination of all the
different senses for every word relevant to the
text under consideration
2. Sense assignment -> a means to assign each
occurrence of a word to the appropriate sense
Sense repositories
 Distinguish word senses in texts with respect to a
dictionary, thesaurus, etc…
 WordNet, LDOCE, Roget thesaurus
 A word is assumed to have a finite number of
discrete senses
 However often a word has somewhat related senses
and it is unclear whether to and where to draw lines
between them
⇒ The issue of a coarse-grained task
9
Word Senses (ex. WordNet)
The noun title has 10 senses (first 7 from tagged texts)
1. title, statute title -- (a heading that names a statute or legislative bill; gives a brief summary of the
matters it deals with; "Title 8 provided federal help for schools")
2. title -- (the name of a work of art or literary composition etc.; "he looked for books with the word `jazz'
in the title"; "he refused to give titles to his paintings"; "I can never remember movie titles")
3. title -- (a general or descriptive heading for a section of a written work; "the novel had chapter titles")
4. championship, title -- (the status of being a champion; "he held the title for two years”)
5. deed, deed of conveyance, title -- (a legal document signed and sealed and delivered to effect a
transfer of property and to show the legal right to possess it; "he signed the deed"; "he kept the title to his
car in the glove compartment")
6. title -- (an identifying appellation signifying status or function: e.g. Mr. or General; "the professor didn't
like his friends to use his formal title")
7. title, claim -- (an established or recognized right: "a strong legal claim to the property"; "he had no
documents confirming his title to his father's estate")
8. title -- ((usually plural) written material introduced into a movie or TV show to give credits or represent
dialogue or explain an action; "the titles go by faster than I can read”)
9. title -- (an appellation signifying nobility; "`your majesty' is the appropriate title to use in addressing a
king")
10. claim, title -- (an informal right to something: "his claim on her attentions"; "his title to fame”)
The verb title has 1 sense (first 1 from tagged texts)
1. entitle, title -- (give a title to)
Sense repositories - a different
approach
 Word Sense Discrimination
 Sense discrimination divides the occurrences
of a word into a number of classes by
determining for any two occurrences whether
they belong to the same sense or not.
 We need only determine which occurrences
have the same meaning and not what the
meaning actually is.
10
Word Sense Discrimination
 Discrimination



Clustering word senses in a text
Pros:
 no need for a-priori dictionary definitions
 agglomerative clustering is a well studied field
Cons:
 sense inventory varies from one text to another
 hard to evaluate
 hard to standardize
(Schütze 98) “Automatic Word Sense Discrimination” ACL 98
WSD methodology
 A WSD task involves two steps
1. Sense repository -> the determination of all the
different senses for every word relevant to the
text under consideration
2. Sense assignment -> a means to assign each
occurrence of a word to the appropriate sense
11
Sense assignment
 The assignment of words to senses relies on
two major sources of information


the context of the word to be disambiguated, in
the broad sense: this includes info contained
within the text in which the word appears
external knowledge sources, including lexical,
encyclopedic, etc. resources as well as (possibly)
hand-devised knowledge sources
Sense assignment (2)
 All disambiguation work involves matching
the context of the word to be disambiguated
with either


information from an external knowledge source
(knowledge-driven WSD)
OR
information information about the contexts of
previous disambiguated instances of the word
derived from corpora (corpus-based WSD)
12
Some approaches
 Corpus based approaches


Supervised algorithms:
 Exemplar-Based Learning (Ng & Lee 96)
 Naïve Bayes
Semi-supervised algorithms: bootstrapping
approaches (Yarowsky 95)
 Dictionary based approaches

Lesk 86
 Hybrid algorithms (supervised + dictionary)

Mihalcea 00
Brief review:
What is Supervised Learning?
 Collect a set of examples that illustrate the
various possible classifications or outcomes of
an event
 Identify patterns in the examples associated
with each particular class of the event
 Generalize those patterns into rules
 Apply the rules to classify a new event
13
Learn from these examples :
“when do I go to the university?”
Day
CLASS
Go to University ?
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1
YES
YES
NO
NO
2
NO
YES
NO
YES
3
YES
NO
NO
NO
4
NO
NO
NO
YES
Learn from these examples :
“when do I go to the university?”
Day
CLASS
Go to University ?
F1
Hot Outside?
F2
Slept Well?
F3
Ate Well?
1
YES
YES
NO
NO
2
NO
YES
NO
YES
3
YES
NO
NO
NO
4
NO
NO
NO
YES
14
Supervised WSD
 Supervised WSD: Class of methods that induces a classifier
from manually sense-tagged text using machine learning
techniques.
 Resources



Sense Tagged Text
Dictionary (implicit source of sense inventory)
Syntactic Analysis (POS tagger, Chunker, Parser, …)
 Scope



Typically one target word per context
Part of speech of target word resolved
Lends itself to “targeted word” formulation
 Looking at WSD as a classification problem where a target word
is assigned the most appropriate sense from a given set of
possibilities based on the context in which it occurs
An Example of Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1
robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right
on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I can’t walk there.
15
Two Bags of Words
(Co-occurrences in the “window of context”)
FINANCIAL_BANK_BAG:
a an and are ATM Bonnie card charges check Clyde criminals
deposit famous for get I much My new overdraft really robbers the
they think to too two went were
RIVER_BANK_BAG:
a an and big campus cant catfish East got grandfather great
has his I in is Minnesota Mississippi muddy My of on planted pole
pretty right River The the there University walk West
Simple Supervised Approach
Given a sentence S containing “bank”:
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then
Sense_1 = Sense_1 + 1;
If Wi is in RIVER_BANK_BAG then
Sense_2 = Sense_2 + 1;
If Sense_1 > Sense_2 then print “Financial”
else if Sense_2 > Sense_1 then print “River”
else print “Can’t Decide”;
16
General Supervised Methodology
 Create a sample of training data where a given target
word is manually annotated with a sense from a
predetermined set of possibilities

One tagged word per instance/lexical sample disambiguation
 Select a set of features with which to represent context.

co-occurrences, collocations, POS tags, verb-obj relations, etc...
 Convert sense-tagged training instances to feature vectors
 Apply a machine learning algorithm to induce a classifier


Form – structure or relation among features
Parameters – strength of feature interactions
 Convert a held out sample of test data into feature
vectors
 Apply classifier to test instances to assign a sense tag
From Text to Feature Vectors
 My/pronoun grandfather/noun used/verb to/prep
fish/verb along/adv the/det banks/SHORE of/prep
the/det Mississippi/noun River/noun. (S1)
 The/det bank/FINANCE issued/verb a/det
check/noun for/prep the/det amount/noun of/prep
interest/noun. (S2)
S1
S2
P-2
P-1
P+1
P+2 fish check river interest
SENSE TAG
adv
det
prep
det
Y
N
Y
N
SHORE
det
verb
det
N
Y
N
Y
FINANCE
17
Supervised Learning Algorithms
 Once data is converted to feature vector form, any
supervised learning algorithm can be used. Many
have been applied to WSD with good results:










Support Vector Machines
Nearest Neighbor Classifiers
Decision Trees
Decision Lists
Naïve Bayesian Classifiers
Perceptrons
Neural Networks
Graphical Models
Log Linear Models
…
Summing up: Supervised algorithms
 In ML approaches, systems are trained on a set
of labeled instances to perform the task of WSD
 What is learned is a classifier that can be used
to assign unseen examples to senses
 These approaches vary in



the nature of the training material
how much material is needed
the kind of linguistic knowledge
18
Summing up: Supervised algorithms
 All approaches put emphasis on acquiring
knowledge needed for the task from data
(rather from humans)
 The question is about scaling up:
“Is it possible (or realistic) to apply these
methodologies to the entire lexicon of the
language ?”
 Manually building sense-tagged corpora is
extremely costly
Inputs: feature vectors
 The input consists of


the target word (i.e. the word to be disambiguated)
the context (i.e. a portion of the text in which it is embedded)
 The input is normally part-of-speech tagged
 The context consists of larger of smaller segments
surrounding the target word
 Often some kind of morphological processing is
performed on all words of the context
 Seldom, some form of parsing is performed to find out
grammatical roles and relations
19
Feature vectors
 Two steps


Selecting the relevant linguistic features
Encoding them in a suitable way for a learning
algorithm
 A feature vector consists of numerical or
nominal values encoding the selected
linguistic information
Linguistic features
 The linguistic features used in training WSD
systems can be divided in two classes


Syntagmatic features two words are syntagmatically related when they
frequently appear in the same syntagm (e.g. when
one of them frequently follows or precedes the
other)
Paradigmatic features two words are paradigmatically related when their
meanings are closely related (e.g. like synonyms,
hyponyms, they have the same semantic domains)
20
Syntagmatic features
 Typical syntagmatic features are collocational features,
encoding information about the lexical inhabitants of
specific positions on the left or on the right of the target
words:




the word
the root form of the word
the word’s part-of-speech
…
 Ex. “An electric guitar and bass player stand off…”
 2 words on the right, 2 on the left, with POS
⇒ [guitar, NN1, and, CJC, player, NN1, stand, VVB]
Pos
Paradigmatic features
 Features that are effective at capturing the
general topic of the discourse in which the
target word has occurred


Bag of words
Domain labels
 In BOW, co-occurrences of words ignoring
their exact position - the value of the feature
is the number the word occurs in a region
surrounding the target word
21
Ng and Lee (1996): LEXAS
 “Exemplar-based learning” (in practice k-nearest
neighbor learning)
 LEXical Ambiguity-resolving System
 A set of features is extracted from disambiguated
examples
 When a new untagged example is encountered,
it is compared with each of the training
examples, using a “distance” function
LEXAS: the features
 Feature extraction

[L3, L2, L1, R1, R2, R3, K1, ... , Km, C1,..., C9, V]
 Part of speech and morphological form:

part of speech of the words to the left L3, L2, L1 and right R1,
R2, R3
 Unordered set of surrounding words:

keywords that co-occur frequently with the word w K1, .., Km
 Local collocations

C1, .., C9 determined by left and right offset
 e.g. : [-1 1] “national interest of”
 Verb-object syntactic relations

used only for nouns
22
LEXAS: the metric
 Distance among feature vectors:


distance among two vectors: sum of distances among
features
distance among two values v1 and v2 of feature f
n
! (v1,v 2 ) = # |
i=1
C1,i C2,i
"
|
C1 C2
 C1,i represents the number of training examples with
value v1 for the feature f that are classified with the
sense i in the training corpus.
 C1 is the number with value v1 in any sense
 C2,i and C2 denote similar quantities for v2
 n is the total number of senses for the word
LEXAS: the algorithm
 Training phase: build training examples
 Testing:


for a new untagged occurrence of word w ,
measure the distance with the training examples
choose the sense from the example which
provides the minimum distance
23
LEXAS: evaluation
 Two datasets:

•

•
The “interest corpus” by Bruce and Wiebe (1994): 2,639
sentences from the Wall Street Journal each containing the
noun interest
Sense repository: one of the six senses from LDOCE
 Lexas: 89%
 Yarowsky: 72%
 Bruce & Wiebe: 79%
192,800 occurrences of very ambiguous words: 121 nouns
(e.g. action, board) [~ 7.8 senses per word] and 70 verbs
(e.g. add, build) [~ 12 senses per word] - (from Brown and
WSJ)
Sense repository: WordNet
 Lexas: 68,6%
 Most Frequent Sense: 63,7%
Intermezzo:
Evaluating a WSD system
 Numerical evaluation is meaningless without
specifying how difficult is the task
 Ex. 90% accuracy is easy for a POS tagger
but it is beyond the ability of any Machine
Translation system
 Estimation of upper and lower bound does
make sense of the performance of an
algorithm
24
Evaluation a WSD system:
Upper bound
 The upper bound is usually human performance
 In the case of WSD, if humans disagree on
sense disambiguation, we cannot expect a WSD
system to do better
 Inter-judgment agreement is higher for words
with clear sense distinctions ex. bank (95% and
higher), lower for polysemous words ex. title
(65% to 70%)
Evaluation a WSD system:
Inter-judgment agreement
 To correctly compare the extent of inter-
judge agreement, we need to correct for the
expected chance agreement
 This depends on the number of senses being
distinguished
 This is done by Kappa statistics (Carletta,
1996)
25
Evaluation a WSD system:
Lower bound
 The lower bound or baseline is the
performance of the simplest possible algorithm
 Usually the assignment of most frequent sense
 90% accuracy is a good result for a word with
2 equiprobable sense, a trivial result for a word
with 2 sense in a 9 to 1 frequency ratio
 Another possible baseline: random choice
Evaluation a WSD system:
Scoring
 Precision and Recall
 Precision was defined as the proportion of classified
instances that were correctly classified
 Recall as the proportion of instances classified
correctly – these allow for the possibility of an
algorithm choosing not to classify a given instance.
 For the sense-tagging task, accuracy is reported as
recall.
 The coverage of a system (i.e., the percentage of
items for which the system guesses some sense tag)
can be computed by dividing recall by precision.
26
Evaluation frameworks
 SENSEVAL http://www.senseval.org
 A competition with various WSD tasks and on different
languages



All-words task
Lexical sample task
…
 Senseval 1: 1999 [Hector dictionary] – ~ 10 teams
 Senseval 2: 2001 [WordNet dictionary] – ~ 30 teams
 Senseval 3: 2004 [mainly WordNet dictionary] – ~ 60 teams

many different tasks
 Senseval 4 -> Semeval 1 : 2007
Naïve Bayes
 A premise: choosing the best sense for an input
vector is choosing the most probable sense given
that vector ˆ
s = argmax P(s |V )
s!S
sˆ = argmax
s!S
P(V | s)P(s)
P(V )
re-writing in the usual
Bayesian manner
 But the data available that associates specific
vectors with sense is too sparse
 What is largely available in the training set is
information about individual feature-value pairs
for a specific sense
27
Naïve Bayes (2)
 Naïve assumption: the features are independent
n
P(V | s) ! " P(v j | s)
j=1
n
sˆ = argmax P(s)" P(v j | s)
s#S
P(V) is the same for
all possible senses
⇒it does not effect
the final ranking of senses
j=1
 Training a naïve Bayes classifier consists of
collecting the individual feature-value statistics
wrt each sense of the target word in a sensetagged training corpus
 In practice, considerations about smoothing
apply
Semi-supervised approaches
 Problems with supervised algorithms
=> the need of a large sense-tagged training set
 Bootstrapping approach (Yarowsky, 1995)



A small number of labeled instances (seeds) is used
to trained an initial classifier in a supervised way
This classifier is then used to extract a large
training set from an untagged corpus
Iterating this process results in a series of accurate
classifiers
28
“One sense per …” constraints
 There are constraints between different occurrences of
an ambiguous word within a corpus that can be exploited
for disambiguation:
 One sense per discourse: The sense of a target
word is highly consistent within any given document.
 e.g. … He planted the pansy seeds himself, buying them from a
pansy specialist. These specialists have done a great deal of work to
improve the size and health of the plants and the resulting flowers.
Their seeds produce vigorous blooming plants half again the size of
he unimproved strains. …

One sense per collocation: nearby words provide
strong and consistent clues to the sense of a target
word, conditional on relative distance, order and
syntactic relationship.
 e.g. industrial plant -- same meaning for plant regardless of where
this collocation occurs
“One sense per …” constraints
 Summing up:
 One sense per discourse

A word tends to preserve its meaning across all its
occurrences in a given discourse (Gale, Church, Yarowksy
1992)
 One sense per collocation

A word tends to preserver its meaning when used in
the same collocation (Yarowsky 1993)
 Strong for adjacent collocations
 Weaker as the distance between words increases
29
Bootstrapping (Yarowsky 95)
 Simplification: binary sense assignment
 Step 1
identify in a corpus all examples for a given polysemous word

 Step 2
Identify a small set of representative examples for each sense

 Step 3
a)
b)
c)
Train a classification algorithm on Sense-A/Sense-B seed sets
Apply the resulting classifier to the rest of the corpus and add
the new examples to the seed set
Repeat iteratively
 Step 4

Apply the classification algorithm to the testing set
Bootstrapping (Yarowsky 95)
 Example: word plant


Sense-A “living organism”
Sense-B “factory”
 Seed collocations: life and manufacturing
 Extract examples containing these seeds
 Decision list:
LogL
8.10
7.58
7.39
7.20
6.27
4.70
4.36
Collocation
plant life
manufacturing plant
life (within ± 2-10 words)
manufacturing (within ± 2-10 words)
animal (within ± 2-10 words)
equipment (within ± 2-10 words)
employee (within ± 2-10 words)
Sense
A
B
A
B
A
B
B
30
Bootstrapping (Yarowsky 95)




Use this decision list to classify new examples
Repeat until no more examples can be classified
Test: apply the decision list on the testing set
Option for training seeds:
use words in dictionary definitions
single defining collocate (e.g. bird and machine for word “crane”)


 Results:
Yarowsky 96.5%
Most frequent sense 63.9 %


 It works well only for distinct senses of words.
Sometime this is not the case
1. bass -- (the lowest part of the musical range)
3. bass, basso -- (an adult male singer with the lowest voice)


Some approaches
 Corpus based approaches


Supervised algorithms:
 Exemplar-Based Learning (Ng & Lee 96)
 Naïve Bayes
Semi-supervised algorithms: bootstrapping
approaches (Yarowsky 95)
 Dictionary based approaches

Lesk 86
 Hybrid algorithms (supervised + dictionary)

Mihalcea 00
31
Lesk algorithm (1986)
 It is one of the first algorithm developed for the
semantic disambiguation of all words in open
text
 The only resource requires is a set of dictionary
entries (definitions)
 The most likely sense for a word in a given
context is decided by a measure of the overlap
between the definitions of the target words and
of the words of the current context
Lesk algorithm - dictionary based
 The main idea of the original version of the
algorithm is to disambiguate words finding
the overlap among their sense definitions
(1) for each sense i of W1
(2)
for each sense j of W2
(3)
determine Overlapij as the number of common occurrences
between definitions of sense i of W1 and sense j of W2
(4) find i and j for which Overlapij is maximum
(5) Assign sense i to W1 and sense j to W2
32
Lesk algorithm - an example
 Select the appropriate sense of cone and pine in the
phrase pine cone given the following definitions


pine
 1.
 2.
cone
 1.
 2.
 3.
kinds of evergreen tree with needle-shaped leaves
waste away through sorrow or illness
solid body which narrows to a point from a round flat base
something of this shape whether solid or hollow
fruit of certain evergreen trees
 Select pine#1 and cone#3 because they have two
words in common
Lesk algorithm - corpus based
 A corpus-based variation: take into
consideration also additional tagged examples
(1)
(2)
(3)
(4)
(5)
for each sense s of the word W1
set weight(s) to zero
for each unique word w in surrounding context of W1
for each sense s,
if w occurs in the training examples /
dictionary definitions for sense s,
(6) add weight(w) to weight(s)
(7) choose sense with greatest weight(s)
weight(w) = IDF = -log(p(w))
p(w) is estimated over the examples and dictionary definitions
33
Lesk algorithm - evaluation
 Corpus-based variation is one of the best
performing baselines in comparative
evaluation of WSD systems
 In Senseval-2 => 51.2% precision compared
to 64.2% achieved by the best supervised
system
 Problems of this approach:
The dictionary entries are relatively short
Combinatorial explosion when applied to more
than two words


Some approaches
 Corpus based approaches


Supervised algorithms:
 Exemplar-Based Learning (Ng & Lee 96)
 Naïve Bayes
Semi-supervised algorithms: bootstrapping
approaches (Yarowsky 95)
 Dictionary based approaches


Lesk 86
…
 Hybrid algorithms (supervised + dictionary)

Mihalcea 00
34
Hybrid algorithms (Mihalcea 2000)
 It combines two sources of information:
WordNet and a sense tagged corpus
(SemCor)
 It is based on



WordNet definitions
WordNet ISA relations
Rules acquired from SemCor
Hybrid algorithm (Mihalcea 2000)
 It was developed for the purpose of increasing
the Information Retrieval with WSD techniques:


disambiguation of the words in the input IR query
disambiguation of words in the documents
 The algorithm determines a set of nouns and
verbs that can be disambiguated with high
precision
 Several procedures (8) are called iteratively in
the main algorithm
35
Procedure 1
 The system uses a Named Entity recognizer
 In particular person names, organizations and
locations
⇒ Identify Named Entities in text and mark
them with sense #1
 Examples:



Bush => PER => person#1
Trento => LOC => location#1
IBM => ORG => organization#1
Procedure 2
 Exploiting the monosemous words
⇒ Identify words having only one sense in
WordNet and mark them with that sense
 Example:

The noun subcommittee has only one sense in
WordNet
36
Procedure 3
 Exploiting the contextual clues about the usage of the
words
⇒ Given a clue (collocation), search it in SemCor and mark
them with the correspondent sense from SemCor
 Example: disambiguate “approval” in “approval of”
=> 4 examples in SemCor




“…with the approval#1 of the Farm Credit Association…”
“…subject to the approval#1 of the Secretary of State…”
“…administrative approval#1 of the reclassification…”
“…recommended approval#1 of the 1-A classification…”
 In all this occurrences the sense of approval is approval#1
Procedure 4
 Using SemCor, for a given noun N in the text,
determine the noun-context of each of its
senses
 Noun-context: list of nouns which occur more
often within the context of N
⇒ Find common words between the current
context and the noun context
 Example: diameter has 2 senses

2 noun-contexts:
 diameter#1: {property, hole, ratio}
 diameter#2: {form}
37
Procedure 5
⇒ Find words that are semantically connected to
already disambiguated words
 Connected means there is a relation in WordNet
 If they belong to the same synset => connection
distance = 0
 Example: authorize and clear in a text to be
disambiguated
Knowing: authorize#1 – disambiguated with procedure 2
It results: clear#4 - because they are synonyms in WordNet


Procedure 6
⇒Find words that are semantically connected
and for which the connection distance is 0
 Weaker than procedure 5: none of the words
considered are already disambiguated
 Example: measure and bill, both are
ambiguous


bill has 10 senses, measure 9
bill#1 and measure#4 belong to the same synset
38
Procedures 7-8
 Similar to procedures 5-6, but they use other
semantic relations (connection distance = 1)
hypernymy/hyponymy = ISA relations
 Example: subcommittee and committee


subcommittee #1 – disambiguated with procedure 2
committee #1 – because is a hypernym of
subcommittee#1
(Mihalcea 2000) - Evaluation
 The procedures presented above are applied
iteratively
 This allows us to identify a set of nouns and
verbs that can be disambiguated with high
precision
 Tests on six randomly selected files from
SemCor
 The algorithm disambiguate 55% of the
nouns and verbs with 92% precision
39