Advances in Word Sense Disambiguation

Versione ridotta (a cura di R. Basili) per il
Corso di MGRI a.a. 2007-8
URL originale:
http://www.aaai.org/Conferences/AAAI/aaai05.php
Advances in
Word Sense Disambiguation
Tutorial at ACL 2005
June 25, 2005
Ted Pedersen
University of Minnesota, Duluth
http://www.d.umn.edu/~tpederse
Rada Mihalcea
University of North Texas
http://www.cs.unt.edu/~rada
Outline of Tutorial
•
•
•
•
•
•
•
•
3
Introduction (Ted)
Methodolodgy (Rada)
Knowledge Intensive Methods (Rada)
Supervised Approaches (Ted)
Minimally Supervised Approaches (Rada) / BREAK
Unsupervised Learning (Ted)
How to Get Started (Rada)
Conclusion (Ted)
Part 1:
Introduction
Outline
•
•
•
•
•
5
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections
Practical Applications
Definitions
• Word sense disambiguation is the problem of selecting a
sense for a word from a set of predefined possibilities.
– Sense Inventory usually comes from a dictionary or thesaurus.
– Knowledge intensive methods, supervised learning, and
(sometimes) bootstrapping approaches
• Word sense discrimination is the problem of dividing the
usages of a word into different meanings, without regard to
any particular existing sense inventory.
– Unsupervised techniques
6
Outline
•
•
•
•
•
7
Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections
Practical Applications
Computers versus Humans
• Polysemy – most words have many possible meanings.
• A computer program has no basis for knowing which one
is appropriate, even if it is obvious to a human…
• Ambiguity is rarely a problem for humans in their day to
day communication, except in extreme cases…
8
Ambiguity for Humans - Newspaper Headlines!
•
•
•
•
•
•
•
•
•
9
DRUNK GETS NINE YEARS IN VIOLIN CASE
FARMER BILL DIES IN HOUSE
PROSTITUTES APPEAL TO POPE
STOLEN PAINTING FOUND BY TREE
RED TAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFF TREES
INCLUDE CHILDREN WHEN BAKING COOKIES
MINERS REFUSE TO WORK AFTER DEATH
Early Days of WSD
• Noted as problem for Machine Translation (Weaver, 1949)
– A word can often only be translated if you know the specific sense
intended (A bill in English could be a pico or a cuenta in Spanish)
• Bar-Hillel (1960) posed the following:
– Little John was looking for his toy box. Finally, he found it. The
box was in the pen. John was very happy.
– Is “pen” a writing instrument or an enclosure where children play?
…declared it unsolvable, left the field of MT!
12
Since then…
• 1970s - 1980s
– Rule based systems
– Rely on hand crafted knowledge sources
• 1990s
– Corpus based approaches
– Dependence on sense tagged text
– (Ide and Veronis, 1998) overview history from early days to 1998.
• 2000s
– Hybrid Systems
– Minimizing or eliminating use of sense tagged text
– Taking advantage of the Web
13
Practical Applications
• Machine Translation
– Translate “bill” from English to Spanish
• Is it a “pico” or a “cuenta”?
• Is it a bird jaw or an invoice?
• Information Retrieval
– Find all Web Pages about “cricket”
• The sport or the insect?
• Question Answering
– What is George Miller’s position on gun control?
• The psychologist or US congressman?
• Knowledge Acquisition
– Add to KB: Herb Bergson is the mayor of Duluth.
• Minnesota or Georgia?
17
References
•
•
•
•
•
•
•
•
•
18
(Bar-Hillel, 1960) The Present Status of Automatic Translations of Languages. In
Advances in Computers. Volume 1. Alt, F. (editor). Academic Press, New York, NY. pp
91-163.
(Collins and Loftus, 1975) A Spreading Activation Theory of Semantic Memory.
Psychological Review, (82) pp. 407-428.
(Fodor and Katz, 1963) The structure of semantic theory. Language (39). pp 170-210.
(Hirst, 1987) Semantic Interpretation and the Resolution of Ambiguity. Cambridge
University Press.
(Ide and Véronis, 1998)Word Sense Disambiguation: The State of the Art..
Computational Linguistics (24) pp 1-40.
(Quillian, 1968) Semantic Memory. In Semantic Information Processing. Minsky, M.
(editor). The MIT Press, Cambridge, MA. pp. 227-270.
(Resnik, 1993) Selection and Information: A Class-Based Approach to Lexical
Relationships. Ph.D. Dissertation. University of Pennsylvania.
(Weaver, 1949): Translation. In Machine Translation of Languages: fourteen essays.
Locke, W.N. and Booth, A.D. (editors) The MIT Press, Cambridge, Mass. pp. 15-23.
(Wittgenstein, 1958) Philosophical Investigations, 3rd edition. Translated by G.E.M.
Anscombe. Macmillan Publishing Co., New York.
Part 2:
Methodology
Outline
•
•
•
•
•
20
General considerations
All-words disambiguation
Targeted-words disambiguation
Word sense discrimination, sense discovery
Evaluation (granularity, scoring)
Overview of the Problem
• Many words have several meanings (homonymy / polysemy)
–Ex: “chair” – furniture or person
–Ex: “child” – young person or human offspring
• Determine which sense of a word is used in a specific sentence
• Note:
– often, the different senses of a word are closely related
• Ex: title
- right of legal ownership
- document that is evidence of the legal ownership,
– sometimes, several senses can be “activated” in a single context
(co-activation)
• Ex: “This could bring competition to the trade”
competition: - the act of competing
- the people who are competing
21
Word Senses
• The meaning of a word in a given context
• Word sense representations
– With respect to a dictionary
chair = a seat for one person, with a support for the back; "he put his coat
over the back of the chair and sat down"
chair = the position of professor; "he was awarded an endowed chair in
economics"
– With respect to the translation in a second language
chair = chaise
chair = directeur
– With respect to the context where it occurs (discrimination)
“Sit on a chair” “Take a seat on this chair”
“The chair of the Math Department” “The chair of the meeting”
22
Approaches to Word Sense Disambiguation
• Knowledge-Based Disambiguation
– use of external lexical resources such as dictionaries and thesauri
– discourse properties
• Supervised Disambiguation
– based on a labeled training set
– the learning system has:
• a training set of feature-encoded inputs AND
• their appropriate sense label (category)
• Unsupervised Disambiguation
– based on unlabeled corpora
– The learning system has:
• a training set of feature-encoded inputs BUT
• NOT their appropriate sense label (category)
23
All Words Word Sense Disambiguation
• Attempt to disambiguate all open-class words in a text
“He put his suit over the back of the chair”
• Knowledge-based approaches
• Use information from dictionaries
– Definitions / Examples for each meaning
• Find similarity between definitions and current context
• Position in a semantic network
• Find that “table” is closer to “chair/furniture” than to “chair/person”
• Use discourse properties
• A word exhibits the same sense in a discourse / in a collocation
24
All Words Word Sense Disambiguation
• Minimally supervised approaches
– Learn to disambiguate words using small annotated corpora
– E.g. SemCor – corpus where all open class words are
disambiguated
• 200,000 running words
• Most frequent sense
25
Targeted Word Sense Disambiguation
• Disambiguate one target word
“Take a seat on this chair”
“The chair of the Math Department”
• WSD is viewed as a typical classification problem
– use machine learning techniques to train a system
• Training:
– Corpus of occurrences of the target word, each occurrence
annotated with appropriate sense
– Build feature vectors:
• a vector of relevant linguistic features that represents the context (ex:
a window of words around the target word)
• Disambiguation:
– Disambiguate the target word in new unseen text
26
Targeted Word Sense Disambiguation
•
•
Take a window of n word around the target word
Encode information about the words around the target word
– typical features include: words, root forms, POS tags, frequency, …
• An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
• Surrounding context (local features)
– [ (guitar, NN1), (and, CJC), (player, NN1), (stand, VVB) ]
• Frequent co-occurring words (topical features)
– [fishing, big, sound, player, fly, rod, pound, double, runs, playing, guitar, band]
– [0,0,0,1,0,0,0,0,0,0,1,0]
• Other features:
– [followed by "player", contains "show" in the sentence,…]
– [yes, no, … ]
27
Unsupervised Disambiguation
• Disambiguate word senses:
– without supporting tools such as dictionaries and thesauri
– without a labeled training text
• Without such resources, word senses are not labeled
– We cannot say “chair/furniture” or “chair/person”
• We can:
– Cluster/group the contexts of an ambiguous word into a number
of groups
– Discriminate between these groups without actually labeling
them
28
Unsupervised Disambiguation
• Hypothesis: same senses of words will have similar neighboring
words
• Disambiguation algorithm
– Identify context vectors corresponding to all occurrences of a particular
word
– Partition them into regions of high density
– Assign a sense to each such region
“Sit on a chair”
“Take a seat on this chair”
“The chair of the Math Department”
“The chair of the meeting”
29
Evaluating Word Sense Disambiguation
• Metrics:
– Precision = percentage of words that are tagged correctly, out of the
words addressed by the system
– Recall = percentage of words that are tagged correctly, out of all words
in the test set
– Example
• Test set of 100 words
• System attempts 75 words
• Words correctly disambiguated 50
Precision = 50 / 75 = 0.66
Recall = 50 / 100 = 0.50
• Special tags are possible:
– Unknown
– Proper noun
– Multiple senses
• Compare to a gold standard
– SEMCOR corpus, SENSEVAL corpus, …
30
Evaluating Word Sense Disambiguation
• Difficulty in evaluation:
– Nature of the senses to distinguish has a huge impact on results
• Coarse versus fine-grained sense distinction
chair = a seat for one person, with a support for the back; "he put his coat
over the back of the chair and sat down“
chair = the position of professor; "he was awarded an endowed chair in
economics“
bank = a financial institution that accepts deposits and channels the money
into lending activities; "he cashed a check at the bank"; "that bank holds
the mortgage on my home"
bank = a building in which commercial banking is transacted; "the bank is
on the corner of Nassau and Witherspoon“
• Sense maps
– Cluster similar senses
– Allow for both fine-grained and coarse-grained evaluation
31
Bounds on Performance
• Upper and Lower Bounds on Performance:
– Measure of how well an algorithm performs relative to the difficulty of
the task.
• Upper Bound:
– Human performance
– Around 97%-99% with few and clearly distinct senses
– Inter-judge agreement:
• With words with clear & distinct senses – 95% and up
• With polysemous words with related senses – 65% – 70%
• Lower Bound (or baseline):
– The assignment of a random sense / the most frequent sense
• 90% is excellent for a word with 2 equiprobable senses
• 90% is trivial for a word with 2 senses with probability ratios of 9 to 1
32
References
•
•
•
•
33
(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. Estimating
upper and lower bounds on the performance of word-sense disambiguation programs
ACL 1992.
(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.
Using a semantic concordance for sense identification. ARPA Workshop 1994.
(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.
(Senseval) Senseval evaluation exercises http://www.senseval.org
Part 3:
Knowledge-based Methods for
Word Sense Disambiguation
Task Definition
• Knowledge-based WSD = class of WSD methods relying
(mainly) on knowledge drawn from dictionaries and/or raw
text
• Resources
– Yes
• Machine Readable Dictionaries
• Raw corpora
– No
• Manually annotated corpora
• Scope
– All open-class words
36
Machine Readable Dictionaries
• In recent years, most dictionaries made available in
Machine Readable format (MRD)
– Oxford English Dictionary
– Collins
– Longman Dictionary of Ordinary Contemporary English (LDOCE)
• Thesauruses – add synonymy information
– Roget Thesaurus
• Semantic networks – add more semantic relations
– WordNet
– EuroWordNet
37
MRD – A Resource for Knowledge-based WSD
• A thesaurus adds:
– An explicit synonymy relation between word meanings
WordNet synsets for the noun “plant”
1. plant, works, industrial plant
2. plant, flora, plant life
• A semantic network adds:
– Hypernymy/hyponymy (IS-A), meronymy/holonymy (PART-OF),
antonymy, entailnment, etc.
WordNet related concepts for the meaning “plant life”
{plant, flora, plant life}
hypernym: {organism, being}
hypomym: {house plant}, {fungus}, …
meronym: {plant tissue}, {plant part}
holonym: {Plantae, kingdom Plantae, plant kingdom}
39
Lesk Algorithm
•
(Michael Lesk 1986): Identify senses of words in context
using definition overlap
Algorithm:
1. Retrieve from MRD all sense definitions of the words to be
disambiguated
2. Determine the definition overlap for all possible sense combinations
3. Choose senses that lead to highest overlap
Example: disambiguate PINE CONE
• PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
• CONE
1. solid body which narrows to a point
2. something of this shape whether solid or hollow
3. fruit of certain evergreen trees
41
Pine#1 ∩ Cone#1 = 0
Pine#2 ∩ Cone#1 = 0
Pine#1 ∩ Cone#2 = 1
Pine#2 ∩ Cone#2 = 0
Pine#1 ∩ Cone#3 = 2
Pine#2 ∩ Cone#3 = 0
Lesk Algorithm for More than Two Words?
• I saw a man who is 98 years old and can still walk and tell jokes
– nine open class words: see(26), man(11), year(4), old(8), can(5), still(4),
walk(10), tell(8), joke(3)
• 43,929,600 sense combinations! How to find the optimal sense
combination?
• Simulated annealing (Cowie, Guthrie, Guthrie 1992)
– Define a function E = combination of word senses in a given text.
– Find the combination of senses that leads to highest definition overlap
(redundancy)
1. Start with E = the most frequent sense for each word
2. At each iteration, replace the sense of a random word in the set with a
different sense, and measure E
3. Stop iterating when there is no change in the configuration of senses
42
Lesk Algorithm: A Simplified Version
•
Original Lesk definition: measure overlap between sense definitions for
all words in context
–
•
Simplified Lesk (Kilgarriff & Rosensweig 2000): measure overlap
between sense definitions of a word and current context
–
•
43
Identify simultaneously the correct senses for all words in context
Identify the correct sense for one word at a time
Search space significantly reduced
Lesk Algorithm: A Simplified Version
• Algorithm for simplified Lesk:
1.Retrieve from MRD all sense definitions of the word to be
disambiguated
2.Determine the overlap between each sense definition and the
current context
3.Choose the sense that leads to highest overlap
Example: disambiguate PINE in
“Pine cones hanging in a tree”
• PINE
1. kinds of evergreen tree with needle-shaped leaves
2. waste away through sorrow or illness
44
Pine#1 ∩ Sentence = 1
Pine#2 ∩ Sentence = 0
Evaluations of Lesk Algorithm
• Initial evaluation by M. Lesk
– 50-70% on short samples of text manually annotated set, with respect
to Oxford Advanced Learner’s Dictionary
• Simulated annealing
– 47% on 50 manually annotated sentences
• Evaluation on Senseval-2 all-words data, with back-off to
random sense (Mihalcea & Tarau 2004)
– Original Lesk: 35%
– Simplified Lesk: 47%
• Evaluation on Senseval-2 all-words data, with back-off to
most frequent sense (Vasilescu, Langlais, Lapalme 2004)
– Original Lesk: 42%
– Simplified Lesk: 58%
45
Outline
• Task definition
– Machine Readable Dictionaries
•
•
•
•
46
Algorithms based on Machine Readable Dictionaries
Selectional Preferences
Measures of Semantic Similarity
Heuristic-based Methods
Selectional Preferences
• A way to constrain the possible meanings of words in a
given context
• E.g. “Wash a dish” vs. “Cook a dish”
– WASH-OBJECT vs. COOK-FOOD
• Capture information about possible relations between
semantic classes
– Common sense knowledge
• Alternative terminology
– Selectional Restrictions
– Selectional Preferences
– Selectional Constraints
47
Acquiring Selectional Preferences
• From annotated corpora
– Circular relationship with the WSD problem
• Need WSD to build the annotated corpus
• Need selectional preferences to derive WSD
• From raw corpora
– Frequency counts
– Information theory measures
– Class-to-class relations
48
Preliminaries: Learning Word-to-Word Relations
• An indication of the semantic fit between two words
1. Frequency counts
– Pairs of words connected by a syntactic relations
Count (W 1 , W 2 , R )
2. Conditional probabilities
– Condition on one of the words
Count (W1,W2 , R)
P(W1 | W2 , R) =
Count (W2 , R)
49
Learning Selectional Preferences (1)
• Word-to-class relations (Resnik 1993)
– Quantify the contribution of a semantic class using all the concepts
subsumed by that class
P(C2 | W1 , R )
P(C2 | W1 , R) log
P (C2 )
A(W1 , C2 , R) =
P (C2 | W1 , R )
P
(
C
|
W
,
R
)
log
∑
2
1
P(C2 )
C2
– where
P (C2 | W1 , R ) =
Count (W1 , C2 , R )
Count (W1 , R )
Count (W1 , C2 , R ) =
50
Count (W1 , W2 , R )
∑
Count (W2 )
W2 ∈C 2
Learning Selectional Preferences (2)
• Determine the contribution of a word sense based on the assumption of
equal sense distributions:
– e.g. “plant” has two senses Æ 50% occurences are sense 1, 50% are sense 2
• Example: learning restrictions for the verb “to drink”
– Find high-scoring verb-object pairs
Co-occ score
11.75
11.75
11.75
10.53
10.2
9.34
Verb
drink
drink
drink
drink
drink
drink
Object
tea
Pepsi
champagne
liquid
beer
wine
– Find “prototypical” object classes (high association score)
51
A(v,c)
Object class
3.58 (beverage, [drink, …])
2.05 (alcoholic_beverage, [intoxicant, …])
Learning Selectional Preferences (3)
• Other algorithms
– Learn class-to-class relations (Agirre and Martinez, 2002)
• E.g.: “ingest food” is a class-to-class relation for “eat chicken”
– Bayesian networks (Ciaramita and Johnson, 2000)
– Tree cut model (Li and Abe, 1998)
52
Using Selectional Preferences for WSD
Algorithm:
1. Learn a large set of selectional preferences for a given syntactic
relation R
2. Given a pair of words W1– W2 connected by a relation R
3. Find all selectional preferences W1– C (word-to-class) or C1– C2
(class-to-class) that apply
4. Select the meanings of W1 and W2 based on the selected semantic
class
Example: disambiguate coffee in “drink coffee”
1. (beverage) a beverage consisting of an infusion of ground coffee beans
2. (tree) any of several small trees native to the tropical Old World
3. (color) a medium to dark brown color
Given the selectional preference “DRINK BEVERAGE” : coffee#1
53
Evaluation of Selectional Preferences for WSD
• Data set
– mainly on verb-object, subject-verb relations extracted from
SemCor
• Compare against random baseline
• Results (Agirre and Martinez, 2000)
– Average results on 8 nouns
– Similar figures reported in (Resnik 1997)
Object
Precision
Random
19,2
Word-to-word
95,9
Word-to-class
66,9
Class-to-class
66,6
54
Subject
Recall F-meas. Precision Recall F-meas
19,2 19,2
19,2
19,2 19,2
24,9 39,5
74,2
18,0 29,0
58,0 62,1
56,2
46,8 51,1
64,8 65,7
54,0
53,7 53,8
Outline
• Task definition
– Machine Readable Dictionaries
•
•
•
•
55
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods
Semantic Similarity
• Words in a discourse must be related in meaning, for the
discourse to be coherent (Haliday and Hassan, 1976)
• Use this property for WSD – Identify related meanings for
words that share a common context
• Context span:
1. Local context: semantic similarity between pairs of words
2. Global context: lexical chains
56
Semantic Similarity in a Local Context
• Similarity determined between pairs of concepts, or
between a word and its surrounding context
• Relies on similarity metrics on semantic networks
– (Rada et al. 1989)
carnivore
fissiped mamal, fissiped
wolf
dingo
canine, canid
wild dog
dog
hyena dog
feline, felid
hyena
hunting dog
dachshund
57
terrier
bear
Semantic Similarity Metrics (1)
• Input: two concepts (same part of speech)
• Output: similarity measure
•
(Leacock and Chodorow 1998)
⎛ Path (C1 , C 2 ) ⎞ , D is the taxonomy depth
Similarity (C1 , C 2 ) = − log ⎜
⎟
2D
⎝
⎠
– E.g. Similarity(wolf,dog) = 0.60 Similarity(wolf,bear) = 0.42
•
(Resnik 1995)
– Define information content, P(C) = probability of seeing a concept of type
C in a large corpus
IC (C ) = − log( P (C ))
– Probability of seeing a concept = probability of seeing instances of that
concept
– Determine the contribution of a word sense based on the assumption of
equal sense distributions:
• e.g. “plant” has two senses Æ 50% occurrences are sense 1, 50% are sense 2
58
Semantic Similarity Metrics (2)
• Similarity using information content
– (Resnik 1995) Define similarity between two concepts (LCS = Least
Common Subsumer)
Similarity (C1 , C2 ) = IC ( LCS (C1 , C2 ))
– Alternatives (Jiang and Conrath 1997)
Similarity (C1 , C 2 ) = 2 × IC ( LCS (C1 , C 2 )) −
( IC (C1 ) + IC (C 2 ))
• Other metrics:
– Similarity using information content (Lin 1998)
– Similarity using gloss-based paths across different hierarchies (Mihalcea
and Moldovan 1999)
– Conceptual density measure between noun semantic hierarchies and
current context (Agirre and Rigau 1995)
– Adapted Lesk algorithm (Banerjee and Pedersen 2002)
59
Semantic Similarity Metrics for WSD
• Disambiguate target words based on similarity with one
word to the left and one word to the right
– (Patwardhan, Banerjee, Pedersen 2002)
Example: disambiguate PLANT in “plant with flowers”
PLANT
1.
2.
plant, works, industrial plant
plant, flora, plant life
Similarity (plant#1, flower) = 0.2
Similarity (plant#2, flower) = 1.5
: plant#2
• Evaluation:
60
– 1,723 ambiguous nouns from Senseval-2
– Among 5 similarity metrics, (Jiang and Conrath 1997) provide the
best precision (39%)
Outline
• Task definition
– Machine Readable Dictionaries
•
•
•
•
64
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods
Most Frequent Sense (1)
• Identify the most often used meaning and use this meaning
by default
Example: “plant/flora” is used more often than “plant/factory”
- annotate any instance of PLANT as “plant/flora”
• Word meanings exhibit a Zipfian distribution
– E.g. distribution of word senses in SemCor
0.9
0.8
Frequency
0.7
0.6
Noun
0.5
Verb
0.4
Adj
0.3
Adv
0.2
0.1
0
1
2
3
4
5
6
7
Sense number
65
8
9
10
Most Frequent Sense (2)
•
•
Method 1: Find the most frequent sense in an annotated
corpus
Method 2: Find the most frequent sense using a method
based on distributional similarity (McCarthy et al. 2004)
1. Given a word w, find the top k distributionally similar words
Nw = {n1, n2, …, nk}, with associated similarity scores
{dss(w,n1), dss(w,n2), … dss(w,nk)}
2. For each sense wsi of w, identify the similarity with the words nj,
using the sense of nj that maximizes this score
3. Rank senses wsi of w based on the total similarity score
Score( wsi ) =
∑
dss ( w,n j )
n j ∈N w
wnss ( wsi , n j )
∑ wnss(ws ' , n )
i
wsi '∈senses ( w )
where
wnss ( wsi , n j ) =
66
max
ns x ∈senses ( n j )
( wnss ( wsi , ns x ))
j
,
Most Frequent Sense(3)
• Word senses
– pipe #1 = tobacco pipe
– pipe #2 = tube of metal or plastic
• Distributional similar words
– N = {tube, cable, wire, tank, hole, cylinder, fitting, tap, …}
• For each word in N, find similarity with pipe#i (using the sense that
maximizes the similarity)
– pipe#1 – tube (#3) = 0.3
– pipe#2 – tube (#1) = 0.6
• Compute score for each sense pipe#i
– score (pipe#1) = 0.25
– score (pipe#2) = 0.73
Note: results depend on the corpus used to find distributionally
similar words => can find domain specific predominant senses
67
One Sense Per Discourse
• A word tends to preserve its meaning across all its occurrences in a
given discourse (Gale, Church, Yarowksy 1992)
• What does this mean?
E.g. The ambiguous word PLANT occurs 10 times in a discourse
all instances of “plant” carry the same meaning
• Evaluation:
– 8 words with two-way ambiguity, e.g. plant, crane, etc.
– 98% of the two-word occurrences in the same discourse carry the same
meaning
• The grain of salt: Performance depends on granularity
– (Krovetz 1998) experiments with words with more than two senses
– Performance of “one sense per discourse” measured on SemCor is approx.
70%
68
One Sense per Collocation
• A word tends to preserver its meaning when used in the same
collocation (Yarowsky 1993)
– Strong for adjacent collocations
– Weaker as the distance between words increases
• An example
The ambiguous word PLANT preserves its meaning in all its
occurrences within the collocation “industrial plant”, regardless
of the context where this collocation occurs
• Evaluation:
– 97% precision on words with two-way ambiguity
• Finer granularity:
– (Martinez and Agirre 2000) tested the “one sense per collocation”
hypothesis on text annotated with WordNet senses
– 70% precision on SemCor words
69
Conceptual Density
• Vedi slide separate
70
References
•
•
•
•
•
•
•
•
•
•
•
71 •
(Agirre and Rigau, 1995) Agirre, E. and Rigau, G. A proposal for word sense disambiguation
using conceptual distance. RANLP 1995.
(Agirre and Martinez 2001) Agirre, E. and Martinez, D. Learning class-to-class selectional
preferences. CONLL 2001.
(Banerjee and Pedersen 2002) Banerjee, S. and Pedersen, T. An adapted Lesk algorithm for
word sense disambiguation using WordNet. CICLING 2002.
(Cowie, Guthrie and Guthrie 1992), Cowie, L. and Guthrie, J. A. and Guthrie, L.: Lexical
disambiguation using simulated annealing. COLING 2002.
(Gale, Church and Yarowsky 1992) Gale, W., Church, K., and Yarowsky, D. One sense per
discourse. DARPA workshop 1992.
(Halliday and Hasan 1976) Halliday, M. and Hasan, R., (1976). Cohesion in English.
Longman.
(Galley and McKeown 2003) Galley, M. and McKeown, K. (2003) Improving word sense
disambiguation in lexical chaining. IJCAI 2003
(Hirst and St-Onge 1998) Hirst, G. and St-Onge, D. Lexical chains as representations of
context in the detection and correction of malaproprisms. WordNet: An electronic lexical
database, MIT Press.
(Jiang and Conrath 1997) Jiang, J. and Conrath, D. Semantic similarity based on corpus
statistics and lexical taxonomy. COLING 1997.
(Krovetz, 1998) Krovetz, R. More than one sense per discourse. ACL-SIGLEX 1998.
(Lesk, 1986) Lesk, M. Automatic sense disambiguation using machine readable dictionaries:
How to tell a pine cone from an ice cream cone. SIGDOC 1986.
(Lin 1998) Lin, D An information theoretic definition of similarity. ICML 1998.
References
•
•
•
•
•
•
•
•
•
•
•
72
•
(Martinez and Agirre 2000) Martinez, D. and Agirre, E. One sense per collocation and
genre/topic variations. EMNLP 2000.
(Miller et. al., 1994) Miller, G., Chodorow, M., Landes, S., Leacock, C., and Thomas, R.
Using a semantic concordance for sense identification. ARPA Workshop 1994.
(Miller, 1995) Miller, G. Wordnet: A lexical database. ACM, 38(11) 1995.
(Mihalcea and Moldovan, 1999) Mihalcea, R. and Moldovan, D. A method for word
sense disambiguation of unrestricted text. ACL 1999.
(Mihalcea and Moldovan 2000) Mihalcea, R. and Moldovan, D. An iterative approach
to word sense disambiguation. FLAIRS 2000.
(Mihalcea, Tarau, Figa 2004) R. Mihalcea, P. Tarau, E. Figa PageRank on Semantic
Networks with Application to Word Sense Disambiguation, COLING 2004.
(Patwardhan, Banerjee, and Pedersen 2003) Patwardhan, S. and Banerjee, S. and
Pedersen, T. Using Measures of Semantic Relatedeness for Word Sense Disambiguation.
CICLING 2003.
(Rada et al 1989) Rada, R. and Mili, H. and Bicknell, E. and Blettner, M. Development
and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and
Cybernetics, 19(1) 1989.
(Resnik 1993) Resnik, P. Selection and Information: A Class-Based Approach to Lexical
Relationships. University of Pennsylvania 1993.
(Resnik 1995) Resnik, P. Using information content to evaluate semantic similarity.
IJCAI 1995.
(Vasilescu, Langlais, Lapalme 2004) F. Vasilescu, P. Langlais, G. Lapalme "Evaluating
variants of the Lesk approach for disambiguating words”, LREC 2004.
(Yarowsky, 1993) Yarowsky, D. One sense per collocation. ARPA Workshop 1993.
Summary (1): WSD e tecniche principali
• Abbiamo definito i due problemi connessi al WSD
– WSD: Dato un contesto (testuale) determinare il senso corretto di una o
più parole in tale contesto
– Word Sense Discrimination: Dati i contesti d’uso di una parola in un
corpus determinare i diversi sensi necessari a descriverli in modo coerente
• Per gli approcci di WSD:
– Metodi basati su dizionari (es. Lesk)
– Metodi basati su machine learning
• Supervised
• Unsupervised
– Ruolo delle metriche di somiglianza
• Funzioni Quantitative di preferenza tra i sensi di una parola, basate su
– Coppia di termini
– Struttura della rete semantica data dal tesauro (negli esempi visti dalla rete di
Wordnet)
73
Summary (1.2): WSD e tecniche principali
• Le metriche di somiglianza si applicano al WSD secondo
– Un modello del Contesto locale che seleziona le parole (coppie) su
cui applicare la funzione di somiglianza binaria
– Sfruttano la nozione di Selectional Preference, cioè della tendenza
di alcune parole (ad esempio i verbi) di preferire alcune classi
semantiche in corrispondenza dei loro argomenti principali (per
esempio il complemento oggetto di “cucinare”)
• I modelli principali visti sono
– Mutua Associazione in WN (Resnik, 1997)
– Altre metriche, ad es. (Jiang and Conrath 1997)
– Il ruolo del Senso Predominante
74
Summary (2)
• Altre misure di somiglianza semantica (Conceptual Density)
– Agisce su insiemi di n-parole
– Può sfruttare contesti di tipo
• Testuale (n-word windows)
• Sintattico (ad es. soggetti di un verbo dato)
• Tematico (domini applicativi)
– Può essere usata in modo dinamico per cambiare la nozione di senso
predominante
• Useful Links
– Semantica Lessicale, Sensi delle parole e Wordnet:
• http://www.cs.colorado.edu/%7Emartin/SLP/Updates/19.pdf
– Word Sense Disambiguation:
• http://www.cs.colorado.edu/%7Emartin/SLP/Updates/20.pdf
75
Part 7:
How to Get Started in
Word Sense Disambiguation
Research
Outline
• Where to get the required ingredients?
–
–
–
–
Machine Readable Dictionaries
Machine Learning Algorithms
Sense Annotated Data
Raw Data
• Where to get WSD software?
• How to get your algorithms tested?
– Senseval
184
Machine Readable Dictionaries
• Machine Readable format (MRD)
– Oxford English Dictionary
– Collins
– Longman Dictionary of Ordinary Contemporary English (LDOCE)
• Thesauri – add synonymy information
– Roget Thesaurus http://www.thesaurus.com
• Semantic networks – add more semantic relations
– WordNet http://www.cogsci.princeton.edu/~wn/
• Dictionary files, source code
– EuroWordNet http://www.illc.uva.nl/EuroWordNet/
• Seven European languages
185
Machine Learning Algorithms
• Many implementations available online
• Weka: Java package of many learning algorithms
– http://www.cs.waikato.ac.nz/ml/weka/
– Includes decision trees, decision lists, neural networks, naïve
bayes, instance based learning, etc.
• C4.5: C implementation of decision trees
– http://www.cse.unsw.edu.au/~quinlan/
• Timbl: Fast optimized implementation of instance based
learning algorithms
– http://ilk.kub.nl/software.html
• SVM Light: efficient implementation of Support Vector
Machines
– http://svmlight.joachims.org
186
Sense Tagged Data
• A lot of annotated data available through Senseval
– http://www.senseval.org
• Data for lexical sample
– English (with respect to Hector, WordNet, Wordsmyth)
– Basque, Catalan, Chinese, Czech, Romanian, Spanish, etc.
– Data produced within Open Mind Word Expert project
http://teach-computers.org
• Data for all words
– English, Italian, Czech (Senseval-2 and Senseval-3)
– SemCor (200,000 running words)
http://www.cs.unt.edu/~rada/downloads.html
• Pointers to additional data available from
– http://www.senseval.org/data.html
187
Sense Tagged Data – Lexical Sample
<instance id="art.40008" docsrc="bnc_ANF_855">
<answer instance="art.40008" senseid="art%1:06:00::"/>
<context>
The evening ended in a brawl between the different factions in Cubism, but it brought a
moment of splendour into the blackouts and bombings of war. [/p] [p]
Yet Modigliani was too much a part of the life of Montparnasse, too involved with the
individuals leading the " new art " , to remain completely aloof.
In 1914 he had met Hans Arp, the French painter who was to become prominent in the
new Dada movement, at the artists' canteen in the Avenue du Maine.
Two years later Arp was living in Zurich, a member of a group of talented emigrant
artists who had left their own countries because of the war.
Through casual meetings at cafes, the artists drew together to form a movement in
protest against the waste of war, against nationalism and against everything
pompous, conventional or boring in the <head>art</head>of the Western world.
</context>
</instance>
188
Sense Tagged Data – SemCor
<p pnum=1>
<s snum=1>
<wf cmd=ignore pos=DT>The</wf>
<wf cmd=done rdf=group pos=NNP lemma=group wnsn=1 lexsn=1:03:00::
pn=group>Fulton_County_Grand_Jury</wf>
<wf cmd=done pos=VB lemma=say wnsn=1 lexsn=2:32:00::>said</wf>
<wf cmd=done pos=NN lemma=friday wnsn=1 lexsn=1:28:00::>Friday</wf>
<wf cmd=ignore pos=DT>an</wf>
<wf cmd=done pos=NN lemma=investigation wnsn=1
lexsn=1:09:00::>investigation</wf>
<wf cmd=ignore pos=IN>of</wf>
<wf cmd=done pos=NN lemma=atlanta wnsn=1 lexsn=1:15:00::>Atlanta</wf>
<wf cmd=ignore pos=POS>'s</wf>
<wf cmd=done pos=JJ lemma=recent wnsn=2 lexsn=5:00:00:past:00>recent</wf>
<wf cmd=done pos=NN lemma=primary_election wnsn=1
lexsn=1:04:00::>primary_election</wf>
<wf cmd=done pos=VB lemma=produce wnsn=4 lexsn=2:39:01::>produced</wf>
…
189
Raw Data
• For use with
– Bootstrapping algorithms
– Word sense discrimination algorithms
• British National Corpus
– 100 million words covering a variety of genres, styles
– http://www.natcorp.ox.ac.uk/
• TREC (Text Retrieval Conference) data
– Los Angeles Times, Wall Street Journal, and more
– 5 gigabytes of text
– http://trec.nist.gov/
• The Web
190
Outline
• Where to get the required ingredients?
–
–
–
–
Machine Readable Dictionaries
Machine Learning Algorithms
Sense Annotated Data
Raw Data
• Where to get WSD software?
• How to get your algorithms tested?
– Senseval
191
WSD Software – Lexical Sample
•
Duluth Senseval-2 systems
– Lexical decision tree systems that participated in Senseval-2 and 3
– http://www.d.umn.edu/~tpederse/senseval2.html
•
SyntaLex
– Enhance Duluth Senseval-2 with syntactic features, participated in Senseval-3
– http://www.d.umn.edu/~tpederse/syntalex.html
•
WSDShell
– Shell for running Weka experiments with wide range of options
– http://www.d.umn.edu/~tpederse/wsdshell.html
•
SenseTools
– For easy implementation of supervised WSD, used by the above 3 systems
– Transforms Senseval-formatted data into the files required by Weka
– http://www.d.umn.edu/~tpederse/sensetools.html
•
SenseRelate::TargetWord
– Identifies the sense of a word based on the semantic relation with its neighbors
– http://search.cpan.org/dist/WordNet-SenseRelate-TargetWord
– Uses WordNet::Similarity – measures of similarity based on WordNet
• http://search.cpan.org/dist/WordNet-Similarity
192
WSD Software – All Words
• SenseLearner
–
–
–
–
A minimally supervised approach for all open class words
Extension of a system participating in Senseval-3
http://lit.csci.unt.edu/~senselearner
Demo on Sunday, June 26 (1:30-3:30)
• SenseRelate::AllWords
– Identifies the sense of a word based on the semantic relation with
its neighbors
– http://search.cpan.org/dist/WordNet-SenseRelate-AllWords
– Demo on Sunday, June 26 (1:30-3:30)
193
WSD Software – Unsupervised
• Clustering by Committee
– http://www.cs.ualberta.ca/~lindek/demos/wordcluster.htm
• InfoMap
– Represent the meanings of words in vector space
– http://infomap-nlp.sourceforge.net
• SenseClusters
– Finds clusters of words that occur in similar context
– http://senseclusters.sourceforge.net
– Demo Sunday, June 26 (4:00-6:00)
194
Outline
• Where to get the required ingredients?
–
–
–
–
Machine Readable Dictionaries
Machine Learning Algorithms
Sense Annotated Data
Raw Data
• Where to get WSD software?
• How to get your algorithms tested?
– Senseval
195
Senseval
•
•
•
•
•
Evaluation of WSD systems http://www.senseval.org
Senseval 1: 1999 – about 10 teams
Senseval 2: 2001 – about 30 teams
Senseval 3: 2004 – about 55 teams
Senseval 4: 2007(?)
• Provides sense annotated data for many languages, for
several tasks
– Languages: English, Romanian, Chinese, Basque, Spanish, etc.
– Tasks: Lexical Sample, All words, etc.
196
• Provides evaluation software
• Provides results of other participating systems
Senseval
S e ns e va l Eva lua tions
160
140
120
100
Tas ks
Te ams
80
S ys te ms
60
40
20
0
S e n s e va l 1
197
S e n s e va l 2
S e n s e va l 3