Word Sense Induction Using Graphs of Collocations

WSI using Graphs of
Collocations
Paper by: Ioannis P. Klapaftis and
Suresh Manandhar
Presented by: Ahmad R. Shahid
Word Sense Induction (WSI)
• Identifying different senses (uses) of a word
• Finds applications in Information Retrieval (IR)
and Machine Translation (MT)
• Most of the work in WSI is based on vectorspace model
– Each context of a target word is represented as a
vector of features
– Context vectors are clustered and the resulting
clusters are taken to represent the induced senses.
Word Sense Induction (WSI)
Graph based methods
• Agirre et al. (2007) used co-occurrence graphs
– Vertices are words and two vertices share an edge if
they co-occur in the same context
• Each edge receives a weight indicating the strength of
relationship between words (vertices)
– Co-occurrence graphs have highly dense subgraphs
representing different clusters the entity may have
• Each cluster has a “hub”
– Hubs are highly dense vertices
Graph based methods
• Each cluster (induced sense) consists of a
set of words that are semantically related
to the particular sense.
• Graph-based methods assume that each
context word is related to one and only
one sense of the target word
– Not always valid
Graph based methods
• Consider the contexts for the target word
network:
– To install our satellite system please call our
technicians and book an appointment. Connection to
our television network is free of charge…
– To connect to the BT network, proceed with the
installation of the connection software and then
reboot your system…
• Two senses are used: 1) Television Network, 2) Computer
Network
Graph based methods
• Any hard-clustering approach would
assign system to only one of the two
senses of network, even though its related
to both
– Same is true for connection
• The two words cannot be filtered out as
noise since they are semantically related
to the target word
WSI using Graph Clustering
Small Lexical Worlds
Small Lexical Worlds
• Small worlds
– The characteristic path length (L)
• Mean length of the shortest path between two nodes of the
graph. Let d min (i, j ) be the length of the shortest path
between two nodes i and j , and let N be the total number
of nodes
1
L=
N
N
∑d
i =1
– Clustering coefficient (C)
min
(i, j )
•
Clustering Coefficient
For each nodei , one can define a local
clustering coefficient Ci equal to the proportion
of connections E (Γ(i )) between the neighbors Γ(i )
of that node
• For a node i with four neighbors the maximum
number of connections is ⎛⎜ Γ(i ) ⎞⎟ = 6
– If five of these
⎜2 ⎟
⎠
connections ⎝actually
exist, Ci = 5 6 ~ 0.83
• The global coefficient C is the mean of the local
N E (Γ (i ))
1
coefficients C = ∑
N i =1 ⎛ Γ(i ) ⎞
⎟
⎜
⎜2 ⎟
⎠ and 1 for a complete
⎝
– Its 0 for totally disconnected
graph.
Small World Networks
• They lie somewhere between regular graphs and
random graphs
• In the case of random graph of N nodes whose
mean degree is k
Lrand ~ log( N )
log(k )
Crand ~ 2k
N
• Small world graphs are characterized by:
L ~ Lrand
C >> Crand
Small World Networks
• At a constant mean degree, the number of
nodes can increase exponentially, whereas the
characteristic path length will only increase in a
linear way
– Six degrees of freedom
• In a small world there will be bundles or highly
connected groups
– Friends of a given individual will be much more likely
to be acquainted with each other than would be
predicted if the edges of the graph were simply drawn
at random
Small World Networks
Adam Smith
• Every individual...generally, indeed, neither
intends to promote the public interest, nor knows
how much he is promoting it. By preferring the
support of domestic to that of foreign industry he
intends only his own security; and by directing
that industry in such a manner as its produce
may be of the greatest value, he intends only his
own gain, and he is in this, as in many other
cases, led by an invisible hand to promote an
end which was no part of his intention.
• The Wealth of Nations, Book IV Chapter II
• “Greed is good”: the basic theme of Capitalism
Co-occurrence Graphs
• Co-occurrence graphs are small world
graphs
– The number of nodes can increase
exponentially, whereas the characteristic path
length will only increase in a linear way
• They are scale-free
– They contain a small number of highly
connected hubs and a large number of weakly
connected nodes
Co-occurrence Graphs
• Since they are small-world networks
– Contain highly dense subgraphs (hubs) which
represent the different clusters (senses) the
target word may have
High Density Components
• Different uses of a target word form highly
interconnected “bundles” in a small world
of cooccurrences (high density
components)
– Barrage (in the sense of a hydraulic dam)
must cooccur frequently with eau, ouvrage,
riviere, crue, irrigation, production, electricite
(water, work, river, flood, irrigation,
production, electricity)
• Those words themselves are likely to be
interconnected
High Density Components
• Detecting the different uses of a word
amounts to isolating the high density
components in the cooccurrence graph
– Most exact graph-partitioning techniques are
NP-hard
• Given that graphs have several thousand nodes
and edges, only approximate heuristic-based
methods can be employed
Detecting Root Hubs
• In every high-density component, one of
the nodes has a higher degree than the
others
– Called the component’s root hub
• For the most frequent use of barrage (hydraulic
dam), the root hub is the word eau (water).
• The isolated root hub is deleted along with
all of its neighbors
– Must have at least 6 neighbors (determined
empirically)
Minimum Spanning Tree (MST)
• After isolating the root hub along with all
its neighbors, the next root hub is
identified and the process is repeated
• A MST is computed by taking the target
word as the root and making its first level
consist of the previously identified root
hubs
Minimum Spanning Tree (MST)
Veronis Algorithm
• Iteratively finds the candidate root hub
– The one with the highest degree
• The root hub is deleted along with its
direct neighbors from the graph
– Only if it satisfies certain heuristics
• Minimum number of vertices in a hub
• The average weight between the candidate root
hub and its adjacent neighbors
• Minimum frequency of a root hub
Collocational Graphs for WSI
• Let bc, be the base corpus
– Consists of paragraphs containing the target
word tw
• The aim is to induce the sense of tw given
bc as the only input
• Let rc be a large reference corpus
– British National Corpus (BNC) has been used
for this study
Corpus pre-processing
• Initially, tw is removed from bc
• Each paragraph of bc and rc is POS-tagged
• Only nouns are kept and lemmatized
– Since they are less ambiguous than verbs, adverbs or
adjectives
• At this stage each paragraph
is a list of lemmatized nouns
pi
both in bc and rc
Corpus pre-processing
• Each paragraph p in bc contains nouns which
are semantically related to tw, as well as,
common nouns which are noisy, in the sense
that they are not semantically related to tw
• In order to filter out the noise, they used a
technique based on corpora comparison using
log-likelihood ( G 2 ).
i
Corpus pre-processing
• The aim is to check if the distribution of a word w ,
given it appears in bc, is similar to the distribution
of w , given it appears in rc , i.e. p(wi | bc ) = p(wi | rc )
i
i
– It’s the null hypothesis
– If that is true G 2 will have a small value, and wi should be
removed from the paragraphs of bc.
Corpus pre-processing
• If the probability of the occurrence of a word in
the base corpus is the same as in the reference
corpus, then it looses its discriminating power
and must be weeded out
– In other words if the observed frequency of a word is
very close to its expected value then it really hasn’t got
much to say
p (wi | bc ) = p(wi | rc ) = p (w)
Corpus pre-processing
• The expressions are:
⎛ nij
G 2 = 2 * ∑ nij . log⎜
⎜m
i, j
⎝ ij
∑k =1 nik .∑k =1 nkj
2
mij =
⎞
⎟
⎟
⎠
2
N
– The nij corresponds to values in the observed values
(OT) table
– The mij corresponds to values in the expected values
(ET) table
• The values in ET are calculated from the values
in OT using the equation for mij
Corpus pre-processing
• They created two noun frequency lists
– lbc, derived from the bc corpus
– lrc, derived from the rc corpus
• For word wi ∈ lbc , they created two contingency
tables
– OT contains the observed counts taken from lbc and
lrc
– ET contains the expected values under the model of
independence
Corpus pre-processing
2
G
• Then they calculated , where nij is the i , j cell of
OT and mij is the i , j cell of ET, and N = ∑ n
• lbc is first filtered by removing words, which
have a relative frequency in lbc less than lrc
i, j
ij
– The resulting lbc is then sorted by the G 2 values
2
• The G - sorted list is used to remove words from each
2
paragraph of bc, which have a G value less than a prespecified thresholed (parameter p1).
• At the end of that stage, each paragraph pi ∈ bc
is a list of nouns, which are assumed to be
topically related to the target word tw
Corpus pre-processing
Collocational Graph
• A key problem at this stage is the
determination of related nouns
– They can be grouped into collocations
• Where each collocation is assigned a weight
• In this study collocations of size 2 are
considered (pairs of words)
– They consist of 2 nouns
Collocational Graph
• Collocations are detected by generating all the
combinations for each n - length paragraph
⎛n⎞
⎜⎜ ⎟⎟
⎝2⎠
– Then measuring their frequency
• The frequency of a collocation is the number of paragraphs, which
contain that collocation
• Consider the following paragraphs:
– To install our satellite system please call our technicians and
book an appointment. Connection to our television network is
free of charge…
– To connect to the BT network, proceed with the installation of
the connection
software and then reboot your system…
⎛n⎞
– All the ⎜⎜⎝ 2 ⎟⎟⎠ combinations for each - length paragraph of our
example would provide us with 24 unique collocations, such as
{system, technician}, {system, connection} etc.
n
Collocational Graph
2
• Although the use of G aims at keeping in bc
words, which are related to the target one, this
does not necessarily mean that their pairwise
combinations are useful for discriminating the
senses of tw
– For example ambiguous collocations, which are
related to both senses of tw should not be taken into
account, such as the {system, connection} collocation
• To circumvent around this problem, each extracted
collocation is assigned a weight, which measures the relative
frequency of two nouns co-occurring.
– Usually weighted using information theoretic measures such as
pointwise mutual information (PMI)
Collocational Graph
• Conditional probabilities produce better results
than PMI, which overestimates rare events
– Hence they used conditional probabilities
• Let freqij denote the number of paragraphs, in
which nouns i, j co-occur, and freq denote the
number of paragraphs, where noun j occurs
– Since G 2allows us to capture the words which are
j
related to tw , the calculations for collocations
frequency take place on the whole SemEval-2007
WSI (SWSI) corpus (27132 paragraphs)
• To deal with data sparsity and to determine whether a
candidate collocation appears frequently enough to be
included in our graphs
Collocational Graph
• We can measure the conditional probability p(i | j )
and p( j | i ) in a similar manner p (i | j ) = freq
freq
ij
j
– The final weight applied to collocation cij is the
average of the calculated conditional probabilities
p(i | j ) + p( j | i )
w =
cij
2
– They only extracted collocations, which had
frequency (parameter p2 ) and weight (parameter p3)
higher than pre-specified thresholds
• The collocational graph can now be created, in
which each extracted and weighted collocation
is represented as a vertex and two vertices
share an edge, if they co-occur, in one or more
paragraphs of bc
Collocational Graph
• The next stage is to weight the edges of the initial
collocational graph, G, as well as to discover new
edges connecting the vertices of G
– The constructed graph is sparse since we are
attempting to identify rare events, i.e. edges
connecting collocations
• The solution to the problem of data sparsity is smoothing
Weighting and Populating
• For each vertex i (collocation ci ), they associated
a vertex vector VCi containing the vertices
(collocations), which share an edge with i in
graph G
• Table shows an example of two vertices, i.e.
cnn_nbc and nbc_news, which are not
connected in G of the target word network
Weighting and Populating
• In the next step, the similarity between each
vertex vector VCi and each vertex vector VC j is
calculated
• Lee [4] showed that Jaccard similarity coefficient
(JC) showed superior performance over other
symmetric similarity measures such as cosine,
L1 norm, euclidean distance, Jensen-Shannon
divergence, etc.
Weighting and Populating
• Using JC for estimating similarity between vertex
vectors yields:
VCi ∩ VC j
JC (VCi ,VC j ) =
VCi ∪ VC j
– Two collocations ci and c j are said to be mutually
similar if ci is the most similar collocation to c j and the
other way around.
Weighting and Populating
• Two mutually similar collocations ci and c j are
clustered with the result that an occurrence of a
collocation ck with one of ci , c j is also counted as
an occurrence with the other collocation
– In table (slide 34) if cnn_nbc and nbc_news are
mutually similar, then the zero-frequency event
between nbc_news and cnn_tv is set equal to the joint
frequency between cnn_nbc and cnn_tv
• Many collocations connected to one of the target collocations
are not connected to the other, although they should be, since
both of the target collocations are contextually related i.e. both
of them refer to the Television Network sense.
Weighting and Populating
• The weight applied to each edge connecting
vertices i and j (collocations ci and c j ) is the
maximum of their conditional probabilities, where:
p(i | j ) =
freqi , j
freq j
• Where freqij denotes the number of paragraphs,
in which nouns i, j co-occur, and freq j denotes the
number of paragraphs, where noun j occurs
Inducing Senses and Tagging
• The final graph G,/ resulting from the previous
stage, is clustered in order to produce the
induced senses.
• The two criteria for choosing a clustering
algorithm were:
– Its ability to automatically induce the number of
clusters
– Its execution time
Inducing Senses and Tagging
• Markov Clustering Algorithm
– Is fast
– Is based on stochastic flow in graphs
– The number of clusters produced depends on an
inflation parameter that controls the number of
produced clusters
• Chinese Whispers
–
–
–
–
–
Is a randomized graph-clustering method
Time-linear to the number of edges
Does not require any input parameters
Not guaranteed to converge
Automatically infers the number and size of clusters
Inducing Senses and Tagging
• Normalized MinCut
– Graph partitioning technique
– Graph is partitioned into two subgraphs by
minimising the total association between the
two subgraphs
– Iteratively applied for each extracted
subgraph until a user-defined criterion is met
(e.g. number of clusters)
Inducing Senses and Tagging
• CW assigns all vertices to different classes
• Each vertex i is processed for an x (parameter p4 )
number of iterations and inherits the strongest
class in its local neighbourhood (LN) in an update
step.
– LN is defined as the set of vertices which share a direct
connection with vertex i
• During the update step for a vertex i : each class, cl , receives a
score equal to the sum of the weights of edges ( i, j ), where j
has been assigned class cl
– The maximum score determines the strongest class
» In case of multiple classes, one is chosen at random
• Classes are updated immediately, which means that a node
can inherit classes from its LN that were introduced there in the
same iteration
Evaluation
• The WSI approach was evaluated under
the framework of SemEval-2007 WSI task
(SWSI)
• The corpus consists of texts of the Wall
Street Journal corpus, and is hand tagged
with OntoNotes senses
• They focused on all 35 nouns of SWSI,
ignoring verbs
Evaluation
• They induced the senses of each target noun, tn,
and then they tagged each instance of tn with
one of its induced senses
• SWSI organizers employ two evaluation schemes
– Unsupervised evaluation
• The results of systems are treated as clusters of target noun
contexts and gold standard (GS) senses as classes
– Supervised evaluation
• The training corpus is used to map the induced clusters to GS
senses. The testing corpus is then used to measure
performance
Evaluation
• Perfect clustering solution is defined in
terms of Homogeneity and Completeness
• Homogeneity
– Where each induced cluster has exactly the
same contexts as one of the classes
• Completeness
– Where each class has exactly the same
contexts as one of the clusters
Evaluation
• F-Score is used to asses the overall
quality of clustering
– Measures both Homogeneity and
Completeness
– Other measures, entropy and purity only
measure the first
Evaluation
• Purity
– Let q be the number of classes in the gold standard
(GS), k be the number of clusters, nr be the size of
cluster r , and nri be the number of data points in class i
that belong to cluster r , then:
k
1
Purity = ∑ arg max nri
i
r =1 n
k
n
Entropy = ∑ r
r =1 n
⎛
1 q nri
nri ⎞
⎜⎜ −
log ⎟⎟
∑
nr ⎠
⎝ log q i =1 nr
q
Evaluation
• Their WSI methodology used Jaccard similarity
to populate the graph, referred to as Col-JC
– Col-BL induces senses as Col-JC does but without
smoothing
• Baselines
– 1cl1inst: assigns each instance to a distinct cluster
– 1c1w: groups all instances of a target word into
one cluster
• equal to the most frequent baseline (MFS) in the supervised
evaluation
– The sense which appears most often in an annotated text
Evaluation
•
The evaluation results are given in the table below:
– UOY, UBS-AC have used labeled data for parameter estimation
– I2R, UPV_SI, UMND2 do not state how their parameters were
estimated
Analysis
• Evaluation of WSI methods is a difficult
task
– 1cl1inst baseline achieves a perfect purity and
entropy, however scores low on F-Score
• Because senses of GS are spread among induced
clusters causing a low unsupervised recall
• Supervised recall of 1cl1inst is undefined, due to
the fact that each cluster tags one and only one
instance in the corpus
– Clusters tagging instances in the test corpus do not tag
any instances in the train corpus and the mapping cannot
be performed
Analysis
• 1c1w baseline achieves high F-Score
performance due to the dominance of
MFS in the testing corpus
– Its purity, entropy and supervised recall are
much lower than other systems
Analysis
• A clustering solution, which achieves high
supervised recall, does not necessarily
achieve high F-Score
– Because F-Score penalizes systems for
getting the number of GS classes wrongly as
1cl1inst baseline
Analysis
• No system was able to achieve high
performance in both settings except their
technique
• Col-BL (Col-JC) achieved 72.9% (78%) FScore
Analysis
• The target of smoothing was to reduce the
number of clusters and obtain a better
mapping of clusters to GS senses, but
without affecting the clustering quality
Bibliography
1) Ioannis Klapaftis, and Suresh Manandhar,
‘Word Sense Induction Using Graphs of
Collocations’, in Proceedings of the 18th
European Conference On Artificial Intelligence
(ECAI-2008).
2) Agirre, E., Soroa, A.: Ubc-as: A graph based
unsupervised system for induction and
classification. In: Proceedings of the Fourth
International Workshop on Semantic
Evaluations (SemEval-2007), pp. 346–349
(2007)
Bibliography
3) J. V´eronis. 2004. Hyperlex: lexical
cartography for information retrieval.
Computer Speech & Language,
18(3):223.252.
4) Lee, L. 1999. Measures of distributional
similarity. In Proc. ACL ’99, 25–32.