Synonym Dictionary Improvement through Markov Clustering and

Synonym Dictionary Improvement through
Markov Clustering and Clustering Stability!
David Gfeller, Jean-Cédric Chappelier, and Paolo De Los Rios
Ecole Polytechnique Fédérale de Lausanne (EPFL)
CH-1015 Lausanne, Switzerland
Abstract. The aim of the work presented here is to clean up a dictionary of
synonyms which appeared to be ambiguous, incomplete and inconsistent. The key
idea is to use Markov Clustering and Clustering Stability techniques on the network
that represents the synonymy relation contained in the dictionary. Each densely
connected cluster is considered to correspond to a specific concept, and ambiguous
words are identified by the evaluation of the stability of the clustering under noise.
This allows to disambiguate polysemic words, introducing missing senses where
required, and merge similar senses of a same word if necessary.
Keywords: Complex Network, Markov Clustering Algorithm, Clustering Stability,
Community Structure, Synonymy, Word Sense Disambiguation.
Introduction
One aspect of the complexity of text mining comes from the synonymy and
the polysemy of the words. The aim of Word Sense Disambiguation (WSD) is
precisely to associate a specific sense to every word within context [Besançon
et al., 2001, Schütze, 1998]. The determination of the list of possible senses
for a given word is a key aspect in this disambiguation. A possible starting
point could be a dictionary of synonyms. It happens however that, due
to both inherent human errors and errors coming from the automatic (semisupervised) construction of the dictionary, the synonymy network can contain
several mistakes and turn out to be ambiguous, incomplete or inconsistent.
This paper starts with a brief description of the synonymy network derived from the dictionary we used. Then, the clustering algorithm applied to
improve it is presented and the general method to identify ambiguities in the
clustering is introduced. Finally, the results obtained are discussed.
1
The Synonymy Network
The synonym network we here consider has been built from a French dictionary of synonyms. The synonymy relation is defined between the words in
one of their senses and is considered to be symmetric. The resulting network
!
This work was financially supported by the EU commission by contracts COSIN
(FET Open IST 2001-33555) and DELIS (FET Open 001907).
Synonym Dictionary Improvement through Markov Clustering Stability
107
falsificateur;n.;1
écumeur de mer;n.m;-1
faussaire;n.m.;2
contrefacteur;n.m.;1
plagiaire;n.;-2
imitateur;n.;2
frère de la côte;n.m;-1
pilleur;n;0 pirate;n.m.;-3
écumeur des mers;n.m.;1
plagiaire;n.;1
charognard;n.m;0
épigone;n.m.;1
copieur;n.;1
pirate;n.m.;1
corsaire;n.m;0
rapace;n.m.;-2
suiveur;n.;1
imitateur;n.;3
frère de la côte;n.m;0
flibustier;n.m.;1
pilleur;n;-1
vautour;n.m.;1
boucanier;n.m.;1
pillard;n.m.;1
requin;n.m.;2
forban;n.m.;1
ecumeur de mer;n.m;0
aventurier;n.;-3
gibier de potence;n.;2
détrousseur;n.m;0
coquin;n.m;0
malandrin;n.m;-1
pirate;n.m.;2
forban;n.m.;2
gangster;n.m.;1 délinquant;n.;1
chevalier
d’industrie;n.;2
malandrin;n.m;0
brigand;n.m.;1
voleur;n.;3
aigrefin;n.m.;1
gangster;n.m.;2
malfaiteur;n.m.;1
escroc;n.m.;1
arnaqueur;n;0
criminel;n.;2
filou;n.m.;1 truand;n.m.;1
filou;n.m.;-2
racaille;n.f.;-2
malfrat;n.m;0
flibustier;n.m.;2
arsouille;n.m;0
apache;n.m;0
gibier
de potence;n.;-3
voyou;n.m.;1
crapule;n.f.;1
fripouille;n.f.;1
bandit;n.m.;1
milieu;n.m.;6
maraud;n.m;0
mafioso;n.m;0
vermine;n.f.;3
canaille;n.f.;2 escroc;n.m.;-3
satyre;n.m.;2
canaille;n.f.;1
coquin;n.m;-1
scélérat;n.;1
gredin;n.m.;1
dégoûtant;n.m;0
coquin;n.m;-7
pègre;n.f.;1
sans-le-sou;n;0
coquin;n.m;-4
coquin;n.m;-3
brigand;n.m.;2 drôle;n.m.;1
fripon;n.;1
vaurien;n.;1
misérable;n.;1
voyou;n.m.;2
obsédé;n.;-2
impécunieux;n;0
dépravé;n;-1
galapiat;n.m;0
misérable;n.;-5
sacripant;n.m.;1
chenapan;n.m.;1
pendard;n.m;-1
coquin;n.m;-5
famélique;n;0
clochard;n.;-2
débauché;n.m.;-2
gouape;n.f.;1
pourceau;n.m.;1
gueux;n.;1
polisson;n.;1
garnement;n.m.;1
miséreux;n.;-2
nécessiteux;n.;1
gueux;n.;-2
pendard;n.m;0
vicieux;n.;1
galapiat;n.m;-1
frappe;n.f;0
galopin;n.m.;1
misérable;n.;-2
débauché;n.m.;-3
indigent;n;-1
cochon;n.m.;2
licencieux;n;0
diable;n.m.;-4
coquin;n.m;-2
coquin;n.m;-6
gamin;n.;2
indigent;n;0
pouilleux;n.;1
pervers;n.;1
misérable;n.;-4
porc;n.m.;2 verrat;n.m.;1
malheureux;n.;1
gueux;n.;-3
gavroche;n.m.;1
gaulois;n;0
paillard;n.;1
titi;n.m.;1
pauvre;n.;1
salace;n;0
loqueteux;n.;1
poulbot;n.m.;1
cochon;n.m.;1
déshérité;n.;1
goret;n.m.;-2
égrillard;n;0
grivois;n;0
miséreux;n.;1
libertin;n.m.;-2
laissé pour compte;n;0
va-nu-pieds;n.;1
porc;n.m.;1
cochon;n.m.;-3
indigent;n;-2
faim;n.f.;-4
défavorisé;n.;1
démuni;n;0 sans-domicile-fixe;n;0
chemineau;n;-2
vagabond;n.;1
chemineau;n;0
paillard;n.;2
noctambule;n.;1
fêtard;n.;1
satyre;n.m.;-3
mendiant;n.;1
clochard;n.;1
clodo;n;0
sans abri;n;0
chemineau;n;-1
gueux;n.;-4
bambocheur;n.;1
sdf;n;0
sdf;n;-1
sans-abri;n.;1
rôdeur;n.;1
sans-logis;n;0
débauché;n.m.;1
dépravé;n;0
libertin;n.m.;-3
dissolu;n;0
viveur;n.m.;1
noceur;n.;1
bon vivant;n;-2
licencieux;n;-1
dévergondé;n.;1
jouisseur;n.;1
bon vivant;n;-1
pourceau;n.m.;2
hédoniste;n;0
dissipé;n;0
perverti;n;0
sybarite;n;0
épicurien;n.;1
Fig. 1. MCL clustering of the component with 185 elements (different level of grey
represent different clusters). Unstable nodes (explained in section 3) are represented
with diamonds.
is thus undirected (and unweighted). It is not fully connected but consists
of many disconnected components1 , with a power-law size distribution (see
fig. 2); the nodes inside a single component representing the same “concept”2 .
The first encountered problem in this dictionary was the existence of
much too large synonym components (up to almost 10,000 nodes, see fig. 2).
The transitive closure of the synonymy relation connects words which have
different senses; e.g. fêtard (“merrymaker”) and sans-abri (“homeless”) (see
fig. 1). This is due to words which are still ambiguous3 and relate different
“concepts”: even if a path exists between two nodes, the slight changes in
the senses that could occur at each step along this path may result in a quite
different sense between both ends. Moreover these big components clearly
show a sub-structure suggesting a partition into smaller clusters.
The second encountered problem was that some words were given too
many senses, i.e. senses that actually correspond to the same “concept”.
To solve these problems, the Markov clustering algorithm (MCL) [Van Dongen, 2000] was first applied to the synonymy network. The idea is that words
with tighter neighborhoods are likely to be less ambiguous than words with
1
2
3
i.e. groups of words which are claimed to be synonyms
We use the word “concept” to denote a group of senses that are synonyms.
i.e. the distinction between two of their senses has not been introduced
108
Gfeller et al.
fuzzier neighborhoods [Sproat and van Santen, 1998], and thus a cluster is
likely to correspond to words with a close meaning, and therefore represents
one ”concept”. Then, in order to find ambiguous words, we noise the edges
and compare the clusters obtained for several noisy realizations of the network. This provides informations about the network that could not have
been extracted with the standard clustering algorithms.
2
Markov Clustering
In this section, MCL, the clustering algorithm we used for splitting the big
components into smaller clusters, is briefly described.
Its basis is that “a random walk on a network that visits a dense cluster
will likely not leave it until many of its vertices have been visited” [Van Dongen, 2000]. The idea is to favor the most probable random walks, increasing
the probability of staying in the initial cluster. The algorithm works as follows: (1) consider the adjacency matrix of the network4 ; (2) normalize each
column to one, in order to obtain a stochastic matrix S; (3) square the matrix
S; the element (S 2 )ij is the probability that a random walk starting at node
j ends up at node i after two steps; (4) take the rth power of every element
of S 2 (typically r ≈ 1.5 − 2); this favors the most probable random walks;
(5) go back to 2 until convergence.
After several iterations we end up with a matrix stable under MCL. Only
a few lines of the matrix have non-zero entries, which give the cluster structure of the network. Note that the parameter r can tune the granularity of
the clustering: a small r corresponds to a few big clusters, whereas a big r
corresponds to smaller clusters. Comparing the results with different r for
some of the components, we chose r = 1.6 as a reasonable value.
As an example, the result of MCL on the component of size 185 is displayed in fig. 1. The obtained subdivision into smaller clusters is definitely
more meaningful; e.g. fêtard is no longer in the same cluster as sans-abri.
MCL was applied on the biggest components of the network. We noticed
that, as the size of the components becomes smaller than 40, the clustering
is often not meaningful anymore, since the components do not show any
particular community structure. After clustering the biggest components,
the cluster size distribution shown in fig. 2 is obtained. The power-law is still
conserved, but the size of the biggest components is much reduced.
3
Unstable Nodes
MCL partitions the network into clusters without overlap, i.e. every node
is assigned to a single cluster (“hard-clustering”). However, the resulting
4
For an undirected and unweighted network, this matrix is symmetric and composed only of zeros and ones.
Synonym Dictionary Improvement through Markov Clustering Stability
Component Size Distribution
Cluster Size Distribution
1000
1000
# Clusters
10000
# Components
10000
100
10
100
10
1
1
109
10
100
1000
Component Size
10000
1
1
10
Cluster Size
100
Fig. 2. Left: Distribution of the size of the components in the whole network (loglog scale). Right: Distribution of the size of the clusters after MCL with r = 1.6.
clustering is sometimes questionable – both from a topological and a linguistic
point of view –, especially for nodes that “lie on the border” between two
clusters (see fig. 3).
The problem of finding ambiguities is closely related to the evaluation of
the robustness of the clustering. Some attempts, based on particular clustering algorithms, were drawn recently to solve this problem [Wilkinson and
Huberman, 2004, Reichardt and Bornholdt, 2004]. We present here a new
method based on the introduction of a stochastic noise over the edges of the
network, and apply it in the framework of MCL. This consists in adding noise
over the non-zero entries of the adjacency matrix5 . Running MCL with noise
several times, some nodes are switching from one cluster to the other (for
example node “reprendre;6 ” in fig. 3). This procedure is now detailed.
Let pij be the probability for the edge between node i and node j of
being inside a cluster. After several runs of the clustering algorithm with
the noise, a weighted network is obtained where edges with probability 1 are
always within a cluster and edges with probability close to 0 connect two
different clusters. Edges with a probability smaller than a threshold θ are
thus considered as “external edges”. By removing those edges, one gets a
disconnected network6 .
5
6
In this study, the noise added over the edges weights (originally equal to 1) is
equally distributed in [−σ, σ], 0 < σ < 1. With σ close to 0, unstable nodes are
not detected, while with σ " 1 the topology of the network changes dramatically.
The results were stable for a broad range of values of σ around 0.5. For example
in the component displayed in fig. 3, the node “reprendre;6 ” was identified as
the only unstable node for 0.35 ≤ σ ≤ 0.8.
For the choice of the parameter θ, we looked at the distribution of the probabilities
pij over the whole network. As expected, this distribution has a clear maximum
in 1, corresponding to edges that are never cut by MCL, preceded by a region
corresponding to edges almost never cut. Since for pij ≤ 0.8 the distribution is
almost flat, we choose θ = 0.8.
110
Gfeller et al.
confirmer;v.;2
rééditer;v.;1
réitérer;v.;1
réaffirmer;v;0
maintenir;v.;4
refaire;v.;1
renouveler;v.;3
reproduire;v.;4
répéter;v.;5
redire;v.;1
recommencer;v.;1
0.73
0.75
0.73
0.27
bisser;v.;2
seriner;v.;1
reprendre;v.;6
0.27
rabâcher;v.;1
répéter;v.;1
rabâcher;v.;2
radoter;v.;2
ressasser;v.;1
rebattre les oreilles avec;v.;1
Fig. 3. Small sub-network with one unstable node (“reprendre;6 ”), extracted from
a component of 111 nodes. The values over the dashed edges are the probabilities
for the edges to be inside a cluster (average over 100 realizations of the clustering
with r = 1.6, and σ = 0.5.). Only probabilities smaller than θ = 0.8 are shown.
The shape of the nodes indicates the cluster found without noise.
In this section, we use the word “cluster ” for the clusters obtained without
noise, and “subcomponent” for the disconnected parts of the network after
the removal of the external edges. If the community structure of the network
is stable under several repetitions of the clustering with noise, the subcomponents of the disconnected network correspond to the clusters obtained without
noise. In the opposite case, a new community structure appears with some
similarity with the initial one.
In order to identify which subcomponents correspond to the initial clusters
and which are new subcomponents, we introduce the notion of similarity
between two sets of nodes. If E1 (resp. E2 ) is the set of clusters (resp. the
set of subcomponents), we use the Jaccard index to define the similarity sij
|C2i ∩C1j |
between cluster C1j ∈ E1 and subcomponent C2i ∈ E2 : sij = |C2i
∪C1j | .
If C1j = C2i , sij = 1 and if C1j ∩ C2i = ∅, sij = 0. For every C1j ∈ E1 ,
we find the component C2i with the maximal similarity and identify it with
the cluster C1j (C2i often corresponds to the stable core of the cluster C1j ).
If there is more than one of such components, none of them is identified with
the cluster. In practice, this latter case is extremely rare.
Nodes belonging to subcomponents that have never been identified with
any cluster could be defined as unstable nodes. However, this definition
suffers some drawbacks since it sometimes happens that a big cluster splits
into two subcomponents of comparable size. Considering that almost half of
the nodes of the cluster are unstable is not realistic and a new cluster should
be defined instead. In practice, subcomponents of four nodes or more often
correspond to a cluster not detected by the algorithm. We therefore define
the unstable nodes as the nodes belonging to subcomponents that have not
been identified with a cluster and whose size is smaller than 4.
Synonym Dictionary Improvement through Markov Clustering Stability
satyre;n.m.;2
dé goû tant;n.m;0
obsé dé ;n.;-2
dé bauché ;n.m.;-3
dé pravé ;n;-1
licencieux;n;0
salace;n;0
gaulois;n;0
pervers;n.;1
cochon;n.m.;2
grivois;n;0
é grillard;n;0
vicieux;n.;1
porc;n.m.;2
libertin;n.m.;-2
0.71
dé bauché ;n.m.;-2
pourceau;n.m.;1
coquin;n.m;-2
paillard;n.;1
111
verrat;n.m.;1
cochon;n.m.;1
cochon;n.m.;-3
1
0.7
0.07
porc;n.m.;1
goret;n.m.;-2
paillard;n.;2
satyre;n.m.;-3
0.05
0.97
dé bauché ;n.m.;1
0.7
0.26
noctambule;n.;1
viveur;n.m.;1
fê tard;n.;1
dé pravé ;n;0
libertin;n.m.;-3
dissolu;n;0
0.26
0.11
0.61
dé vergondé ;n.;1
bambocheur;n.;1
bon vivant;n;-2
noceur;n.;1
hé doniste;n;0
licencieux;n;-1
jouisseur;n.;1
pourceau;n.m.;2
perverti;n;0
dissipé ;n;0
sybarite;n;0
é picurien;n.;1
bon vivant;n;-1
Fig. 4. Zoom over the bottom-right of fig. 1. Five unstable nodes have been found
(non-circle nodes). The splitting of them proceeds as follows: removing the edges
with pij < 1−θ (dashed-line), they are divided into two groups ({cochon;-3 libertin;2, paillard;2 } (diamonds) and {débauché;1, satyre;-3 } (squares). The first group
has only one adjacent subcomponent. It is therefore merged to this subcomponent.
The second group has two adjacent subcomponents. It is thus duplicated and
merged into those two subcomponents.
In the framework of a synonymy network, unstable nodes, which lie on
the border of two subcomponents, correspond to polysemic words which have
not been clearly identified as such (i.e. one of their senses is not present in
the dictionary). We thus decided to split these nodes among their adjacent
subcomponents. The adjacent subcomponents are defined as the subcomponents to which the node is connected through at least one edge with a
probability higher than a given threshold θ# . Typically we choose θ# = 1 − θ,
where θ was the threshold for defining an edge as external.
If several unstable nodes are connected together, we split them according
to the following procedure: first group these nodes keeping only the edges
with pij > θ# ; then, for each group, duplicate it and join it to its adjacent
subcomponents (see fig. 4).
Finally, the second problem, where the same word appears with different
senses in a cluster (e.g. rabâcher in fig. 3) has been addressed by simply
merging into a single node the nodes that correspond to the same word in
a same subcomponent. Indeed, if a node appears twice in a subcomponent,
both senses are actually not different, at least not at the level of granularity
used. Such a situation occurs 4,642 times in the whole network (the total
number of nodes is 50,913). This number is more than four times smaller
than before the MCL clustering (21,261 “duplicates”).
112
4
Gfeller et al.
Discussion and Conclusion
The objective evaluation of the work presented here is not easy. How to
evaluate whether better results are obtained when no reference to compare to
is available7 ? One possible evaluation could be to compare the performances
obtained in a targeted WSD experiment using the original and the corrected
resource. It is however highly dependent on the targeted application. We
thus rather choose to evaluate the method presented here by a subjective
(i.e. human centered) validation, achieved by sampling components in the
newly obtained network. This evaluation appeared to be very convincing8 .
However, we still wanted to develop objective overall clues from the network to try to objectively grasp the benefits of the method. We, for instance,
computed the clustering coefficient of the unstable nodes. The clustering
coefficient (C) of a node is the number (N3 ) of 3-loops passing through the
node, divided by the the maximal possible number of such 3-loops. When
2 N3
the node has degree k, C = k(k−1)
. If a node lies in the middle of a densely
connected cluster, it quite likely has a high clustering coefficient. If the node
lies between clusters, it has a small clustering coefficient9 .
Using MCL, the assumption is made that a community structure is present
in the network. Since MCL may also give a partition of random networks
without any community structure, it is important to validate this assumption.
The introduction of the probability pij over the edges provides a way to do
so. In the case of a random network, the community structure found by the
clustering algorithm is expected to be very sensitive to the noise, whereas in
the case of a network with a clear community structure, the clusters are quite
stable, except for a few unstable nodes. To characterize these two situations,
we introduce the clustering entropy Sc as a measure of the clustering stability:
#
1 !"
pij log2 pij + (1 − pij ) log2 (1 − pij ) ,
Sc = −
M
(i,j)
where M is the total number of edges (i, j) in the network.
Important differences in the clustering entropy between networks with a
clear community structure and random networks with no community structure are expected. If the network is totally unstable (i.e. pij = 21 for all
edges), Sc = 1, while if the edges are perfectly stable (pij = 0 or 1), Sc = 0.
To avoid biasing the results, we compare the components with a randomized version of the same component in which the degree of each node is
conserved [Maslov and Sneppen, 2002]. Table 1 shows the comparison for
several big components of the network of synonyms. The clustering entropy
7
8
9
had we had one, wouldn’t have we developed a method to correct it!
See fig. 1, 2, 4 and 3 for illustrations.
For example, twelve unstable nodes were found in the component displayed in
fig. 1. The average clustering coefficient of these nodes is 0.08, while the average
clustering coefficient over the whole component is 0.42.
Synonym Dictionary Improvement through Markov Clustering Stability
113
Component Size Sc (original) Sc (random)
912
0.25±0.01 0.55±0.01
185
0.19±0.01 0.62±0.02
155
0.27±0.01 0.55±0.03
111
0.21±0.01 0.69±0.02
61
0.20±0.01 0.68±0.04
60
0.19±0.01 0.76±0.04
54
0.21±0.01 0.60±0.07
51
0.21±0.01 0.69±0.05
average
0.21
0.64
Table 1. Comparison for several components. Sc (original) is the clustering entropy
of the original components. Sc (random) is the average clustering entropy for 50
randomized versions of the component. We used r = 1.6 and σ = 0.5.
of the randomly rewired components is at least twice bigger than the entropy
of the original components. This experimentally shows that the clusters obtained with MCL are not an artifact of the method, but correspond to a real
community structure in the network.
Applying the MCL clustering algorithm, the network of synonyms splits
into sensible clusters, significantly improving, at least subjectively, the quality of the dictionary. Most of the clusters can be interpreted as groups of
synonyms and, at a coarse-grained level of representation, correspond to a
general concept of the language. The method introduced to identify nodes
which lie between clusters and to check the robustness of the clustering appeared to be fruitful in the splitting of polysemic nodes. We emphasize that
this method does not depend of a particular clustering algorithm and can be
applied on any complex network.
References
[Besançon et al., 2001]R. Besançon, J.-C. Chappelier, M. Rajman, and A. Rozenknop. Improving text representations through probabilistic integration of synonymy relations. In Proc. of ASMDA’2001, pages 200–205, 2001.
[Maslov and Sneppen, 2002]S. Maslov and K. Sneppen. Specificity and stability in
topology of protein networks. Science, 296:910–913, 2002.
[Reichardt and Bornholdt, 2004]J. Reichardt and S. Bornholdt. Detecting fuzzy
community structures in complex networks with a potts model. Phys. Rev.
Lett., 93(218701), 2004.
[Schütze, 1998]H. Schütze. Automatic word sense discrimination. Computational
Linguistics, 24(1):97–123, 1998.
[Sproat and van Santen, 1998]R. Sproat and J. van Santen. Automatic ambiguity
detection. In Proc. of ICSLP’98, 1998.
[Van Dongen, 2000]S. Van Dongen. Graph clustering by flow simulation. PhD
thesis, University of Utrecht, 2000.
[Wilkinson and Huberman, 2004]D.M. Wilkinson and B.A Huberman. A method
for finding communities of related genes. PNAS, 101:5241–5248, 2004.

Download Report

Synonym Dictionary Improvement through Markov Clustering and

Paperzz.com

Your Paperzz