Automatic extraction of synonyms in a dictionary

Automatic extraction of synonyms in a dictionary
Vincent D. Blondel
Pierre P. Senellart
April 13th 2002
The dictionary graph
Computation (n.) The act or process of computing; calculation; reckoning.
Computation (n.) The result of computation; the amount computed.
Computed (imp. & p. p.) of Compute
Computing (p. pr. & vb. n.) of Compute
Compute (v.t.) To determine calculation; to reckon; to count.
Compute (n.) Computation.
Computer (n.) One who computes.
1
Computed
Computation
Compute
Computer
Computing
Rest of the graph
2
Looking for near-synonyms
Definition. The neighborhood graph of a node i in a directed graph G is the
subgraph consisting of i, all parents of i and all children of i.
i is some word we want a synonym of.
A will be the adjacency matrix of the neighborhood graph of i in the dictionary
graph.
n is the order of A.
3
Subgraph of the neighborhood graph of likely
invidious
likely
adapted
giving
truthy
probable
verisimilar
belief
probably
4
Hubs and Authorities on the Web
The Web as a graph:
• vertices=web pages
• edges=links
Hub −→ Authority
A mutually reinforcing relationship: good hubs are pages that point to good
authorities and good authorities are pages pointed by good hubs.
5
Kleinberg’s algorithm
x1i : iteratively computed hub weights
x2i : iteratively computed authority weights


1
x10 = x20 =  . . . 
1
1
x
x2
0 A
AT 0
t = 0, 1, . . .
=
t+1
1
x
x2
t
The principal eigenvectors of AT A and AAT give respectively the authority
weights and hub weights of the vertices of the graph.
6
An extension of Kleinberg’s algorithm
Let M (m, m) and N (n, n) be the transition matrices of two oriented graphs.
Let C = M ⊗ N + M T ⊗ N T where ⊗ is the Kronecker tensorial product.
We assume that the greatest eigenvalue of C is strictly greater than the absolute
value of all other eigenvalues.
Then, the normalized principal eigenvector X of C gives the ”similarity” between
a vertex of M and a vertex of N : Xi×n+j characterizes the similarity between vertex
i of M and vertex j of N .
0 1
In particular, if M =
, the result is that of Kleinberg’s algorithm.
0 0
7
Application to the search for synonyms
1 −→ 2 −→ 3
We are looking for vertices “like 2” in the neighborhood graph of i.


0 1 0
Let C = M ⊗ A + M T ⊗ AT where M =  0 0 1 .
0 0 0
The principal eigenvector of C gives the similarity between a node in G and a
node in the graph 1 −→ 2 −→ 3.
We just select the subvector corresponding to the vertex 2 in order to have
synonymy weights.
8
The vectors method
For each 1 ≤ j ≤ n, j 6= i, compute:
k(Ai,· − Aj,·)k + k(A·,i − A·,j )T k
(where k k is some vector norm, Ai,· and A·,i are respectively the ith line and the
ith column of A).
For instance, if we choose the Euclidean norm, we compute:
n
X
k=1
(Ai,k − Aj,k )2
! 21
+
n
X
(Ak,i − Ak,j )2
! 12
k=1
The lower this value is, the better j is a synonym of i.
9
ArcRank
PageRank (Google): stationary distribution of weights over vertices corresponding to the principal eigenvector of the adjacency matrix.
ArcRank:
rs,t
ps/|as|
=
pt
|as| is the outdegree of s.
pt is the pagerank of t.
The best synonyms of i are the other extremity of the best-ranked arcs arriving
to or leaving from i.
10
Extraction of the graph
• Multiwords (e.g. All Saints’, Surinam toad)
• Prefixes and suffixes (e.g un-, -ous)
• Different meanings of a word
• Derived forms (e.g. daisies, sought)
• Accentuated characters (e.g. proven/al, cr/che)
• Misspelled words
112, 169 vertices - 1, 398, 424 arcs.
11
Lexical units
13, 396 lexical units not defined in the dictionary:
• Numbers (e.g. 14159265, 14th)
• Mathematical and chemical symbols (e.g. x3, fe3o4)
• Proper nouns (e.g. California, Aaron)
• Misspelled words (e.g. aligator, abudance)
• Undefined words (e.g. snakelike, unwound)
• Abbreviations (e.g. adj, etc)
12
Too frequent words
of
a
the
or
to
in
as
and
an
by
one
with
which
68187
47500
43760
41496
31957
23999
22529
16781
14027
12468
12216
10944
10446
13
Parts of speech
305 different categories transformed into combinations of:
• noun
• adjective
• adverb
• verb
• other (articles, conjunctions, interjections. . . )
14
Disappear
1
2
3
4
5
6
7
8
9
10
Mark
Std dev.
Vectors
vanish
wear
die
sail
faint
light
port
absorb
appear
cease
3.6
1.8
Kleinberg
vanish
pass
die
wear
faint
fade
sail
light
dissipate
cease
6.3
1.7
ArcRank
epidemic
disappearing
port
dissipate
cease
eat
gradually
instrumental
darkness
efface
1.2
1.2
Wordnet
vanish
go away
end
finish
terminate
cease
7.5
1.4
Microsoft Word
vanish
cease to exist
fade away
die out
go
evaporate
wane
expire
withdraw
pass away
8.6
1.3
Table 1: Near-synonyms for disappear
15
Parallelogram
1
2
3
4
5
6
7
8
9
10
Mark
Std dev.
Vectors
square
parallel
rhomb
prism
figure
equal
quadrilateral
opposite
altitude
parallelopiped
4.6
2.7
Kleinberg
square
rhomb
parallel
figure
prism
equal
opposite
angles
quadrilateral
rectangle
4.8
2.5
ArcRank
quadrilateral
gnomon
right-lined
rectangle
consequently
parallelopiped
parallel
cylinder
popular
prism
3.3
2.2
Wordnet
quadrilateral
quadrangle
tetragon
Microsoft Word
diamond
lozenge
rhomb
6.3
2.5
5.3
2.6
Table 2: Near-synonyms for parallelogram
16
Sugar
1
2
2
4
5
6
7
8
9
10
Mark
Std dev.
Vectors
juice
starch
cane
milk
molasses
sucrose
wax
root
crystalline
confection
3.9
2.0
Kleinberg
cane
starch
sucrose
milk
sweet
dextrose
molasses
juice
glucose
lactose
6.3
2.4
ArcRank
granulation
shrub
sucrose
preserve
honeyed
property
sorghum
grocer
acetate
saccharine
4.3
2.3
Wordnet
sweetening
sweetener
carbohydrate
saccharide
organic compound
saccarify
sweeten
dulcify
edulcorate
dulcorate
6.2
2.9
Microsoft Word
darling
baby
honey
dear
love
dearest
beloved
precious
pet
babe
4.7
2.7
Table 3: Near-synonyms for sugar
17
Science
1
2
3
4
5
6
7
8
9
10
Mark
Std dev.
Vectors
art
branch
nature
law
knowledge
principle
life
natural
electricity
biology
3.6
2.0
Kleinberg
art
branch
law
study
practice
natural
knowledge
learning
theory
principle
4.4
2.5
ArcRank
formulate
arithmetic
systematize
scientific
knowledge
geometry
philosophical
learning
expertness
mathematics
3.2
2.9
Wordnet
knowledge domain
knowledge base
discipline
subject
subject area
subject field
field
field of study
ability
power
7.1
2.6
Microsoft Word
discipline
knowledge
skill
art
6.5
2.4
Table 4: Near-synonyms for science
18
Perspectives
• Extension of the subgraph
• Other dictionaries, other languages
• Other applications of the extension of Kleinberg’s algorithm
19