Synonyms

Detecting synonyms in social tagging systems
Maarten Clements
Arjen P. de Vries & Marcel J.T. Reinders
Synonymy
Detection
If a term has different synonyms, people tend to prefer one of
the words to describe their content. Therefore, synonyms are frequently applied on the same content by distinct groups of users.
The user and item similarity between the query tag (tq ) and the
potential synonym (ts) are derived from aggregated annotation
data in the UT and IT matrix.
Synonym (ts)
‘Humor’
Query (tq)
‘Humour’
D(uk , il, tm)
Relevant
Content
UT(uk , tm)
t1
P
tq ts tM
u1
U
P
t1
uK
T
I
tq ts tM
SU (tq , ts) < 0
i1
SI (tq , ts) ≥ 0.5
iL
British
people
IT(il, tm)
American
people
The Pearson correlation is used as a similarity measure between
two tags. E.g. the item similarity is computed by:
Most query expansion methods use words that frequently cooccur in the same content to enrich the initial query. The dissimilarity between the user groups of the synonyms could be used
to improve the selection of true synonyms for query expansion.
L
X
SI (tq , ts) = ρ(ITq , ITs) =
(ITl,q − µITq )(ITl,s − µITs)
l=1
(1)
σITq σITs
Results
The similarities of all tags to the queries ‘Humour’ and ‘Classic’ are shown in scatter plots. The synonyms (green labels) clearly show
the expected high item correlation and low user correlation. A ranking based on item similarity alone would return many false positives.
The tables list the terms that will be selected when SI ≥ 0.5 and SU < 0 are used as thresholds. The tables also give the total
number of items annotated with a tag (‘Items’) and the number of items that was not annotated with the original query (‘New’).
Humour
Classic
1.0
1.0
Proposed synonym
classics
classic literature
0.1%
0.9
0.9
1%
humor
comedy
100%
terry pratchett
SI (tq , ts)
funny
discworld series
1001
comic fantasy
british humor
parody
satire
0.5
0.4
0.1
0.1
0
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
-0.1
classics
classic literature
19th century literature
classic lit
bbc big read
assigned
classic fiction
-0.05
0
0.05
0.1
0.15
SU (tq , ts)
SU (tq , ts)
[I,C)
0.4
Proposed synonym
0.4
0.2
0.7829
0.5738
0.6144
0.5614
0.6031
SU
-0.0356 4323 2511
-0.0091 1209 510
-0.0065 364 132
-0.0057 99
9
-0.0035 36
1
0.2
0.25
SI
SU
0.9407
0.8742
0.5811
0.6584
0.5162
0.5253
0.8288
-0.043
-0.0164
-0.0112
-0.0089
-0.0066
-0.0049
-0.0048
Items New
2094
494
162
132
53
70
430
824
72
24
18
6
11
60
0.3
Maarten Clements, M.Sc.
ICT group, Faculty of EEMCS, Delft University of Technology (TUD)
Tel: +31 (0)15 2781845 / Mail: [email protected]
T
ICT Group - Faculty of EEMCS
Items New
Classic (Items: 2872)
0.5
0.2
SI
Fiction
Read
0.3
-0.05
19th century
Novel
0.6
0.3
-0.1
100%
english literature
british literature
classic lit
discworld
humorous
0.6
50%
0.7
pratchett
10%
0.8
SI (tq , ts)
0.7
1%
literature
50%
humor
funny
humorous
british humor
discworld series
0.1%
classic fiction
10%
0.8
Humour (Items: 2527)
Delft University of Technology