Detecting synonyms in social tagging systems Maarten Clements Arjen P. de Vries & Marcel J.T. Reinders Synonymy Detection If a term has different synonyms, people tend to prefer one of the words to describe their content. Therefore, synonyms are frequently applied on the same content by distinct groups of users. The user and item similarity between the query tag (tq ) and the potential synonym (ts) are derived from aggregated annotation data in the UT and IT matrix. Synonym (ts) ‘Humor’ Query (tq) ‘Humour’ D(uk , il, tm) Relevant Content UT(uk , tm) t1 P tq ts tM u1 U P t1 uK T I tq ts tM SU (tq , ts) < 0 i1 SI (tq , ts) ≥ 0.5 iL British people IT(il, tm) American people The Pearson correlation is used as a similarity measure between two tags. E.g. the item similarity is computed by: Most query expansion methods use words that frequently cooccur in the same content to enrich the initial query. The dissimilarity between the user groups of the synonyms could be used to improve the selection of true synonyms for query expansion. L X SI (tq , ts) = ρ(ITq , ITs) = (ITl,q − µITq )(ITl,s − µITs) l=1 (1) σITq σITs Results The similarities of all tags to the queries ‘Humour’ and ‘Classic’ are shown in scatter plots. The synonyms (green labels) clearly show the expected high item correlation and low user correlation. A ranking based on item similarity alone would return many false positives. The tables list the terms that will be selected when SI ≥ 0.5 and SU < 0 are used as thresholds. The tables also give the total number of items annotated with a tag (‘Items’) and the number of items that was not annotated with the original query (‘New’). Humour Classic 1.0 1.0 Proposed synonym classics classic literature 0.1% 0.9 0.9 1% humor comedy 100% terry pratchett SI (tq , ts) funny discworld series 1001 comic fantasy british humor parody satire 0.5 0.4 0.1 0.1 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 -0.1 classics classic literature 19th century literature classic lit bbc big read assigned classic fiction -0.05 0 0.05 0.1 0.15 SU (tq , ts) SU (tq , ts) [I,C) 0.4 Proposed synonym 0.4 0.2 0.7829 0.5738 0.6144 0.5614 0.6031 SU -0.0356 4323 2511 -0.0091 1209 510 -0.0065 364 132 -0.0057 99 9 -0.0035 36 1 0.2 0.25 SI SU 0.9407 0.8742 0.5811 0.6584 0.5162 0.5253 0.8288 -0.043 -0.0164 -0.0112 -0.0089 -0.0066 -0.0049 -0.0048 Items New 2094 494 162 132 53 70 430 824 72 24 18 6 11 60 0.3 Maarten Clements, M.Sc. ICT group, Faculty of EEMCS, Delft University of Technology (TUD) Tel: +31 (0)15 2781845 / Mail: [email protected] T ICT Group - Faculty of EEMCS Items New Classic (Items: 2872) 0.5 0.2 SI Fiction Read 0.3 -0.05 19th century Novel 0.6 0.3 -0.1 100% english literature british literature classic lit discworld humorous 0.6 50% 0.7 pratchett 10% 0.8 SI (tq , ts) 0.7 1% literature 50% humor funny humorous british humor discworld series 0.1% classic fiction 10% 0.8 Humour (Items: 2527) Delft University of Technology
© Copyright 2026 Paperzz