Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network Hugo Gonçalo Oliveira, Paulo Gomes (hroliv,pgomes)@dei.uc.pt Cognitive & Media Systems Group CISUC, University of Coimbra Uppsala, July 15, 2010 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 1 / 24 Outline 1 Introduction Lexical ontologies Information extraction Issues Research Goals 2 Approach Clustering for synsets Merging thesauri Assigning terms to synsets 3 Experimentation Preparation Wordnet establishment 4 Concluding remarks Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 2 / 24 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum 1998] Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 3 / 24 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum 1998] ▶ Ontology + lexicon [Hirst 2004] Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 3 / 24 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum 1998] ▶ ▶ Ontology + lexicon [Hirst 2004] Knowledge structured on words and their meanings Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 3 / 24 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum 1998] ▶ ▶ ▶ ▶ Ontology + lexicon [Hirst 2004] Knowledge structured on words and their meanings Cover the whole language Not based on a specific domain Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 3 / 24 Introduction Lexical ontologies Lexical ontologies Such as Princeton WordNet [Fellbaum 1998] ▶ ▶ ▶ ▶ Ontology + lexicon [Hirst 2004] Knowledge structured on words and their meanings Cover the whole language Not based on a specific domain Construction and maintenance involve time-consuming human effort! Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 3 / 24 Introduction Information extraction Information extraction from text From dictionaries: Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 4 / 24 Introduction Information extraction Information extraction from text From dictionaries: 1 basketball, noun – a game, also known as hoops, played indoors... → game HYPERNYM OF basketball → basketball SYNONYM OF hoops Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 4 / 24 Introduction Information extraction Information extraction from text From dictionaries: 1 2 basketball, noun – a game, also known as hoops, played indoors... → game HYPERNYM OF basketball → basketball SYNONYM OF hoops basketball, noun – the ball used in playing basketball. → ball HYPERNYM OF basketball Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 4 / 24 Introduction Information extraction Information extraction from text From dictionaries: 1 2 basketball, noun – a game, also known as hoops, played indoors... → game HYPERNYM OF basketball → basketball SYNONYM OF hoops basketball, noun – the ball used in playing basketball. → ball HYPERNYM OF basketball From textual corpora: Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 4 / 24 Introduction Information extraction Information extraction from text From dictionaries: 1 2 basketball, noun – a game, also known as hoops, played indoors... → game HYPERNYM OF basketball → basketball SYNONYM OF hoops basketball, noun – the ball used in playing basketball. → ball HYPERNYM OF basketball From textual corpora: ▶ ... team sports, such as basketball, rugby ... → team sport HYPERNYM OF basketball → team sport HYPERNYM OF rugby Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 4 / 24 Introduction Issues Natural language is ambiguous Term-based networks are impractical for many applications Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 5 / 24 Introduction Issues Natural language is ambiguous Term-based networks are impractical for many applications In the previous example: is hoops a team sport? Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 5 / 24 Introduction Issues Natural language is ambiguous Term-based networks are impractical for many applications In the previous example: is hoops a team sport? An example extracted from a Portuguese dictionary: ruı́na SYNONYM OF queda ∧ queda SYNONYM OF habilidade → habilidade SYNONYM OF ruı́na ?? Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 5 / 24 Introduction Issues Natural language is ambiguous Term-based networks are impractical for many applications In the previous example: is hoops a team sport? An example extracted from a Portuguese dictionary: ruı́na SYNONYM OF queda ∧ queda SYNONYM OF habilidade → habilidade SYNONYM OF ruı́na ?? queda can either mean aptitude or downfall! Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 5 / 24 Introduction Research Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 6 / 24 Introduction Research Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 6 / 24 Introduction Research Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 6 / 24 Introduction Research Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Modelled after Princeton WordNet Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 6 / 24 Introduction Research Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Modelled after Princeton WordNet ▶ ▶ Synsets: groups of synonymous words Synset-based relational triples Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 6 / 24 Introduction Research Goals Onto.PT Automatic construction of a lexical ontology for Portuguese Extracted from different sources ▶ ▶ ▶ Manually created thesauri Language dictionaries/encyclopedias Corpora Modelled after Princeton WordNet ▶ ▶ Synsets: groups of synonymous words Synset-based relational triples WSD based on the knowledge already extracted, not on the context. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 6 / 24 Approach Information flow Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 7 / 24 Approach Information flow Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 7 / 24 Approach Clustering for synsets Synonymy networks tend to have a clustered structure Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 8 / 24 Approach Clustering for synsets Synonymy networks tend to have a clustered structure Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 8 / 24 Approach Clustering for synsets Synset discovery (inspired by [Gfeller et al. 2005]) 1 Split the original network into sub-networks and calculate the frequency-weighted adjacency matrix F of each sub-network; Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 9 / 24 Approach Clustering for synsets Synset discovery (inspired by [Gfeller et al. 2005]) 1 Split the original network into sub-networks and calculate the frequency-weighted adjacency matrix F of each sub-network; 2 Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5; Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 9 / 24 Approach Clustering for synsets Synset discovery (inspired by [Gfeller et al. 2005]) 1 Split the original network into sub-networks and calculate the frequency-weighted adjacency matrix F of each sub-network; 2 Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5; 3 Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times; Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 9 / 24 Approach Clustering for synsets Synset discovery (inspired by [Gfeller et al. 2005]) 1 Split the original network into sub-networks and calculate the frequency-weighted adjacency matrix F of each sub-network; 2 Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5; 3 Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times; 4 Use the (hard) clustering from each run to create P, a matrix with the probabilities of each pair of words in F belonging to the same cluster; Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 9 / 24 Approach Clustering for synsets Synset discovery (inspired by [Gfeller et al. 2005]) 1 Split the original network into sub-networks and calculate the frequency-weighted adjacency matrix F of each sub-network; 2 Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5; 3 Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times; 4 Use the (hard) clustering from each run to create P, a matrix with the probabilities of each pair of words in F belonging to the same cluster; 5 Remove: (a) big clusters, B, if there is a group of clusters C = C1 , C2 , ...Cn such that B = C1 ∪ C2 ∪ ... ∪ Cn ; (b) clusters completely included in other clusters. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 9 / 24 Approach Merging thesauri Merging synsets from different thesaurus For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) 1 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 10 / 24 Approach Merging thesauri Merging synsets from different thesaurus For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) T1 = (divindade, diva, deusa) 1 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 10 / 24 Approach Merging thesauri Merging synsets from different thesaurus For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) T1 = (divindade, diva, deusa) ▶ ▶ 1 c(T1 , B1 ) = c(T1 , B2 ) = 1 3 1 6 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 10 / 24 Approach Merging thesauri Merging synsets from different thesaurus For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1 B1 = (diva, beldade, beleza, deidade, deusa, divindade) B2 = (divindade, deidade, deus, nume) T1 = (divindade, diva, deusa) ▶ ▶ c(T1 , B1 ) = c(T1 , B2 ) = 1 3 1 6 N = B1 ∪ T1 = (diva, beldade, beleza, deidade, deusa, divindade) 1 Jaccard coefficient Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 10 / 24 Approach Assigning terms to synsets Mapping methods Input: ▶ ▶ Thesaurus T , containing synsets Term-based semantic network, N, where each edge has a type R Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 11 / 24 Approach Assigning terms to synsets Mapping methods Input: ▶ ▶ Thesaurus T , containing synsets Term-based semantic network, N, where each edge has a type R Goal: map a R b ∈ N to A R B, (A, B) ∈ T Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 11 / 24 Approach Assigning terms to synsets Mapping methods Input: ▶ ▶ Thesaurus T , containing synsets Term-based semantic network, N, where each edge has a type R Goal: map a R b ∈ N to A R B, (A, B) ∈ T Output: semantic network W , whose nodes are synsets, which relate to other synsets by means of semantic relations (wordnet) Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 11 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ a is not in T ? create synset A = (a), a → A Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , ▶ pai = nai ∣Sai ∣ , nai = number of terms tj ∈ Sai : (tj R b) Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , ▶ pai = ★ ★ ★ nai ∣Sai ∣ , nai = number of terms tj ∈ Sai : (tj R b) Sa1 = (a, c, d, e), pa1 = 34 Sa2 = (a, f, g ), pa2 = 23 Sa3 = (a, h, i, j), pa3 = 41 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 Assignment of a (in a R b) to A: 1 Fix b 2 Sa ⊂ T : Sai ∈ Sa , a ∈ Sai ▶ 3 a is not in T ? create synset A = (a), a → A For each Sai ∈ Sa , ▶ pai = ★ ★ ★ ▶ nai ∣Sai ∣ , nai = number of terms tj ∈ Sai : (tj R b) Sa1 = (a, c, d, e), pa1 = 34 Sa2 = (a, f, g ), pa2 = 23 Sa3 = (a, h, i, j), pa3 = 41 a → Sa1 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 12 / 24 Approach Assigning terms to synsets Procedure 1 (stage 2) Only for semi-mapped triples a R B and A R b Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 13 / 24 Approach Assigning terms to synsets Procedure 1 (stage 2) Only for semi-mapped triples a R B and A R b Take advantage of established hypernymy links. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 13 / 24 Approach Assigning terms to synsets Procedure 1 (stage 2) Only for semi-mapped triples a R B and A R b Take advantage of established hypernymy links. Assigning b in A R b Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 13 / 24 Approach Assigning terms to synsets Procedure 1 (stage 2) – examples and additional cleaning If there is Ci ∈ C with... Ci HYPER OF H ∧ A R H, b → Ci If all Ci HYPER OF Ii ∧ A R Ii , triples A R Ii can be inferred! Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 14 / 24 Approach Assigning terms to synsets Procedure 1 (stage 2) – examples and additional cleaning If there is Ci ∈ C with... Ci HYPER OF H ∧ A R H, b → Ci If all Ci HYPER OF Ii ∧ A R Ii , triples A R Ii can be inferred! If H = (dog ) I1 = (cat), I1 = (mouse) and Ci = (mammal): ▶ ▶ A = (hair ) and R = (PART OF ) A = (animal) and R = (HYPER OF ) Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 14 / 24 Approach Assigning terms to synsets Alternative mapping procedure 1 M = term-term matrix based on the adjacencies of the lexical network. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 15 / 24 Approach Assigning terms to synsets Alternative mapping procedure 1 M = term-term matrix based on the adjacencies of the lexical network. 2 Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T . Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 15 / 24 Approach Assigning terms to synsets Alternative mapping procedure 1 M = term-term matrix based on the adjacencies of the lexical network. 2 Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T . 3 For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B: ∣A∣ ∑ ∣B∣ ∑ sim(A, B) = Gonçalo Oliveira & Gomes (CISUC) cos(Ai , Bj ) i=1 j=1 TextGraphs-5 ∣A∣∣B∣ Uppsala, July 15, 2010 15 / 24 Approach Assigning terms to synsets Alternative mapping procedure 1 M = term-term matrix based on the adjacencies of the lexical network. 2 Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T . 3 For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B: ∣A∣ ∑ ∣B∣ ∑ sim(A, B) = 4 cos(Ai , Bj ) i=1 j=1 ∣A∣∣B∣ Select the pair of synsets with the highest similarity Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 15 / 24 Experimentation Preparation Resources used (only nouns) PAPEL2 lexical network 2 http://www.linguateca.pt/PAPEL/ http://www.nilc.icmc.usp.br/tep2/index.htm 4 http://openthesaurus.caixamagica.pt/ 3 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 16 / 24 Experimentation Preparation Resources used (only nouns) PAPEL2 lexical network ▶ ▶ Hypernymy, part-of and member-of triples Synonymy instances 2 http://www.linguateca.pt/PAPEL/ http://www.nilc.icmc.usp.br/tep2/index.htm 4 http://openthesaurus.caixamagica.pt/ 3 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 16 / 24 Experimentation Preparation Resources used (only nouns) PAPEL2 lexical network ▶ ▶ Hypernymy, part-of and member-of triples Synonymy instances ★ Huge synonymy sub-network with 16k nodes!!! 2 http://www.linguateca.pt/PAPEL/ http://www.nilc.icmc.usp.br/tep2/index.htm 4 http://openthesaurus.caixamagica.pt/ 3 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 16 / 24 Experimentation Preparation Resources used (only nouns) PAPEL2 lexical network ▶ ▶ Hypernymy, part-of and member-of triples Synonymy instances ★ Huge synonymy sub-network with 16k nodes!!! TeP3 thesaurus OpenThesaurus.PT (OT)4 CLIP = clustered PAPEL TOP = TeP merged with OT, merged with CLIP 2 http://www.linguateca.pt/PAPEL/ http://www.nilc.icmc.usp.br/tep2/index.htm 4 http://openthesaurus.caixamagica.pt/ 3 Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 16 / 24 Experimentation Preparation Resulting Thesaurus Words Synsets Quantity Ambiguous Most ambiguous Quantity Avg. size Biggest TeP 17,158 5,867 20 8,254 3.51 21 OT 5,819 442 4 1,872 3.37 14 CLIP 23,741 12,196 47 7,468 12.57 103 TOP 30,554 13,294 21 9,960 6.6 277 Table: (Noun) thesauruses in numbers. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 17 / 24 Experimentation Preparation Clustered sub-network of PAPEL – example Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 18 / 24 Experimentation Preparation Manual validation CLIP CLIP’ TOP TOP’ Sample 519 sets 310 sets 480 sets 448 sets Correct 65.8% 81.1% 83.2% 86.8% Incorrect 31.7% 16.9% 15.8% 12.3% N/A 2.5% 2.0% 1.0% 0.9% Agreement 76.1% 84.2% 82.3% 83.0% Table: Results of manual synset validation. CLIP’ and TOP’ only consider synsets with 10 or less words. ▶ The quality is higher for smaller synsets. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 19 / 24 Experimentation Wordnet establishment Resulting WordNet Term-based triples Mapped 1st Same synset Already present Semi-mapped triples Mapped 2nd Could be inferred Already present Synset-based triples Hypernym of 62,591 27,750 233 3,970 7,952 88 50 13 23,572 Part of 2,805 1,460 5 40 262 1 0 0 1,416 Member of 5,929 3,962 12 167 357 0 0 0 3,783 Table: Results of triples mapping Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 20 / 24 Experimentation Wordnet establishment Automatic validation For each triple, A R B 1 Compile a set of textual patterns denoting R, e.g.: ▶ ▶ (hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper) (whole/group) é um (grupo∣conjunto∣...) de (part/member) Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 21 / 24 Experimentation Wordnet establishment Automatic validation For each triple, A R B 1 Compile a set of textual patterns denoting R, e.g.: ▶ ▶ 2 (hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper) (whole/group) é um (grupo∣conjunto∣...) de (part/member) Score the triple with the help of Google: ∣A∣ ∑ ∣B∣ ∑ score = Gonçalo Oliveira & Gomes (CISUC) found(Ai , Bj , R) i=1 j=1 ∣A∣ ∗ ∣B∣ TextGraphs-5 Uppsala, July 15, 2010 21 / 24 Experimentation Wordnet establishment Automatic validation For each triple, A R B 1 Compile a set of textual patterns denoting R, e.g.: ▶ ▶ 2 (hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper) (whole/group) é um (grupo∣conjunto∣...) de (part/member) Score the triple with the help of Google: ∣A∣ ∑ ∣B∣ ∑ score = Relation Hypernymy of Member of Part of found(Ai , Bj , R) i=1 j=1 ∣A∣ ∗ ∣B∣ Sample size 419 synsets 379 synsets 290 synsets Validation 44,1% 24,3% 24,8% Table: Automatic validation of triples Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 21 / 24 Concluding remarks Concluding remarks Our way to achieve WSD without a context continues... ▶ Clustering is a suitable alternative for establishing synsets Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 22 / 24 Concluding remarks Concluding remarks Our way to achieve WSD without a context continues... ▶ Clustering is a suitable alternative for establishing synsets ★ What about for networks not extracted from dictionaries? Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 22 / 24 Concluding remarks Concluding remarks Our way to achieve WSD without a context continues... ▶ Clustering is a suitable alternative for establishing synsets ★ ▶ What about for networks not extracted from dictionaries? Rules can be defined to map terms in triples to synsets Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 22 / 24 Concluding remarks Concluding remarks Our way to achieve WSD without a context continues... ▶ Clustering is a suitable alternative for establishing synsets ★ ▶ What about for networks not extracted from dictionaries? Rules can be defined to map terms in triples to synsets ★ Though some triples remain unmapped... Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 22 / 24 Concluding remarks Concluding remarks Our way to achieve WSD without a context continues... ▶ Clustering is a suitable alternative for establishing synsets ★ ▶ What about for networks not extracted from dictionaries? Rules can be defined to map terms in triples to synsets ★ Though some triples remain unmapped... Future: ▶ ▶ Evaluate the alternative mapping method Exploit other resources: e.g. Wiktionary and Wikipedia Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 22 / 24 The end References Christiane Fellbaum, editor (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. Graeme Hirst (2004). Ontology and the lexicon. In Steffen Staab and Rudi Studer, editors, Handbook on Ontologies, International Handbooks on Information Systems, pages 209–230. Springer. S. M. van Dongen (2000). Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht. David Gfeller, Jean-Cédric Chappelier and Paulo De Los Rios (2005). Synonym Dictionary Improvement through Markov Clustering and Clustering Stability. In Proc. of International Symposium on Applied Stochastic Models and Data Analysis (ASMDA), pages 106–113. Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 23 / 24 The end Thank you! Gonçalo Oliveira & Gomes (CISUC) TextGraphs-5 Uppsala, July 15, 2010 24 / 24
© Copyright 2025 Paperzz