Towards the Automatic Creation of a Wordnet from a Term

Towards the Automatic Creation of a Wordnet from a
Term-based Lexical Network
Hugo Gonçalo Oliveira, Paulo Gomes
(hroliv,pgomes)@dei.uc.pt
Cognitive & Media Systems Group
CISUC, University of Coimbra
Uppsala, July 15, 2010
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
1 / 24
Outline
1
Introduction
Lexical ontologies
Information extraction
Issues
Research Goals
2
Approach
Clustering for synsets
Merging thesauri
Assigning terms to synsets
3
Experimentation
Preparation
Wordnet establishment
4
Concluding remarks
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
2 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
Ontology + lexicon [Hirst 2004]
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
▶
Ontology + lexicon [Hirst 2004]
Knowledge structured on words and their meanings
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
▶
▶
▶
Ontology + lexicon [Hirst 2004]
Knowledge structured on words and their meanings
Cover the whole language
Not based on a specific domain
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
▶
▶
▶
Ontology + lexicon [Hirst 2004]
Knowledge structured on words and their meanings
Cover the whole language
Not based on a specific domain
Construction and maintenance involve time-consuming human effort!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
2
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
basketball, noun – the ball used in playing basketball.
→ ball HYPERNYM OF basketball
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
2
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
basketball, noun – the ball used in playing basketball.
→ ball HYPERNYM OF basketball
From textual corpora:
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
2
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
basketball, noun – the ball used in playing basketball.
→ ball HYPERNYM OF basketball
From textual corpora:
▶
... team sports, such as basketball, rugby ...
→ team sport HYPERNYM OF basketball
→ team sport HYPERNYM OF rugby
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
In the previous example: is hoops a team sport?
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
In the previous example: is hoops a team sport?
An example extracted from a Portuguese dictionary:
ruı́na SYNONYM OF queda ∧ queda SYNONYM OF habilidade
→ habilidade SYNONYM OF ruı́na ??
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
In the previous example: is hoops a team sport?
An example extracted from a Portuguese dictionary:
ruı́na SYNONYM OF queda ∧ queda SYNONYM OF habilidade
→ habilidade SYNONYM OF ruı́na ??
queda can either mean aptitude or downfall!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
▶
▶
Synsets: groups of synonymous words
Synset-based relational triples
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
▶
▶
Synsets: groups of synonymous words
Synset-based relational triples
WSD based on the knowledge already extracted, not on the context.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Approach
Information flow
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
7 / 24
Approach
Information flow
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
7 / 24
Approach
Clustering for synsets
Synonymy networks tend to have a clustered structure
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
8 / 24
Approach
Clustering for synsets
Synonymy networks tend to have a clustered structure
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
8 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
3
Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
3
Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times;
4
Use the (hard) clustering from each run to create P, a matrix with the
probabilities of each pair of words in F belonging to the same cluster;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
3
Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times;
4
Use the (hard) clustering from each run to create P, a matrix with the
probabilities of each pair of words in F belonging to the same cluster;
5
Remove: (a) big clusters, B, if there is a group of clusters
C = C1 , C2 , ...Cn such that B = C1 ∪ C2 ∪ ... ∪ Cn ; (b) clusters
completely included in other clusters.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
1
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
1
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
▶
▶
1
c(T1 , B1 ) =
c(T1 , B2 ) =
1
3
1
6
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
▶
▶
c(T1 , B1 ) =
c(T1 , B2 ) =
1
3
1
6
N = B1 ∪ T1 = (diva, beldade, beleza, deidade, deusa, divindade)
1
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
11 / 24
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Goal: map a R b ∈ N to A R B, (A, B) ∈ T
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
11 / 24
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Goal: map a R b ∈ N to A R B, (A, B) ∈ T
Output: semantic network W , whose nodes are synsets, which relate
to other synsets by means of semantic relations (wordnet)
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
11 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
a is not in T ? create synset A = (a), a → A
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
★
★
★
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Sa1 = (a, c, d, e), pa1 = 34
Sa2 = (a, f, g ), pa2 = 23
Sa3 = (a, h, i, j), pa3 = 41
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
★
★
★
▶
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Sa1 = (a, c, d, e), pa1 = 34
Sa2 = (a, f, g ), pa2 = 23
Sa3 = (a, h, i, j), pa3 = 41
a → Sa1
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2)
Only for semi-mapped triples a R B and A R b
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
13 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2)
Only for semi-mapped triples a R B and A R b
Take advantage of established hypernymy links.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
13 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2)
Only for semi-mapped triples a R B and A R b
Take advantage of established hypernymy links.
Assigning b in A R b
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
13 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2) – examples and additional cleaning
If there is Ci ∈ C with...
Ci HYPER OF H ∧ A R H, b → Ci
If all Ci HYPER OF Ii ∧ A R Ii , triples A R Ii can be inferred!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
14 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2) – examples and additional cleaning
If there is Ci ∈ C with...
Ci HYPER OF H ∧ A R H, b → Ci
If all Ci HYPER OF Ii ∧ A R Ii , triples A R Ii can be inferred!
If H = (dog ) I1 = (cat), I1 = (mouse) and Ci = (mammal):
▶
▶
A = (hair ) and R = (PART OF )
A = (animal) and R = (HYPER OF )
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
14 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
15 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T .
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
15 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T .
3
For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B:
∣A∣ ∑
∣B∣
∑
sim(A, B) =
Gonçalo Oliveira & Gomes (CISUC)
cos(Ai , Bj )
i=1 j=1
TextGraphs-5
∣A∣∣B∣
Uppsala, July 15, 2010
15 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T .
3
For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B:
∣A∣ ∑
∣B∣
∑
sim(A, B) =
4
cos(Ai , Bj )
i=1 j=1
∣A∣∣B∣
Select the pair of synsets with the highest similarity
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
15 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
▶
▶
Hypernymy, part-of and member-of triples
Synonymy instances
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
▶
▶
Hypernymy, part-of and member-of triples
Synonymy instances
★
Huge synonymy sub-network with 16k nodes!!!
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
▶
▶
Hypernymy, part-of and member-of triples
Synonymy instances
★
Huge synonymy sub-network with 16k nodes!!!
TeP3 thesaurus
OpenThesaurus.PT (OT)4
CLIP = clustered PAPEL
TOP = TeP merged with OT, merged with CLIP
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resulting Thesaurus
Words
Synsets
Quantity
Ambiguous
Most ambiguous
Quantity
Avg. size
Biggest
TeP
17,158
5,867
20
8,254
3.51
21
OT
5,819
442
4
1,872
3.37
14
CLIP
23,741
12,196
47
7,468
12.57
103
TOP
30,554
13,294
21
9,960
6.6
277
Table: (Noun) thesauruses in numbers.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
17 / 24
Experimentation
Preparation
Clustered sub-network of PAPEL – example
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
18 / 24
Experimentation
Preparation
Manual validation
CLIP
CLIP’
TOP
TOP’
Sample
519 sets
310 sets
480 sets
448 sets
Correct
65.8%
81.1%
83.2%
86.8%
Incorrect
31.7%
16.9%
15.8%
12.3%
N/A
2.5%
2.0%
1.0%
0.9%
Agreement
76.1%
84.2%
82.3%
83.0%
Table: Results of manual synset validation.
CLIP’ and TOP’ only consider synsets with 10 or less words.
▶
The quality is higher for smaller synsets.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
19 / 24
Experimentation
Wordnet establishment
Resulting WordNet
Term-based triples
Mapped
1st
Same synset
Already present
Semi-mapped triples
Mapped
2nd Could be inferred
Already present
Synset-based triples
Hypernym of
62,591
27,750
233
3,970
7,952
88
50
13
23,572
Part of
2,805
1,460
5
40
262
1
0
0
1,416
Member of
5,929
3,962
12
167
357
0
0
0
3,783
Table: Results of triples mapping
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
20 / 24
Experimentation
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
21 / 24
Experimentation
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
2
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Score the triple with the help of Google:
∣A∣ ∑
∣B∣
∑
score =
Gonçalo Oliveira & Gomes (CISUC)
found(Ai , Bj , R)
i=1 j=1
∣A∣ ∗ ∣B∣
TextGraphs-5
Uppsala, July 15, 2010
21 / 24
Experimentation
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
2
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Score the triple with the help of Google:
∣A∣ ∑
∣B∣
∑
score =
Relation
Hypernymy of
Member of
Part of
found(Ai , Bj , R)
i=1 j=1
∣A∣ ∗ ∣B∣
Sample size
419 synsets
379 synsets
290 synsets
Validation
44,1%
24,3%
24,8%
Table: Automatic validation of triples
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
21 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
What about for networks not extracted from dictionaries?
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
▶
What about for networks not extracted from dictionaries?
Rules can be defined to map terms in triples to synsets
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
▶
What about for networks not extracted from dictionaries?
Rules can be defined to map terms in triples to synsets
★
Though some triples remain unmapped...
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
▶
What about for networks not extracted from dictionaries?
Rules can be defined to map terms in triples to synsets
★
Though some triples remain unmapped...
Future:
▶
▶
Evaluate the alternative mapping method
Exploit other resources: e.g. Wiktionary and Wikipedia
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
The end
References
Christiane Fellbaum, editor (1998).
WordNet: An Electronic Lexical Database (Language, Speech, and Communication).
The MIT Press.
Graeme Hirst (2004).
Ontology and the lexicon.
In Steffen Staab and Rudi Studer, editors, Handbook on Ontologies, International
Handbooks on Information Systems, pages 209–230. Springer.
S. M. van Dongen (2000).
Graph Clustering by Flow Simulation.
Ph.D. thesis, University of Utrecht.
David Gfeller, Jean-Cédric Chappelier and Paulo De Los Rios (2005).
Synonym Dictionary Improvement through Markov Clustering and Clustering Stability.
In Proc. of International Symposium on Applied Stochastic Models and Data Analysis
(ASMDA), pages 106–113.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
23 / 24
The end
Thank you!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
24 / 24