Towards the Automatic Creation of a Wordnet from a Term

Towards the Automatic Creation of a Wordnet from a
Term-based Lexical Network
Hugo Gonçalo Oliveira, Paulo Gomes
(hroliv,pgomes)@dei.uc.pt
Cognitive & Media Systems Group
CISUC, University of Coimbra
Uppsala, July 15, 2010
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
1 / 24
Outline
1
Introduction
Lexical ontologies
Information extraction
Issues
Research Goals
2
Approach
Clustering for synsets
Merging thesauri
Assigning terms to synsets
3
Experimentation
Preparation
Wordnet establishment
4
Concluding remarks
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
2 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
Ontology + lexicon [Hirst 2004]
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
▶
Ontology + lexicon [Hirst 2004]
Knowledge structured on words and their meanings
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
▶
▶
▶
Ontology + lexicon [Hirst 2004]
Knowledge structured on words and their meanings
Cover the whole language
Not based on a specific domain
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Lexical ontologies
Lexical ontologies
Such as Princeton WordNet [Fellbaum 1998]
▶
▶
▶
▶
Ontology + lexicon [Hirst 2004]
Knowledge structured on words and their meanings
Cover the whole language
Not based on a specific domain
Construction and maintenance involve time-consuming human effort!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
3 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
2
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
basketball, noun – the ball used in playing basketball.
→ ball HYPERNYM OF basketball
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
2
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
basketball, noun – the ball used in playing basketball.
→ ball HYPERNYM OF basketball
From textual corpora:
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Information extraction
Information extraction from text
From dictionaries:
1
2
basketball, noun – a game, also known as hoops, played indoors...
→ game HYPERNYM OF basketball
→ basketball SYNONYM OF hoops
basketball, noun – the ball used in playing basketball.
→ ball HYPERNYM OF basketball
From textual corpora:
▶
... team sports, such as basketball, rugby ...
→ team sport HYPERNYM OF basketball
→ team sport HYPERNYM OF rugby
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
4 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
In the previous example: is hoops a team sport?
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
In the previous example: is hoops a team sport?
An example extracted from a Portuguese dictionary:
ruı́na SYNONYM OF queda ∧ queda SYNONYM OF habilidade
→ habilidade SYNONYM OF ruı́na ??
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Issues
Natural language is ambiguous
Term-based networks are impractical for many applications
In the previous example: is hoops a team sport?
An example extracted from a Portuguese dictionary:
ruı́na SYNONYM OF queda ∧ queda SYNONYM OF habilidade
→ habilidade SYNONYM OF ruı́na ??
queda can either mean aptitude or downfall!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
5 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
▶
▶
Synsets: groups of synonymous words
Synset-based relational triples
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Introduction
Research Goals
Onto.PT
Automatic construction of a lexical ontology for Portuguese
Extracted from different sources
▶
▶
▶
Manually created thesauri
Language dictionaries/encyclopedias
Corpora
Modelled after Princeton WordNet
▶
▶
Synsets: groups of synonymous words
Synset-based relational triples
WSD based on the knowledge already extracted, not on the context.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
6 / 24
Approach
Information flow
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
7 / 24
Approach
Information flow
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
7 / 24
Approach
Clustering for synsets
Synonymy networks tend to have a clustered structure
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
8 / 24
Approach
Clustering for synsets
Synonymy networks tend to have a clustered structure
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
8 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
3
Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
3
Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times;
4
Use the (hard) clustering from each run to create P, a matrix with the
probabilities of each pair of words in F belonging to the same cluster;
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Clustering for synsets
Synset discovery (inspired by [Gfeller et al. 2005])
1
Split the original network into sub-networks and calculate the
frequency-weighted adjacency matrix F of each sub-network;
2
Fij = Fij + Fij ∗ 𝛿, −0.5 < 𝛿 < 0.5;
3
Run MCL [van Dongen 2000], with 𝛾 = 1.6, over F for 30 times;
4
Use the (hard) clustering from each run to create P, a matrix with the
probabilities of each pair of words in F belonging to the same cluster;
5
Remove: (a) big clusters, B, if there is a group of clusters
C = C1 , C2 , ...Cn such that B = C1 ∪ C2 ∪ ... ∪ Cn ; (b) clusters
completely included in other clusters.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
9 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
1
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
1
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
▶
▶
1
c(T1 , B1 ) =
c(T1 , B2 ) =
1
3
1
6
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Merging thesauri
Merging synsets from different thesaurus
For each synset Ti ∈ T , select Bj ∈ B with higher c = Ti ∩ Bj /Ti ∪ Bj 1
B1 = (diva, beldade, beleza, deidade, deusa, divindade)
B2 = (divindade, deidade, deus, nume)
T1 = (divindade, diva, deusa)
▶
▶
c(T1 , B1 ) =
c(T1 , B2 ) =
1
3
1
6
N = B1 ∪ T1 = (diva, beldade, beleza, deidade, deusa, divindade)
1
Jaccard coefficient
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
10 / 24
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
11 / 24
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Goal: map a R b ∈ N to A R B, (A, B) ∈ T
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
11 / 24
Approach
Assigning terms to synsets
Mapping methods
Input:
▶
▶
Thesaurus T , containing synsets
Term-based semantic network, N, where each edge has a type R
Goal: map a R b ∈ N to A R B, (A, B) ∈ T
Output: semantic network W , whose nodes are synsets, which relate
to other synsets by means of semantic relations (wordnet)
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
11 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
a is not in T ? create synset A = (a), a → A
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
★
★
★
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Sa1 = (a, c, d, e), pa1 = 34
Sa2 = (a, f, g ), pa2 = 23
Sa3 = (a, h, i, j), pa3 = 41
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1
Assignment of a (in a R b) to A:
1
Fix b
2
Sa ⊂ T : Sai ∈ Sa , a ∈ Sai
▶
3
a is not in T ? create synset A = (a), a → A
For each Sai ∈ Sa ,
▶
pai =
★
★
★
▶
nai
∣Sai ∣ ,
nai = number of terms tj ∈ Sai : (tj R b)
Sa1 = (a, c, d, e), pa1 = 34
Sa2 = (a, f, g ), pa2 = 23
Sa3 = (a, h, i, j), pa3 = 41
a → Sa1
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
12 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2)
Only for semi-mapped triples a R B and A R b
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
13 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2)
Only for semi-mapped triples a R B and A R b
Take advantage of established hypernymy links.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
13 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2)
Only for semi-mapped triples a R B and A R b
Take advantage of established hypernymy links.
Assigning b in A R b
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
13 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2) – examples and additional cleaning
If there is Ci ∈ C with...
Ci HYPER OF H ∧ A R H, b → Ci
If all Ci HYPER OF Ii ∧ A R Ii , triples A R Ii can be inferred!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
14 / 24
Approach
Assigning terms to synsets
Procedure 1 (stage 2) – examples and additional cleaning
If there is Ci ∈ C with...
Ci HYPER OF H ∧ A R H, b → Ci
If all Ci HYPER OF Ii ∧ A R Ii , triples A R Ii can be inferred!
If H = (dog ) I1 = (cat), I1 = (mouse) and Ci = (mammal):
▶
▶
A = (hair ) and R = (PART OF )
A = (animal) and R = (HYPER OF )
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
14 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
15 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T .
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
15 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T .
3
For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B:
∣A∣ ∑
∣B∣
∑
sim(A, B) =
Gonçalo Oliveira & Gomes (CISUC)
cos(Ai , Bj )
i=1 j=1
TextGraphs-5
∣A∣∣B∣
Uppsala, July 15, 2010
15 / 24
Approach
Assigning terms to synsets
Alternative mapping procedure
1
M = term-term matrix based on the adjacencies of the lexical
network.
2
Collect all the synsets with a, Sa ⊂ T , and all synsets with b, Sb ⊂ T .
3
For each A ∈ Sa and B ∈ Sb , with terms Ai ∈ A and Bj ∈ B:
∣A∣ ∑
∣B∣
∑
sim(A, B) =
4
cos(Ai , Bj )
i=1 j=1
∣A∣∣B∣
Select the pair of synsets with the highest similarity
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
15 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
▶
▶
Hypernymy, part-of and member-of triples
Synonymy instances
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
▶
▶
Hypernymy, part-of and member-of triples
Synonymy instances
★
Huge synonymy sub-network with 16k nodes!!!
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resources used (only nouns)
PAPEL2 lexical network
▶
▶
Hypernymy, part-of and member-of triples
Synonymy instances
★
Huge synonymy sub-network with 16k nodes!!!
TeP3 thesaurus
OpenThesaurus.PT (OT)4
CLIP = clustered PAPEL
TOP = TeP merged with OT, merged with CLIP
2
http://www.linguateca.pt/PAPEL/
http://www.nilc.icmc.usp.br/tep2/index.htm
4
http://openthesaurus.caixamagica.pt/
3
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
16 / 24
Experimentation
Preparation
Resulting Thesaurus
Words
Synsets
Quantity
Ambiguous
Most ambiguous
Quantity
Avg. size
Biggest
TeP
17,158
5,867
20
8,254
3.51
21
OT
5,819
442
4
1,872
3.37
14
CLIP
23,741
12,196
47
7,468
12.57
103
TOP
30,554
13,294
21
9,960
6.6
277
Table: (Noun) thesauruses in numbers.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
17 / 24
Experimentation
Preparation
Clustered sub-network of PAPEL – example
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
18 / 24
Experimentation
Preparation
Manual validation
CLIP
CLIP’
TOP
TOP’
Sample
519 sets
310 sets
480 sets
448 sets
Correct
65.8%
81.1%
83.2%
86.8%
Incorrect
31.7%
16.9%
15.8%
12.3%
N/A
2.5%
2.0%
1.0%
0.9%
Agreement
76.1%
84.2%
82.3%
83.0%
Table: Results of manual synset validation.
CLIP’ and TOP’ only consider synsets with 10 or less words.
▶
The quality is higher for smaller synsets.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
19 / 24
Experimentation
Wordnet establishment
Resulting WordNet
Term-based triples
Mapped
1st
Same synset
Already present
Semi-mapped triples
Mapped
2nd Could be inferred
Already present
Synset-based triples
Hypernym of
62,591
27,750
233
3,970
7,952
88
50
13
23,572
Part of
2,805
1,460
5
40
262
1
0
0
1,416
Member of
5,929
3,962
12
167
357
0
0
0
3,783
Table: Results of triples mapping
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
20 / 24
Experimentation
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
21 / 24
Experimentation
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
2
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Score the triple with the help of Google:
∣A∣ ∑
∣B∣
∑
score =
Gonçalo Oliveira & Gomes (CISUC)
found(Ai , Bj , R)
i=1 j=1
∣A∣ ∗ ∣B∣
TextGraphs-5
Uppsala, July 15, 2010
21 / 24
Experimentation
Wordnet establishment
Automatic validation
For each triple, A R B
1 Compile a set of textual patterns denoting R, e.g.:
▶
▶
2
(hypo) é um∣uma (tipo∣forma∣variedade∣...)* de (hyper)
(whole/group) é um (grupo∣conjunto∣...) de (part/member)
Score the triple with the help of Google:
∣A∣ ∑
∣B∣
∑
score =
Relation
Hypernymy of
Member of
Part of
found(Ai , Bj , R)
i=1 j=1
∣A∣ ∗ ∣B∣
Sample size
419 synsets
379 synsets
290 synsets
Validation
44,1%
24,3%
24,8%
Table: Automatic validation of triples
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
21 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
What about for networks not extracted from dictionaries?
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
▶
What about for networks not extracted from dictionaries?
Rules can be defined to map terms in triples to synsets
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
▶
What about for networks not extracted from dictionaries?
Rules can be defined to map terms in triples to synsets
★
Though some triples remain unmapped...
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
Concluding remarks
Concluding remarks
Our way to achieve WSD without a context continues...
▶
Clustering is a suitable alternative for establishing synsets
★
▶
What about for networks not extracted from dictionaries?
Rules can be defined to map terms in triples to synsets
★
Though some triples remain unmapped...
Future:
▶
▶
Evaluate the alternative mapping method
Exploit other resources: e.g. Wiktionary and Wikipedia
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
22 / 24
The end
References
Christiane Fellbaum, editor (1998).
WordNet: An Electronic Lexical Database (Language, Speech, and Communication).
The MIT Press.
Graeme Hirst (2004).
Ontology and the lexicon.
In Steffen Staab and Rudi Studer, editors, Handbook on Ontologies, International
Handbooks on Information Systems, pages 209–230. Springer.
S. M. van Dongen (2000).
Graph Clustering by Flow Simulation.
Ph.D. thesis, University of Utrecht.
David Gfeller, Jean-Cédric Chappelier and Paulo De Los Rios (2005).
Synonym Dictionary Improvement through Markov Clustering and Clustering Stability.
In Proc. of International Symposium on Applied Stochastic Models and Data Analysis
(ASMDA), pages 106–113.
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
23 / 24
The end
Thank you!
Gonçalo Oliveira & Gomes (CISUC)
TextGraphs-5
Uppsala, July 15, 2010
24 / 24

Download Report

Towards the Automatic Creation of a Wordnet from a Term

Paperzz.com

Your Paperzz