1 - Clarin PL

CLARIN-PL
Language Technology for Polish in
Practice
Lexical semantics tools
Maciej Piasecki, Paweł Kędzia
Wrocław University of Science and Technology
G4.19 Research Group
[email protected]
2017-01-17
Tools
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  WoSeDon (Kędzia et al., 2015) (Piasecki et al., 2016)
§  weakly Word Sense Disambiguation
§  based on plWordNet
§  Sense statistics, integrated with Inforex
§  Recognition of semantic relations
§  Between Named Entities
§  Inside Noun Phrases (rule-based)
§  Recognition of semantic relations between text fragments
§  seven types of relations
§  manually annotated data set
§  A prototype and ongoing work
§  Keywords generation: extractive and descriptive
(plWordNet, SUMO, domain ontologies)
plWordnet – Example
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Weakly supervised WSD
CLARIN-PL
§  PageRank (Static)
P(new) = c · M · P(old) + (1 – c) · v
§  G: graph with N nodes n1, …, nN,
§  di: the degree of the node ni,
§  M: matrix with N x N dimension:
⎧1
⎪
M ji = ⎨ d i
⎪⎩ 0
-  if exist an edge between
ni and nj nodes
-  otherwise
§  c: damping factor <0, 1>
§  v: vector N x 1
1
vi =
N
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Static PageRank in WSD:
Example
1
300
...
1
300
...
...
strażnica
1
300
...
1
300
brama
furtka
drzwi
...
1
300
zamek-6
(zip)
kurtka
zapięcie
1
300
...
1
300
zamknięcie
zamek-2
(lock)
1
300
...
...
zatrzask
1
300
...
...
...
posiadać
1
300
1
300
...
1
300
Mam zamek w kurtce i garniturze.
mieć
1
300
...
1
300
...
budowla obronna
zamek-1
(castle)
...
...
rezydencja
baszta
1
300
CLARIN-PL
garnitur
...
1
300
1
300
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Personalized PageRank in WSD
CLARIN-PL
§  PageRank (Personalized)
P(new) = c · M · P(old) + (1 – c) · v
§  G: graph with N nodes n1, …, nN,
§  di: the degree of the node ni,
§  M: matrix with N x N dimension:
⎧1
⎪
M ji = ⎨ d i
⎪⎩ 0
-  if exist an edge between
ni and nj nodes
-  otherwise
§  c: damping factor <0, 1>
§  v: vector N x 1
⎧ 1 - if ni is the sense
⎪ S of word from the
⎪
context
vi = ⎨
⎪ 0 - otherwise
⎪
⎩
S: the number of all the
words’ meanings from the
context, excluding the
meanings of disambiguated
words
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Personalized PageRank in WSD:
Example
...
...
0
...
...
0
strażnica
...
0
baszta
CLARIN-PL
...
0
brama
0
furtka
drzwi
...
0
0
0
...
1
6
...
budowla obronna
zamek-1
(castle)
zamknięcie
zamek-2
(lock)
...
rezydencja
zatrzask
1
6
...
Mam zamek w kurtce i garniturze.
posiadać
...
zamek-6
(zip)
1
6
kurtka
0
1
6
...
zapięcie
0
...
...
...
1
6
mieć
garnitur
...
1
6
0
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Personalized PageRank
Word to word in WSD
CLARIN-PL
§  PageRank (Personalized Word to Word)
P(new) = c · M · P(old) + (1 – c) · v
§  G: graph with N nodes n1, …, nN,
§  di: the degree of the node ni,
§  M: matrix with N x N dimension:
⎧1
⎪
M ji = ⎨ d i
⎪⎩ 0
-  if exist an edge between
ni and nj nodes
-  otherwise
§  c: damping factor <0, 1>
§  v: vector N x 1
⎧ 1 - if ni is the sense
⎪ S of word from the
⎪
context
vi = ⎨
⎪ 0 - otherwise
⎪
⎩
S: the number of all the
words’ meanings from the
context, excluding the
meanings of disambiguated
words
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Personalized PageRank
Word-to-Word in WSD: Example
...
...
0
...
...
0
strażnica
...
0
baszta
CLARIN-PL
...
0
brama
0
furtka
drzwi
...
0
0
budowla obronna
0
...
zamek-1
(castle)
zatrzask
0
0
...
zamknięcie
zamek-2
(lock)
...
rezydencja
...
Mam zamek w kurtce i garniturze.
posiadać
...
zamek-6
(zip)
0
kurtka
0
1
3
...
zapięcie
0
...
...
...
1
3
mieć
garnitur
...
1
3
0
Expanded WordNet
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  Best results achieved for WordNet expanded with additional
resources
§  Glosses
§  Links derived from the SemCore corpus with disambiguated
word senses
§  SUMO
§  Wikipedia
§  VerbNet
Exploring plWordNet
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  Page Rank based WSD depends on links between word
senses
§  Limited number of glosses, use examples,
but non-disambiguated
§  Relations
§  synset relations (> 20 types)
§  different meanings for sense description, different weights
§  relations between lexical units (> 20 types)
§  including cross-categorial, e.g. adjective lexical units associated
with noun lexical units
§  mapped on the synset level
§  Mapping onto SUMO
§  Sense order – numbers for lexical units
Exploring plWordNet: glosses
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  Lesk’s algorithm:
§  Sense represented by a bag of words that is compared with a
bag of words created from a context of use
§  Description set for a sense
§  all lemmas from a synset, glosses and use examples
§  description sets from synsets connected to the given one by
selected relations
§  weights depending on the path distance
§  Disambiguation context
§  A document
§  Larger text window
Exploring plWordNet: sense
order
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  Partial disambiguation
§  k=30% of the top scored synsets left non-disambiguated
§  they are very often closely related
§  No larger corpus of Polish with disambiguated word senses
§  Lexical units numbers in plWordNet
§ 
§ 
§ 
§ 
No guideline for the order
Numbers assigned according to the order of the creation
Null hypothesis: the order is random
Our assumption: more frequent or prominent senses added as
the first ones
§  Application
§  The use of LU numbers for WSD post-processing
Samsung R&D
Institute
Invit. Lecture
2017-01-17
plWordNet merged with SUMO
Entity
CLARIN-PL
baszta
strażnica
brama
Physical
zamek-1
(budowla)
Object
Collection
StationaryArtifact
GroupOfPeople
Building
drzwi
zamek-2
(w drzwiach)
Artifact
Group
FamilyGroup
furtka
Device
SecurityDevice
Lock
EngineeringComponent
Samsung R&D
Institute
Invit. Lecture
2017-01-17
SUMO Expansion:
Generalised description
CLARIN-PL
Mam zamek w kurtce i garniturze.
Entity
garnitur
kurtka
baszta
strażnica
zamek-6
(suwak)
brama
Phisical
zamek-1
(budowla)
Object
Collection
StationaryArtifact
GroupOfPeople
Building
FamilyGroup
furtka
Artifact
Group
drzwi
zamek-2
(w drzwiach)
Device
SecurityDevice
Lock
zapięcie
EngineeringComponent
WSD experimental setting:
KPWr corpora
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  Learning corpora:
Korpus Politechniki Wrocławskiej (Corpus of the Wroclaw
University of Technology) of Polish
§ 83 different lemmas annotated with senses
§  representing different polysemy types
§ 188 different word senses
§ 7 979 word sense annotations in total
Total
Nouns Verbs
Tagged words
88
58
30
Tagged instances
6048
3846
2202
Samsung R&D
Institute
Invit. Lecture
2017-01-17
WSD experimental setting:
Składnica corpora
CLARIN-PL
§  Testing corpora:
Składnica (A treebank of Polish)
§ 20000 annotated sentences
§ 8200 lemmas with manually assigned senses
§ nouns, verbs and adjectivies.
Total
MN
PN
MV
PV
Tagged words
6309
1717
2424
684
1484
Tagged instances
15342
3560
6610
1307
3865
§ MN – Monosemous nouns
§ PN – Polysemous nouns
§ MV – Monosemous verbs
§ PV – Polysemous verbs
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Best results (1/2)
CLARIN-PL
KPWr
Składnica
V
N
All
V
N
All
Lesk
16.80
18.80
18.12
39.34
38.56
38.87
C8
32.61
52.22
45.52
49.02
64.02
58.48
C9
42.66
47.91
46.12
47.51
61.67
56.16
§  Row C8: PPR algorithm, plWordNet 2.3 merged with SUMO
(Only nodes from plWordnet are initialised)
§  Row C9: Static PageRank, plWordnet with weighed relations:
synsets are euqal 0.7, lexical unit relations are equal 0.3
Samsung R&D
Institute
Invit. Lecture
2017-01-17
Best results (2/2)
CLARIN-PL
KPWr
Składnica
V
N
All
V
N
All
C10
38.57
43.20
41.62
48.77
61.74
56.69
C11
39.76
39.30
39.46
49.26
61.12
56.51
§  Row C10: Static PageRank, plWordNet with reranking 30% of
the maximal score
§  Row C11: Static PageRank, plWordNet with reranking 40% of
the maximal score
Reranking on KPWr
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
Reranking on Składnica
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
Bibliography
Samsung R&D
Institute
Invit. Lecture
2017-01-17
CLARIN-PL
§  Kędzia, P.; Piasecki, M. & Orlińska, M.
(2015) Word Sense Disambiguation based on Large Scale
Polish CLARIN Heterogeneous Lexical Resources
Cognitive Studies / Études cognitives, pp. 269-292.
https://ispan.waw.pl/journals/index.php/cs-ec/article/view/cs.
2015.019/1765
§  Piasecki, M.; Kędzia, P. & Orlińska, M.
Mititelu, V. B.; Forăscu, C.; Fellbaum, C. & Vossen, P. (Eds.)
plWordNet in Word Sense Disambiguation task
Proceedings of the 8th Global Wordnet Conference,
Bucharest, 27-30 January 2016, Global Wordnet
Association, 2016, pp. 280-289.
http://gwc2016.racai.ro/procedings.pdf
CLARIN-PL
Thank you very much for your attention!
www.clarin-pl.eu
Supported by the Polish Ministry of Science and Higher
Education [CLARIN-PL]