CLARIN-PL Language Technology for Polish in Practice Lexical semantics tools Maciej Piasecki, Paweł Kędzia Wrocław University of Science and Technology G4.19 Research Group [email protected] 2017-01-17 Tools Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § WoSeDon (Kędzia et al., 2015) (Piasecki et al., 2016) § weakly Word Sense Disambiguation § based on plWordNet § Sense statistics, integrated with Inforex § Recognition of semantic relations § Between Named Entities § Inside Noun Phrases (rule-based) § Recognition of semantic relations between text fragments § seven types of relations § manually annotated data set § A prototype and ongoing work § Keywords generation: extractive and descriptive (plWordNet, SUMO, domain ontologies) plWordnet – Example Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL Samsung R&D Institute Invit. Lecture 2017-01-17 Weakly supervised WSD CLARIN-PL § PageRank (Static) P(new) = c · M · P(old) + (1 – c) · v § G: graph with N nodes n1, …, nN, § di: the degree of the node ni, § M: matrix with N x N dimension: ⎧1 ⎪ M ji = ⎨ d i ⎪⎩ 0 - if exist an edge between ni and nj nodes - otherwise § c: damping factor <0, 1> § v: vector N x 1 1 vi = N Samsung R&D Institute Invit. Lecture 2017-01-17 Static PageRank in WSD: Example 1 300 ... 1 300 ... ... strażnica 1 300 ... 1 300 brama furtka drzwi ... 1 300 zamek-6 (zip) kurtka zapięcie 1 300 ... 1 300 zamknięcie zamek-2 (lock) 1 300 ... ... zatrzask 1 300 ... ... ... posiadać 1 300 1 300 ... 1 300 Mam zamek w kurtce i garniturze. mieć 1 300 ... 1 300 ... budowla obronna zamek-1 (castle) ... ... rezydencja baszta 1 300 CLARIN-PL garnitur ... 1 300 1 300 Samsung R&D Institute Invit. Lecture 2017-01-17 Personalized PageRank in WSD CLARIN-PL § PageRank (Personalized) P(new) = c · M · P(old) + (1 – c) · v § G: graph with N nodes n1, …, nN, § di: the degree of the node ni, § M: matrix with N x N dimension: ⎧1 ⎪ M ji = ⎨ d i ⎪⎩ 0 - if exist an edge between ni and nj nodes - otherwise § c: damping factor <0, 1> § v: vector N x 1 ⎧ 1 - if ni is the sense ⎪ S of word from the ⎪ context vi = ⎨ ⎪ 0 - otherwise ⎪ ⎩ S: the number of all the words’ meanings from the context, excluding the meanings of disambiguated words Samsung R&D Institute Invit. Lecture 2017-01-17 Personalized PageRank in WSD: Example ... ... 0 ... ... 0 strażnica ... 0 baszta CLARIN-PL ... 0 brama 0 furtka drzwi ... 0 0 0 ... 1 6 ... budowla obronna zamek-1 (castle) zamknięcie zamek-2 (lock) ... rezydencja zatrzask 1 6 ... Mam zamek w kurtce i garniturze. posiadać ... zamek-6 (zip) 1 6 kurtka 0 1 6 ... zapięcie 0 ... ... ... 1 6 mieć garnitur ... 1 6 0 Samsung R&D Institute Invit. Lecture 2017-01-17 Personalized PageRank Word to word in WSD CLARIN-PL § PageRank (Personalized Word to Word) P(new) = c · M · P(old) + (1 – c) · v § G: graph with N nodes n1, …, nN, § di: the degree of the node ni, § M: matrix with N x N dimension: ⎧1 ⎪ M ji = ⎨ d i ⎪⎩ 0 - if exist an edge between ni and nj nodes - otherwise § c: damping factor <0, 1> § v: vector N x 1 ⎧ 1 - if ni is the sense ⎪ S of word from the ⎪ context vi = ⎨ ⎪ 0 - otherwise ⎪ ⎩ S: the number of all the words’ meanings from the context, excluding the meanings of disambiguated words Samsung R&D Institute Invit. Lecture 2017-01-17 Personalized PageRank Word-to-Word in WSD: Example ... ... 0 ... ... 0 strażnica ... 0 baszta CLARIN-PL ... 0 brama 0 furtka drzwi ... 0 0 budowla obronna 0 ... zamek-1 (castle) zatrzask 0 0 ... zamknięcie zamek-2 (lock) ... rezydencja ... Mam zamek w kurtce i garniturze. posiadać ... zamek-6 (zip) 0 kurtka 0 1 3 ... zapięcie 0 ... ... ... 1 3 mieć garnitur ... 1 3 0 Expanded WordNet Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § Best results achieved for WordNet expanded with additional resources § Glosses § Links derived from the SemCore corpus with disambiguated word senses § SUMO § Wikipedia § VerbNet Exploring plWordNet Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § Page Rank based WSD depends on links between word senses § Limited number of glosses, use examples, but non-disambiguated § Relations § synset relations (> 20 types) § different meanings for sense description, different weights § relations between lexical units (> 20 types) § including cross-categorial, e.g. adjective lexical units associated with noun lexical units § mapped on the synset level § Mapping onto SUMO § Sense order – numbers for lexical units Exploring plWordNet: glosses Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § Lesk’s algorithm: § Sense represented by a bag of words that is compared with a bag of words created from a context of use § Description set for a sense § all lemmas from a synset, glosses and use examples § description sets from synsets connected to the given one by selected relations § weights depending on the path distance § Disambiguation context § A document § Larger text window Exploring plWordNet: sense order Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § Partial disambiguation § k=30% of the top scored synsets left non-disambiguated § they are very often closely related § No larger corpus of Polish with disambiguated word senses § Lexical units numbers in plWordNet § § § § No guideline for the order Numbers assigned according to the order of the creation Null hypothesis: the order is random Our assumption: more frequent or prominent senses added as the first ones § Application § The use of LU numbers for WSD post-processing Samsung R&D Institute Invit. Lecture 2017-01-17 plWordNet merged with SUMO Entity CLARIN-PL baszta strażnica brama Physical zamek-1 (budowla) Object Collection StationaryArtifact GroupOfPeople Building drzwi zamek-2 (w drzwiach) Artifact Group FamilyGroup furtka Device SecurityDevice Lock EngineeringComponent Samsung R&D Institute Invit. Lecture 2017-01-17 SUMO Expansion: Generalised description CLARIN-PL Mam zamek w kurtce i garniturze. Entity garnitur kurtka baszta strażnica zamek-6 (suwak) brama Phisical zamek-1 (budowla) Object Collection StationaryArtifact GroupOfPeople Building FamilyGroup furtka Artifact Group drzwi zamek-2 (w drzwiach) Device SecurityDevice Lock zapięcie EngineeringComponent WSD experimental setting: KPWr corpora Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § Learning corpora: Korpus Politechniki Wrocławskiej (Corpus of the Wroclaw University of Technology) of Polish § 83 different lemmas annotated with senses § representing different polysemy types § 188 different word senses § 7 979 word sense annotations in total Total Nouns Verbs Tagged words 88 58 30 Tagged instances 6048 3846 2202 Samsung R&D Institute Invit. Lecture 2017-01-17 WSD experimental setting: Składnica corpora CLARIN-PL § Testing corpora: Składnica (A treebank of Polish) § 20000 annotated sentences § 8200 lemmas with manually assigned senses § nouns, verbs and adjectivies. Total MN PN MV PV Tagged words 6309 1717 2424 684 1484 Tagged instances 15342 3560 6610 1307 3865 § MN – Monosemous nouns § PN – Polysemous nouns § MV – Monosemous verbs § PV – Polysemous verbs Samsung R&D Institute Invit. Lecture 2017-01-17 Best results (1/2) CLARIN-PL KPWr Składnica V N All V N All Lesk 16.80 18.80 18.12 39.34 38.56 38.87 C8 32.61 52.22 45.52 49.02 64.02 58.48 C9 42.66 47.91 46.12 47.51 61.67 56.16 § Row C8: PPR algorithm, plWordNet 2.3 merged with SUMO (Only nodes from plWordnet are initialised) § Row C9: Static PageRank, plWordnet with weighed relations: synsets are euqal 0.7, lexical unit relations are equal 0.3 Samsung R&D Institute Invit. Lecture 2017-01-17 Best results (2/2) CLARIN-PL KPWr Składnica V N All V N All C10 38.57 43.20 41.62 48.77 61.74 56.69 C11 39.76 39.30 39.46 49.26 61.12 56.51 § Row C10: Static PageRank, plWordNet with reranking 30% of the maximal score § Row C11: Static PageRank, plWordNet with reranking 40% of the maximal score Reranking on KPWr Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL Reranking on Składnica Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL Bibliography Samsung R&D Institute Invit. Lecture 2017-01-17 CLARIN-PL § Kędzia, P.; Piasecki, M. & Orlińska, M. (2015) Word Sense Disambiguation based on Large Scale Polish CLARIN Heterogeneous Lexical Resources Cognitive Studies / Études cognitives, pp. 269-292. https://ispan.waw.pl/journals/index.php/cs-ec/article/view/cs. 2015.019/1765 § Piasecki, M.; Kędzia, P. & Orlińska, M. Mititelu, V. B.; Forăscu, C.; Fellbaum, C. & Vossen, P. (Eds.) plWordNet in Word Sense Disambiguation task Proceedings of the 8th Global Wordnet Conference, Bucharest, 27-30 January 2016, Global Wordnet Association, 2016, pp. 280-289. http://gwc2016.racai.ro/procedings.pdf CLARIN-PL Thank you very much for your attention! www.clarin-pl.eu Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]
© Copyright 2026 Paperzz