WO RK IN P Multilingual Wikipedia: Editors of Primary Language Contribute to More Complex Articles RO GR ES S Sungjoon Park, Suin Kim*, Scott A. Hale, Sooyoung Kim, Jeongmin Byun, Alice Oh {sungjoon.park, suin.kim, sooyoungkim, jmbyun}@kaist.ac.kr, [email protected], [email protected] Primary/Non-Primary Users Research Questions • Data Primary Languages of Users Editing this Language Edition Language Edition EN DE 3,832 EN FR 1,350 ES 889 RU IT 801 594 472 DE 4,150 EN ES 787 147 99 91 70 EN FR 996 ES #Editors FR NLRUIT OTHERS 1,631 DE English OTHERS DE PT CA #Article Edit Sessions #Edits 446 OTHERS 675 144 139 92 86 Quantifying Language Complexity History Seen as "Cela Sculptoris" in the lower right of this 1825 star chart from Urania's Mirror Caelum was first introduced in the == History == [[File:Sidney Hall - Urania's Mirror - Canis Major, Lepus, Columba Noachi & Cela Sculptoris.jpg|thumb|left|Seen as "Cela Sculptoris" in the Wiki Markup HTML max(f (di )) Aggregate per User 11,616 3,271 2,506 237,849 120,123 69,557 350,541 160,126 112,099 Lexical Diversity • • Number of characters Number of words Number of unique words Number of sentences Average word length in characters Average sentence length in words • • • history seen as " cela sculptoris " in the lower right of this 1825 star chart from urania ' s mirror caelum was first introduced in the eighteenth century by nicolas louis de lacaille , a french Measure Spanish Basic Features • Plain Text f (di ) di German 374 • • • • Entropy of word frequency Average word rank Average word occurrence frequency Error rate Syntactic Structure • • Tokenized + Lowercase • Entropy of POS frequency Mean phrase length (NP, VP) Mean parse tree depth Topic Clusters Length of Noun Phrases 22 30000 19 22500 LDA+DBSCAN Algorithm • Number of Articles • • 16 15000 13 7500 10 Basic Features Basic Lexical Features Diversity 0 Primary Number of Characters Number of Characters Entropy of Part-of-Speech Unigrams Number of Words Results (per Paragraph)Basic Features Basic Features Basic Features 174.2 232.9 219.5 Number of Words Sentences 42.1 Articles in history cluster are more complex Articles in football cluster are less complex The complexity measures used are consistent Different topics show different complexity Non-Primary 4.99 4.53 of Part-of-Speech Unigrams English German 4.43 Spanish 3.97 3.95Entropy 3.75 49.1 32.3 Average Word Length English German Spanish Lexical Diversity Structure 2.99 Average Occurrence of Words in Reference 2.43 2.23 1.90 1.84Word Length Average 1.49 • Entropy of Part-of-Speech Unigrams English 2.37 Number of Words Number of Sentences 51.2 Average Occurrence of Words in Reference Primary Non-Primary Entropy of Part-of-Speech Unigrams Average Word Length • • • Number of Characters Average WordPrimary Length 139.5 60.3 Entropy of Part-of-Speech Unigrams Lexical Diversity Primary Non-Primary Syntactic Number of Words 56.4 • Non-Primary Number of Characters Lexical Diversity Number of Characters Average Word Length 234.6 Primary Number of Words Average Occurrence of Words in Reference 296.8 Lexical Diversity Non-Primary Fit LDA over entire data with 100 topics • Each document is reduced to 100 dimensions Cluster documents into 20 clusters with DBSCAN Cluster-level rank of language complexity is consistent over measures • Arabic Football(Soccer) Olympics City Animals/Plants Transportation Sports:Teams Music:Song, album Names of Places Entertainment Sports:American Politics TV Music:musician Science Computer Film Military Education History • Is it possible to quantify the complexity of language in Wikipedia articles? Do primary users tend to edit parts of the Wikipedia • articles with higher language complexity than nonprimary users? Contributions Do we observe more natural language in the articles • We tested 20 measures for language complexity and showed they are highly consistent. after primary users’ edits compared to the articles • Multilinguals in Wikipedia show relatively high levels of proficiency in after non-primary users’ edits? their primary languages. Arabic Football(Soccer) Olympics City Animals/Plants Transportation Sports:Teams Music:Song, album Names of Places Entertainment Sports:American Politics TV Music:musician Science Computer Film Military Education History • Multilingual editors have edited different language editions We define a user’s primary language as the language that the user edited the most (not necessarily user’s native language) For each language X edition, we define primary users as those whose primary language is X • • 2.00 2.82German 2.47 2.55 • Spanish 1.98 • Syntactic Structure Average Occurrence of Words in Reference Average Occurrence of Words in Reference 1.63E+07 1.55E+07 6.14E+06 Primary editors edit articles with higher lexical diversity. 5.41E+06 English 5.86E+06 German 4.96E+06 Spanish English German Spanish • Primary editors edit articles with more complex syntactic structure. • English NP length German VP length Spanish Parse Tree Depth NP length VP length Parse Tree Depth Syntactic Structure English German Spanish Syntactic Number of Sentences 22.92 Structure Primary editors edit longer articles. Number ofaverage Sentences word Number of characters, words, Syntactic Structure length, sentences are always higher for the Number of Sentences articles edited by primary users • 18.19 13.86 17.50 • 13.28 11.53 English German Spanish NP length VP length Parse Tree Depth English German Spanish NP length VP length Parse Tree Depth Entropy of n-gram is always higher in primary edits Primary users edit articles of frequent yet diverse set of words Parse-tree based measures • Length of longest noun/verb phrase • Depth of parse tree Articles edited by primary users have complex syntactic structure
© Copyright 2024 Paperzz