Early Adhesion of Structural Inequality in the Formation of Collaborative Knowledge, Wikipedia Jinhyuk Yun (ᄋ ᆫᄌ ᅲ ᆫᄒ ᅵ ᆨ),1, ∗ Sang Hoon Lee (ᄋ ᅧ ᅵᄉ ᆼᄒ ᅡ ᆫ),2, † and Hawoong Jeong (ᄌ ᅮ ᆼᄒ ᅥ ᅡᄋ ᆼ)1, 3, 4, ‡ ᅮ 1 arXiv:1610.06006v1 [physics.soc-ph] 19 Oct 2016 Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea 2 School of Physics, Korea Institute for Advanced Study, Seoul 02455, Korea 3 Institute for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea 4 Asia Pacific Center for Theoretical Physics, Pohang 37673, Korea (Dated: October 20, 2016) We perform an in-depth analysis on the inequality in 863 Wikimedia projects. We take the complete editing history of 267 304 095 Wikimedia items until 2016, which not only covers every language edition of Wikipedia, but also embraces the complete versions of Wiktionary, Wikisource, Wikivoyage, etc. Our findings of common growth pattern described by the interrelation between four characteristic growth yardsticks suggest a universal law of the communal data formation. In this encyclopedic data set, we observe the interplay between the number of edits and the degree of inequality. In particular, the rapid increasing of the Gini coefficient suggests that this entrenched inequality stems from the nature of such open-editing communal data sets, namely abiogenesis of supereditors’ cartel. We show that these groups are created at the early stage of these open-editing media and still alive. Furthermore, our model taking both short-term and long-term memories into account successfully elucidates the underlying mechanism to establish the oligarchy in Wikipedia. Eventually, our results forewarn a rather pessimistic prospect of such communal databases in the future: the inequality will endure extendedly. I. INTRODUCTION Throughout untold ages, knowledge had been monopolized by privileged classes and it had preserved the status of those classes in turn. Philosophy and religions were used as convenient instruments to allay the public sentiments by sovereigns; thus, the bread and circuses policy dominated and public education was absent, which limited public knowledge exclusively to practicality, e.g., simple labor works. Meanwhile, overlords’ literacy was mainly in Latin. For example, only Latin was allowed to write the Bible until the Protestant Reformation. Medieval education did not aim at the entire public, thus it had consolidated an authority of the clergy [1]. That clergy was in the symbiotic relationship with political sovereigns by the theory of the divine right of kings, which limits natural rights of the general public. As Francis Bacon said, knowledge was the power [2]. The Renaissance, referred to as a cultural movement that began in Italy, allowed the revival of intellects. As the appellation, various disciplines suppressed before were reborn based on the spirit of humanism in this period; flowers of the music, literature, art, and science were bloomed. Polymaths, or the Renaissance men such as Leonardo da Vinci, who are experts of a number of different subject areas, appeared and the knowledge created by them spread quickly to the entire Europe [3]. However, although a huge wave swept across the Europe, the impact was mostly limited to a small fraction of inner circles compared to the entire population, due to socioeconomic inequalities [4]. The French Revolution induced modern society, which introduced public education to develop the potential of “Estates General” [5]. Almost concurrently, Prussia implemented a ∗ † ‡ Present address: Naver Corporation, Seongnam 13561, Korea Corresponding author: [email protected] Corresponding author: [email protected] modern compulsory education system, which enables commoners to escape from illiterate [6]. From then on, a vast amount of knowledge has become open to public, by taking advantage of technological developments of lower-cost letterpress printing, telegraphic communications, and modern information technology. Resultantly, the general education level has so improved that commoners can acquire acquaintances, yet intelligentsia still remained as creators of knowledges. The emergence of information technology in this century is finally offering environments to share up-to-date information generated by commoners; people believe that such a new environment will bring “democratization” of knowledge [7]. Inconceivable volume of information is produced by everyone for every second. Wikipedia, a representative open editing knowledge market, may be referred as the department store for commoners who generate information [8]. However, due to the nature of information online, there has been long-lasting doubt against its credibility, i.e., it is sometimes considered unstable, imprecise, and even deceiving. On the other hand, studies have proved that current state of accuracy of Wikipedia is remarkable; the accuracy of its contents surpasses traditional encyclopedias [9, 10]. At the same time, as yet another twist in the scene, researchers found the inequalities in the editing processes, based on the contents governed by few supereditors’ cartel; thus it is still far from Elysium of communal knowledges as we have dreamed [11–13]. However, the majority of such studies, including our own, focused on a few language editions, mostly English edition of Wikipedia to examine the dynamics and the property of communal data set (an editable data set shared in community to build collective knowledge) [9, 12, 14–16]. Although they successfully warned potential risks behind the current social structure in English editions, it is clear that cultural background affects the behaviors of individuals. For instance, people belonging different cultural backgrounds tend to use different symbols in their Web usage [17], and the design of Web pages is also affected by their backgrounds [18, 19]. Similarly, the users of Wikipedia must be affected by their social norms 2 or culture. It is also reported that editors in different language editions edit Wikipedia with distinctive patterns [20]. Moreover, linguistic complexity of English Wikipedia itself differs from German and Spanish editions [21]. Therefore, the results based on cultural differences seem to deny the generality in establishing such inequality. On the other hand, those studies are based on the small samples from nonidentical-sized data sets, e.g., the number of articles in the English Wikipedia and the numbers of articles in Wikipedia in other languages are of different orders of magnitude; the English one is at least ten times larger than others. Thus, the result might be caused by their relative size, and it is impossible to accept or to deny the lack of generality. As a result, the discussion is limited to a vague regional generality among the English users, yet the pan-human-scale behavior remains unanswered. To investigate the aforementioned generality, we extend our previous analysis from the English Wikipedia [12] to the 863 entire Wikimedia projects [22], which is composed of various types of communal data sets served by non-profit organization called Wikimedia Foundation to encourage worldwide knowledge collecting and sharing. For the purpose, we diagnose the inequality and supereditors’ cartel in the Wikimedia projects to understand socio-physiology of open-editing communal data sets. In particular, we take the complete edit history of every Wikimedia project to inspect Wikipedia’s growth, mainly focusing on the number of edits, the number of editors, the number of articles, the article size, and the level of inequality. We demonstrate that there exist typical growth patterns of such open-editing communal data sets that eventually yields drastic inequalities among the editors. In addition, to comprehend the mechanism behind such inequality, we introduce an agent-based model that replicates the interactions between communal data sets and editors. Our model takes into account the competition between the editors’ natural decrement of motivation over time, the editors’ stronger memory on more recent activities, and the psychological attachment to articles. The model reproduces the universal growth patterns indeed, which is consistent with real data. The rest of the paper is organized as follows. In Sec. II, we introduce the Wikimedia project data used in our investigation. In Sec. III, we present the universality of the communal data sets’ inequality in editing processes, which is increased by its number of edits and size. We trace the complete history by the chronological order in Sec. IV to present the existence of the supereditors’ cartel in the whole-wiki-scale. We present our agent-based model in Sec. V and conclude the paper in Sec. VI. II. DATA SET For the analysis, we use the March 2016 dump of the entire Wikimedia projects [23]. This data set not only includes wellknown Wikipedia, but also covers its sibling projects such as Wiktionary, Wikibooks, Wikiquote, Wikisource, Wikinews, Wikiversity, Wikivoyate, etc., in different languages (see Table I for its detailed composition). Basically, each of these open-editing projects has a distinct subject and object. For ex- ample, each language edition of Wiktionary aims to describe all words from all of the other languages described in the main language of the edition, e.g., English edition of Wikitionary has descriptions of words of all languages in English. The difference between the objects may yield the gap between the editing behaviors of editors belonging to each project, caused by the different demographic pools, the accessibility, the degree of interests, etc. This dump contains the complete copy of Wikimedia articles from the very beginning of January 15, 2001 to March 5, 2016, including the raw text and metadata sources in the Extensible Markup Language (XML) format. In this data set, there are a total of 267 304 095 articles across the entire Wikimedia projects with the complete history of edits. Each article documents either Wikipedia account identification (ID) or Internet protocol (IP) address of the editor for each edit, the article size, the timestamp for each edit, etc. A single character takes one byte, except for a few cases such as Korean (two bytes) and Chinese (two or three bytes), so the article size in bytes is a direct measure of article length [24]. Each data set contains the number of articles ranging from 43 124 816 (Wikimedia Commons: a database of freely usable audiovisual media) to 3 (Wikipedia Login: a database using for administrative purpose), the number of editors ranging from 44 349 908 (English Wikipedia) to 5 (Nostalgia Wikipedia: read-only copy of the old English Wikipedia), the number of edits ranging from 654 163 757 (English Wikipedia) to 14 (Wikipedia Login), and the total article size ranging from 99 519 138 751 bytes (English Wikipedia) to 1 206 bytes (Wikipedia Login). See Fig. 1 for the distributions of various characteristic measures. Previous studies, including our own, tend to use a few language editions, commonly restricted to the English edition of Wikipedia that is the largest [9, 12, 14–16]. In addition, Wikimedia projects other than Wikipedia are not usually under investigation, even though several language editions of Wikipedia were considered [21, 25]. It is true that most other Wikimedia projects are not so large as the English Wikipedia as shown in Fig. 1 showing the fat-tailed distributions for different Wikimedia projects. However, properties of everyday phenomena vary by their sizes [26], thus characteristics of such communal databases may vary by its size and category. Therefore, despite the relative proportion of other Wikimedia projects, those smaller editions should not be neglected just because they are much smaller than the English editions, to comprehend the omnidirectional properties of communal data sets. Accordingly, we take the entire 863 editions of Wikimedia projects for our analysis. Our main goal is to find the underlying principle in the developing of communal data sets. Wikimedia project, as the representative player among those communal data sets, consists of various types of data sets operated by Wikimedia Foundation. This massive record of knowledge market spans 292 different languages and 12 different types of subject (see Table I for details). This variety allowing us to explore the innate nature of humankind’s behavior for each of their written language and purpose of use. We take a single Wikimedia project as each sample of such communal data sets. In order to proceed with the in-depth analysis on the evolution of 3 TABLE I. The list of Wikimedia projects. In general, there are different language editions for each project. Project Wikipedia Wiktionary Wikibooks Wikiquote Wikisource Wikinews Wikiversity Wikivoyage etc. Total 109 108 107 106 105 104 103 102 101 100 Description/Notes Encyclopedia articles Dictionary Educational textbooks and learning materials Collection of quotations Library of source documents and translations News source Educational and research materials and activities Travel guide Deactivated (not editable) ones are included (b) Np Ne (a) Editions 292 172 121 89 65 33 16 17 58 863 101 102 103 108 107 106 105 104 103 102 101 100 100 101 Rank 108 107 106 105 104 103 102 101 100 100 (d) 101 103 102 103 Rank S Na (c) 102 102 103 Rank 1011 1010 109 108 107 106 105 104 103 100 101 Rank FIG. 1. The rank versus characteristic measures of Wikimedia projects: (a) the number of edits Ne , (b) the number of editors N p , (c) the number of articles Na , and (d) the total volume of texts S (in the unit of bytes). communal data set, we stress the fact that most data sets’ ages are around 3.5 × 109 seconds (about eleven years; see Fig. 2). Thus, most Wikimedia projects are of similar ages, suggesting that the raw characteristic measures without time-rescaling are appropriate, free from the age effect. III. UNIVERSALITY AND INEQUALITIES FOR COMMUNAL DATA SETS In this section, we present the evidence of a universal growth pattern shared by all of the Wikimedia projects, dis- played by characteristic measures at the current snapshot of communal data set, such as the total number of edits Ne , the total number of editors N p , the total number of articles Na , and the current size S (in the unit of bytes). Our primary interest is to identify the generality of growth for the communal data set, not individual articles. Thus, we take the total sum of values for all of the articles in a Wikimedia project, without considering individualistic properties of its constituent articles. For example, Ne is total number of edits for a given edition of Wikimedia project, or the sum of the edit numbers of individual articles for that edition. Our first analysis on the interplay between such measures indicates the regularity regardless of Probability density (× 10-8) 4 contrast to the interrelations between the measures we report here, the measures are not correlated with the age of data sets (as shown in Fig. 4), indicating that the raw number of edits is a proper measure of time to compare various data sets rather than the real time. As we observe in Fig. 2, most of the Wikimedia projects are of similar ages, so our analysis implies that the speed of growth per unit time also decreases as its size increases, as we revealed in our previous study on English Wikipedia [12]. 14 12 10 8 6 4 2 0 0 1 2 3 4 Age (× 108 seconds) FIG. 2. The age distribution of Wikimedia projects, where we bin the data in the uniform length of 2 × 107 seconds (the resolution of the horizontal axis). its age, language, and the type of data sets. A. B. 5 Growth scale of communal data sets We begin our analysis with the inspection on the intercorrelation between Ne , Na , N p , and S in the current Wikimedia projects. One may speculate the absence of a general rule between measures due to the excessive heterogeneity of current status (as shown in Fig. 1) compared to its age distribution (as shown in Fig. 2). As an example of difference in different language editions of Wikipedia projects, it has been reported that the levels of editors’ language proficiency in English Wikipedia are qualitatively different from the other language editions [21]. Despite such difference, we find common positive correlations among the measures. First, we observe the clear tendency that the numbers of editors, the numbers of articles, and the sizes of data set gradually vary as the functions of the number of edits. The growing patterns are characterized by a simple sublinear growth of the form y ∼ xλ where x is the number of edits and the exponent λ ' 0.70 [as shown in Fig. 3(a)], λ ' 0.85 [as shown in Fig. 3(b)], and λ ' 0.87 [as shown in Fig. 3(c)], respectively. In other words, the frequency of appearance of new editors, that of brand new articles, and the increasing of the amount of text is slowing down when more edits have taken place; in the sense of editability, larger data set are more inefficient than smaller ones. To find the reason behind such a stagnation in terms of number of edits, we also track the interrelations between the other measures themselves. The number of editors is increased with the number of articles with the exponent λ ' 0.78 [as shown in Fig. 3(d)]. Meanwhile, the size of article is roughly linearly increased with the number of articles and the number of editors, with the exponents λ ' 1.02 and λ ' 1.06, respectively. In short, the rate of accumulation of the text remains almost constant regardless of the number of articles and the number of editors. The result implies that the stagnation is caused by the decelerated appearance of new editors, not the decreased productivity of the existing editors. In addition, in Inequality in contributions The general growth pattern of the characteristic measures, Ne , Na , N p , and S , triggers an interesting proposition: could there also be a universal rule for building up recently reported structural inequalities [12, 13]? In other words, we wonder if the existence of de facto ownership or monopoly of supereditors [11] is a generic phenomenon for the entire communal data sets. To examine the validity of the proposition, we employ the Gini coefficient, which is a conventional measure for inequality [28]. In our analysis, the Gini coefficient quantifies how the number of edits is distributed among different editors who are involved in a certain Wikimedia project of interest, i.e., edited an article in the project at least once. The Gini coefficient ranges from 0 for the minimal inequality (or the maximal equality—when every editor contributes equally) to 1 as the maximal inequality (when only a single editor contributes everything). We consider the number of edits and the data size for individual editors as the variable of interest in the Gini coefficient and call it “wealth” unless specified (as the Gini coefficient is usually used to quantify the inequality in economic wealth). The trend of the Gini coefficient as an increasing function of Ne displayed in Figs. 5(a)–5(d) connotes that the inequality is intensified as the communal data set grows. Larger values of Ne induce severer inequality not only for the number of edits performed by the editors [as shown in Fig. 5(a)], but also for the total sum of amount of data changes (in bytes) by the editors [as shown in Fig. 5(b)]. This increasing trend is still valid when we account addition [as shown in Fig. 5(c)] and subtraction separately [as shown in Fig. 5(d)]. In addition, since the article age does not severely affect the inequality [as shown in Figs. 5(e) and 5(f)], the observation of the Gini coefficient is consistent with our observation that the article age does not affect the current state of communal data sets. We predict that the inequality will become severer if a given data set is edited more frequently. To sum up, we observe the universal pattern of inequality increased by the number of edits based on the current snapshot of communal data set. IV. EVIDENCES FOR THE ESTABLISHMENT OF THE EDITORS’ CARTEL In Sec. III B, we have shown the current snapshots displaying the high level of inequality and the increasing of the Gini coefficient regarding the number of edits (see Fig. 5). Al- 5 (b) 108 102 104 102 103 105 107 100 101 109 103 105 107 103 101 109 106 (f) 1011 109 S 104 102 105 Ne 107 109 106 108 107 103 100 102 104 Np 106 108 1011 109 105 104 Na 103 Ne (e) 102 107 105 Ne (d) 108 Np 109 S 104 100 100 1011 106 Na Np 106 100 101 (c) 108 S (a) 107 105 102 104 Na 106 108 103 100 FIG. 3. The correlations between number of edits Ne , number of editors N p , number of articles Na , and total size of the data set S . Every correlation is characterized by the simple power-law growth form of y ∼ xλ . For the number of edits, the other measures grow sub-linearly with an exponent of (a) λ ' 0.70 for the number of editors, (b) λ ' 0.85 for the number of articles, and (c) λ ' 0.87 for the total size of data set in bytes, respectively. (d) The number of editors also sub-linearly increased by the number of articles with an exponent of λ ' 0.78, (e) yet the size of data set is almost linearly increased by the number of articles with an exponent of λ ' 1.02. Finally, panel (f) displays the nearly linear interrelation between the number of editors and size of the data set, which in turn indicates that the average productivity of a single editor is maintained. though the current status of entire Wikimedia projects seems to follow a specific function of Ne , this snapshot could be coincidental. Thus, we further track the actual history to confirm or to reject the hypothesis of possible coincidence so that we can judge if the increasing trend is actually the inherent nature of the formation of communal data sets. To check the hypothesis, we set the initial times of the entire 863 data sets the same [Ne (t = 0) = 0] and record the trajectories of Gini coefficient as the functions of Ne [see Fig. 6(a) for the curve averaged over all of the data sets with the deviation]. Similar to the conventional usage of the Gini coefficient for wealth distributions, we use the cumulated numbers of edits up to Ne (note that the unit of time is Ne in this case, as discussed in Sec. III A) for each editor as the wealth variable. Note that, technically, the Gini coefficient is undefined when a single editor has edited a data set alone (as we define the set of editors as the editors who have contributed at least once), but we take the Gini coefficient as 1 for that case because it well describes the completely monopolized state. Our result shows that the average Gini coefficient is coterminous with the current states of the Wikimedia projects [see Fig. 6(a)]; thus the current status of a specific data set can be taken as a certain midpoint of a single master curve described as a function of Ne . For example, a history of Cebuano Wikipedia clearly follows the typical growth pattern for Ne > 104 [see Fig. 6(b)], except for the initial fluctuations for small values of Ne . Although we employ the Gini coefficient as the inequality measure in accumulated wealth distributions, an alternative approach of the index for an income is also widely accepted. In economics, the income is defined as the value gained within a specific time frame [29]. Alternatively, we also consider the number of edits for individual editors per unit time frame as the “income” variable in Gini coefficient and call it income unless specified. In other words, the wealth analyzed before is the accumulated income from the onset of an individual editor’s first activity. In this study, we use the time window of 104 edits, but the different values of time frame do not affect the result meaningfully. The Gini coefficient in terms of the income distribution for the communal data set as a function of Ne indicates that the larger Ne values induce less severe inequality in contributions [see Fig. 6(c)]. It indeed suggests that the income distribution becomes more homogeneous as time goes by [see Fig. 6(c)], while the inequality in the wealth distribution is maintained [see Fig. 6(a)]. It is doubtful that this phenomenon is caused by the changed amount of productivity of the editors, based on the fact that the average number of edits performed by an editor is maintained even when the Ne values are varied [see Fig. 4]. 6 (a) (b) 109 108 106 Np Ne 107 105 103 102 101 100 0 1 2 Age (× 3 108 4 5 0 1 seconds) 2 Age (× (d) 108 109 104 107 102 3 108 4 5 4 5 seconds) 1011 106 S Na (c) 104 105 100 103 0 1 2 Age (× 3 108 4 5 seconds) 0 1 2 Age (× 3 108 seconds) FIG. 4. The correlations for (a) the number of edits, (b) the number of editors, (c) the number of articles, and (d) the total size of Wikimedia projects, with the age of Wikimedia projects. Therefore, the inequalities in the wealth distributions are intensified over time, while the inequalities in the contribution per time frame become less severe for all of the editors as time goes by. To consolidate the two results, we examine how the rich-get-richer affects the communal data set in details. Figures 7(a) and 7(b) suggest that the editors tend not only to keep their short-term social positions but also to maintain their long-term social positions. For instance, 58.1% of the editors remain in the 10th percentile for next 104 edits if the editors were in the 10th percentile in the time window of 990 000 ≤ Ne < 1 000 000, meanwhile only 32.6% of the editors in the 90th percentile retain their position [see Fig. 7(a)]. In other words, the editors who edit more often within a specific time window tend to edit more often later as well. Although the exact proportion and the number of edits for each percentile vary over time, the distinction between social classes is preserved. As a result, a hierarchical structure between the editors are gradually becoming concrete. The trend is even clearer for accumulated edit numbers [see Fig. 7(b)]. At the early stage, only highly ranked editors, whose amount of contribution is much larger than the median, maintain their positions represented by their numbers of edits ever performed. Meanwhile, the rest of the lowly ranked editors, whose amount of contribution is much smaller than the median, change their positions more frequently. For every percentile, the percentile of revisiting editors is getting more associated with the previous class over time, which eventually makes most editors belong to a stratified percentile. There- fore, rather solid classes are formed at a very early stage and adhere until much later. The cartel of supereditors indeed emerges [12, 13]. We have not only revisited the existence of such cartel, but also observed how its degree of influences changes as more edits are performed. The territory of those conglomerate spans the entire Wikimedia project level beyond only single articles, and their leverage on Wikipedia is still growing [see Fig. 7(b)]. To comprehend the formation of such supereditors’ cartel, we further discuss the interrelationship between number of edits in two consecutive edit sequences in various time windows from the onset of the data set [as shown in Fig. 7(c)]. We calculate the Pearson correlation coefficient between the lists of edit numbers in two successive frame windows for an editor. Initially, two consecutive sequences of edit numbers are highly correlated across various lengths of time windows, but the short-term correlation values are gradually decreased as more edits are performed. In addition, a boundary between high-correlated domain (correlation & 0.7) and low-correlated domain (correlation . 0.7) goes upwards as time goes by, so only long-term correlation is maintained. Putting these pieces together, the results shown in Fig. 7 explain the results shown in Figs. 6(a) and 6(c); the inequality in the wealth distributions is preserved by the long-term correlations, while the inequality in the income distribution is steadily resolved due to the diminution of the short-term correlations. Although the short-term activities of editors may vary, the adhesion and monopolization of a few editors are 7 Gini coeff. S+(i) Gini coeff. Ne(i) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 101 105 Ne 107 (d) 1 0.9 0.8 0.7 0.6 0.5 101 103 105 Ne 107 0.8 0.7 0.6 0.8 0.7 0.6 0.5 0.4 0.3 103 105 Ne 107 109 0 1 Age (× 1 0.9 0.8 0.7 0.6 0.5 0.4 101 109 1 0.9 0.9 0.5 101 109 Gini coeff. S-(i) Gini coeff. Sabs(i) (b) 103 (e) 1 Gini coeff. Ne(i) (c) 1 (f) 1 Gini coeff. Sabs(i) (a) 0.9 2 3 108 4 seconds) 2 3 108 seconds) 5 0.8 0.7 0.6 0.5 103 105 Ne 107 109 0 1 Age (× 4 5 FIG. 5. The Gini coefficient of Wikimedia projects as functions of the number of edits and the real time. (a) The Gini coefficient for numbers of edits performed by each editor. (b) The average Gini coefficient for total sums of absolute amount of editing (in the unit of bytes) performed by each editor. (c) The average Gini coefficient for the absolute number of sums of incremental change in edits performed by each editor. (d) The average Gini coefficient for the absolute number of sums of decremental change in edits performed by each editor. (b) 1 0.9 Gini coeff. Ne(i) Gini coeff. Ne(i) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 101 (c) 1 Gini coeff. per 10000 edits (a) 0.8 0.7 0.6 0.5 0.4 103 105 Ne 107 109 0.3 101 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 103 105 Ne 107 109 104 106 Ne 108 FIG. 6. The Gini coefficient of Wikimedia project as the functions of Ne . The blue curve in (a) is the average Gini coefficient and the shaded area corresponds to its standard deviation, averaged over different Wikimedia projects. (b) shows a typical example of Cebuano Wikipedia. Initially, it does not seem to follow the general trend, the Gini coefficient for Cebuano Wikipedia starts to follow the trend curve for Ne & 104 . In panel (c), we consider the number of edits for individual editors per unit time frame as the income variable in the Gini coefficient. not resolved in the long run because of the cartel constructed at a very early stage of communal data set. V. AGENT-BASED MODEL OF INEQUALITY FORMATION To elucidate the dynamics to the formation of such supereditors’ cartel, we introduce an agent-based model by importing different types of editors’ “memory” affecting the motivation 8 0.6 0.5 0.4 0.3 0.2 10% 30% 50% 70% 90% 0.1 0 (c) 1.1 Correlation length (×104) (b) 0.7 Longterm revisit fraction Shorterm revisit fraction (a) 1 0.9 0.8 0.7 10% 30% 50% 70% 90% 0.6 0.9 8 0.8 6 0.7 4 0.6 2 0.5 0.4 0.5 104 106 Ne 108 104 106 Ne 108 0 1 2 3 Ne (×107) 4 FIG. 7. The properties of the revisiting editors are characterized by their activity. The fraction of editors who are in a certain percentile within ∆Ne = 104 edits (a) if the editors are also in a certain percentile for the previous 104 edits and (b) if the editors are also in a certain percentile for the entire edit activity until the specific point Ne . (c) The Pearson correlation between the two lists of numbers of edits performed by a specific editor between previous n edits and next n edits for given number of edits on the horizontal axis, averaged over the Wikimedia projects. Here, n is the value on vertical axis. We consider the number of edits of next (previous) 104 edits for the editor as 0 when the editor only appeared only in previous (next) 104 edits, respectively. The correlation is undefined if only one editor is in both sequences, and we set the value as 1 by convention in that case, because it corresponds to the complete dominance by that editor. for edits. We assume that there are two fundamental and inherent motivations decaying over time, which govern the shortterm and long-term behaviors of the editors. Our primary purpose is to examine the separate effects of these two memories governing the current state of Wikimedia projects. Besides these two decaying factors, the editors are also engaged in certain articles when they have already given more efforts in editing the articles [30], which represents the psychological attachment. In the following, we describe how we implement the socio-psychological effects into our mathematical model in details. A. Model description Our observation is mainly based on the indicators as the functions of Ne in Sec. IV, as we already observe its validity in the real data. Accordingly, we set a single edit as the unit of time t. The model begins with a single agent. Each agent represents a single editor who participates in editing processes. We take a single media representing the communal data set, or a single language edition of a certain Wikimedia project. In our model, we consider the action of editors to be motivated by their inherent nature, and introduce the parameters for the editors to describe their activities. First, for editor i, we denote their accumulated number of edits by Ni (t) at time t. Their time of birth tb;i and time when the last edit happened te;i are specified. The dynamic rules are as follows. For each simulation step, the debut of a new agent and the revisit (or re-edit) of an already existing agent occur in turn. For every simulation step, with a constant probability b, a new agent appears and begins to participate in the editing process. Once a new agent appears in the data set at time Ne , the agent edits the data set at the time of inauguration so that tb;i and te;i are assigned as Ne , and the time unit t is increased by 1 (the unit of edit number). For the second step, an editor chosen uniformly at random attempts to edit the data set. There are many factors affecting the motivation of edits, but we take three: the long-term decay of motivations, the short-term motivation of ownership, and the psychological engagement of editors. In general, editors are highly motivated at the beginning of the participation, but their motivation fades steadily [31, 32]. Thus, participants loose their attention as time goes by, which is modeled by the power-law decay as the factor (t − tb;i )−k where k is the characteristic exponent representing the motivation decay, which is observed in many temporally varying systems [33, 34]. In addition, a fat-tailed distribution is observed for the time between the consecutive edits [12], which suggests that the editing time scale of Wikipedia shows “bursty” behaviors, meaning that there is a short-term stimulation of edit motivation affected by their interval between an editor’s latest editing te;i and the current time t [33, 35]. This short-term stimulation of motivation is modeled as the factor [1 + e−(t−te;i )/τ ], where τ is characteristic time of this stimulation. Finally, there is a tendency for editors to be engaged when they have already participated more frequently [12, 30], which is assigned as 1 at the time of first participation of the editor and increased by unity every time an agent participating in the edit process, which is equivalent to the number of edits Ni (t) up to the time point t. Taking these factors together, in our model, when agent i is chosen for editing, she participates in editing with the probability n o Pi [t; Ni (t), tb;i , te;i ] = min 1, Ni (t)(t − tb;i )−k [1 + e−(t−te;i )/τ ] . (1) Once she decides to participate, te;i is newly set as t + 1 and Ni (t + 1) = Ni (t) + 1. In addition, we also include the possibil- 9 ity for an agent to leave the editing process indefinitely. We consider that this breakup is based on the loss of motivation to edit [31, 32]. In our model, therefore, an agent leaves the system when she chooses not to edit and Pi [t; Ni (t), tb;i , te;i ] < r, where r is a preassigned cutoff parameter common to all of the editors. We give some evidences that the formation of current inequality is from the factors above, regardless of the innate nature of an individual editor in following section. B. Model results In Sec. IV, we have shown the increasing trend of the Gini coefficient as the number of edits is increased, which is in particular, rapidly increased at the early stage of data set and stabilized at the high level (the Gini coefficient & 0.8 for Ne & 104 , see Fig. 6). Our model result is consistent with the empirical observations. The Gini coefficient of model data set is rapidly increased until the high level is reached at Ne ' 105 for k = 0.8 [see Fig 8(a); compared to Fig. 6(a)]. Smaller k values yield a slower increment of the Gini coefficient, while the τ value does not affect much. The Gini coefficient does not reach the high level (the Gini coefficient ' 1) if we assign k & 1, which suggests that moderate decaying of motivation is essential to reproduce the current state of communal data sets. The Gini coefficient of the income distribution also displays similar results from our model to the ones from the data. For k = 0.8, the Gini coefficient for the income is steadily decreased from Ne ' 105 [see Fig 8(a)], which is observed in the data for Ne & 105 [see Fig. 6(c)]. In addition to the Gini coefficient, our model also reproduces the trend of reduced short-term correlations for the number of edits between time windows reported in Fig. 7(c). As shown in Fig. 8(c), the interrelationship between the number of edits in two consecutive sequences in various time frames from the onset of the data set gives a similar result. Both in the model and real data, we observe the large correlation between two consecutive sequences regardless of the length of the sequences. As time goes by, the short-term correlation is steadily reduced, while the long-term correlation is sustained. Similar to the data, the border between largecorrelation (correlation & 0.7) and small-correlated domains (correlation . 0.7) rises as the more edits are performed [see Fig. 8]. The slope of a border is different for different as k values, but τ does not affect the slope. In short, despite the fact that the rapid increment of wealth inequality happens at the early stage and the gradual decrement of the income inequality always occur, the parameter k mainly governs the overall dynamics. In other words, the loss of long-term motivation induces the inequality, while the short-term memory does not affect the system much. Therefore, the rich-gets-richer effect is mainly driven by the accumulated engagement induced by previous edits, and such a long-term engagement makes supereditors’ cartel formed at the early stage survive. In addition, our model indicates that the supereditors’ cartel can be formed without the direct communication between editors, or any direct pressure from the society in other words. VI. CONCLUSION In this study, we have examined the common patterns among the communal data sets displayed in the entire language editions of different types of Wikimedia projects. Although some studies have uncovered the general patterns before, it is usually based on partial observations on specific type or specific languages data set, which have left many unanswered questions and speculations [9, 12, 14–16]. However, the extensive data set from entire Wikimedia projects recording the pan-human scale collaboration of forming collective knowledge has given us a unprecedented opportunity to explore the true innate nature of humankind. In this data set, we have observed the universal interplays between the numbers of editors, the numbers of articles, the numbers of edits, and the total length of articles, which are characterized by the powerlaw scaling form with a single set exponents. The existence of the universal growth rules among the entire 293 languages and 12 types of Wikimedia projects suggests the pan-human scale behavior for collaboration. This universal patterns are shown not only in the external appearance of data sets, but also in its inequality quantified by the Gini coefficient; the inequality is formed at a very early stage of communal data sets and continued. It was widely hoped that the communal data sets will bring democratization of knowledge [7], yet studies reveal that the current Wikimedia projects are just another stratified society under control of a few authoritative entities [12, 13]. We have demonstrated that the inequalities among the editors can be even deeperrooted than our expectation. The existence of supereditors’ cartel is a universal phenomenon across the entire communal data sets regardless of its size and activity. We have also observed the universal trend of intensified inequality for all types of data sets, which suggests that the privatization by a few dedicated editors will be intensified further. In addition, we have shown that a social stratum of such communal data sets can be formed at the very early stage and the polarization of editors has already been set. Our study is not limited to diagnose the current state of Wikimedia projects, but provides the general insight on the future direction of the communal data set. For instance, our simulation suggests that the inequality can be formulated without the direct interactions between editors. Indeed, it is also observed that the editors tend to obey pre-forged authorities [13]. Considering the fact that the total productivity of the editors is decreased by the number of edits, which may result in less productivity and even less accuracy in the future. It was already reported that the growth of Wikipedia has slowed down [36], and our analysis also warns that the inequality will not be easily resolved without active efforts. For a long time, Wikipedia has served as a spearhead of the international open knowledge market. However, to sustain the abundant playground for worldwide collaborations, strategic actions considering the nature of such a social structure are required. Our finding displays abiogenesis imbalances in the formation of a particular set of communal data, but the result and implication of our study can be applied outside the Wikimedia projects. There will be extensive applications for understanding collec- 10 k=0.8 k=0.7 k=0.5 k=0.4 (c) 0.8 0.7 k=0.8 0.6 0.5 k=0.7 0.4 k=0.5 0.3 0.2 k=0.4 0.1 Correlation length (×103) (b) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Gini coeff. per 10000 edits Gini coeff. Ne(i) (a) 3 1 0.8 2 0.6 0.4 1 0.2 0 0 102 104 Ne 106 104 105 106 Ne 107 2 4 6 8 Ne (×106) 10 FIG. 8. The Gini coefficient from our model as functions of the number of edits. Panel (a) shows the Gini coefficient for the number of edits. Panel (b) shows the Gini coefficient for the time frame of every 104 edits, for a given value in the horizontal edits. For panels (a) and (b), the color corresponds to the different value of τ ranges from ∞ (no short-term stimulation) to 0.001: τ = 0.01 (purple), τ = 0.001 (green), τ = 0.0001 (blue), and τ → ∞ (yellow). For panels (a) and (b), we use the following parameters: b = 0.0001, and r = 0.01. (c) The Pearson correlation between the lists of numbers of edits performed by an editor between previous n edits and next n edits for the number of edits on horizontal axis, where n is the value on the vertical axis. We consider the number of edits for next (previous) 104 edits for an editor as 0 when the editor only appears in the previous (next) 104 edits, respectively. For the panel (c), we used the following parameters: b = 0.0001, k = 0.8, τ → ∞, and r = 0.01. We check that other choice of τ gives similar results, but we show the result with τ → ∞ to emphasize the long-term correlation. For (a) to (c), each parameter is averaged over 1000 independent realizations. tive behaviors of humankind based on this type of analysis, which could give clues to solve even larger scale social inequalities in turn, we hope. [1] B. Smalley, The Study of the Bible in the Middle Ages (Blackwell, Oxford, 1952). [2] J. Bartlett, Familiar Quotations (10th ed, Little, Brown and Company, Boston 1919). [3] P. Burke, The European Renaissance: Centers and Peripheries (Wiley-Blackwell, Oxford, 1998). [4] K. Marx, Das Kapital: Kritik der Politischen Ökonomie (Verlag Von Otto Meissner, Hamburg, 1867). [5] H. C. Barnard, Education and the French Revolution, British Journal of Educational Studies 18, 314 (1970). [6] A. Bott, Prussia and the German system of education (Albany, New York, 1868). [7] C. Lemke and E. Coughlin, The Change Agents, Teaching for the 21st Century 67, 54 (2009). [8] Wikipedia, https://www.wikipedia.org/. [9] T. Chesney, An Empirical Examination of Wikipedia’s Credibility, First Monday 11 (2006). [10] J. Giles, Internet Encyclopedias Go Head to Head, Nature 438, 900 (2005). [11] Y. Gandica, J. Carvalho, and F. Sampaio dos Aidos, Wikipedia Editing Dynamics, Phys. Rev. E 91, 012824 (2015). [12] J. Yun, S. H. Lee, and H. Jeong, Intellectual Interchanges in the History of the Massive Online Open-editing Encyclopedia, Wikipedia, Phys. Rev. E 93, 012307 (2016). [13] B. Heaberlin and S. DeDeo, The Evolution of Wikipedia’s Norm ACKNOWLEDGMENTS This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government through Grant No. NRF-2015-S1A3A-2046742 (J.Y. and H.J.). Network, Future Internet, 8, 14 (2016). [14] A. Kittur, B. Suh, and Ed H. Chi, Can You Ever Trust a Wiki?: Impacting Perceived Trustworthiness in Wikipedia, CSCW ’08 Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, 477 (2008). [15] B. T. Adler, K. Chatterjee, L. De Alfaro, M. Faella, I. Pye, and V. Raman, Assigning Trust to Wikipedia Content, WikiSym ’08 Proceedings of the 4th International Symposium on Wikis, Article No. 26 (2008). [16] T. Yasseri, R. Sumi, A. Rung, A. Kornai, and J. Kertész, Dynamics of Conflicts in Wikipedia, PLOS ONE 7, e38869 (2012). [17] W. Barber and A. Badre, Culturability: The merging of culture and usability, Proceedings of The Fourth Conference on Human Factors and the Web (1998). [18] A. Marcus and E. W. Gould, Crosscurrents: Cultural dimensions and global web user-interface design, Interactions, 7, 32 (2000). [19] S. Schmid-Isler, The language of digital genresA semiotic investigation of style and iconology on the World Wide Web, Proceedings of the 33rd Hawaii International Conference on System Sciences (2000). [20] U. Pfeil, P. Zaphiris, and C. S. Ang, Cultural Differences in Collaborative Authoring of Wikipedia, J. Comput. Mediat. Commun, 12 (2006). [21] S. Kim, S. Park, S. A. Hale, S. Kim, J. Byun, and A. H. Oh, 11 [22] [23] [24] [25] [26] [27] [28] [29] Understanding Editing Behaviors in Multilingual Wikipedia, PLOS ONE 11, e0155305 (2016). Wikimedia Projects, https://wikimediafoundation.org/. Wikimedia Downloads, https://dumps.wikimedia.org/ backup-index.html. F. Yergeau UTF-8, a Transformation Format of ISO 10646, STD 63, RFC 3629. S. A. Hale, Multilinguals and Wikipedia Editing, ACM Web Science Conference 2014 (2014). W. G. Song, , H. P. Zhang, T. Chen, and W. C. Fan, Power-law Distribution of City Fires, Fire Safety Journal 38, 453 (2003). Ethnologue: Summary by language size, http://www. ethnologue.com/statistics/size, Accessed: 14 May 2016. C. Gini, Variabilita e Mutabilita (Variability and Mutability) (C. Cuppini, Bologna, 1912). N. G. Mankiw, Principles of Economics, 7th edition (Cengage Learning, Boston, 2014). [30] B. P. George, Past Visits and the Intention to Revisit a Destination: Place Attachment as the Mediator and Novelty Seeking as the Moderator, Journal of Tourism Studies 15, 51 (2004). [31] R. Crane and D. Sornette, Robust Dynamic Classes Revealed by Measuring the Response Function of a Social System, Proc. Natl. Acad. Sci. USA 105, 15649 (2008). [32] F. Wu and B. A. Huberman , Novelty and collective attention, Proc. Natl. Acad. Sci. USA 104, 17599 (2007). [33] M. Karsai, K. Kaski, A.-L. Barabási, and J. Kertész, Universal Features of Correlated Bursty Behaviour, Sci. Rep. 2, 397 (2012). [34] M. Karsai, N. Perra, and A. Vespignani, Time Varying Networks and the Weakness of Strong Ties, Sci. Rep. 4, 4001 (2014). [35] H.-H. Jo, J. I. Perotti, K. Kaski, and J. Kertész, Correlated Bursts and the Role of Memory Range, Phys. Rev. E 92, 022814 (2015). [36] B. Suh, The Singularity Is Not Near: Slowing Growth of Wikipedia, WikiSym ’09 Proceedings of the 5th International Symposium on Wikis, Article No. 8 (2009).
© Copyright 2026 Paperzz