Early Adhesion of Structural Inequality in the Formation of

Early Adhesion of Structural Inequality in the Formation of Collaborative Knowledge, Wikipedia
Jinhyuk Yun (ᄋ
ᆫᄌ
ᅲ
ᆫᄒ
ᅵ
ᆨ),1, ∗ Sang Hoon Lee (ᄋ
ᅧ
ᅵᄉ
ᆼᄒ
ᅡ
ᆫ),2, † and Hawoong Jeong (ᄌ
ᅮ
ᆼᄒ
ᅥ
ᅡᄋ
ᆼ)1, 3, 4, ‡
ᅮ
1
arXiv:1610.06006v1 [physics.soc-ph] 19 Oct 2016
Department of Physics, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea
2
School of Physics, Korea Institute for Advanced Study, Seoul 02455, Korea
3
Institute for the BioCentury, Korea Advanced Institute of Science and Technology, Daejeon 34141, Korea
4
Asia Pacific Center for Theoretical Physics, Pohang 37673, Korea
(Dated: October 20, 2016)
We perform an in-depth analysis on the inequality in 863 Wikimedia projects. We take the complete editing
history of 267 304 095 Wikimedia items until 2016, which not only covers every language edition of Wikipedia,
but also embraces the complete versions of Wiktionary, Wikisource, Wikivoyage, etc. Our findings of common
growth pattern described by the interrelation between four characteristic growth yardsticks suggest a universal
law of the communal data formation. In this encyclopedic data set, we observe the interplay between the number
of edits and the degree of inequality. In particular, the rapid increasing of the Gini coefficient suggests that this
entrenched inequality stems from the nature of such open-editing communal data sets, namely abiogenesis of
supereditors’ cartel. We show that these groups are created at the early stage of these open-editing media and
still alive. Furthermore, our model taking both short-term and long-term memories into account successfully
elucidates the underlying mechanism to establish the oligarchy in Wikipedia. Eventually, our results forewarn a
rather pessimistic prospect of such communal databases in the future: the inequality will endure extendedly.
I.
INTRODUCTION
Throughout untold ages, knowledge had been monopolized
by privileged classes and it had preserved the status of those
classes in turn. Philosophy and religions were used as convenient instruments to allay the public sentiments by sovereigns;
thus, the bread and circuses policy dominated and public education was absent, which limited public knowledge exclusively to practicality, e.g., simple labor works. Meanwhile,
overlords’ literacy was mainly in Latin. For example, only
Latin was allowed to write the Bible until the Protestant Reformation. Medieval education did not aim at the entire public, thus it had consolidated an authority of the clergy [1].
That clergy was in the symbiotic relationship with political
sovereigns by the theory of the divine right of kings, which
limits natural rights of the general public. As Francis Bacon
said, knowledge was the power [2].
The Renaissance, referred to as a cultural movement that
began in Italy, allowed the revival of intellects. As the appellation, various disciplines suppressed before were reborn
based on the spirit of humanism in this period; flowers of the
music, literature, art, and science were bloomed. Polymaths,
or the Renaissance men such as Leonardo da Vinci, who are
experts of a number of different subject areas, appeared and
the knowledge created by them spread quickly to the entire
Europe [3]. However, although a huge wave swept across the
Europe, the impact was mostly limited to a small fraction of
inner circles compared to the entire population, due to socioeconomic inequalities [4].
The French Revolution induced modern society, which introduced public education to develop the potential of “Estates
General” [5]. Almost concurrently, Prussia implemented a
∗
†
‡
Present address: Naver Corporation, Seongnam 13561, Korea
Corresponding author: [email protected]
Corresponding author: [email protected]
modern compulsory education system, which enables commoners to escape from illiterate [6]. From then on, a vast
amount of knowledge has become open to public, by taking
advantage of technological developments of lower-cost letterpress printing, telegraphic communications, and modern information technology. Resultantly, the general education level
has so improved that commoners can acquire acquaintances,
yet intelligentsia still remained as creators of knowledges.
The emergence of information technology in this century is
finally offering environments to share up-to-date information
generated by commoners; people believe that such a new environment will bring “democratization” of knowledge [7]. Inconceivable volume of information is produced by everyone
for every second. Wikipedia, a representative open editing
knowledge market, may be referred as the department store
for commoners who generate information [8]. However, due
to the nature of information online, there has been long-lasting
doubt against its credibility, i.e., it is sometimes considered
unstable, imprecise, and even deceiving. On the other hand,
studies have proved that current state of accuracy of Wikipedia
is remarkable; the accuracy of its contents surpasses traditional encyclopedias [9, 10]. At the same time, as yet another
twist in the scene, researchers found the inequalities in the
editing processes, based on the contents governed by few supereditors’ cartel; thus it is still far from Elysium of communal
knowledges as we have dreamed [11–13].
However, the majority of such studies, including our own,
focused on a few language editions, mostly English edition
of Wikipedia to examine the dynamics and the property of
communal data set (an editable data set shared in community to build collective knowledge) [9, 12, 14–16]. Although
they successfully warned potential risks behind the current social structure in English editions, it is clear that cultural background affects the behaviors of individuals. For instance, people belonging different cultural backgrounds tend to use different symbols in their Web usage [17], and the design of Web
pages is also affected by their backgrounds [18, 19]. Similarly,
the users of Wikipedia must be affected by their social norms
2
or culture. It is also reported that editors in different language
editions edit Wikipedia with distinctive patterns [20]. Moreover, linguistic complexity of English Wikipedia itself differs
from German and Spanish editions [21]. Therefore, the results
based on cultural differences seem to deny the generality in
establishing such inequality. On the other hand, those studies
are based on the small samples from nonidentical-sized data
sets, e.g., the number of articles in the English Wikipedia and
the numbers of articles in Wikipedia in other languages are of
different orders of magnitude; the English one is at least ten
times larger than others. Thus, the result might be caused by
their relative size, and it is impossible to accept or to deny
the lack of generality. As a result, the discussion is limited to
a vague regional generality among the English users, yet the
pan-human-scale behavior remains unanswered.
To investigate the aforementioned generality, we extend our
previous analysis from the English Wikipedia [12] to the 863
entire Wikimedia projects [22], which is composed of various types of communal data sets served by non-profit organization called Wikimedia Foundation to encourage worldwide
knowledge collecting and sharing. For the purpose, we diagnose the inequality and supereditors’ cartel in the Wikimedia projects to understand socio-physiology of open-editing
communal data sets. In particular, we take the complete edit
history of every Wikimedia project to inspect Wikipedia’s
growth, mainly focusing on the number of edits, the number of
editors, the number of articles, the article size, and the level of
inequality. We demonstrate that there exist typical growth patterns of such open-editing communal data sets that eventually
yields drastic inequalities among the editors. In addition, to
comprehend the mechanism behind such inequality, we introduce an agent-based model that replicates the interactions between communal data sets and editors. Our model takes into
account the competition between the editors’ natural decrement of motivation over time, the editors’ stronger memory
on more recent activities, and the psychological attachment to
articles. The model reproduces the universal growth patterns
indeed, which is consistent with real data.
The rest of the paper is organized as follows. In Sec. II, we
introduce the Wikimedia project data used in our investigation. In Sec. III, we present the universality of the communal
data sets’ inequality in editing processes, which is increased
by its number of edits and size. We trace the complete history
by the chronological order in Sec. IV to present the existence
of the supereditors’ cartel in the whole-wiki-scale. We present
our agent-based model in Sec. V and conclude the paper in
Sec. VI.
II.
DATA SET
For the analysis, we use the March 2016 dump of the entire
Wikimedia projects [23]. This data set not only includes wellknown Wikipedia, but also covers its sibling projects such as
Wiktionary, Wikibooks, Wikiquote, Wikisource, Wikinews,
Wikiversity, Wikivoyate, etc., in different languages (see Table I for its detailed composition). Basically, each of these
open-editing projects has a distinct subject and object. For ex-
ample, each language edition of Wiktionary aims to describe
all words from all of the other languages described in the main
language of the edition, e.g., English edition of Wikitionary
has descriptions of words of all languages in English. The
difference between the objects may yield the gap between the
editing behaviors of editors belonging to each project, caused
by the different demographic pools, the accessibility, the degree of interests, etc. This dump contains the complete copy
of Wikimedia articles from the very beginning of January 15,
2001 to March 5, 2016, including the raw text and metadata
sources in the Extensible Markup Language (XML) format.
In this data set, there are a total of 267 304 095 articles
across the entire Wikimedia projects with the complete history of edits. Each article documents either Wikipedia account identification (ID) or Internet protocol (IP) address of
the editor for each edit, the article size, the timestamp for
each edit, etc. A single character takes one byte, except for
a few cases such as Korean (two bytes) and Chinese (two
or three bytes), so the article size in bytes is a direct measure of article length [24]. Each data set contains the number
of articles ranging from 43 124 816 (Wikimedia Commons: a
database of freely usable audiovisual media) to 3 (Wikipedia
Login: a database using for administrative purpose), the number of editors ranging from 44 349 908 (English Wikipedia) to
5 (Nostalgia Wikipedia: read-only copy of the old English
Wikipedia), the number of edits ranging from 654 163 757
(English Wikipedia) to 14 (Wikipedia Login), and the total article size ranging from 99 519 138 751 bytes (English
Wikipedia) to 1 206 bytes (Wikipedia Login). See Fig. 1 for
the distributions of various characteristic measures.
Previous studies, including our own, tend to use a few language editions, commonly restricted to the English edition
of Wikipedia that is the largest [9, 12, 14–16]. In addition,
Wikimedia projects other than Wikipedia are not usually under investigation, even though several language editions of
Wikipedia were considered [21, 25]. It is true that most other
Wikimedia projects are not so large as the English Wikipedia
as shown in Fig. 1 showing the fat-tailed distributions for different Wikimedia projects. However, properties of everyday
phenomena vary by their sizes [26], thus characteristics of
such communal databases may vary by its size and category.
Therefore, despite the relative proportion of other Wikimedia
projects, those smaller editions should not be neglected just
because they are much smaller than the English editions, to
comprehend the omnidirectional properties of communal data
sets. Accordingly, we take the entire 863 editions of Wikimedia projects for our analysis.
Our main goal is to find the underlying principle in the
developing of communal data sets. Wikimedia project, as
the representative player among those communal data sets,
consists of various types of data sets operated by Wikimedia
Foundation. This massive record of knowledge market spans
292 different languages and 12 different types of subject (see
Table I for details). This variety allowing us to explore the
innate nature of humankind’s behavior for each of their written language and purpose of use. We take a single Wikimedia
project as each sample of such communal data sets. In order to proceed with the in-depth analysis on the evolution of
3
TABLE I. The list of Wikimedia projects. In general, there are different language editions for each project.
Project
Wikipedia
Wiktionary
Wikibooks
Wikiquote
Wikisource
Wikinews
Wikiversity
Wikivoyage
etc.
Total
109
108
107
106
105
104
103
102
101
100
Description/Notes
Encyclopedia articles
Dictionary
Educational textbooks and learning materials
Collection of quotations
Library of source documents and translations
News source
Educational and research materials and activities
Travel guide
Deactivated (not editable) ones are included
(b)
Np
Ne
(a)
Editions
292
172
121
89
65
33
16
17
58
863
101
102
103
108
107
106
105
104
103
102
101
100
100
101
Rank
108
107
106
105
104
103
102
101
100
100
(d)
101
103
102
103
Rank
S
Na
(c)
102
102
103
Rank
1011
1010
109
108
107
106
105
104
103
100
101
Rank
FIG. 1. The rank versus characteristic measures of Wikimedia projects: (a) the number of edits Ne , (b) the number of editors N p , (c) the number
of articles Na , and (d) the total volume of texts S (in the unit of bytes).
communal data set, we stress the fact that most data sets’ ages
are around 3.5 × 109 seconds (about eleven years; see Fig. 2).
Thus, most Wikimedia projects are of similar ages, suggesting
that the raw characteristic measures without time-rescaling are
appropriate, free from the age effect.
III.
UNIVERSALITY AND INEQUALITIES FOR
COMMUNAL DATA SETS
In this section, we present the evidence of a universal
growth pattern shared by all of the Wikimedia projects, dis-
played by characteristic measures at the current snapshot of
communal data set, such as the total number of edits Ne , the
total number of editors N p , the total number of articles Na , and
the current size S (in the unit of bytes). Our primary interest is
to identify the generality of growth for the communal data set,
not individual articles. Thus, we take the total sum of values
for all of the articles in a Wikimedia project, without considering individualistic properties of its constituent articles. For
example, Ne is total number of edits for a given edition of
Wikimedia project, or the sum of the edit numbers of individual articles for that edition. Our first analysis on the interplay
between such measures indicates the regularity regardless of
Probability density (× 10-8)
4
contrast to the interrelations between the measures we report
here, the measures are not correlated with the age of data sets
(as shown in Fig. 4), indicating that the raw number of edits is a proper measure of time to compare various data sets
rather than the real time. As we observe in Fig. 2, most of
the Wikimedia projects are of similar ages, so our analysis
implies that the speed of growth per unit time also decreases
as its size increases, as we revealed in our previous study on
English Wikipedia [12].
14
12
10
8
6
4
2
0
0
1
2
3
4
Age (× 108 seconds)
FIG. 2. The age distribution of Wikimedia projects, where we bin
the data in the uniform length of 2 × 107 seconds (the resolution of
the horizontal axis).
its age, language, and the type of data sets.
A.
B.
5
Growth scale of communal data sets
We begin our analysis with the inspection on the intercorrelation between Ne , Na , N p , and S in the current Wikimedia
projects. One may speculate the absence of a general rule between measures due to the excessive heterogeneity of current
status (as shown in Fig. 1) compared to its age distribution
(as shown in Fig. 2). As an example of difference in different language editions of Wikipedia projects, it has been reported that the levels of editors’ language proficiency in English Wikipedia are qualitatively different from the other language editions [21]. Despite such difference, we find common
positive correlations among the measures. First, we observe
the clear tendency that the numbers of editors, the numbers of
articles, and the sizes of data set gradually vary as the functions of the number of edits. The growing patterns are characterized by a simple sublinear growth of the form y ∼ xλ where
x is the number of edits and the exponent λ ' 0.70 [as shown
in Fig. 3(a)], λ ' 0.85 [as shown in Fig. 3(b)], and λ ' 0.87
[as shown in Fig. 3(c)], respectively. In other words, the frequency of appearance of new editors, that of brand new articles, and the increasing of the amount of text is slowing down
when more edits have taken place; in the sense of editability,
larger data set are more inefficient than smaller ones.
To find the reason behind such a stagnation in terms of number of edits, we also track the interrelations between the other
measures themselves. The number of editors is increased with
the number of articles with the exponent λ ' 0.78 [as shown
in Fig. 3(d)]. Meanwhile, the size of article is roughly linearly increased with the number of articles and the number of
editors, with the exponents λ ' 1.02 and λ ' 1.06, respectively. In short, the rate of accumulation of the text remains
almost constant regardless of the number of articles and the
number of editors. The result implies that the stagnation is
caused by the decelerated appearance of new editors, not the
decreased productivity of the existing editors. In addition, in
Inequality in contributions
The general growth pattern of the characteristic measures,
Ne , Na , N p , and S , triggers an interesting proposition: could
there also be a universal rule for building up recently reported
structural inequalities [12, 13]? In other words, we wonder if
the existence of de facto ownership or monopoly of supereditors [11] is a generic phenomenon for the entire communal
data sets. To examine the validity of the proposition, we employ the Gini coefficient, which is a conventional measure for
inequality [28]. In our analysis, the Gini coefficient quantifies
how the number of edits is distributed among different editors
who are involved in a certain Wikimedia project of interest,
i.e., edited an article in the project at least once. The Gini
coefficient ranges from 0 for the minimal inequality (or the
maximal equality—when every editor contributes equally) to
1 as the maximal inequality (when only a single editor contributes everything). We consider the number of edits and the
data size for individual editors as the variable of interest in the
Gini coefficient and call it “wealth” unless specified (as the
Gini coefficient is usually used to quantify the inequality in
economic wealth).
The trend of the Gini coefficient as an increasing function
of Ne displayed in Figs. 5(a)–5(d) connotes that the inequality
is intensified as the communal data set grows. Larger values
of Ne induce severer inequality not only for the number of
edits performed by the editors [as shown in Fig. 5(a)], but also
for the total sum of amount of data changes (in bytes) by the
editors [as shown in Fig. 5(b)]. This increasing trend is still
valid when we account addition [as shown in Fig. 5(c)] and
subtraction separately [as shown in Fig. 5(d)]. In addition,
since the article age does not severely affect the inequality
[as shown in Figs. 5(e) and 5(f)], the observation of the Gini
coefficient is consistent with our observation that the article
age does not affect the current state of communal data sets.
We predict that the inequality will become severer if a given
data set is edited more frequently. To sum up, we observe
the universal pattern of inequality increased by the number of
edits based on the current snapshot of communal data set.
IV.
EVIDENCES FOR THE ESTABLISHMENT OF THE
EDITORS’ CARTEL
In Sec. III B, we have shown the current snapshots displaying the high level of inequality and the increasing of the Gini
coefficient regarding the number of edits (see Fig. 5). Al-
5
(b)
108
102
104
102
103
105
107
100
101
109
103
105
107
103
101
109
106
(f)
1011
109
S
104
102
105
Ne
107
109
106
108
107
103
100
102
104
Np
106
108
1011
109
105
104
Na
103
Ne
(e)
102
107
105
Ne
(d) 108
Np
109
S
104
100
100
1011
106
Na
Np
106
100
101
(c)
108
S
(a)
107
105
102
104
Na
106
108
103
100
FIG. 3. The correlations between number of edits Ne , number of editors N p , number of articles Na , and total size of the data set S . Every
correlation is characterized by the simple power-law growth form of y ∼ xλ . For the number of edits, the other measures grow sub-linearly
with an exponent of (a) λ ' 0.70 for the number of editors, (b) λ ' 0.85 for the number of articles, and (c) λ ' 0.87 for the total size of data set
in bytes, respectively. (d) The number of editors also sub-linearly increased by the number of articles with an exponent of λ ' 0.78, (e) yet the
size of data set is almost linearly increased by the number of articles with an exponent of λ ' 1.02. Finally, panel (f) displays the nearly linear
interrelation between the number of editors and size of the data set, which in turn indicates that the average productivity of a single editor is
maintained.
though the current status of entire Wikimedia projects seems
to follow a specific function of Ne , this snapshot could be coincidental. Thus, we further track the actual history to confirm
or to reject the hypothesis of possible coincidence so that we
can judge if the increasing trend is actually the inherent nature of the formation of communal data sets. To check the
hypothesis, we set the initial times of the entire 863 data sets
the same [Ne (t = 0) = 0] and record the trajectories of Gini
coefficient as the functions of Ne [see Fig. 6(a) for the curve
averaged over all of the data sets with the deviation]. Similar to the conventional usage of the Gini coefficient for wealth
distributions, we use the cumulated numbers of edits up to
Ne (note that the unit of time is Ne in this case, as discussed
in Sec. III A) for each editor as the wealth variable. Note that,
technically, the Gini coefficient is undefined when a single editor has edited a data set alone (as we define the set of editors
as the editors who have contributed at least once), but we take
the Gini coefficient as 1 for that case because it well describes
the completely monopolized state. Our result shows that the
average Gini coefficient is coterminous with the current states
of the Wikimedia projects [see Fig. 6(a)]; thus the current status of a specific data set can be taken as a certain midpoint
of a single master curve described as a function of Ne . For
example, a history of Cebuano Wikipedia clearly follows the
typical growth pattern for Ne > 104 [see Fig. 6(b)], except for
the initial fluctuations for small values of Ne .
Although we employ the Gini coefficient as the inequality
measure in accumulated wealth distributions, an alternative
approach of the index for an income is also widely accepted.
In economics, the income is defined as the value gained within
a specific time frame [29]. Alternatively, we also consider the
number of edits for individual editors per unit time frame as
the “income” variable in Gini coefficient and call it income
unless specified. In other words, the wealth analyzed before
is the accumulated income from the onset of an individual editor’s first activity. In this study, we use the time window of
104 edits, but the different values of time frame do not affect
the result meaningfully. The Gini coefficient in terms of the
income distribution for the communal data set as a function
of Ne indicates that the larger Ne values induce less severe inequality in contributions [see Fig. 6(c)]. It indeed suggests
that the income distribution becomes more homogeneous as
time goes by [see Fig. 6(c)], while the inequality in the wealth
distribution is maintained [see Fig. 6(a)]. It is doubtful that
this phenomenon is caused by the changed amount of productivity of the editors, based on the fact that the average number
of edits performed by an editor is maintained even when the
Ne values are varied [see Fig. 4].
6
(a)
(b)
109
108
106
Np
Ne
107
105
103
102
101
100
0
1
2
Age (×
3
108
4
5
0
1
seconds)
2
Age (×
(d)
108
109
104
107
102
3
108
4
5
4
5
seconds)
1011
106
S
Na
(c)
104
105
100
103
0
1
2
Age (×
3
108
4
5
seconds)
0
1
2
Age (×
3
108
seconds)
FIG. 4. The correlations for (a) the number of edits, (b) the number of editors, (c) the number of articles, and (d) the total size of Wikimedia
projects, with the age of Wikimedia projects.
Therefore, the inequalities in the wealth distributions are
intensified over time, while the inequalities in the contribution per time frame become less severe for all of the editors
as time goes by. To consolidate the two results, we examine how the rich-get-richer affects the communal data set in
details. Figures 7(a) and 7(b) suggest that the editors tend
not only to keep their short-term social positions but also to
maintain their long-term social positions. For instance, 58.1%
of the editors remain in the 10th percentile for next 104 edits if the editors were in the 10th percentile in the time window of 990 000 ≤ Ne < 1 000 000, meanwhile only 32.6%
of the editors in the 90th percentile retain their position [see
Fig. 7(a)]. In other words, the editors who edit more often
within a specific time window tend to edit more often later as
well. Although the exact proportion and the number of edits
for each percentile vary over time, the distinction between social classes is preserved. As a result, a hierarchical structure
between the editors are gradually becoming concrete.
The trend is even clearer for accumulated edit numbers [see
Fig. 7(b)]. At the early stage, only highly ranked editors,
whose amount of contribution is much larger than the median,
maintain their positions represented by their numbers of edits ever performed. Meanwhile, the rest of the lowly ranked
editors, whose amount of contribution is much smaller than
the median, change their positions more frequently. For every
percentile, the percentile of revisiting editors is getting more
associated with the previous class over time, which eventually
makes most editors belong to a stratified percentile. There-
fore, rather solid classes are formed at a very early stage and
adhere until much later. The cartel of supereditors indeed
emerges [12, 13]. We have not only revisited the existence
of such cartel, but also observed how its degree of influences
changes as more edits are performed. The territory of those
conglomerate spans the entire Wikimedia project level beyond
only single articles, and their leverage on Wikipedia is still
growing [see Fig. 7(b)].
To comprehend the formation of such supereditors’ cartel,
we further discuss the interrelationship between number of edits in two consecutive edit sequences in various time windows
from the onset of the data set [as shown in Fig. 7(c)]. We
calculate the Pearson correlation coefficient between the lists
of edit numbers in two successive frame windows for an editor. Initially, two consecutive sequences of edit numbers are
highly correlated across various lengths of time windows, but
the short-term correlation values are gradually decreased as
more edits are performed. In addition, a boundary between
high-correlated domain (correlation & 0.7) and low-correlated
domain (correlation . 0.7) goes upwards as time goes by, so
only long-term correlation is maintained.
Putting these pieces together, the results shown in Fig. 7
explain the results shown in Figs. 6(a) and 6(c); the inequality in the wealth distributions is preserved by the long-term
correlations, while the inequality in the income distribution is
steadily resolved due to the diminution of the short-term correlations. Although the short-term activities of editors may
vary, the adhesion and monopolization of a few editors are
7
Gini coeff. S+(i)
Gini coeff. Ne(i)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
101
105
Ne
107
(d)
1
0.9
0.8
0.7
0.6
0.5
101
103
105
Ne
107
0.8
0.7
0.6
0.8
0.7
0.6
0.5
0.4
0.3
103
105
Ne
107
109
0
1
Age (×
1
0.9
0.8
0.7
0.6
0.5
0.4
101
109
1
0.9
0.9
0.5
101
109
Gini coeff. S-(i)
Gini coeff. Sabs(i)
(b)
103
(e)
1
Gini coeff. Ne(i)
(c)
1
(f)
1
Gini coeff. Sabs(i)
(a)
0.9
2
3
108
4
seconds)
2
3
108
seconds)
5
0.8
0.7
0.6
0.5
103
105
Ne
107
109
0
1
Age (×
4
5
FIG. 5. The Gini coefficient of Wikimedia projects as functions of the number of edits and the real time. (a) The Gini coefficient for numbers
of edits performed by each editor. (b) The average Gini coefficient for total sums of absolute amount of editing (in the unit of bytes) performed
by each editor. (c) The average Gini coefficient for the absolute number of sums of incremental change in edits performed by each editor. (d)
The average Gini coefficient for the absolute number of sums of decremental change in edits performed by each editor.
(b)
1
0.9
Gini coeff. Ne(i)
Gini coeff. Ne(i)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
101
(c)
1
Gini coeff. per 10000 edits
(a)
0.8
0.7
0.6
0.5
0.4
103
105
Ne
107
109
0.3
101
1.1
1
0.9
0.8
0.7
0.6
0.5
0.4
103
105
Ne
107
109
104
106
Ne
108
FIG. 6. The Gini coefficient of Wikimedia project as the functions of Ne . The blue curve in (a) is the average Gini coefficient and the shaded
area corresponds to its standard deviation, averaged over different Wikimedia projects. (b) shows a typical example of Cebuano Wikipedia.
Initially, it does not seem to follow the general trend, the Gini coefficient for Cebuano Wikipedia starts to follow the trend curve for Ne & 104 .
In panel (c), we consider the number of edits for individual editors per unit time frame as the income variable in the Gini coefficient.
not resolved in the long run because of the cartel constructed
at a very early stage of communal data set.
V.
AGENT-BASED MODEL OF INEQUALITY
FORMATION
To elucidate the dynamics to the formation of such supereditors’ cartel, we introduce an agent-based model by importing
different types of editors’ “memory” affecting the motivation
8
0.6
0.5
0.4
0.3
0.2
10%
30%
50%
70%
90%
0.1
0
(c)
1.1
Correlation length (×104)
(b)
0.7
Longterm revisit fraction
Shorterm revisit fraction
(a)
1
0.9
0.8
0.7
10%
30%
50%
70%
90%
0.6
0.9
8
0.8
6
0.7
4
0.6
2
0.5
0.4
0.5
104
106
Ne
108
104
106
Ne
108
0
1
2
3
Ne (×107)
4
FIG. 7. The properties of the revisiting editors are characterized by their activity. The fraction of editors who are in a certain percentile within
∆Ne = 104 edits (a) if the editors are also in a certain percentile for the previous 104 edits and (b) if the editors are also in a certain percentile for
the entire edit activity until the specific point Ne . (c) The Pearson correlation between the two lists of numbers of edits performed by a specific
editor between previous n edits and next n edits for given number of edits on the horizontal axis, averaged over the Wikimedia projects. Here,
n is the value on vertical axis. We consider the number of edits of next (previous) 104 edits for the editor as 0 when the editor only appeared
only in previous (next) 104 edits, respectively. The correlation is undefined if only one editor is in both sequences, and we set the value as 1 by
convention in that case, because it corresponds to the complete dominance by that editor.
for edits. We assume that there are two fundamental and inherent motivations decaying over time, which govern the shortterm and long-term behaviors of the editors. Our primary purpose is to examine the separate effects of these two memories
governing the current state of Wikimedia projects. Besides
these two decaying factors, the editors are also engaged in
certain articles when they have already given more efforts in
editing the articles [30], which represents the psychological
attachment. In the following, we describe how we implement
the socio-psychological effects into our mathematical model
in details.
A.
Model description
Our observation is mainly based on the indicators as the
functions of Ne in Sec. IV, as we already observe its validity
in the real data. Accordingly, we set a single edit as the unit of
time t. The model begins with a single agent. Each agent represents a single editor who participates in editing processes.
We take a single media representing the communal data set,
or a single language edition of a certain Wikimedia project.
In our model, we consider the action of editors to be motivated by their inherent nature, and introduce the parameters
for the editors to describe their activities. First, for editor i,
we denote their accumulated number of edits by Ni (t) at time
t. Their time of birth tb;i and time when the last edit happened
te;i are specified. The dynamic rules are as follows. For each
simulation step, the debut of a new agent and the revisit (or
re-edit) of an already existing agent occur in turn. For every
simulation step, with a constant probability b, a new agent appears and begins to participate in the editing process. Once a
new agent appears in the data set at time Ne , the agent edits
the data set at the time of inauguration so that tb;i and te;i are
assigned as Ne , and the time unit t is increased by 1 (the unit
of edit number).
For the second step, an editor chosen uniformly at random
attempts to edit the data set. There are many factors affecting the motivation of edits, but we take three: the long-term
decay of motivations, the short-term motivation of ownership,
and the psychological engagement of editors. In general, editors are highly motivated at the beginning of the participation,
but their motivation fades steadily [31, 32]. Thus, participants
loose their attention as time goes by, which is modeled by the
power-law decay as the factor (t − tb;i )−k where k is the characteristic exponent representing the motivation decay, which
is observed in many temporally varying systems [33, 34]. In
addition, a fat-tailed distribution is observed for the time between the consecutive edits [12], which suggests that the editing time scale of Wikipedia shows “bursty” behaviors, meaning that there is a short-term stimulation of edit motivation
affected by their interval between an editor’s latest editing te;i
and the current time t [33, 35]. This short-term stimulation
of motivation is modeled as the factor [1 + e−(t−te;i )/τ ], where
τ is characteristic time of this stimulation. Finally, there is
a tendency for editors to be engaged when they have already
participated more frequently [12, 30], which is assigned as 1
at the time of first participation of the editor and increased
by unity every time an agent participating in the edit process,
which is equivalent to the number of edits Ni (t) up to the time
point t.
Taking these factors together, in our model, when agent i is
chosen for editing, she participates in editing with the probability
n
o
Pi [t; Ni (t), tb;i , te;i ] = min 1, Ni (t)(t − tb;i )−k [1 + e−(t−te;i )/τ ] .
(1)
Once she decides to participate, te;i is newly set as t + 1 and
Ni (t + 1) = Ni (t) + 1. In addition, we also include the possibil-
9
ity for an agent to leave the editing process indefinitely. We
consider that this breakup is based on the loss of motivation to
edit [31, 32]. In our model, therefore, an agent leaves the system when she chooses not to edit and Pi [t; Ni (t), tb;i , te;i ] < r,
where r is a preassigned cutoff parameter common to all of the
editors. We give some evidences that the formation of current
inequality is from the factors above, regardless of the innate
nature of an individual editor in following section.
B.
Model results
In Sec. IV, we have shown the increasing trend of the Gini
coefficient as the number of edits is increased, which is in
particular, rapidly increased at the early stage of data set and
stabilized at the high level (the Gini coefficient & 0.8 for
Ne & 104 , see Fig. 6). Our model result is consistent with the
empirical observations. The Gini coefficient of model data set
is rapidly increased until the high level is reached at Ne ' 105
for k = 0.8 [see Fig 8(a); compared to Fig. 6(a)]. Smaller k
values yield a slower increment of the Gini coefficient, while
the τ value does not affect much. The Gini coefficient does
not reach the high level (the Gini coefficient ' 1) if we assign
k & 1, which suggests that moderate decaying of motivation
is essential to reproduce the current state of communal data
sets. The Gini coefficient of the income distribution also displays similar results from our model to the ones from the data.
For k = 0.8, the Gini coefficient for the income is steadily decreased from Ne ' 105 [see Fig 8(a)], which is observed in the
data for Ne & 105 [see Fig. 6(c)].
In addition to the Gini coefficient, our model also reproduces the trend of reduced short-term correlations for the
number of edits between time windows reported in Fig. 7(c).
As shown in Fig. 8(c), the interrelationship between the number of edits in two consecutive sequences in various time
frames from the onset of the data set gives a similar result.
Both in the model and real data, we observe the large correlation between two consecutive sequences regardless of the
length of the sequences. As time goes by, the short-term correlation is steadily reduced, while the long-term correlation
is sustained. Similar to the data, the border between largecorrelation (correlation & 0.7) and small-correlated domains
(correlation . 0.7) rises as the more edits are performed [see
Fig. 8]. The slope of a border is different for different as k
values, but τ does not affect the slope.
In short, despite the fact that the rapid increment of wealth
inequality happens at the early stage and the gradual decrement of the income inequality always occur, the parameter
k mainly governs the overall dynamics. In other words, the
loss of long-term motivation induces the inequality, while the
short-term memory does not affect the system much. Therefore, the rich-gets-richer effect is mainly driven by the accumulated engagement induced by previous edits, and such a
long-term engagement makes supereditors’ cartel formed at
the early stage survive. In addition, our model indicates that
the supereditors’ cartel can be formed without the direct communication between editors, or any direct pressure from the
society in other words.
VI.
CONCLUSION
In this study, we have examined the common patterns
among the communal data sets displayed in the entire language editions of different types of Wikimedia projects. Although some studies have uncovered the general patterns before, it is usually based on partial observations on specific type
or specific languages data set, which have left many unanswered questions and speculations [9, 12, 14–16]. However,
the extensive data set from entire Wikimedia projects recording the pan-human scale collaboration of forming collective
knowledge has given us a unprecedented opportunity to explore the true innate nature of humankind. In this data set, we
have observed the universal interplays between the numbers of
editors, the numbers of articles, the numbers of edits, and the
total length of articles, which are characterized by the powerlaw scaling form with a single set exponents. The existence
of the universal growth rules among the entire 293 languages
and 12 types of Wikimedia projects suggests the pan-human
scale behavior for collaboration.
This universal patterns are shown not only in the external
appearance of data sets, but also in its inequality quantified by
the Gini coefficient; the inequality is formed at a very early
stage of communal data sets and continued. It was widely
hoped that the communal data sets will bring democratization
of knowledge [7], yet studies reveal that the current Wikimedia projects are just another stratified society under control of
a few authoritative entities [12, 13]. We have demonstrated
that the inequalities among the editors can be even deeperrooted than our expectation. The existence of supereditors’
cartel is a universal phenomenon across the entire communal data sets regardless of its size and activity. We have also
observed the universal trend of intensified inequality for all
types of data sets, which suggests that the privatization by a
few dedicated editors will be intensified further. In addition,
we have shown that a social stratum of such communal data
sets can be formed at the very early stage and the polarization
of editors has already been set.
Our study is not limited to diagnose the current state of
Wikimedia projects, but provides the general insight on the future direction of the communal data set. For instance, our simulation suggests that the inequality can be formulated without
the direct interactions between editors. Indeed, it is also observed that the editors tend to obey pre-forged authorities [13].
Considering the fact that the total productivity of the editors is decreased by the number of edits, which may result
in less productivity and even less accuracy in the future. It
was already reported that the growth of Wikipedia has slowed
down [36], and our analysis also warns that the inequality will
not be easily resolved without active efforts. For a long time,
Wikipedia has served as a spearhead of the international open
knowledge market. However, to sustain the abundant playground for worldwide collaborations, strategic actions considering the nature of such a social structure are required. Our
finding displays abiogenesis imbalances in the formation of a
particular set of communal data, but the result and implication
of our study can be applied outside the Wikimedia projects.
There will be extensive applications for understanding collec-
10
k=0.8
k=0.7
k=0.5
k=0.4
(c)
0.8
0.7
k=0.8
0.6
0.5
k=0.7
0.4
k=0.5
0.3
0.2
k=0.4
0.1
Correlation length (×103)
(b)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Gini coeff. per 10000 edits
Gini coeff. Ne(i)
(a)
3
1
0.8
2
0.6
0.4
1
0.2
0
0
102
104
Ne
106
104
105
106
Ne
107
2
4
6
8
Ne (×106)
10
FIG. 8. The Gini coefficient from our model as functions of the number of edits. Panel (a) shows the Gini coefficient for the number of edits.
Panel (b) shows the Gini coefficient for the time frame of every 104 edits, for a given value in the horizontal edits. For panels (a) and (b),
the color corresponds to the different value of τ ranges from ∞ (no short-term stimulation) to 0.001: τ = 0.01 (purple), τ = 0.001 (green),
τ = 0.0001 (blue), and τ → ∞ (yellow). For panels (a) and (b), we use the following parameters: b = 0.0001, and r = 0.01. (c) The Pearson
correlation between the lists of numbers of edits performed by an editor between previous n edits and next n edits for the number of edits on
horizontal axis, where n is the value on the vertical axis. We consider the number of edits for next (previous) 104 edits for an editor as 0 when
the editor only appears in the previous (next) 104 edits, respectively. For the panel (c), we used the following parameters: b = 0.0001, k = 0.8,
τ → ∞, and r = 0.01. We check that other choice of τ gives similar results, but we show the result with τ → ∞ to emphasize the long-term
correlation. For (a) to (c), each parameter is averaged over 1000 independent realizations.
tive behaviors of humankind based on this type of analysis,
which could give clues to solve even larger scale social inequalities in turn, we hope.
[1] B. Smalley, The Study of the Bible in the Middle Ages (Blackwell, Oxford, 1952).
[2] J. Bartlett, Familiar Quotations (10th ed, Little, Brown and
Company, Boston 1919).
[3] P. Burke, The European Renaissance: Centers and Peripheries
(Wiley-Blackwell, Oxford, 1998).
[4] K. Marx, Das Kapital: Kritik der Politischen Ökonomie (Verlag
Von Otto Meissner, Hamburg, 1867).
[5] H. C. Barnard, Education and the French Revolution, British
Journal of Educational Studies 18, 314 (1970).
[6] A. Bott, Prussia and the German system of education (Albany,
New York, 1868).
[7] C. Lemke and E. Coughlin, The Change Agents, Teaching for
the 21st Century 67, 54 (2009).
[8] Wikipedia, https://www.wikipedia.org/.
[9] T. Chesney, An Empirical Examination of Wikipedia’s Credibility, First Monday 11 (2006).
[10] J. Giles, Internet Encyclopedias Go Head to Head, Nature 438,
900 (2005).
[11] Y. Gandica, J. Carvalho, and F. Sampaio dos Aidos, Wikipedia
Editing Dynamics, Phys. Rev. E 91, 012824 (2015).
[12] J. Yun, S. H. Lee, and H. Jeong, Intellectual Interchanges in
the History of the Massive Online Open-editing Encyclopedia,
Wikipedia, Phys. Rev. E 93, 012307 (2016).
[13] B. Heaberlin and S. DeDeo, The Evolution of Wikipedia’s Norm
ACKNOWLEDGMENTS
This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government
through Grant No. NRF-2015-S1A3A-2046742 (J.Y. and
H.J.).
Network, Future Internet, 8, 14 (2016).
[14] A. Kittur, B. Suh, and Ed H. Chi, Can You Ever Trust a Wiki?:
Impacting Perceived Trustworthiness in Wikipedia, CSCW ’08
Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, 477 (2008).
[15] B. T. Adler, K. Chatterjee, L. De Alfaro, M. Faella, I. Pye, and
V. Raman, Assigning Trust to Wikipedia Content, WikiSym ’08
Proceedings of the 4th International Symposium on Wikis, Article No. 26 (2008).
[16] T. Yasseri, R. Sumi, A. Rung, A. Kornai, and J. Kertész, Dynamics of Conflicts in Wikipedia, PLOS ONE 7, e38869 (2012).
[17] W. Barber and A. Badre, Culturability: The merging of culture
and usability, Proceedings of The Fourth Conference on Human
Factors and the Web (1998).
[18] A. Marcus and E. W. Gould, Crosscurrents: Cultural dimensions and global web user-interface design, Interactions, 7, 32
(2000).
[19] S. Schmid-Isler, The language of digital genresA semiotic investigation of style and iconology on the World Wide Web, Proceedings of the 33rd Hawaii International Conference on System Sciences (2000).
[20] U. Pfeil, P. Zaphiris, and C. S. Ang, Cultural Differences in Collaborative Authoring of Wikipedia, J. Comput. Mediat. Commun, 12 (2006).
[21] S. Kim, S. Park, S. A. Hale, S. Kim, J. Byun, and A. H. Oh,
11
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
Understanding Editing Behaviors in Multilingual Wikipedia,
PLOS ONE 11, e0155305 (2016).
Wikimedia Projects, https://wikimediafoundation.org/.
Wikimedia Downloads, https://dumps.wikimedia.org/
backup-index.html.
F. Yergeau UTF-8, a Transformation Format of ISO 10646,
STD 63, RFC 3629.
S. A. Hale, Multilinguals and Wikipedia Editing, ACM Web
Science Conference 2014 (2014).
W. G. Song, , H. P. Zhang, T. Chen, and W. C. Fan, Power-law
Distribution of City Fires, Fire Safety Journal 38, 453 (2003).
Ethnologue: Summary by language size, http://www.
ethnologue.com/statistics/size, Accessed: 14 May
2016.
C. Gini, Variabilita e Mutabilita (Variability and Mutability)
(C. Cuppini, Bologna, 1912).
N. G. Mankiw, Principles of Economics, 7th edition (Cengage
Learning, Boston, 2014).
[30] B. P. George, Past Visits and the Intention to Revisit a Destination: Place Attachment as the Mediator and Novelty Seeking as
the Moderator, Journal of Tourism Studies 15, 51 (2004).
[31] R. Crane and D. Sornette, Robust Dynamic Classes Revealed
by Measuring the Response Function of a Social System, Proc.
Natl. Acad. Sci. USA 105, 15649 (2008).
[32] F. Wu and B. A. Huberman , Novelty and collective attention,
Proc. Natl. Acad. Sci. USA 104, 17599 (2007).
[33] M. Karsai, K. Kaski, A.-L. Barabási, and J. Kertész, Universal Features of Correlated Bursty Behaviour, Sci. Rep. 2, 397
(2012).
[34] M. Karsai, N. Perra, and A. Vespignani, Time Varying Networks
and the Weakness of Strong Ties, Sci. Rep. 4, 4001 (2014).
[35] H.-H. Jo, J. I. Perotti, K. Kaski, and J. Kertész, Correlated
Bursts and the Role of Memory Range, Phys. Rev. E 92, 022814
(2015).
[36] B. Suh, The Singularity Is Not Near: Slowing Growth of
Wikipedia, WikiSym ’09 Proceedings of the 5th International
Symposium on Wikis, Article No. 8 (2009).