Universals versus historical contingencies in lexical evolution

Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
Universals versus historical contingencies
in lexical evolution
V. Bochkarev1, V. Solovyev1 and S. Wichmann1,2
rsif.royalsocietypublishing.org
1
2
Research
Cite this article: Bochkarev V, Solovyev V,
Wichmann S. 2014 Universals versus historical
contingencies in lexical evolution. J. R. Soc.
Interface 11: 20140841.
http://dx.doi.org/10.1098/rsif.2014.0841
Received: 28 July 2014
Accepted: 11 September 2014
Subject Areas:
mathematical physics
Keywords:
language dynamics, lexical change, Kullback –
Leibler divergence, Google Books N-grams
Author for correspondence:
S. Wichmann
e-mail: [email protected]
Electronic supplementary material is available
at http://dx.doi.org/10.1098/rsif.2014.0841 or
via http://rsif.royalsocietypublishing.org.
Kazan Federal University, Kremlevskaya Street 18, 420000 Kazan, Russia
Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, 04103 Leipzig, Germany
The frequency with which we use different words changes all the time, and
every so often, a new lexical item is invented or another one ceases to be
used. Beyond a small sample of lexical items whose properties are well
studied, little is known about the dynamics of lexical evolution. How do the
lexical inventories of languages, viewed as entire systems, evolve? Is the rate
of evolution of the lexicon contingent upon historical factors or is it driven
by regularities, perhaps to do with universals of cognition and social interaction? We address these questions using the Google Books N-Gram Corpus
as a source of data and relative entropy as a measure of changes in the frequency distributions of words. It turns out that there are both universals
and historical contingencies at work. Across several languages, we observe
similar rates of change, but only at timescales of at least around five decades.
At shorter timescales, the rate of change is highly variable and differs between
languages. Major societal transformations as well as catastrophic events such
as wars lead to increased change in frequency distributions, whereas stability
in society has a dampening effect on lexical evolution.
1. Introduction
In linguistics, discussions of rates of lexical change have revolved around the idea,
introduced by Morris Swadesh, that a small core of the vocabulary, which has
come to be known as ‘the Swadesh list’, has a characteristic degree of stability,
making it possible to date the diversification of groups of related languages by
observing the amount of change within such a list of basic vocabulary [1,2]. This
so-called glottochronological method has had the curious fate of having been
criticized to the extent that it is often routinely dismissed as being ‘discredited’
[3–5] whereas, at the same time, glottochronological dates are frequently cited,
even by the critics of glottochronology [6], and many refinements of the method
have been proposed, some [7] having had a great deal of impact. For many linguists, the fatal wounds inflicted upon glottochronology came from studies
where some languages were shown to have had quite different rates of change
[8]. But, such studies raise the question of whether the rate of change might still
be sufficiently constant that it is useful for dating language group divergence
with reasonable accuracy provided that averages over several languages are
used and the time spans measured are great enough. A recent study from the
project known as the Automated Similarity Judgment Program (ASJP) [9] uses a
measure of phonological distance between words as another way of deriving
dates of language group divergence. Glottochronology uses a count of cognates
(related words) as its input. In this approach, word pairs will score 1 for being
related and 0 for being unrelated. In the ASJP approach, unrelated words will
usually have a similarity close to 0 and related words some number between 0
and 1. Thus, the ASJP approach is also sensitive to cognacy and therefore comparable to glottochronology. It was found in reference [9] that for 52 groups of
languages whose divergence dates are approximately known, there is a –0.84
correlation between log similarity and time. Thus, the assumption of glottochronology of an approximately regular rate of change is vindicated in this study,
which is the first to provide a test across many language groups.
Studies in glottochronology have been limited to short word lists, the maximum being a standard set of 200 items. The differential stabilities within the
& 2014 The Author(s) Published by the Royal Society. All rights reserved.
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
[13], papers have been published that have analysed patterns
of evolution of the usage of different words from a variety of
perspectives, such as random fractal theory [14], Zipf’s and
Heaps’ laws and their generalization to two-scaling regimes
[15], the evolution of self-organization of word frequencies
over time in written English [16], statistical properties of
word growth [17,18], socio-historical determinants of word
length dynamics [19]; and other studies have been more
specifically aimed at issues such as the sociology of changing
values resulting from urbanization [20] or changing concepts
of happiness [21].
The Google Books N-Gram Corpus offers data on word frequencies from 1520 onwards based on a corpus of 862 billion
words in eight languages. Of these, we will draw upon
English, Russian, German, French, Spanish and Italian with
particular attention to English, which conveniently is presented
both as a total corpus (‘common English’) and also as subcorpora (British, American, fictional). Chinese and Hebrew are not
included in the study, because the corpora available for these
languages are relatively small. We focus on 1-grams, i.e.
single words, and these are filtered such that only words containing letters of the basic alphabet of the relevant language
plus a single apostrophe are admitted. Among other things,
this excludes numbers. We furthermore ignored capitalization.
For each language (or variety of a language in the case of
English), the basic data consist of a year, the words occurring
that year and their raw frequencies.
The version of the Google Books N-Gram Corpus used
here is the second version, from 2012, but all analyses were
in fact also carried out using the 2009 version. The second version, apart from being larger and including an extra language
(Italian), mainly differs by the introduction of parts-of-speech
(POS) tags [22]. The POS tags are inaccurate to the extent
that for the great majority of words there are at least some
tokens that are misclassified. For instance, some tokens of the
word ‘internet’ are tagged as belonging to the categories adpositions, adverbs, determiners, numerals, pronouns, particle and
‘other’—all categories where ‘internet’ clearly cannot belong
given that the word is a noun. In each of these clear cases of
misclassification, it is only a small amount of tokens that are
wrongly assigned, however. In the case of ‘internet’, the
tokens that are put in each of the wrong categories make up
from 0.004% (‘internet’ as an adverb) to 0.107% (‘internet’ as
a verb). As many as 94.044% of the tokens are correctly classified as nouns. Finally, 5.443% of the tokens are put in the
adjective category. This assignment is not obviously wrong,
because ‘internet’ in a construction such as ‘internet café’
does share a characteristic with adjectives, namely that of
modifying a noun. After manual inspection of the 300 most
common English words, we came to the conclusion that a
reasonable strategy for filtering out obviously wrong assignments would be to ignore cases where a POS tag is assigned
to 1% or less of the total number of tokens of a given word.
In electronic supplementary material, figure S1, we replicate one of our results (see figure 1 of the Results section) on
three different versions of the database: the 2009 version, the
2012 version ignoring POS tags altogether and the 2012 version including POS tags (but filtered as just described).
There are some differences in behaviour for the 2009 version
J. R. Soc. Interface 11: 20140841
2. Materials
2
rsif.royalsocietypublishing.org
more popular 100-item Swadesh list have been measured
[10], and the above-mentioned ASJP study [9] was conducted
on a 40-item subset of the 100-item list. Beyond these short
lists, little is known in quantitative terms about the dynamics
of lexical change, and even within the 100-item list, the factors that contribute to making some items more or less
stable than others are not well understood. The possibility
that some items are less stable because they tend to be borrowed more often was investigated [10], but no correlation
was found between borrowability and stability. However,
stability and frequency have been found to be correlated
[11], indicating that words that are used often tend to survive
better than infrequently used words. These findings were
based on the behaviour of a short (200 items) set of words.
Similarly, it has been found that irregular morphological
forms tend to survive longer in their irregular forms without
being regularized the more frequent they are [12].
The availability of the large Google Books Corpus (https://
books.google.com/ngrams/) has dramatically increased the
amount of data available for studying lexical dynamics, allowing us to look at thousands of different words. But, the time
span of half a millennium for which printed books are available is not great enough to study the replacement rates of all
lexical items—it is only in some cases that we can observe
one item being entirely replaced by another, and if a word survives for 500 years, we cannot know what its likely longevity
is beyond the 500 years. Moreover, eight languages, which is
what the Google Books Corpus offers, is not enough to establish whether words associated with particular meanings are
more or less prone to change across languages.
We can nevertheless still indirectly investigate to what
degree the rate of lexical change is constant in words not pertaining to the Swadesh list. The Google Books N-Gram Corpus
allows us to study changing frequency distributions of entire
word inventories year by year over hundreds of years.
Inasmuch as the gain or loss of a lexical item presupposes a
change in its frequency, there is a relationship, possibly one
of proportionality, between frequency changes and gain/loss.
In other words, it is meaningful to use frequency change as a
proxy for lexical change in general. By ‘lexical change’, we
refer mainly to processes whereby one form covering a certain
meaning is gradually being replaced by another because of a
change in the semantic scope of the original lexical item
and/or because of the innovation of new words.
In this paper, we address the question of the constancy of
the rate of lexical change beyond the Swadesh list by studying
changes in frequency distributions. In so doing, we will
observe that the rate of change over short time intervals is
anything but constant, leading to additional questions
about which factors might cause this variability. Comparisons of British and American English will hint at processes
by which innovations arise and spread; observations of changing frequency distributions over time will disclose effects of
historical events and comparisons across languages will help
us to discern some universal tendencies cutting across effects
of historical contingencies. It should of course be kept in
mind that the data consist of written, mostly literary,
language. Written language tends to be more conservative
than spoken language, so changes observed are likely to be
more enhanced in the spoken language.
This is not the first paper to draw on the Google Books
N-Gram Corpus. Since the seminal publication introducing
this resource as well as the new concept of ‘culturomics’
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
English
3
2
1
An information-theoretic metric, however, is expected to be more
meaningful in our case, and the selection of such a metric is furthermore justified empirically in §4. First, in order to measure
information, we use the standard formula in (3.2).
X
H¼
pi log2 pi :
(3:2)
1900
1950
2000
year
Figure 1. Change rates over time in the English lexicon.
normalized distance (× 10–3)
i
1850
8
7
6
5
4
1850
1900
1950
2000
year
Figure 2. Distances between British and American English.
i
In order to calculate distances, we then apply the Kullback–
Leibler (KL) metric [23] (equation (3.3)). Being asymmetric,
this is not a true metric in the technical sense, but it can be
symmetrized, as in equation (3.4)
X
X
A
B
DA,B ¼
pA
pA
(3:3)
i log2 pi i log2 pi
i
and r(A, B) ¼ DA, B þ DB, A ¼ The change in entropy DA,B is normalized by the entropy of
the changed frequency distribution (equation (3.5)). For the symmetrized version, we normalize by the sum of the two entropies
(equation (3.6)).
X
i
B
[pA
i pi ]log2
pBi
:
pA
i
(3:4)
The K – L divergence is commonly used in studies related to ours
[24,25]. Being a measure of relative entropy, it measures the
divergence of two frequency distributions with regard to their
informational content. The informational content of a single
occurrence of a word is proportional to the logarithm of its probability taken with the opposite sign. Thus, for a rare word and a
commonly occurring word, one and the same change in frequency values has different implications for the change in
entropy. Information is usually measured in bits, which accounts
for why the logarithms are to base 2 in equations (3.2) – (3.4).
Because one cannot take the logarithm of zero, the situation
where the frequency of a word in B is zero must be handled in
a special way. A possible approach, which was tested, is to
simply remove the ‘zero-cases’ from the distribution. However,
we handled the situation in a more principled way by replacing
zero-frequencies by imputed frequencies based on the corresponding words in A as shown in the electronic supplementary material.
In practice, however, no appreciable differences were observed in
the results when applying the simpler as opposed to the more
complicated approach.
DA,B
HB
DA,B þ DB,A
(norm)
r
(A, B) ¼
:
HA þ HB
D(norm)
¼
A,B
i
and
(3:5)
(3:6)
When comparing frequency distributions at different time
periods for one and the same language, we use the asymmetric
K–L divergence (equations (3.3) and (3.5)). This was used to produce figures 1 and 7 –9. When frequency distributions in different
languages are compared, we use the symmetrized version
(equations (3.4) and (3.6)). This was used to produce figures 2
and 3. When the unit of analysis is the frequency of individual
words, neither is used (figures 4 –6).
We will sometimes refer to the ‘lexical rate of change’. This is
defined as the average normalized K–L divergence per year between
two points in time, t and t þ T, as expressed in equation (3.7)
Q
Q
T
1 D(p (t þ T), p (t))
Vnorm t þ
¼
(3:7)
2
T
H(t)
4. Results
We first look at changes in frequency distributions in the
common English corpus, focusing on the period from 1840
J. R. Soc. Interface 11: 20140841
normalized rate of change (× 10−4)
4
0
3. Methods
The sets of word frequencies for each year in a given language
can be considered vectors, where changes in word usage affect
the value of the vector components. The frequency of a word
at a given time is here interpreted as the probability of its occurrence in the corpus pertaining to a given year. To estimate
changes in the lexical profile, a metric is needed for determining
distances between the vectors. In other words, given two probability (frequency) distributions, pi A and pi B, we need a
suitable index for measuring the difference between them.
A frequently used class of metrics is shown in equation (3.1).
For n ¼ 1, it gives Manhattan distances, and for n ¼ 2, it gives
Euclidean distances
!1=n
X
QA
QB
A
B n
r(A, B) ¼ kp p kLn ¼
j pi pi j
:
(3:1)
3
5
rsif.royalsocietypublishing.org
as opposed to the 2012 version with and without POS,
especially in recent years. But, as it turns out, it would be
possible to ignore POS tags altogether and not get appreciably different results for the 2012 version of the database.
Nevertheless, they are included here.
Thus, when we speak of a ‘word’ or 1-gram in this paper,
we are referring to any unique combination of a word form
(whether distinct for morphological or purely lexical reasons)
and a POS tag. Unless otherwise noted, all results were based
on the 100 000 most frequent words in order to filter out most
of the noise caused by errors of optical character recognition.
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
English (British)
English
4
2000
rsif.royalsocietypublishing.org
frequency, 1950–1980
1950
1900
1850
10–4
10–6
10–8
1800
1800
1850
1900
1950
year (American English)
2000
Figure 3. Distances between British and American English for any pair of
years (1800 – 2008). The colour spectrum ranges from blue (¼small distances) to red (¼large distances). Yellow and red dots show local
minimal distances. The diagonal is marked in black.
10–10 –10
10
10–6
10–4
frequency, 1860–1900
English, 1800–2000
104
each word
averaged value
f –0.31637
102
1
Df / f
ratio, 1950−1980
102
10
10–2
10–4
1
10–6
10−1
10–8 –10
10
10−2
10−3
10–2
Figure 5. Average frequency of each lexical item during 1950– 1980 as a
function of its average frequency during 1860 – 1900.
American/British
103
10–8
10−2
1
ratio, 1860−1900
102
10–8
10–6
frequency
10–4
10–2
Figure 6. Relative frequency changes for English during 1800– 2000 as a
function of frequency. Average relative frequency changes (red line) are
fitted to a power law (yellow dotted line).
Figure 4. The ratio of Americanisms to Briticisms.
onwards, when the amount of available books in good-quality
printing is suitable.
Figure 1 shows a graph of normalized K–L distances
between neighbouring states of the language year by year
using a 10-year moving window, applying formula (3.7). A 10year moving window means that T ¼ 5. The mean of the normalized distances, taken across the per-year changes (calculated
as explained in the previous sentence), is kVl ¼ 1.93 1024,
the standard deviation s.d.(V ) ¼ 2.77 1025, and s.d.(V )/
(V ) ¼ 14.3%.
The graph shows some peaks corresponding to changes in
lexical profile that clearly must be due to major political
events, in addition to smaller, chaotic oscillations. Above all,
peaks corresponding to the periods of each of the world
wars catch the eye. Growth-rate fluctuations in the English lexicon have similarly been observed to be sensitive to these
events [17]. The turbulent twentieth century stands in contrast
to the period preceding it, where the fluctuations are generally
smaller. This is the Victorian era, which was generally a stable
period for British society. Thus, the presence of frequency
changes can be provoked by major international historical
events and, reversely, the potential for change is dampened
by political stability. Over the approximately 150 year period
that we are witnessing, these effects roughly cancel out.
Graphs based on the same data as in figure 1 were produced using the symmetrized version of K –L divergence
(equation (3.4)) as well as Manhattan and Euclidean distances
(equation (3.1)) for different values of n. For these graphs, see
the electronic supplementary material, figures S2 and S3.
For the symmetrized K– L divergence, the results are not
appreciably different from figure 1; for Manhattan distances,
we see a similar picture but with some dampening in the
major peaks and for Euclidean distances, the effects of
minor variation are enhanced to the extent that the effects
of the two world wars become difficult to distinguish from
fluctuations that are not obviously connected to historical
events. This is the empirical motivation for our choice of
J. R. Soc. Interface 11: 20140841
year (British English)
10–2
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
5
J. R. Soc. Interface 11: 20140841
upper right are Americanisms that have stayed as such
during both periods (32.1%); in the lower right are Americanisms that have become Briticisms (16.0%); in the upper left are
Briticisms that have become Americanisms (16.4%). There
are somewhat more words that change from being Briticisms
to Americanisms than the other way around, but the difference is small, and there are more words that remain as
Briticisms than words that remain as Americanisms.
Having treated changes in English from the points of view
of contingencies of history and particulars relating to the interaction between the major, British and American, variants, we
now pass on to a question relating more to the universals of
language change: which words are most sensitive to changes
in frequency distributions? This question is investigated
through two plots. First, in figure 5, we average frequencies
for each item within the respective periods 1860–1900 and
1950–1980 and plot the latter as a function of the former.
The boundaries for the time intervals were selected such that
sample sizes are approximately equal: the total corpus size
for 1860–1900 is 1.591010 words, and for 1950–1980, it is
1.651010 words thus, the difference is smaller than 4%.
Unlike the other figures, figure 5 is based on all words,
not just the 100 000 most frequent ones. It shows, first of
all, a wider range of variation in changes in frequency distributions as frequencies decrease. Words with a frequency of
around 10 – 3 or higher do not change their frequency radically, leading to dots gathered narrowly around a straight
line. These include function words—articles, prepositions
and conjunctions. For words which have frequencies in the
less than 1023 range, the distribution widens symmetrically,
indicating that these words may either increase or decrease
their frequencies, and that the relative changes of frequency
values grow with decreasing frequencies.
The impression of figure 5 that the range of frequency
changes increases as frequency decreases is brought out
more clearly by figure 6, which is also based on all words
rather than the 100 000 most frequent ones. The horizontal
axis shows the frequency f, and the vertical axis shows the
relative frequency change Df/f of words for a 1-year period.
The blue dots show the values of Df/f for the individual
words in a given year. There are 6 783 304 different words
in the corpus for the 200-year period, so the number of
dots could theoretically be 6 783 304200, but because not
every word is attested in every pair of consecutive years,
the number of dots is less than half of that amount. The
red, solid line shows the average relative change in frequency
for words as a function of frequency averaged over all words
whose frequencies do not differ more than 10% from a certain, fixed value of f. More precisely, in order to produce
the red, solid line we selected 113 points distributed uniformly on a log scale of f in the interval from 4 10210 to
1021, and for each of these f-values, we took the average of
the relative frequency changes within an interval + 10%
around it (resulting in slightly overlapping bins). The curve
obtained in this way is fitted (yellow dotted line) by a
power law Df/f f – 0.316 for a wide range of values (approx.
from 2 1027 to 1023). To the extent that changes in frequency
distributions can be used as a proxy for lexical change, this
confirms the findings, based on a much smaller set of words
[11], that more frequent words undergo less change. For
very low frequencies, we observe some deviations from the
power law. The deviation from the power law is observed in
the region where sampling error renders the frequency
rsif.royalsocietypublishing.org
K –L divergence as a preferred metric. The differences
are most likely owing to Manhattan distances and K– L divergence being more sensitive than Euclidean distances to small,
absolute changes in frequency, which mainly occur for rare
words. These speculations are supported by figures 5 and
6, which show that more infrequent words suffer the greater
frequency changes.
Because the Google Books N-Gram Corpus distinguishes
between British and American English, it is possible to study
differences between the evolution of these two language variants as well as their interrelatedness. The lexicons of the two
variants are practically identical in spite of differences in frequencies of items, so we can again use K –L divergence as a
metric for comparing them. In figure 2, the dots show
normalized distances for each year within the period 1840 –
2008, and the line shows the result of smoothing using
a median filter with a 10-year window.
The two variants of English start out diverging more and
more up to around 1950– 1955, after which time they rapidly
begin to converge; and from around 1975 onwards, they
appear to fluctuate around the same level of divergence
that they exhibited in 1840. No doubt the introduction of
mass media has played a major role in this convergence.
Next, we may ask whether the two variants behave in a
similar way. For instance, do they mutually converge
towards one another at the same speed or is one of them converging faster? To answer this question, figure 3 shows a
heat map where colours represent informational distances
between any two points in time for the two variants, using
the same data as for figure 2. Figure 3 (also provided in a
more detailed three-dimensional version in figure S4 of the
electronic supplementary material) reveals that between
around 1900 and 2000 the smallest distances tend to lie
above the diagonal rather than being scattered around it as
before 1900. For instance, the American English data of
1920 are most similar to the British English data of around
1940. In general, for each year throughout the twentieth century, British English is conservative relative to American
English, being most similar to American English of one to
two decades earlier. This situation seems to change by
the end of the century, although we are not yet far
enough into the twenty-first century to be able to determine
whether the convergence is a lasting trend. Were we to
try to predict the future, however, our bet would be that
British English will cease to be conservative relative to
American English.
Additional information about the differences between
British and American English is presented in figure 4, where
words are classified as ‘Briticisms’ versus ‘Americanisms’ for
the two periods 1860–1900 and 1950–1980. A Briticism is
defined as any word that is more frequent in British English
than in American English in a given period, and an Americanism has the opposite definition. The specific periods were
chosen for being around a century apart and relatively
‘quiet’ in the sense that they are not affected by any of the
two major peaks in changes of frequency distributions that
happened during the world wars. A four-way classification
is produced by finding the ratio between American and British
English of the frequencies of individual words. For instance, a
ratio of r ¼ 50 means that the frequency in American English is
50 times higher.
In the lower left are Briticisms that have stayed as such
during both periods (35.5% of the total sample); in the
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
6
Table 1. Summary data for the calculation of lexical change rates in seven languages.
averaging periods
(50-year intervals)
British English
American English
1835– 2007
1835– 2007
1920 – 1940, 1970 – 1990
1920 – 1940, 1970 – 1990
2393
2365
Russian
German
1860– 2007
1855– 2007
1930 – 1950, 1980 – 2000
1910 – 1930, 1960 – 1980
21077
5087
French
1850– 2007
1920 – 1940, 1970 – 1990
2647
Spanish
Italian
1920– 2007
1855– 2007
1920 – 1940, 1970 – 1990
1920 – 1940, 1970 – 1990
3012
5028
Russian
English (British)
8
normalized rate of change (× 10−4)
normalized rate of change
0.010
0.008
0.006
0.004
0.002
0
7
6
1900
1950
2000
year
English (American)
German
French
Spanish
Italian
5
4
3
2
1
0
1850
1850
number of word
forms (75% level)
J. R. Soc. Interface 11: 20140841
averaging periods
(10-year intervals)
1900
1950
rsif.royalsocietypublishing.org
language
2000
year
Figure 7. Change rates over time in the Russian lexicon.
Figure 8. Rates of change for British English, American English, German,
French, Spanish and Italian.
estimates inaccurate. In short, deviations from the power law
are owing to limited amounts of data points.
At this point, we will turn to data from languages other
than English in order to look at historical contingencies
versus universal trends from a cross-linguistic perspective.
Figure 7 shows the lexical evolution for Russian,
computed in the same way as figure 1 for English.
Describing frequency changes in Russian is somewhat
complicated by the orthographical reform of 1917, but we have
partially controlled for these effects by eliminating the letter
‘1’, earlier used at the end of the words with final consonant.
For Russia, the twentieth century has been an era of
numerous drastic changes in society caused by events such
as World War I, the October Revolution and the Civil War,
World War II and the breakup of the Soviet Union and
disruption of the socialist system. The graph shows a slow
drop in the rate of change during a quiet period of the
second half of the nineteenth century, a huge burst in
the beginning of the twentieth century, a peak during
World War II, and another peak around the time of the
disintegration of the Soviet Union.
In the following, we will compare the behaviour of
seven European languages or language variants (henceforth
languages): British English, American English, Russian,
German, French, Spanish and Italian. For this purpose, we
select data as summarized in table 1. Because the data are
not equally sufficient for different languages and different
periods, the periods from which we sample are not identical.
The second column shows the larger periods within which
we average over 10-year subintervals. The third column
shows the two periods for each language used for computing
changes over 50-year periods. Averaging over time periods
lessens the effect of orthographical changes. The fourth
column shows the number of most frequent words (counted
as types) that together account for 75% of the corpora
(counted as tokens). We will be referring to these words as
the ‘kernel’ lexicon. Due especially to inflectional morphology, there is much more diversity within the kernel
lexicon for Russian than for the other languages. Russian
has the largest number of cases among the languages concerned—six in all—and these induce differences into the
shapes of nouns. It also has a rich verbal inflectional morphology. Among the remaining languages, German has
most diversity. This language also has nominal cases, but
only four, and these are mainly expressed through articles
and only to a minor extent through different shapes of the
nouns. Somewhat surprisingly, German is closely followed
by Italian, which, possibly owing to richer verbal morphology, has a markedly larger kernel lexicon than its
Romance sisters Spanish and French. English, which is one
of the morphologically most reduced Germanic languages,
comes in last, with less than 2400 word forms accounting
for 75% of the corpora of the two varieties.
Figure 8 compares the changes in frequency distributions
over time for British English, American English, German,
French, Spanish and Italian (Russian is left out, because the
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
(b)
5
7
all words
kernel words (75%-level)
4
3
2
1
Brit Amer Rus Ger Fre Spa
Ita
Brit Amer Rus Ger Fre Spa
Ita
Figure 9. Rates of change in seven languages over (a) 10-year versus (b) 50-year intervals.
scale would otherwise not be suitable for displaying details
in the curves of the other languages). While the world wars
are registered in the curves for all languages, there are also
some differences in various peaks and valleys, as well as
differences in the magnitude of oscillation, German having
high rates of change and Spanish smaller ones, for instance.
Differences in the magnitude of oscillation would mainly be
due to the morphological differences mentioned in the previous paragraph. Factors that must have contributed to two
of the peaks in German that do not have counterparts in
the other languages are the orthographic reforms of 1901
and 1996. But beyond this, it would be too speculative to
comment further on these differences. Our point is also not
to link linguistic changes to historical events in a detailed
manner, but rather simply to point out that some historical
contingencies must be operating, some of which affect all
languages because of their international nature, and some
of which are more specific to particular languages.
Next, we try to discern shared, universal tendencies. In
figure 9, we compare the rates of change for respectively the
kernel (as mentioned, the words that account for 75% of a
given corpus) and the full lexicon across seven languages.
The bar plots, which show changes over 10 and 50 year periods
(using the periods specified in table 1), confirm our earlier finding that more frequent words—here, the kernel lexicon—are
less prone to change than less frequent ones, with proportions
of change in all versus kernel words looking somewhat similar
across languages. The Pearson correlation coefficient between change in the kernel and the full lexicons across
languages is r ¼ 0.976 ( p ¼ 0.00018) for the 10 year intervals
and r ¼ 0.7995 ( p ¼ 0.031) for 50 year intervals. While these
proportions, then, are similar across languages for both
10- and 50-year periods, the absolute rates of change only
become similar across languages over larger time periods: the
standard deviation in the rate of change for all words across
languages drops from 1.100 1024 to 0.566 1024 when
widening the interval from 10 to 50 years, and for the kernel
words it drops from 1.122 1024 to 0.446 1024. These results
suggest that once we broaden the horizon to wider time intervals, the effects of historically contingent changes cancel out,
and languages begin to behave quite similarly.
5. Discussion and conclusion
The lexicon of a language reflects the world of its speakers.
Accordingly, changes in the lexicon of a language reflect
changes in the environment, understood in the broadest possible sense, of its users. The same goes for the frequencies
or changes in frequencies of words. Several studies of frequency
changes pertaining to individual words using the Google
N-Gram Corpus confirm this [13,16,19–21]. The specificity
and historically contingent nature of lexical change seemingly
contradict observations regarding regularities underlying
lexical change which cut across languages and time periods,
i.e. universals of lexical change [1,2,9,10,17,18,24]. The contradiction, however, is only apparent, because the points of view
are distinct. It is only in the bird’s eye view, facilitated by
either large amounts of data or large time spans, that
regularities emerge.
Using an information-theoretic measure of frequency
changes, we have found support for both a particularistic
and a universalistic view of lexical change. Looking at individual languages like English, Russian, and other European
languages, we clearly see that frequency distributions are sensitive to changes in society and historical events, even if it is
only possible to discern the relationship for a few very important international events like the two world wars. In order to
pursue this line of research further, it would be necessary
to develop some independent index of information fluctuation in society to be correlated with the changes in lexical
frequencies. This is a fascinating but daunting task which is
beyond of the scope of this paper.
Lexical change is not only prompted by changes in the
social and natural environments, but also by the particular linguistic environment, be it other languages or dynamics among
different groups of speakers of one and the same language. In
this paper, we have caught some glimpses of how the developments of American and British English are interrelated. From
the point in time when sufficient data are available to us,
i.e. 1840, to about a century later, we observe a steady divergence, a trend that must have begun already with the
colonization of North America. By the mid-twentieth century,
however, the two variants begin to converge, quite likely
owing to the advent of mass media, including radio and
television. Throughout the twentieth century, British English
typically looks lexically like a slightly outdated version of
American English, but in very recent years, the two variants
seem to be moving more in tandem. These results show a situation in which the universal tendency for dialects to diverge
into separate languages as they split up [26] is counteracted
by historical circumstances that lead them to re-converge.
Looking at which words are more responsible for changes
in frequency profiles, we have observed that more frequent
J. R. Soc. Interface 11: 20140841
0
rsif.royalsocietypublishing.org
rate of change, year−1 (× 10−4)
(a)
Downloaded from http://rsif.royalsocietypublishing.org/ on June 18, 2017
but when the window is broadened to 50 year intervals,
languages behave more similarly.
Swadesh [1], who first proposed the hypothesis that the
rate of change is constant across a selected set of lexical
items, measured lexical change over time in units of millennia. If the number of years and languages documented in
the Google N-Gram Corpus were an order of magnitude
larger, it is likely that we would find support for a version
of Swadesh’s hypothesis extended to the entire lexicon.
W. Holman, Gerhard Jäger, Bernard Comrie and three anonymous
referees for excellent comments on an earlier version of this paper
as well as to Natalia Aralova for valuable discussion. All analyses
were carried out using our own MATLAB scripts.
Funding statement. S.W.’s research was supported by an ERC advanced
grant (MesAndLin(g)k, proj. no. 295918) and by a subsidy of the
Russian Government to support the Programme of Competitive
Development of Kazan Federal University.
References
1.
Swadesh M. 1955 Towards greater accuracy in
lexicostatistic dating. Int. J. Am. Ling. 21, 121–137.
(doi:10.1086/464321)
2. Lees RB. 1953 The basis of glottochronology.
Language 29, 113 –127. (doi:10.2307/410164)
3. McMahon A, McMahon R. 2005 Language
classification by numbers. Oxford, UK: Oxford
University Press.
4. Heggarty P. 2010 Beyond lexicostatistics: how to get
more out of ‘word list’ comparisons. Diachronica 27,
301–324. (doi:10.1075/dia.27.2.07heg)
5. Campbell L. 2004 Historical linguistics. An
introduction. Edinburgh, UK: Edinburgh University
Press.
6. Campbell L. 1997 American Indian languages. The
historical linguistics of native America. Oxford, UK:
Oxford University Press.
7. Burlak SA, Starostin SA. 2001 Vvedenie v
lingvisticheskuju komparativistiku. Moscow, Russia:
Editorial URSS.
8. Bergsland K, Vogt H. 1962 On the validity of
glottochronology. Curr. Anthropol. 3, 115–153.
(doi:10.1086/200264)
9. Holman EW et al. 2011 Automated dating of the
world’s language families based on lexical
similarity. Curr. Anthropol. 52, 841 –875. (doi:10.
1086/662127)
10. Holman EW, Wichmann S, Brown CH, Velupillai V,
Müller A, Bakker D. 2008 Explorations in automated
language classification. Folia Linguist. 42, 331–354.
(doi:10.1515/FLIN.2008.331)
11. Pagel M, Atkinson QD, Meade A. 2007 Frequency of
word-use predicts rates of lexical evolution
throughout Indo-European history. Nature 449,
717–720. (doi:10.1038/nature06176)
12. Lieberman E, Michel J-B, Jackson J, Tang T, Nowak
MA. 2007 Quantifying the evolutionary dynamics of
language. Nature 449, 713–716. (doi:10.1038/
nature06137)
13. Michel J-B et al. 2011 Quantitative analysis of
culture using millions of digitized books. Science
331, 176 –182. (doi:10.1126/science.1199644)
14. Gao J, Hu J, Mao X, Perc M. 2012 Culturomics
meets random fractal theory: insights into
long-range correlations of social and natural
phenomena over the past two centuries. J. R. Soc.
Interface 9, 1956 –1964. (doi:10.1098/rsif.
2011.0846)
15. Gerlach M, Altmann EG. 2013 Stochastic model for
the vocabulary growth in natural languages. Phys.
Rev. X 3, 021006. (doi:10.1103/PhysRevX.3.021006)
16. Perc M. 2012 Evolution of the most common
English words and phrases over the centuries.
J. R. Soc. Interface 9, 3323 –3328. (doi:10.1098/rsif.
2012.0491)
17. Petersen AM, Tenenbaum J, Havlin S, Stanley HE.
2012 Statistical laws governing fluctuations in word
use from word birth to word death. Sci. Rep. 2, 313.
(doi:10.1038/srep00313)
18. Petersen AM, Tenenbaum JN, Havlin S, Stanley HE,
Perc M. 2012 Languages cool as they expand:
allometric scaling and the decreasing need for new
words. Sci. Rep. 2, 943. (doi:10.1038/srep00943)
19. Bochkarev VV, Shevlakova AV, Solovyev VD.
In press. Average word length dynamics as
indicator of cultural changes in society. Soc.
Evol. Hist.
20. Greenfield PM. 2013 The changing psychology of
culture from 1800 through 2000. Psychol. Sci. 24,
1722 –1731. (doi:10.1177/0956797613479387)
21. Oishi S, Graham J, Kesebir S, Costa Galinha I. 2013
Concepts of happiness across time and cultures.
Pers. Soc. Psychol. Bull. 39, 559– 577. (doi:10.1177/
0146167213480042)
22. Lin Y, Michel J-B, Lieberman Aiden E, Orwant J,
Brockman W, Petrov S. 2012 Syntactic annotations
for the Google Books Ngram Corpus. In Proc. 50th
Ann. Meeting Ass. Comput. Linguist., pp. 169 –174,
Jeju, Republic of Korea, 8–14 July 2012.
23. Kullback S, Leibler RA. 1951 On information and
sufficiency. Ann. Math. Stat. 22, 79 –86. (doi:10.
1214/aoms/1177729694.MR 39968)
24. Corominas-Murtra B, Fortuny J, Solé RV. 2011
Emergence of Zipf’s law in the evolution of
communication. Phys. Rev. E 83, 036115. (doi:10.
1103/PhysRevE.83.036115)
25. Hughes JM, Foti NJ, Krakauer DC, Rockmore DN.
2012 Quantitative patterns of stylistic influence in
the evolution of literature. Proc. Natl Acad. Sci. USA
109, 7682– 7686. (doi:10.1073/pnas.1115407109)
26. de Oliveira PMC, Sousa AO, Wichmann S. 2013 On
the disintegration of ( proto-)languages.
Int. J. Sociol. Lang. 221, 11– 19. (doi:10.1515/ijsl2013-0021)
27. Zipf GK. 1936 The psycho-biology of language.
London, UK: Routledge.
28. Baayen RH. 2001 Word-frequency distributions.
Dordrecht, The Netherlands: Springer-Science þ
Business Media, B.V.
29. Ferrer I Cancho R. 2005 The variation of Zipf’s law
in human language. Eur. Phys. J. B 44, 249 –257.
(doi:10.1140/epjb/e2005-00121-8)
30. Ángeles Serrano M, Flammini A, Menczer F. 2009
Modeling statistical properties of written text. PLoS
ONE 4, e5372. (doi:10.1371/journal.pone.0005372)
J. R. Soc. Interface 11: 20140841
Acknowledgements. We are grateful to Eduardo G. Altmann, Eric
8
rsif.royalsocietypublishing.org
words in English exhibit less variation in their frequencies
over time. For English, the normalized change in frequency
of a word over time is a power function of the word’s frequency having an exponent of –0.316. There may well be a
universal regularity here, and it would be interesting in
future research to look at variability of the exponent across
languages, as well as to investigate possible relations between
this phenomenon and Zipf’s law, which states that the frequency of a word is a power-law function of its frequency
rank [24,27 –30]. Meanwhile, the English data serve as a confirmation from at least one language that stability is inversely
related to frequency, a finding that previously had been made
based on a much more restricted set of lexical items [11].
Across languages, it is also the case that the rate of change
is smaller for what we have called the ‘kernel lexicon’,
namely those words which constitute 75% of the corpus
representing a given language. For both the entire set of
words and the kernel, the rate of change varies markedly
across languages when observed within 10-year windows,