A Data-based Classification of Slavic Languages: Indices of

Journal of Quantitative Linguistics, 2016
Vol. 23, No. 2, 177–190, http://dx.doi.org/10.1080/09296174.2016.1142327
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
A Data-based Classification of Slavic Languages: Indices
of Qualitative Variation Applied to Grapheme Frequencies*
Michaela Koščová1, Ján Mačutek1 and Emmerich Kelih2
1
Department of Applied Mathematics and Statistics, Comenius University, Bratislava,
Slovakia; 2Department of Slavonic Studies, University of Vienna, Austria
ABSTRACT
The Ord graph is a simple graphical method for displaying frequency distributions of data or
theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the
first three moments, either empirical or theoretical. A modification of the Ord graph based
on proportions of indices of qualitative variation is presented. Such a modification makes the
graph applicable also to categorical data. In addition, the indices are normalized with values
between 0 and 1, which enables comparison of data files divided into different numbers of
categories. Both the original and the new graph are used to display grapheme frequencies in
eleven Slavic languages. As the original Ord graph requires an assignment of numbers to the
categories, graphemes are ordered by decreasing frequency. Data are taken from parallel corpora; in the present instance these are grapheme frequencies from a Russian novel and its
translations into ten other Slavic languages. Cluster analysis is then applied to the graph
coordinates. While the original graph yields results which are not linguistically interpretable,
its modification reveals meaningful relations among the languages.
1. INTRODUCTION
Ord (1967b) suggested a simple graphical representation of discrete probability distributions1 in the two-dimensional plane. However, his idea can
directly be applied also to continuous distributions. The coordinates of a
distribution in the graph are given as proportions of their first three
*Address correspondence to: Ján Mačutek, Department of Applied Mathematics and
Statistics, Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská
dolina, 842 48 Bratislava, Slovakia. Tel: +421 2 60295717. E-mail: [email protected].
1
In order to avoid confusion, we must mention that the same author also developed another
graphical method for discrete distributions, which was published in the same year, see Ord
(1967a), and also Friendly (2000).
© 2016 Informa UK Limited, trading as Taylor & Francis Group
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
178
M. KOŠČOVÁ ET AL.
moments, namely, the mean μ, the variance μ2 and the third central moment
(i.e. the skewness) μ3. In general, all distributions can be depicted for which
the first three moments exist and the first two of them are non-zero. Keeping the notation from Ord (1967b), the x- and y-coordinates will be denoted
by I and S, respectively, with I = μ2/μ and S = μ3/μ2. If all possible parameter values of a particular distribution are considered, one obtains an area (or
a curve, a line, a point) which characterizes the distribution (we note that
areas belonging to different distributions can overlap). Some of them can be
seen in Figure 1, which is taken from Ord (1967b).
If theoretical moments are replaced with empirical ones, the Ord graph
can also be used to display data, and can serve as a preliminary, intuitive
decision criterion for whether the data can be modelled by a particular distribution. If the point representing the data lies within the area of the distribution, or not too far away from it, a (relatively) good fit between the data
Fig. 1. Graphical representation of discrete distributions from Ord (1967b).
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES
179
and the model can be expected. The graph also provides, among other
things, the possibility of data classification or clustering – points representing related data are supposed to be close to each other.
The Ord graph has been applied to visualization of data not only in linguistics (Stadlober & Djuzelic, 2005; Grzybek & Rusko, 2009), but also in
other branches of research such as biology (Schneider & Duffy, 1985),
transport network modelling (Taylor, 1976; Beguin & Thomas, 1997), and
musicology (Martináková et al., 2009).
However, the Ord graph is not applicable to categorical data; for an overview of graphical methods suitable for such data see Blasius and Greenacre
(1998) and Friendly (2000). Especially in the case of nominal data, that is,
if there is no natural ordering of categories (see, e.g. Agresti, 2013, p. 3),
use of the graph would require an assignment of integers to the categories.
Such an assignment can only be arbitrary, and the arbitrariness leads almost
necessarily to ambiguities.
We will apply both the original Ord graph and a subsequent modification
of it (see Section 3) to grapheme frequencies in Slavic languages (see Section 2 for data description). Grapheme orderings, as they are established in
alphabets or other writing systems specific to particular languages, are the
result of traditions and/or conventions, which are not linguistically substantiated in the vast majority of languages. Slavic languages are not exceptional in this respect. Moreover, two further facts compromise any attempt
to achieve a grapheme ordering common to all Slavic languages. They not
only have different grapheme inventories, but languages from this family
also use writing systems based on two different scripts, namely Latin and
Cyrillic. These two scripts and their modifications follow different traditions
of grapheme orderings, e.g. the grapheme z appears towards the end of Slavic adaptations of the Latin alphabet (Comrie, 1996b), but its Cyrillic counterpart з is positioned around the eighth place (out of roughly 30,
depending on the language, see Section 2) in alphabets based on the Cyrillic script (Comrie, 1996a).
One of reasonable possibilities left is to work with ranked frequencies,
where the most frequent grapheme is given the rank 1, the second most frequent the rank 2, etc. The problem of ambiguities mentioned above is
thereby solved; this approach has enjoyed an increased popularity in recent
years. There are several studies available, mainly for Slavic languages (see
Grzybek et al., 2009, and references therein), but also for German
(Grzybek, 2007), Irish and Manx (Wilson, 2013), and some languages from
West Africa (Rovenchak & Vydrin, 2010). The negative hypergeometric
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
180
M. KOŠČOVÁ ET AL.
distribution (see, e.g., Wimmer & Altmann, 1999, pp. 465–468) is tentatively considered a general mathematical model that fits data from all languages studied so far well. However, its parameters and hence also its
moments seem to depend on the inventory size, that is, on the number of
graphemes used in particular languages2 (henceforth IS). The dependence
within the Slavic language family was demonstrated by Grzybek et al.
(2005) and Grzybek and Kelih (2005). Consequently the Ord graph, which
exploits the moments, will reflect not only a measure of relatedness among
Slavic languages, but will also be influenced by their inventory sizes. We
will show in Section 2 that the graph constructed from grapheme rankfrequency distributions does not lead to linguistically interpretable results.
In Section 3 we suggest a modification of the Ord graph in which
moments are replaced with so-called indices of qualitative variation (see
Wilcox, 1973). The new graph reveals a meaningful classification of Slavic
languages.
2. DATA DESCRIPTION
The grapheme frequencies to be analysed were obtained from the Russian
social realist novel Kak zakaljalas’ stal’ (How the Steel Was Tempered) and
its translations to 10 other Slavic languages. The book was written by Nikolai
Ostrovsky in the 1930s. It enjoyed the status of recommended reading, and
was consequently translated into the languages spoken in the countries of the
socialist bloc within a relatively short time period. The linguistic corpus consisting of the Russian (RUS henceforth, IS = 33) original and its translations
into Belarusian, Bulgarian (BUL, IS = 30), Croatian (CRO, IS = 30), Czech
(CZE, IS = 42), Macedonian (MAC, IS = 31), Polish (POL, IS = 32), Serbian
(SRB, IS = 30), Slovene (SLO, IS = 25), Slovak (SVK, IS = 43), Ukrainian
(UKR, IS = 34), and Upper Sorbian (UPS, IS = 37) was described by Kelih
(2009b).
Belarusian was omitted from consideration as its orthography differs substantially from that of other Slavic languages. Belarusian has an explicit,
phonetically determined orthographic system: letters are used for coding
phones and not phonemes (and partly morphophonemes) as e.g. in case of
2
The determination of the grapheme inventory size is a complex linguistic issue; some
details specific to Slavic languages can be found in Kelih (2013).
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES
181
Russian and Ukrainian. This different coding approach has, among other
things, the effect of an extreme over-exploitation of particular graphemes
(for details see Kelih, 2009a). Rank-frequency distributions of graphemes
from eleven3 Slavic languages can be found in Table 1 and Table 2 (the
languages are ordered decreasingly according to their grapheme inventory
sizes); they are displayed on the Ord graph in Figure 2 left.
Since the beginning of modern Slavic linguistics and typology in the
mid-19th century, the classification of Slavic languages has been discussed
many times. By now, a simple typology based on the geographical location
of the Slavic standard languages is more or less accepted. It divides the languages into three groups: East Slavic (Belarusian, Russian, Ukrainian), West
Slavic (Czech, Polish, Slovak, Upper and Lower Sorbian), and South Slavic
(Bulgarian, Croatian, Macedonian, Serbian, Slovene).
Cluster analysis4 was applied to the I- and S-coordinates from the Ord
graph, with three clusters pre-specified (indicated by ellipses in Figure 2
left). Clustering was performed using the statistical software available in R.
Two methods were used, namely, k-means and k-medoids. In Figure 2 left,
they yield the same clusters regardless of the choice of the algorithm for
the k-means method (Hartigan-Wong, Lloyd, MacQueen) and of the metric
for the k-medoids method (Euclidean, Manhattan). Figure 2 right presents
clusters resulting from the k-means method; the k-medoids method gives
clusters almost identical to the ones from Figure 2 right (the only difference
is that UPS migrates from the upper cluster to the one in the middle).
The results obtained are not linguistically meaningful (e.g. East Slavic
languages form one group with most of South Slavic ones; on the other
hand, Slovene is a single outlier, which is not explicable, since the historical development of its writing system is parallel with that of the other
Slavic languages, etc.). The only clue hinting at a linguistic explanation is
the grapheme inventory size of the languages analysed, as the clusters
coincide with the ones based on the sizes of grapheme inventories (Figure 2
right). Grapheme inventories, however, reflect history, traditions,
3
Two from among currently spoken standard Slavic languages were not included: Belarusian,
as was explained, was omitted because of its peculiar orthography; and Lower Sorbian, because
no suitable texts (i.e. long enough and comparable with analogous texts in other Slavic languages) could be found (the language has about 7000 speakers only). We do not intend to discuss here the status of one language/different languages/dialects of, e.g., Ukrainian/Rusyn,
Bosnian/Croatian/Montenegrin/Serbian, Polish/Cassubian, etc.
4
For a (relatively) short overview of the cluster analysis see e.g. Izenman (2008).
182
M. KOŠČOVÁ ET AL.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
Table 1. Grapheme rank-frequency distributions in Slovak, Czech, Upper Sorbian, Ukrainian,
Russian, and Polish.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
SVK
CZE
UPS
UKR
RUS
POL
26490
23869
20564
15166
13204
12842
12233
12137
11959
11548
10010
8981
8569
8293
7389
6729
6051
5496
4282
4270
4267
3697
3352
2772
2498
2424
2358
1867
1722
1456
1276
642
601
581
366
320
186
100
94
30
10
6
0
20618
20371
19595
15223
14183
12586
12174
11365
11312
10193
9639
9147
8477
8320
8252
6301
5552
5338
5229
5219
4719
4207
4103
3290
3169
2932
2650
2583
2460
2098
2032
892
541
253
213
188
182
169
86
12
7
0
29440
27097
24691
17213
16201
14719
13527
12224
11500
10995
10640
10113
9647
8425
7725
7697
7238
7182
5625
5540
5341
5201
4135
4024
3579
3412
2888
2867
2813
2668
2241
607
505
276
0
0
0
25494
22419
17958
16868
15985
14123
12146
11835
11566
10521
10339
9926
9811
8871
8327
7542
6693
5640
4759
4618
4215
3977
3952
3038
2963
2486
2101
1937
1430
1340
878
282
242
1
28305
23509
21205
17140
16143
14868
13980
13265
13103
12693
10004
8396
8147
7834
7733
5479
5191
5045
5026
4957
4498
3679
3288
2859
2667
2506
1556
1098
971
539
312
59
0
26718
25264
22229
20509
18622
14275
13344
12876
12627
12120
11170
10120
9637
9499
8933
8623
8510
6564
5964
5354
4613
4387
4361
3714
3199
2548
2052
1851
1220
416
406
254
A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES
183
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
Table 2. Grapheme rank-frequency distributions in Macedonian, Bulgarian, Croatian, Serbian,
and Slovene.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
MAC
BUL
CRO
SRB
SLO
40232
30122
28420
20985
20793
17111
13634
13152
11613
10640
10591
7815
7753
7123
6327
6127
5440
5219
5191
4360
3203
2015
1798
1540
803
563
365
303
171
66
35
36841
24724
23098
21644
19535
17133
13867
13394
12224
11329
9197
8542
7950
7339
6197
5633
5309
4770
4554
4344
4035
3220
2681
2197
1956
1936
1464
362
336
320
32444
25820
24952
24320
13457
13215
12958
12759
11581
10237
9958
9885
9741
9139
8384
7779
5047
4688
3808
3768
3075
2258
2225
1810
1769
1709
1665
637
241
55
32507
25823
24709
23473
13332
13168
12888
12728
11453
9949
9929
9661
9163
8296
7958
7794
5015
4732
3889
3797
3004
2239
2194
1832
1703
1592
1512
649
278
77
30849
29708
26129
25886
17175
15921
15045
14144
14139
12402
11569
11412
10029
9167
8753
6441
5515
5336
4755
4429
3054
2923
1967
1893
230
conventions, etc. of a language (see also Section 1) more than linguistic
laws and relations among languages. Furthermore, they are extremely conservative and highly resistant to changes which, if they occur, more often
than not follow sudden historical/political changes rather than slow, continuous ones.
Given that moments of the grapheme rank-frequency distributions
depend, at least for Slavic languages, on the inventory sizes (Grzybek et al.,
2005; Grzybek & Kelih, 2005), the coincidence of clusters in Figure 2 left
and Figure 2 right is not surprising.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
184
M. KOŠČOVÁ ET AL.
Fig. 2. Original Ord graph for grapheme rank-frequency distributions (left), with cluster
analysis applied to graph coordinates; Slavic languages clustered according to their inventory
sizes (right).
3. MODIFIED ORD GRAPH
Consider N data items divided into K categories and denote fi the frequency
of the i-th category. Wilcox (1973) discussed several measures of variation
applicable also to nominal data, among them the variance analogue
K 2
P
fi NK
VA ¼ 1 i¼1N 2 ðK1Þ
;
(1)
K
the standard deviation analogue
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uK
uP N 2
u
f
i
K
u
SDA ¼ 1 ti¼1N 2 ðK1Þ ;
(2)
K
and the relative entropy
RE ¼
K
P
fi
i¼1
N
log Nfi
log K
where log denotes the natural logarithm.
;
(3)
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES
185
In Wilcox (1973) these measures are called indices of qualitative variation. They have at least two properties which distinguish them from the
usual measures of variation like variance, standard deviation, and so on.
Firstly, they are invariant with respect to the ordering of categories, that is,
they depend solely on frequencies. Secondly, all of them are normalized,
with possible values from the interval [0,1]; for all of them, value 0 is
attained if all objects are in one category and other categories are empty,
and value 1 corresponds to the uniform distribution, with all categories having the same frequencies. Thus, if one considers grapheme frequencies in
Slavic languages, indices of qualitative variation can be a response to ambiguities related to the two traditions of grapheme orderings. They also eliminate influences of different inventory sizes.
Given these advantages, we applied the indices (1) to (3) to modify the
Ord graph. The modified coordinates are defined as
Im ¼ SDA=VA
(4)
Sm ¼ RE=SDA:
(5)
and
It is easy to see that Im could be simplified to the form
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uK
uP N 2
u
ui¼1 fi K
Im ¼ 1 þ t N 2 ðK1Þ :
(6)
K
However, we prefer to keep the form (4) for two reasons, the first being to
highlight an analogy with the original Ord graph (other measures of qualitative variation can be more useful for analyses of other types of data, see
Section 4), and the second being that, for linguistic data specifically, the
form (4) can be simpler to interpret. Its denominator is, in fact, the normalized repeat rate
1
0
K
P
2
fi
C
K B
B1 i¼1 C;
(7)
RRnorm ¼
@
K 1
N2 A
(see Gibbs & Poston, 1975), which is one of the standard characteristics in
quantitative linguistics.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
186
M. KOŠČOVÁ ET AL.
Figure 3 shows the new graph applied, again, to grapheme frequencies in
Slavic languages; we emphasize that the order of graphemes within a language is irrelevant in this case. Clusters created from its coordinates Im and
Sm (ellipses in Figure 3) present a pattern quite different from that in
Figure 2. The proposed classification reveals interesting findings on the
typology of Slavic languages; the resulting clusters are the same, again,
regardless of the method, algorithm or metric used, see Section 2.
First of all, there is a group of South Slavic languages which perfectly
fits their geographical location. Bulgarian, Croatian, Macedonian, Serbian,
and Slovene form one homogeneous group. The orthographic systems of
these languages are well organized with respect to the economy of coding
of some specific prosodic features like the pitch accent in Croatian, Serbian,
and Slovene, and to the marking of palatalized consonants in Bulgarian
(marked with a specific vocalic grapheme). Macedonian is one of the
youngest standard languages, codified in 1945, and its orthography is largely based on the same principles as Serbian, that is, one letter for one
sound.
Fig. 3. Modified Ord graph for grapheme frequency distributions, with cluster analysis
applied to graph coordinates.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES
187
The second group can be called the basic West Slavic languages, and
includes Czech and Slovak. The two languages are typologically quite similar in general, including their orthographic and phonemic systems, and thus
their location in one group is justified.
In Figure 3, Russian, Ukrainian, Polish, and Upper Sorbian form a group.
If one compares it with the traditional geographical classification, this North
Slavic group seems to be a mixture of East Slavic (Russian and Ukrainian)
and West Slavic languages (Polish and Upper Sorbian). However, if orthographic and phonemic criteria are taken into account, these languages share
some common features, namely, they are characterized by a systematic correlation of the consonantal system palatalization (i.e. consonants tend to
have both “hard” and “soft” versions). Indeed, these characteristics play a
very important role in Russian and Ukrainian, whereas a regression of
palatalization was reported for Polish and especially for Upper Sorbian.
The groups resulting from cluster analysis of the modified Ord graph
coordinates differ slightly from the traditional, area-based typology of
Slavic languages, but they suggest another, linguistically justifiable classification. It corresponds to the approach of Kolomiec & Mel’ničuk (1986),
where a group of North Slavic languages (Russian, Ukrainian, Polish) is
mentioned; they are characterized by a high number of consonants in their
inventories, whereas South Slavic languages mainly enlarged their vowel
inventory (for a detailed discussion of vocalic and consonantal Slavic
languages see Sawicka, 1991).
4. CONCLUSION
Our modification of the Ord graph yields linguistically motivated and interpretable results. Cluster analysis applied to the coordinates of the new graph
reveals groups of Slavic languages which share some common features, as
far as orthography and phonology is concerned. Thus, the application of the
modified Ord graph to grapheme frequencies can be seen as a contribution
to the typology of Slavic languages. When compared with their traditional,
purely geographical classification, this new approach has the advantage of
being based on empirically observed data.
Admittedly, the definition of the modified graph coordinates (4) and (5)
used in this paper – i.e. the choice of indices (1)–(3) – is heuristic only.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
188
M. KOŠČOVÁ ET AL.
Apart from the fact that they yield linguistically relevant results in this
case,5 there is no other reason why they should be preferred. It can be
expected that other indices of qualitative variation (see e.g. Gibbs & Poston,
1975, or Wilcox, 1973; a comprehensive overview of variation measures is
provided by Gadrich et al., 2015) will be more reasonable for categorical
data arising in other branches of science. A deeper analysis of the indices
and their ratios, which could possibly lead to more general interpretations,
remains a challenge for mathematicians.6
Regardless of the choice of indices, the method is computationally very
simple, and the results it yields are also easy to understand, as they are displayed in the two-dimensional plane. In addition, it represents categorical
data by two real-valued coordinates, thus enabling application of statistical
classification or clustering methods.
DISCLOSURE STATEMENT
No potential conflict of interest was reported by the authors.
FUNDING
Supported by the grants VEGA [2/0047/15] (M. Koščová, J. Mačutek); Comenius University
grant [UK/83/2015] (M. Koščová).
REFERENCES
Agresti, A. (2013). Categorical Data Analysis, Chichester: Wiley.
Beguin, H., & Thomas, I. (1997). Morphologie du réseau de communication et localizations
optimales d’activités. Quelle mesure pour exprimer la forme d’un réseau? Cybergeo
European Journal of Geography, article no. 26.
Blasius, J., & Greenacre, M. (1998). Visualization of Categorical Data, San Diego, CA:
Academic Press.
5
Preliminary analyses indicate that our modification (i.e. the one which uses ratios of the
variance analogue, the standard deviation analogue, and the relative entropy) of the Ord
graph works well also for other languages (e.g. it can discriminate between the Germanic
and Romance languages). However, more reliable and more complete data must be investigated before these results can be published.
6
The same is true, however, for other relatively new graphical methods – e.g. Cullen and
Frey (1999) suggested another graph which, like the original Ord graph, exploits moments,
with the square of the skewness on the x-axis and the kurtosis on the y-axis.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES
189
Comrie, B. (1996a). Adaptations of the Cyrillic alphabet. In: P. T. Daniels & W. Bright
(Eds), The World’s Writing Systems (pp. 700–726). Oxford: Oxford University Press.
Comrie, B. (1996b). Languages of Eastern and Southern Europe. In: P. T. Daniels & W.
Bright (Eds), The World’s Writing Systems (pp. 663–688). Oxford: Oxford University
Press.
Cullen, A. C., & Frey, H. C. (1999). Probabilistic Techniques in Exposure Assessment. A
Handbook for Dealing with Variability and Uncertainty in Models and Inputs, New
York, NY: Plenum Press.
Friendly, M. (2000). Visualizing Categorical Data Cary, NC: SAS Institute.
Gadrich, T., Bashkansky, E., & Zitikis, R. (2015). Assessing variation: a unifying approach
for all scales of measurement. Quality & Quantity, 49, 1145–1167.
Gibbs, J. P., & Poston, D. L. (1975). The division of labor: Conceptualization and related
measures. Social Forces, 53, 468–476.
Grzybek, P. (2007). On the systematic and system-based study of grapheme frequencies: A
re-analysis of German letter frequencies. Glottometrics, 15, 82–91.
Grzybek, P., & Kelih, E. (2005). Towards a general model of grapheme frequencies in Slavic
languages. In: R. Garabík (Ed), Computer Treatment of Slavic and East European
Languages (pp. 73–87). Bratislava: Veda.
Grzybek, P., & Rusko, M. (2009). Letter, grapheme and (allo-)phone frequencies: The case
of Slovak. Glottotheory, 2(1), 30–48.
Grzybek, P., Kelih, E., & Altmann, G. (2005). Graphemhäufigkeiten (am Beispiel des Russischen). Teil III: Die Bedeutung des Inventarumfangs – eine Nebenbemerkung zur
Diskussion um das ‘ë’. Anzeiger für Slavische Philologie, 33, 117–140.
Grzybek, P., Kelih, E., & Stadlober, E. (2009). Slavic letter frequencies: A common discrete
model and regular parameter behavior? In: R. Köhler (Ed), Issues in Quantitative Linguistics (pp. 17–33). Lüdenscheid: RAM-Verlag.
Izenman, A. J. (2008). Modern Multivariate Statistical Techniques. Regression, Classification, and Manifold Learning. Berlin: Springer.
Kelih, E. (2009a). Graphemhäufigkeiten in slawischen Sprachen: Stetige Modelle. Glottometrics, 18, 53–68.
Kelih, E. (2009b). Slawisches Parallel-Textkorpus: Projektvorstellung von “Kak zakaljalas’
stal’ (KZS)”. In: E. Kelih, V. Levickij & G. Altmann (Eds), Methods of Text Analysis
(pp. 106–124). Chernivtsi: ChNU.
Kelih, E. (2013). Grapheme inventory size and repeat rate in Slavic languages. Glottotheory,
4(1), 56–71.
Kolomiec, V. T. & Melʹničuk, A. S. (1986). Istoričeskaja tipologija slavjanskich jazykov.
Fonetika, slovoobrazovanie, leksika i frazeologija. Kiev: Naukova Dumka.
Martináková, Z., Mačutek, J., Popescu, I.-I., & Altmann, G. (2009). Ord’s criterion in
musical texts. Glottotheory, 2(1), 86–98.
Ord, J. K. (1967a). Graphical methods for a class of discrete distributions. Journal of the
Royal Statistical Society A, 130(2), 232–238.
Ord, J. K. (1967b). On a system of discrete distributions. Biometrika, 54(3/4), 649–656.
Rovenchak, A., & Vydrin, V. (2010). Quantitative properties of the Nko writing system. In
P. Grzybek, E. Kelih & J. Mačutek (Eds), Text and Language. Structures, Functions,
Interrelations, Quantitative Perspectives (pp. 171–181). Wien: Praesens.
Sawicka, I. (1991). Problems of the phonetic typology of the Slavic languages. In: I. Sawicka & A. Holvoet (Eds), Studies in the Phonetic Typology of the Slavic Languages
(pp. 13–35). Warszawa: Omnitech Press.
Schneider, D. C., & Duffy, D. C. (1985). Scale-dependent variability in seabird abundance.
Marine Ecology Progress Series, 25, 211–218.
Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016
190
M. KOŠČOVÁ ET AL.
Stadlober, E., & Djuzelic, M. (2005). Multivariate statistical methods in quantitative text
analyses. In P. Grzybek (Ed.), Contributions to the Science of Text and Language.
Word Length Studies and Related Issues (pp. 259–275). Dordrecht: Springer.
Taylor, Z. (1976). Accessibility of urban transport systems. The case of Poznań city.
Geographia Polonica, 33(2), 121–141.
Wilcox, A. R. (1973). Indices of qualitative variation and political measurement. Western
Political Quarterly, 26(2), 325–343.
Wilson, A. (2013). Probability distributions of grapheme frequencies in Irish and Manx.
Journal of Quantitative Linguistics, 20(3), 169–177.
Wimmer, G., & Altmann, G. (1999). Thesaurus of Univariate Discrete Probability
Distributions, Essen: Stamm.