Journal of Quantitative Linguistics, 2016 Vol. 23, No. 2, 177–190, http://dx.doi.org/10.1080/09296174.2016.1142327 Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 A Data-based Classification of Slavic Languages: Indices of Qualitative Variation Applied to Grapheme Frequencies* Michaela Koščová1, Ján Mačutek1 and Emmerich Kelih2 1 Department of Applied Mathematics and Statistics, Comenius University, Bratislava, Slovakia; 2Department of Slavonic Studies, University of Vienna, Austria ABSTRACT The Ord graph is a simple graphical method for displaying frequency distributions of data or theoretical distributions in the two-dimensional plane. Its coordinates are proportions of the first three moments, either empirical or theoretical. A modification of the Ord graph based on proportions of indices of qualitative variation is presented. Such a modification makes the graph applicable also to categorical data. In addition, the indices are normalized with values between 0 and 1, which enables comparison of data files divided into different numbers of categories. Both the original and the new graph are used to display grapheme frequencies in eleven Slavic languages. As the original Ord graph requires an assignment of numbers to the categories, graphemes are ordered by decreasing frequency. Data are taken from parallel corpora; in the present instance these are grapheme frequencies from a Russian novel and its translations into ten other Slavic languages. Cluster analysis is then applied to the graph coordinates. While the original graph yields results which are not linguistically interpretable, its modification reveals meaningful relations among the languages. 1. INTRODUCTION Ord (1967b) suggested a simple graphical representation of discrete probability distributions1 in the two-dimensional plane. However, his idea can directly be applied also to continuous distributions. The coordinates of a distribution in the graph are given as proportions of their first three *Address correspondence to: Ján Mačutek, Department of Applied Mathematics and Statistics, Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia. Tel: +421 2 60295717. E-mail: [email protected]. 1 In order to avoid confusion, we must mention that the same author also developed another graphical method for discrete distributions, which was published in the same year, see Ord (1967a), and also Friendly (2000). © 2016 Informa UK Limited, trading as Taylor & Francis Group Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 178 M. KOŠČOVÁ ET AL. moments, namely, the mean μ, the variance μ2 and the third central moment (i.e. the skewness) μ3. In general, all distributions can be depicted for which the first three moments exist and the first two of them are non-zero. Keeping the notation from Ord (1967b), the x- and y-coordinates will be denoted by I and S, respectively, with I = μ2/μ and S = μ3/μ2. If all possible parameter values of a particular distribution are considered, one obtains an area (or a curve, a line, a point) which characterizes the distribution (we note that areas belonging to different distributions can overlap). Some of them can be seen in Figure 1, which is taken from Ord (1967b). If theoretical moments are replaced with empirical ones, the Ord graph can also be used to display data, and can serve as a preliminary, intuitive decision criterion for whether the data can be modelled by a particular distribution. If the point representing the data lies within the area of the distribution, or not too far away from it, a (relatively) good fit between the data Fig. 1. Graphical representation of discrete distributions from Ord (1967b). Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES 179 and the model can be expected. The graph also provides, among other things, the possibility of data classification or clustering – points representing related data are supposed to be close to each other. The Ord graph has been applied to visualization of data not only in linguistics (Stadlober & Djuzelic, 2005; Grzybek & Rusko, 2009), but also in other branches of research such as biology (Schneider & Duffy, 1985), transport network modelling (Taylor, 1976; Beguin & Thomas, 1997), and musicology (Martináková et al., 2009). However, the Ord graph is not applicable to categorical data; for an overview of graphical methods suitable for such data see Blasius and Greenacre (1998) and Friendly (2000). Especially in the case of nominal data, that is, if there is no natural ordering of categories (see, e.g. Agresti, 2013, p. 3), use of the graph would require an assignment of integers to the categories. Such an assignment can only be arbitrary, and the arbitrariness leads almost necessarily to ambiguities. We will apply both the original Ord graph and a subsequent modification of it (see Section 3) to grapheme frequencies in Slavic languages (see Section 2 for data description). Grapheme orderings, as they are established in alphabets or other writing systems specific to particular languages, are the result of traditions and/or conventions, which are not linguistically substantiated in the vast majority of languages. Slavic languages are not exceptional in this respect. Moreover, two further facts compromise any attempt to achieve a grapheme ordering common to all Slavic languages. They not only have different grapheme inventories, but languages from this family also use writing systems based on two different scripts, namely Latin and Cyrillic. These two scripts and their modifications follow different traditions of grapheme orderings, e.g. the grapheme z appears towards the end of Slavic adaptations of the Latin alphabet (Comrie, 1996b), but its Cyrillic counterpart з is positioned around the eighth place (out of roughly 30, depending on the language, see Section 2) in alphabets based on the Cyrillic script (Comrie, 1996a). One of reasonable possibilities left is to work with ranked frequencies, where the most frequent grapheme is given the rank 1, the second most frequent the rank 2, etc. The problem of ambiguities mentioned above is thereby solved; this approach has enjoyed an increased popularity in recent years. There are several studies available, mainly for Slavic languages (see Grzybek et al., 2009, and references therein), but also for German (Grzybek, 2007), Irish and Manx (Wilson, 2013), and some languages from West Africa (Rovenchak & Vydrin, 2010). The negative hypergeometric Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 180 M. KOŠČOVÁ ET AL. distribution (see, e.g., Wimmer & Altmann, 1999, pp. 465–468) is tentatively considered a general mathematical model that fits data from all languages studied so far well. However, its parameters and hence also its moments seem to depend on the inventory size, that is, on the number of graphemes used in particular languages2 (henceforth IS). The dependence within the Slavic language family was demonstrated by Grzybek et al. (2005) and Grzybek and Kelih (2005). Consequently the Ord graph, which exploits the moments, will reflect not only a measure of relatedness among Slavic languages, but will also be influenced by their inventory sizes. We will show in Section 2 that the graph constructed from grapheme rankfrequency distributions does not lead to linguistically interpretable results. In Section 3 we suggest a modification of the Ord graph in which moments are replaced with so-called indices of qualitative variation (see Wilcox, 1973). The new graph reveals a meaningful classification of Slavic languages. 2. DATA DESCRIPTION The grapheme frequencies to be analysed were obtained from the Russian social realist novel Kak zakaljalas’ stal’ (How the Steel Was Tempered) and its translations to 10 other Slavic languages. The book was written by Nikolai Ostrovsky in the 1930s. It enjoyed the status of recommended reading, and was consequently translated into the languages spoken in the countries of the socialist bloc within a relatively short time period. The linguistic corpus consisting of the Russian (RUS henceforth, IS = 33) original and its translations into Belarusian, Bulgarian (BUL, IS = 30), Croatian (CRO, IS = 30), Czech (CZE, IS = 42), Macedonian (MAC, IS = 31), Polish (POL, IS = 32), Serbian (SRB, IS = 30), Slovene (SLO, IS = 25), Slovak (SVK, IS = 43), Ukrainian (UKR, IS = 34), and Upper Sorbian (UPS, IS = 37) was described by Kelih (2009b). Belarusian was omitted from consideration as its orthography differs substantially from that of other Slavic languages. Belarusian has an explicit, phonetically determined orthographic system: letters are used for coding phones and not phonemes (and partly morphophonemes) as e.g. in case of 2 The determination of the grapheme inventory size is a complex linguistic issue; some details specific to Slavic languages can be found in Kelih (2013). Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES 181 Russian and Ukrainian. This different coding approach has, among other things, the effect of an extreme over-exploitation of particular graphemes (for details see Kelih, 2009a). Rank-frequency distributions of graphemes from eleven3 Slavic languages can be found in Table 1 and Table 2 (the languages are ordered decreasingly according to their grapheme inventory sizes); they are displayed on the Ord graph in Figure 2 left. Since the beginning of modern Slavic linguistics and typology in the mid-19th century, the classification of Slavic languages has been discussed many times. By now, a simple typology based on the geographical location of the Slavic standard languages is more or less accepted. It divides the languages into three groups: East Slavic (Belarusian, Russian, Ukrainian), West Slavic (Czech, Polish, Slovak, Upper and Lower Sorbian), and South Slavic (Bulgarian, Croatian, Macedonian, Serbian, Slovene). Cluster analysis4 was applied to the I- and S-coordinates from the Ord graph, with three clusters pre-specified (indicated by ellipses in Figure 2 left). Clustering was performed using the statistical software available in R. Two methods were used, namely, k-means and k-medoids. In Figure 2 left, they yield the same clusters regardless of the choice of the algorithm for the k-means method (Hartigan-Wong, Lloyd, MacQueen) and of the metric for the k-medoids method (Euclidean, Manhattan). Figure 2 right presents clusters resulting from the k-means method; the k-medoids method gives clusters almost identical to the ones from Figure 2 right (the only difference is that UPS migrates from the upper cluster to the one in the middle). The results obtained are not linguistically meaningful (e.g. East Slavic languages form one group with most of South Slavic ones; on the other hand, Slovene is a single outlier, which is not explicable, since the historical development of its writing system is parallel with that of the other Slavic languages, etc.). The only clue hinting at a linguistic explanation is the grapheme inventory size of the languages analysed, as the clusters coincide with the ones based on the sizes of grapheme inventories (Figure 2 right). Grapheme inventories, however, reflect history, traditions, 3 Two from among currently spoken standard Slavic languages were not included: Belarusian, as was explained, was omitted because of its peculiar orthography; and Lower Sorbian, because no suitable texts (i.e. long enough and comparable with analogous texts in other Slavic languages) could be found (the language has about 7000 speakers only). We do not intend to discuss here the status of one language/different languages/dialects of, e.g., Ukrainian/Rusyn, Bosnian/Croatian/Montenegrin/Serbian, Polish/Cassubian, etc. 4 For a (relatively) short overview of the cluster analysis see e.g. Izenman (2008). 182 M. KOŠČOVÁ ET AL. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 Table 1. Grapheme rank-frequency distributions in Slovak, Czech, Upper Sorbian, Ukrainian, Russian, and Polish. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 SVK CZE UPS UKR RUS POL 26490 23869 20564 15166 13204 12842 12233 12137 11959 11548 10010 8981 8569 8293 7389 6729 6051 5496 4282 4270 4267 3697 3352 2772 2498 2424 2358 1867 1722 1456 1276 642 601 581 366 320 186 100 94 30 10 6 0 20618 20371 19595 15223 14183 12586 12174 11365 11312 10193 9639 9147 8477 8320 8252 6301 5552 5338 5229 5219 4719 4207 4103 3290 3169 2932 2650 2583 2460 2098 2032 892 541 253 213 188 182 169 86 12 7 0 29440 27097 24691 17213 16201 14719 13527 12224 11500 10995 10640 10113 9647 8425 7725 7697 7238 7182 5625 5540 5341 5201 4135 4024 3579 3412 2888 2867 2813 2668 2241 607 505 276 0 0 0 25494 22419 17958 16868 15985 14123 12146 11835 11566 10521 10339 9926 9811 8871 8327 7542 6693 5640 4759 4618 4215 3977 3952 3038 2963 2486 2101 1937 1430 1340 878 282 242 1 28305 23509 21205 17140 16143 14868 13980 13265 13103 12693 10004 8396 8147 7834 7733 5479 5191 5045 5026 4957 4498 3679 3288 2859 2667 2506 1556 1098 971 539 312 59 0 26718 25264 22229 20509 18622 14275 13344 12876 12627 12120 11170 10120 9637 9499 8933 8623 8510 6564 5964 5354 4613 4387 4361 3714 3199 2548 2052 1851 1220 416 406 254 A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES 183 Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 Table 2. Grapheme rank-frequency distributions in Macedonian, Bulgarian, Croatian, Serbian, and Slovene. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 MAC BUL CRO SRB SLO 40232 30122 28420 20985 20793 17111 13634 13152 11613 10640 10591 7815 7753 7123 6327 6127 5440 5219 5191 4360 3203 2015 1798 1540 803 563 365 303 171 66 35 36841 24724 23098 21644 19535 17133 13867 13394 12224 11329 9197 8542 7950 7339 6197 5633 5309 4770 4554 4344 4035 3220 2681 2197 1956 1936 1464 362 336 320 32444 25820 24952 24320 13457 13215 12958 12759 11581 10237 9958 9885 9741 9139 8384 7779 5047 4688 3808 3768 3075 2258 2225 1810 1769 1709 1665 637 241 55 32507 25823 24709 23473 13332 13168 12888 12728 11453 9949 9929 9661 9163 8296 7958 7794 5015 4732 3889 3797 3004 2239 2194 1832 1703 1592 1512 649 278 77 30849 29708 26129 25886 17175 15921 15045 14144 14139 12402 11569 11412 10029 9167 8753 6441 5515 5336 4755 4429 3054 2923 1967 1893 230 conventions, etc. of a language (see also Section 1) more than linguistic laws and relations among languages. Furthermore, they are extremely conservative and highly resistant to changes which, if they occur, more often than not follow sudden historical/political changes rather than slow, continuous ones. Given that moments of the grapheme rank-frequency distributions depend, at least for Slavic languages, on the inventory sizes (Grzybek et al., 2005; Grzybek & Kelih, 2005), the coincidence of clusters in Figure 2 left and Figure 2 right is not surprising. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 184 M. KOŠČOVÁ ET AL. Fig. 2. Original Ord graph for grapheme rank-frequency distributions (left), with cluster analysis applied to graph coordinates; Slavic languages clustered according to their inventory sizes (right). 3. MODIFIED ORD GRAPH Consider N data items divided into K categories and denote fi the frequency of the i-th category. Wilcox (1973) discussed several measures of variation applicable also to nominal data, among them the variance analogue K 2 P fi NK VA ¼ 1 i¼1N 2 ðK1Þ ; (1) K the standard deviation analogue vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uK uP N 2 u f i K u SDA ¼ 1 ti¼1N 2 ðK1Þ ; (2) K and the relative entropy RE ¼ K P fi i¼1 N log Nfi log K where log denotes the natural logarithm. ; (3) Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES 185 In Wilcox (1973) these measures are called indices of qualitative variation. They have at least two properties which distinguish them from the usual measures of variation like variance, standard deviation, and so on. Firstly, they are invariant with respect to the ordering of categories, that is, they depend solely on frequencies. Secondly, all of them are normalized, with possible values from the interval [0,1]; for all of them, value 0 is attained if all objects are in one category and other categories are empty, and value 1 corresponds to the uniform distribution, with all categories having the same frequencies. Thus, if one considers grapheme frequencies in Slavic languages, indices of qualitative variation can be a response to ambiguities related to the two traditions of grapheme orderings. They also eliminate influences of different inventory sizes. Given these advantages, we applied the indices (1) to (3) to modify the Ord graph. The modified coordinates are defined as Im ¼ SDA=VA (4) Sm ¼ RE=SDA: (5) and It is easy to see that Im could be simplified to the form vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uK uP N 2 u ui¼1 fi K Im ¼ 1 þ t N 2 ðK1Þ : (6) K However, we prefer to keep the form (4) for two reasons, the first being to highlight an analogy with the original Ord graph (other measures of qualitative variation can be more useful for analyses of other types of data, see Section 4), and the second being that, for linguistic data specifically, the form (4) can be simpler to interpret. Its denominator is, in fact, the normalized repeat rate 1 0 K P 2 fi C K B B1 i¼1 C; (7) RRnorm ¼ @ K 1 N2 A (see Gibbs & Poston, 1975), which is one of the standard characteristics in quantitative linguistics. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 186 M. KOŠČOVÁ ET AL. Figure 3 shows the new graph applied, again, to grapheme frequencies in Slavic languages; we emphasize that the order of graphemes within a language is irrelevant in this case. Clusters created from its coordinates Im and Sm (ellipses in Figure 3) present a pattern quite different from that in Figure 2. The proposed classification reveals interesting findings on the typology of Slavic languages; the resulting clusters are the same, again, regardless of the method, algorithm or metric used, see Section 2. First of all, there is a group of South Slavic languages which perfectly fits their geographical location. Bulgarian, Croatian, Macedonian, Serbian, and Slovene form one homogeneous group. The orthographic systems of these languages are well organized with respect to the economy of coding of some specific prosodic features like the pitch accent in Croatian, Serbian, and Slovene, and to the marking of palatalized consonants in Bulgarian (marked with a specific vocalic grapheme). Macedonian is one of the youngest standard languages, codified in 1945, and its orthography is largely based on the same principles as Serbian, that is, one letter for one sound. Fig. 3. Modified Ord graph for grapheme frequency distributions, with cluster analysis applied to graph coordinates. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES 187 The second group can be called the basic West Slavic languages, and includes Czech and Slovak. The two languages are typologically quite similar in general, including their orthographic and phonemic systems, and thus their location in one group is justified. In Figure 3, Russian, Ukrainian, Polish, and Upper Sorbian form a group. If one compares it with the traditional geographical classification, this North Slavic group seems to be a mixture of East Slavic (Russian and Ukrainian) and West Slavic languages (Polish and Upper Sorbian). However, if orthographic and phonemic criteria are taken into account, these languages share some common features, namely, they are characterized by a systematic correlation of the consonantal system palatalization (i.e. consonants tend to have both “hard” and “soft” versions). Indeed, these characteristics play a very important role in Russian and Ukrainian, whereas a regression of palatalization was reported for Polish and especially for Upper Sorbian. The groups resulting from cluster analysis of the modified Ord graph coordinates differ slightly from the traditional, area-based typology of Slavic languages, but they suggest another, linguistically justifiable classification. It corresponds to the approach of Kolomiec & Mel’ničuk (1986), where a group of North Slavic languages (Russian, Ukrainian, Polish) is mentioned; they are characterized by a high number of consonants in their inventories, whereas South Slavic languages mainly enlarged their vowel inventory (for a detailed discussion of vocalic and consonantal Slavic languages see Sawicka, 1991). 4. CONCLUSION Our modification of the Ord graph yields linguistically motivated and interpretable results. Cluster analysis applied to the coordinates of the new graph reveals groups of Slavic languages which share some common features, as far as orthography and phonology is concerned. Thus, the application of the modified Ord graph to grapheme frequencies can be seen as a contribution to the typology of Slavic languages. When compared with their traditional, purely geographical classification, this new approach has the advantage of being based on empirically observed data. Admittedly, the definition of the modified graph coordinates (4) and (5) used in this paper – i.e. the choice of indices (1)–(3) – is heuristic only. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 188 M. KOŠČOVÁ ET AL. Apart from the fact that they yield linguistically relevant results in this case,5 there is no other reason why they should be preferred. It can be expected that other indices of qualitative variation (see e.g. Gibbs & Poston, 1975, or Wilcox, 1973; a comprehensive overview of variation measures is provided by Gadrich et al., 2015) will be more reasonable for categorical data arising in other branches of science. A deeper analysis of the indices and their ratios, which could possibly lead to more general interpretations, remains a challenge for mathematicians.6 Regardless of the choice of indices, the method is computationally very simple, and the results it yields are also easy to understand, as they are displayed in the two-dimensional plane. In addition, it represents categorical data by two real-valued coordinates, thus enabling application of statistical classification or clustering methods. DISCLOSURE STATEMENT No potential conflict of interest was reported by the authors. FUNDING Supported by the grants VEGA [2/0047/15] (M. Koščová, J. Mačutek); Comenius University grant [UK/83/2015] (M. Koščová). REFERENCES Agresti, A. (2013). Categorical Data Analysis, Chichester: Wiley. Beguin, H., & Thomas, I. (1997). Morphologie du réseau de communication et localizations optimales d’activités. Quelle mesure pour exprimer la forme d’un réseau? Cybergeo European Journal of Geography, article no. 26. Blasius, J., & Greenacre, M. (1998). Visualization of Categorical Data, San Diego, CA: Academic Press. 5 Preliminary analyses indicate that our modification (i.e. the one which uses ratios of the variance analogue, the standard deviation analogue, and the relative entropy) of the Ord graph works well also for other languages (e.g. it can discriminate between the Germanic and Romance languages). However, more reliable and more complete data must be investigated before these results can be published. 6 The same is true, however, for other relatively new graphical methods – e.g. Cullen and Frey (1999) suggested another graph which, like the original Ord graph, exploits moments, with the square of the skewness on the x-axis and the kurtosis on the y-axis. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 A DATA-BASED CLASSIFICATION OF SLAVIC LANGUAGES 189 Comrie, B. (1996a). Adaptations of the Cyrillic alphabet. In: P. T. Daniels & W. Bright (Eds), The World’s Writing Systems (pp. 700–726). Oxford: Oxford University Press. Comrie, B. (1996b). Languages of Eastern and Southern Europe. In: P. T. Daniels & W. Bright (Eds), The World’s Writing Systems (pp. 663–688). Oxford: Oxford University Press. Cullen, A. C., & Frey, H. C. (1999). Probabilistic Techniques in Exposure Assessment. A Handbook for Dealing with Variability and Uncertainty in Models and Inputs, New York, NY: Plenum Press. Friendly, M. (2000). Visualizing Categorical Data Cary, NC: SAS Institute. Gadrich, T., Bashkansky, E., & Zitikis, R. (2015). Assessing variation: a unifying approach for all scales of measurement. Quality & Quantity, 49, 1145–1167. Gibbs, J. P., & Poston, D. L. (1975). The division of labor: Conceptualization and related measures. Social Forces, 53, 468–476. Grzybek, P. (2007). On the systematic and system-based study of grapheme frequencies: A re-analysis of German letter frequencies. Glottometrics, 15, 82–91. Grzybek, P., & Kelih, E. (2005). Towards a general model of grapheme frequencies in Slavic languages. In: R. Garabík (Ed), Computer Treatment of Slavic and East European Languages (pp. 73–87). Bratislava: Veda. Grzybek, P., & Rusko, M. (2009). Letter, grapheme and (allo-)phone frequencies: The case of Slovak. Glottotheory, 2(1), 30–48. Grzybek, P., Kelih, E., & Altmann, G. (2005). Graphemhäufigkeiten (am Beispiel des Russischen). Teil III: Die Bedeutung des Inventarumfangs – eine Nebenbemerkung zur Diskussion um das ‘ë’. Anzeiger für Slavische Philologie, 33, 117–140. Grzybek, P., Kelih, E., & Stadlober, E. (2009). Slavic letter frequencies: A common discrete model and regular parameter behavior? In: R. Köhler (Ed), Issues in Quantitative Linguistics (pp. 17–33). Lüdenscheid: RAM-Verlag. Izenman, A. J. (2008). Modern Multivariate Statistical Techniques. Regression, Classification, and Manifold Learning. Berlin: Springer. Kelih, E. (2009a). Graphemhäufigkeiten in slawischen Sprachen: Stetige Modelle. Glottometrics, 18, 53–68. Kelih, E. (2009b). Slawisches Parallel-Textkorpus: Projektvorstellung von “Kak zakaljalas’ stal’ (KZS)”. In: E. Kelih, V. Levickij & G. Altmann (Eds), Methods of Text Analysis (pp. 106–124). Chernivtsi: ChNU. Kelih, E. (2013). Grapheme inventory size and repeat rate in Slavic languages. Glottotheory, 4(1), 56–71. Kolomiec, V. T. & Melʹničuk, A. S. (1986). Istoričeskaja tipologija slavjanskich jazykov. Fonetika, slovoobrazovanie, leksika i frazeologija. Kiev: Naukova Dumka. Martináková, Z., Mačutek, J., Popescu, I.-I., & Altmann, G. (2009). Ord’s criterion in musical texts. Glottotheory, 2(1), 86–98. Ord, J. K. (1967a). Graphical methods for a class of discrete distributions. Journal of the Royal Statistical Society A, 130(2), 232–238. Ord, J. K. (1967b). On a system of discrete distributions. Biometrika, 54(3/4), 649–656. Rovenchak, A., & Vydrin, V. (2010). Quantitative properties of the Nko writing system. In P. Grzybek, E. Kelih & J. Mačutek (Eds), Text and Language. Structures, Functions, Interrelations, Quantitative Perspectives (pp. 171–181). Wien: Praesens. Sawicka, I. (1991). Problems of the phonetic typology of the Slavic languages. In: I. Sawicka & A. Holvoet (Eds), Studies in the Phonetic Typology of the Slavic Languages (pp. 13–35). Warszawa: Omnitech Press. Schneider, D. C., & Duffy, D. C. (1985). Scale-dependent variability in seabird abundance. Marine Ecology Progress Series, 25, 211–218. Downloaded by [Farmaceuticka Fakulta UK], [Ján Mautek] at 03:53 08 July 2016 190 M. KOŠČOVÁ ET AL. Stadlober, E., & Djuzelic, M. (2005). Multivariate statistical methods in quantitative text analyses. In P. Grzybek (Ed.), Contributions to the Science of Text and Language. Word Length Studies and Related Issues (pp. 259–275). Dordrecht: Springer. Taylor, Z. (1976). Accessibility of urban transport systems. The case of Poznań city. Geographia Polonica, 33(2), 121–141. Wilcox, A. R. (1973). Indices of qualitative variation and political measurement. Western Political Quarterly, 26(2), 325–343. Wilson, A. (2013). Probability distributions of grapheme frequencies in Irish and Manx. Journal of Quantitative Linguistics, 20(3), 169–177. Wimmer, G., & Altmann, G. (1999). Thesaurus of Univariate Discrete Probability Distributions, Essen: Stamm.
© Copyright 2026 Paperzz