Introduction Preliminaries Alice ICLE Information theory meets linguistic complexity Benedikt Szmrecsanyi and Katharina Ehret KU Leuven and University of Freiburg Download these slides at http://www.benszm.net/complexity.pdf Conclusion Introduction Preliminaries Alice ICLE Conclusion What we would like to do in this talk • drawing inspiration from information theory to conceptualize & measure language complexity: Kolmogorov complexity • two case studies: • crosslinguistic complexity variation: Alice’s Adventures in Wonderland • complexities of learner languages: The International Corpus of Learner English Introduction Preliminaries Alice Introduction ICLE Conclusion Introduction Preliminaries Alice ICLE Conclusion Introduction • linguistic complexity: one of the currently most hotly debated issues in linguistics (e.g. Sampson et al. 2009; Trudgill 2011; Pallotti 2014, among many others) • theoretical linguistics: are all languages, or language varieties, equally complex? If not, what are the sociolinguistic factors that condition language complexity? • applied linguistics: how can we use complexity measures as proxies for tracking learners’ proficiency, and/or for benchmarking development? Introduction Preliminaries Alice ICLE Conclusion Why theoretical linguists are curious about complexity Trudgill (2011), Sociolinguistic Typology. OUP. contact, social instability, adult SLA, population growth ê change ê simplification, i.e.: ê increase in morphological transparency ê loss of redundancy ê loss of “historical baggage” Peter Trudgill Introduction Preliminaries Alice ICLE Some popular complexity measures in the theoretical literature • quantitative complexity: more contrasts, rules, markers, etc. ê complexity • opacity: for example, allomorphies ê complexity • ornamental complexity: communicatively dispensable contrasts, rules, markers etc. ê complexity • L2 acquisition difficulty: contrasts, rules, etc. that are hard to learn for adults ê complexity (see Szmrecsanyi and Kortmann 2012 for a detailed review) Conclusion Introduction Preliminaries Alice ICLE Conclusion Why applied linguists worry about complexity “Second language acquisition researchers use interlanguage complexity measures [. . . ] with at least three main purposes in mind: (a) to gauge proficiency, (b) to describe performance, and (c) to benchmark development.” (Ortega 2012: 128) Lourdes Ortega Introduction Preliminaries Alice ICLE Some popular complexity measures in the applied linguistics literature • length of units (e.g. mean length of T-unit): long units ê complexity • density of subordination: subordination ê complexity • frequency of occurrence of ‘complex’ forms (e.g. passive voice): many complicated forms ê complexity (see Ortega 2003; Pallotti 2014 for reviews) Conclusion Introduction Preliminaries Alice ICLE Conclusion Shortcomings • “theoretical” measures: nicely holistic but not easily amenable to operationalization in usage data (how do you measure, e.g., “historical baggage”?) • “applied” measures: nicely amenable to operationalization in usage data but suffering from “concept reductionism” (Ortega 2012: 128) Introduction Preliminaries Alice ICLE Conclusion Have cake and eat it too • Kolmogorov complexity: unsupervised, holistic, text-based • can be approximated using file compression programs ê text samples that can be compressed efficiently are linguistically simple • can be combined with various distortion techniques to yield measures of morphological and syntactic complexity Andrei Kolmogorov (1903–1987) Introduction Preliminaries Alice Road map 1. 2. 3. 4. 5. Introduction Preliminaries Cross-linguistic complexity: Alice Language-internal complexity: ICLE Conclusion ICLE Conclusion Introduction Preliminaries Alice Preliminaries ICLE Conclusion Introduction Preliminaries Alice ICLE Conclusion Information theory “the science which deals with the concept ‘information’, its measurement and its applications” (Lubbe 1997: 1) Introduction Preliminaries Alice ICLE Conclusion Information theory “the science which deals with the concept ‘information’, its measurement and its applications” (Lubbe 1997: 1) • in information theory, information relates to the unexpectedness and unpredictability (surprisal) of a proposition or of an event (Shannon 1948) • Kolmogorov complexity measures information in text strings Introduction Preliminaries Alice ICLE Conclusion Defining Kolmogorov complexity “for any sequence of symbols, the Kolmogorov complexity of the sequence is the length of the shortest algorithm that will exactly generate the sequence [. . . ] the more predictable the sequence, the shorter the algorithm needed and thus the Kolmogorov complexity of the sequence is also lower” (Sadeniemi et al. 2008: 191; see also Li and Vitányi 1997; Li et al. 2004) Introduction Preliminaries Alice ICLE Conclusion Defining Kolmogorov complexity “for any sequence of symbols, the Kolmogorov complexity of the sequence is the length of the shortest algorithm that will exactly generate the sequence [. . . ] the more predictable the sequence, the shorter the algorithm needed and thus the Kolmogorov complexity of the sequence is also lower” (Sadeniemi et al. 2008: 191; see also Li and Vitányi 1997; Li et al. 2004) Examples: (1) ababababab (10 characters) ê 5×ab (4 characters) Introduction Preliminaries Alice ICLE Conclusion Defining Kolmogorov complexity “for any sequence of symbols, the Kolmogorov complexity of the sequence is the length of the shortest algorithm that will exactly generate the sequence [. . . ] the more predictable the sequence, the shorter the algorithm needed and thus the Kolmogorov complexity of the sequence is also lower” (Sadeniemi et al. 2008: 191; see also Li and Vitányi 1997; Li et al. 2004) Examples: (1) ababababab (10 characters) ê 5×ab (4 characters) (2) kl!f7S23y0 (10 characters) ê kl!f7S23y0 (10 characters) Introduction Preliminaries Alice ICLE Kolmogorov complexity in linguistics • pioneered by Patrick Juola, based on parallel corpora (Juola 1998, 2008; see also Ehret 2014, fc; Ehret and Szmrecsanyi 2015) • increased Kolmogorov complexity ê higher complexity mandated by the language used to encode (constant) propositional content • interpretation: entirely agnostic about form-meaning relationships etc. text-based linguistic surface complexity/redundancy Conclusion Introduction Preliminaries Alice ICLE Conclusion How to measure Kolmogorov complexity • modern file compression programs use adaptive entropy estimation, which approximates Kolmogorov complexity (Ziv and Lempel, 1977; Juola, 1998) • feed in corpus texts, note down file sizes before & after compression • better compression rates ê less Kolmogorov complexity • gzip (GNU zip) version 1.2.4 Introduction Preliminaries Alice ICLE What exactly does gzip do? gzip compresses new text strings on the basis of previously encountered strings: • the program loads a certain amount of text • creates a temporary lexicon • recognises new text (sub)strings on the basis of the lexicon • text is compressed by eliminating redundancy (see Ehret in preparation) Conclusion Introduction Preliminaries Alice ICLE Measuring overall complexity • 2 measurements per text: 1. file size before compression (in bytes) 2. file size after compression (in bytes) • regress out trivial correlation ê adjusted complexity scores (regression residuals, in bytes) • bigger adjusted complexity scores ê more Kolmogorov complexity Conclusion Introduction Preliminaries Alice ICLE Conclusion Addressing morphological complexity • morphological distortion: random deletion of 10% of all orthographic characters in each file prior to compression ê creation of new word forms • rationale: morphologically complex languages exhibit overall a large amount of word forms; therefore, distortion should not hurt them as badly as morphologically simple languages in which distortion creates more random noise/complexity • thus, comparatively bad compression efficiency after distortion ê low morphological complexity Introduction Preliminaries Alice ICLE Conclusion Addressing syntactic complexity • syntactic distortion: random deletion of 10% of all word tokens in each file prior to compression ê disruption of word order patterns • rationale: little impact on languages with relatively simple syntax—free word order—as they lack between-word interdependencies that could be compromised. Syntactically complex languages with with rigid word order will be badly hurt as word order regularities are distorted • thus, comparatively bad compression efficiency after distortion ê syntactic complexity (word order rigidity) Introduction Preliminaries Alice ICLE Conclusion Alice: Through the distortion glass Morphological distortion Syntactic distortion alice was egining to get very tired of sitting by her ist on the bank an of havig nothing to do once or wice she had pped into the book her sister was rading but it had n picures or conversatons in it and what is the se of a boo thought ali without pictures or conversation alice was beginning to get very Ø of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into Ø book her sister was Ø but it had no pictures or conversations Ø it and what is Ø use of a book thought alice without pictures or conversation Introduction Preliminaries Alice ICLE Conclusion Calculating complexity scores • each text file is multiply distorted and compressed with N = 1, 000 iterations • average morphological complexity score: mean quotient of the compressed file sizes after morphological distortion and the undistorted compressed file sizes • average syntactic complexity score: mean quotient of the compressed file sizes after syntactic distortion and the undistorted compressed file sizes Introduction Preliminaries Alice ICLE Cross-linguistic complexity: Alice’s Adventures in Wonderland Conclusion Introduction Preliminaries Alice ICLE Alice’s Adventures in Wonderland • parallel texts: translational equivalents of the same text in different languages • popular in cross-linguistic typology because differences in propositional content can be ruled out & still usage-based (see, e.g., Cysouw and Wälchli 2007) • Alice in 9 European languages: Dutch, English, Finnish, French, German, Hungarian, Italian, Romanian, Spanish Thanks to Annemarie Verkerk for making the Alice database available to us. Conclusion Introduction Preliminaries Alice ICLE Conclusion An overall complexity hierarchy of Alice Adjusted overall complexity scores. Negative residuals indicate below-average complexity, positive residuals indicate above-average complexity (adapted from Ehret and Szmrecsanyi 2015: Fig 4). Hungarian Romanian Dutch Finnish German Italian Spanish French English −2000 −1000 0 1000 average adjusted overall complexity score Introduction Preliminaries Alice ICLE Conclusion Distorting Alice Morphological complexity by syntactic complexity. Abscissa indexes morphological complexity, ordinate indexes syntactic complexity (fixed word order) (adapted from Ehret and Szmrecsanyi 2015: Fig 7). Introduction Preliminaries Alice ICLE Conclusion Distorting Alice Morphological complexity by syntactic complexity. Abscissa indexes morphological complexity, ordinate indexes syntactic complexity (fixed word order) (adapted from Ehret and Szmrecsanyi 2015: Fig 7). Introduction Preliminaries Alice ICLE Conclusion Distorting Alice Morphological complexity by syntactic complexity. Abscissa indexes morphological complexity, ordinate indexes syntactic complexity (fixed word order) (adapted from Ehret and Szmrecsanyi 2015: Fig 7). Introduction Preliminaries Alice ICLE Conclusion Alice: Interim summary • approach yields rankings that are in line with expectations and “traditional” research (e.g. Bakker 1998) • syntactic complexity: English > French > Spanish > Italian > Dutch > German/Romanian > Hungarian > Finnish • morphological complexity: Finnish > Hungarian > Romanian > German > Dutch > Italian > Spanish > French > English Introduction Preliminaries Alice ICLE Assessing learner language Conclusion Introduction Preliminaries Alice ICLE Conclusion The International Corpus of Learner English (ICLE) • Version 1.1 • essays by advanced learners of English with different mother tongue backgrounds • 11 subcorpora: Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish, and Swedish (see Granger et al. 2002 for details) Introduction Preliminaries Alice ICLE Conclusion The dataset work in progress (Ehret in preparation) • focus on argumentative essays • grouping variable: time spent studying English at school/uni ê distinguish between 6 groups: Group Group Group Group Group Group 5 (most instruction) 4 3b 3a 2 1 (least instruction) 7-9 7-9 7-9 4-6 4-6 4-6 yrs yrs yrs yrs yrs yrs @ @ @ @ @ @ school, school, school, school, school, school, 4-5 yrs @ uni 3 yrs @ uni 1-2 yrs @ uni 4-5 yrs @ uni 3 yrs @ uni 1-2 yrs @ uni • amount of instruction received: proxy for proficiency Introduction Preliminaries Alice ICLE Conclusion An overall complexity hierarchy of learner essays Average adjusted overall complexity scores. Negative residuals indicate below-average complexity, positive residuals indicate above-average complexity. N = 1000 iterations sampling 10% of sentences in sample (adapted from Ehret in preparation). group 5 group 4 group 3b Legend: Group 5: most instruction ... Group 1: least instruction group 3a group 2 group 1 −200 0 200 400 average adjusted overall complexity score Introduction Preliminaries Alice ICLE Conclusion Morphological and syntactic complexity in ICLE Morphological complexity by syntactic complexity. Abscissa indexes morphological complexity, ordinate indexes syntactic complexity (fixed word order). N = 1000 iterations sampling 10% of sentences in sample (adapted from Ehret in preparation). Legend: Group 5: most instruction ... Group 1: least instruction Introduction Preliminaries Alice ICLE Conclusion Complexity rankings and national background • relationship between complexity and amount of instruction received survives restriction of attention to particular mother tongue backgrounds • measure complexity in the German national subcorpus (biggest ICLE component) • again, focus on argumentative essays Introduction Preliminaries Alice ICLE Conclusion German subcorpus: An overall complexity hierarchy Average adjusted overall complexity scores. Negative residuals indicate below-average complexity, positive residuals indicate above-average complexity. N = 1000 iterations sampling 10% of sentences in sample (adapted from Ehret in preparation). group 5 group 4 group 3b Legend: Group 5: most instruction ... Group 1: least instruction group 3a group 2 group 1 −100 −50 0 50 average adjusted overall complexity score Introduction Preliminaries Alice ICLE Conclusion Morphological and syntactic complexity Morphological complexity by syntactic complexity. Abscissa indexes morphological complexity, ordinate indexes syntactic complexity (fixed word order). N = 1000 iterations sampling 10% of sentences in sample (adapted from Ehret in preparation). Legend: Group 5: most instruction ... Group 1: least instruction Introduction Preliminaries Alice ICLE ICLE: Interim summary • more instruction correlates with . . . • . . . more overall complexity • . . . more morphological complexity • . . . less syntactic complexity • the complexity rankings are independent of the learners’ mother tongue background Conclusion Introduction Preliminaries Alice Conclusion ICLE Conclusion Introduction Preliminaries Alice ICLE Summary & outlook • exploring the frontiers of linguistically responsible complexity research • key findings: • cross-linguistic complexity variation in line with expectations • learner language: correlation between instruction received/proficiency and complexity Conclusion Introduction Preliminaries Alice ICLE Conclusion Summary & outlook • exploring the frontiers of linguistically responsible complexity research • key findings: • cross-linguistic complexity variation in line with expectations • learner language: correlation between instruction received/proficiency and complexity • advantages & limitations of the construct for L2 research: • objective, fairly holistic • data sparsity Introduction Preliminaries Alice ICLE Future directions • measure phonetic & phonological complexity • more advanced ways to measure syntactic variation • language testing applications? Conclusion Introduction Preliminaries Alice ICLE Thank you! [email protected] [email protected] Conclusion Literatur Bonus material References I Bakker, D. (1998). Flexibility and consistency in word order patterns in the languages of Europe. In A. Siewierska (Ed.), Constituent order in the languages of Europe, pp. 384–419. Berlin: Mouton de Gruyter. Cysouw, M. and B. Wälchli (2007). Parallel texts: using translational equivalents in linguistic typology. Language Typology and Universals 60 (2), 95–99. Ehret, K. (2014). Kolmogorov complexity of morphs and constructions in English. Linguistic Issues in Language Technology (LiLT) 11 (2). Ehret, K. (f.c.). A corpus-based study of information theoretic complexity in World Englishes. PhD dissertation, University of Freiburg. Ehret, K. and B. Szmrecsanyi (2015). An information-theoretic approach to assess linguistic complexity. In R. Baechler and G. Seiler (Eds.), Complexity and Isolation. Berlin: de Gruyter. Granger, S., E. Dagneaux, and F. Meunier (Eds.) (2002). The International Corpus of Learner English: Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires de Louvain. Juola, P. (1998). Measuring linguistic complexity: the morphological tier. Journal of Quantitative Linguistics 5 (3), 206–213. Juola, P. (2008). Assessing linguistic complexity. In M. Miestamo, K. Sinnemäki, and F. Karlsson (Eds.), Language Complexity: Typology, Contact, Change, pp. 89–108. Amsterdam, Philadelphia: Benjamins. Literatur Bonus material References II Li, M., X. Chen, X. Li, B. Ma, and P. M. B. Vitányi (2004). The similarity metric. IEEE Transactions on Information Theory 50 (12), 3250–3264. Li, M. and P. M. B. Vitányi (1997). An introduction to Kolmogorov complexity and its applications. New York: Springer. Lubbe, J. C. A. v. d. (1997). Information theory. Cambridge, New York: Cambridge University Press. Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college-level l2 writing. Applied Linguistics 24, 492–518. Ortega, L. (2012). Interlanguage complexity: A construct in search of theoretical renewal. In B. Kortmann and B. Szmrecsanyi (Eds.), Linguistic Complexity: Second Language Acquisition, Indigenization, Contact. Berlin: De Gruyter. Pallotti, G. (2014). A simple view of linguistic complexity. Second Language Research. Sadeniemi, M., K. Kettunen, T. Lindh-Knuutila, and T. Honkela (2008). Complexity of European Union languages: A comparative approach. Journal of Quantitative Linguistics 15 (2), 185–211. Sampson, G., D. Gil, and P. Trudgill (2009). Language complexity as an evolving variable. Oxford, New York: Oxford University Press. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal 27, 379–423. Literatur Bonus material References III Szmrecsanyi, B. and B. Kortmann (2009). Between simplification and complexification: non-standard varieties of English around the world. In G. Sampson, D. Gil, and P. Trudgill (Eds.), Language Complexity as an Evolving Variable, pp. 64–79. Oxford: Oxford University Press. Szmrecsanyi, B. and B. Kortmann (2012). Introduction: Linguistic complexity – second language acquisition, indigenization, contact. In B. Szmrecsanyi and B. Kortmann (Eds.), Linguistic Complexity: Second Language Acquisition, Indigenization, Contact, pp. 6–34. Berlin: De Gruyter. Trudgill, P. (2011). Sociolinguistic typology: social determinants of linguistic complexity. Oxford; New York: Oxford University Press. Ziv, J. and A. Lempel (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23 (3), 337–343. Literatur Bonus material Bonus material Literatur Bonus material How compresion algorithms see the world • Ehret (in preparation) • • • • re-programs gzip to retrieve an inspectable lexicon input text: Alice’s Adventures in Wonderland length of compressed sequences ranges from 3 characters to 148 85% of all strings have a length of three to ten characters captures linguistic structure alice was beginning to get very tired of sitt [29,4]ing by her [15,3] sist [7,3]er on the bank an [41,5]d of hav [40,4]ing noth [77,7]ing to do [40,3] on [102,3]ce or tw [111,4]ice s [51,3]he had peep [94,3]ed in [37,3]to [71,5]the book [94,12]her sister [151,4]was read [120,5]ing but it [55,5]had no pictures ...
© Copyright 2026 Paperzz