Information theory meets linguistic complexity

Introduction
Preliminaries
Alice
ICLE
Information theory meets linguistic
complexity
Benedikt Szmrecsanyi and Katharina Ehret
KU Leuven and University of Freiburg
Download these slides at http://www.benszm.net/complexity.pdf
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
What we would like to do in this talk
• drawing inspiration from information theory to
conceptualize & measure language complexity:
Kolmogorov complexity
• two case studies:
• crosslinguistic complexity variation: Alice’s Adventures in
Wonderland
• complexities of learner languages: The International
Corpus of Learner English
Introduction
Preliminaries
Alice
Introduction
ICLE
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
Introduction
• linguistic complexity: one of the currently most hotly
debated issues in linguistics
(e.g. Sampson et al. 2009; Trudgill 2011; Pallotti 2014, among many others)
• theoretical linguistics:
are all languages, or language varieties, equally complex?
If not, what are the sociolinguistic factors that condition
language complexity?
• applied linguistics:
how can we use complexity measures as proxies for
tracking learners’ proficiency, and/or for benchmarking
development?
Introduction
Preliminaries
Alice
ICLE
Conclusion
Why theoretical linguists
are curious about complexity
Trudgill (2011), Sociolinguistic Typology. OUP.
contact, social instability, adult SLA,
population growth
ê change
ê simplification, i.e.:
ê increase in morphological
transparency
ê loss of redundancy
ê loss of “historical baggage”
Peter Trudgill
Introduction
Preliminaries
Alice
ICLE
Some popular complexity measures
in the theoretical literature
• quantitative complexity:
more contrasts, rules, markers, etc. ê complexity
• opacity:
for example, allomorphies ê complexity
• ornamental complexity: communicatively dispensable
contrasts, rules, markers etc. ê complexity
• L2 acquisition difficulty: contrasts, rules, etc. that are
hard to learn for adults ê complexity
(see Szmrecsanyi and Kortmann 2012 for a detailed review)
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
Why applied linguists
worry about complexity
“Second language acquisition
researchers use interlanguage
complexity measures [. . . ] with at
least three main purposes in mind:
(a) to gauge proficiency, (b) to
describe performance, and (c) to
benchmark development.”
(Ortega 2012: 128)
Lourdes Ortega
Introduction
Preliminaries
Alice
ICLE
Some popular complexity measures in the
applied linguistics literature
• length of units (e.g. mean length of T-unit):
long units ê complexity
• density of subordination:
subordination ê complexity
• frequency of occurrence of ‘complex’ forms (e.g. passive
voice): many complicated forms ê complexity
(see Ortega 2003; Pallotti 2014 for reviews)
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
Shortcomings
• “theoretical” measures: nicely holistic but not easily
amenable to operationalization in usage data
(how do you measure, e.g., “historical baggage”?)
• “applied” measures: nicely amenable to operationalization
in usage data but suffering from “concept reductionism”
(Ortega 2012: 128)
Introduction
Preliminaries
Alice
ICLE
Conclusion
Have cake and eat it too
• Kolmogorov complexity:
unsupervised, holistic, text-based
• can be approximated using file
compression programs ê text samples
that can be compressed efficiently are
linguistically simple
• can be combined with various
distortion techniques to yield measures
of morphological and syntactic
complexity
Andrei Kolmogorov
(1903–1987)
Introduction
Preliminaries
Alice
Road map
1.
2.
3.
4.
5.
Introduction
Preliminaries
Cross-linguistic complexity: Alice
Language-internal complexity: ICLE
Conclusion
ICLE
Conclusion
Introduction
Preliminaries
Alice
Preliminaries
ICLE
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
Information theory
“the science which deals with the concept ‘information’, its
measurement and its applications”
(Lubbe 1997: 1)
Introduction
Preliminaries
Alice
ICLE
Conclusion
Information theory
“the science which deals with the concept ‘information’, its
measurement and its applications”
(Lubbe 1997: 1)
• in information theory, information relates to the
unexpectedness and unpredictability (surprisal) of a
proposition or of an event
(Shannon 1948)
• Kolmogorov complexity measures information in text
strings
Introduction
Preliminaries
Alice
ICLE
Conclusion
Defining Kolmogorov complexity
“for any sequence of symbols, the Kolmogorov complexity of
the sequence is the length of the shortest algorithm that
will exactly generate the sequence [. . . ] the more predictable
the sequence, the shorter the algorithm needed and thus
the Kolmogorov complexity of the sequence is also lower”
(Sadeniemi et al. 2008: 191; see also Li and Vitányi 1997; Li et al. 2004)
Introduction
Preliminaries
Alice
ICLE
Conclusion
Defining Kolmogorov complexity
“for any sequence of symbols, the Kolmogorov complexity of
the sequence is the length of the shortest algorithm that
will exactly generate the sequence [. . . ] the more predictable
the sequence, the shorter the algorithm needed and thus
the Kolmogorov complexity of the sequence is also lower”
(Sadeniemi et al. 2008: 191; see also Li and Vitányi 1997; Li et al. 2004)
Examples:
(1)
ababababab (10 characters) ê 5×ab (4 characters)
Introduction
Preliminaries
Alice
ICLE
Conclusion
Defining Kolmogorov complexity
“for any sequence of symbols, the Kolmogorov complexity of
the sequence is the length of the shortest algorithm that
will exactly generate the sequence [. . . ] the more predictable
the sequence, the shorter the algorithm needed and thus
the Kolmogorov complexity of the sequence is also lower”
(Sadeniemi et al. 2008: 191; see also Li and Vitányi 1997; Li et al. 2004)
Examples:
(1)
ababababab (10 characters) ê 5×ab (4 characters)
(2)
kl!f7S23y0 (10 characters) ê kl!f7S23y0 (10
characters)
Introduction
Preliminaries
Alice
ICLE
Kolmogorov complexity in linguistics
• pioneered by Patrick Juola, based on parallel corpora
(Juola 1998, 2008; see also Ehret 2014, fc; Ehret and Szmrecsanyi 2015)
• increased Kolmogorov complexity ê higher complexity
mandated by the language used to encode (constant)
propositional content
• interpretation: entirely agnostic about form-meaning
relationships etc.
text-based linguistic surface complexity/redundancy
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
How to measure Kolmogorov complexity
• modern file compression programs use
adaptive entropy estimation, which
approximates Kolmogorov complexity
(Ziv and Lempel, 1977; Juola, 1998)
• feed in corpus texts, note down file
sizes before & after compression
• better compression rates ê less
Kolmogorov complexity
• gzip (GNU zip) version 1.2.4
Introduction
Preliminaries
Alice
ICLE
What exactly does gzip do?
gzip compresses new text strings on the basis of previously
encountered strings:
• the program loads a certain amount of text
• creates a temporary lexicon
• recognises new text (sub)strings on the basis of the
lexicon
• text is compressed by eliminating redundancy
(see Ehret in preparation)
Conclusion
Introduction
Preliminaries
Alice
ICLE
Measuring overall complexity
• 2 measurements per text:
1. file size before compression
(in bytes)
2. file size after compression
(in bytes)
• regress out trivial correlation
ê adjusted complexity scores
(regression residuals, in bytes)
• bigger adjusted complexity
scores ê more Kolmogorov
complexity
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
Addressing morphological complexity
• morphological distortion: random deletion of 10% of all
orthographic characters in each file prior to compression
ê creation of new word forms
• rationale: morphologically complex languages exhibit
overall a large amount of word forms; therefore, distortion
should not hurt them as badly as morphologically simple
languages in which distortion creates more random
noise/complexity
• thus, comparatively bad compression efficiency after
distortion ê low morphological complexity
Introduction
Preliminaries
Alice
ICLE
Conclusion
Addressing syntactic complexity
• syntactic distortion: random deletion of 10% of all word
tokens in each file prior to compression
ê disruption of word order patterns
• rationale: little impact on languages with relatively simple
syntax—free word order—as they lack between-word
interdependencies that could be compromised.
Syntactically complex languages with with rigid word
order will be badly hurt as word order regularities are
distorted
• thus, comparatively bad compression efficiency after
distortion ê syntactic complexity (word order rigidity)
Introduction
Preliminaries
Alice
ICLE
Conclusion
Alice: Through the distortion glass
Morphological distortion
Syntactic distortion
alice was egining to get
very tired of sitting by
her ist on the bank an
of havig nothing to do
once or wice she had pped
into the book her sister
was rading but it had n
picures or conversatons
in it and what is the
se of a boo thought
ali without pictures
or conversation
alice was beginning to
get very Ø of sitting by
her sister on the bank
and of having nothing
to do once or twice she
had peeped into Ø book
her sister was Ø but
it had no pictures or
conversations Ø it and
what is Ø use of a book
thought alice without
pictures or conversation
Introduction
Preliminaries
Alice
ICLE
Conclusion
Calculating complexity scores
• each text file is multiply distorted and compressed with
N = 1, 000 iterations
• average morphological complexity score: mean quotient of
the compressed file sizes after morphological distortion
and the undistorted compressed file sizes
• average syntactic complexity score: mean quotient of the
compressed file sizes after syntactic distortion and the
undistorted compressed file sizes
Introduction
Preliminaries
Alice
ICLE
Cross-linguistic complexity: Alice’s
Adventures in Wonderland
Conclusion
Introduction
Preliminaries
Alice
ICLE
Alice’s Adventures in Wonderland
• parallel texts: translational equivalents
of the same text in different languages
• popular in cross-linguistic typology
because differences in propositional
content can be ruled out & still
usage-based
(see, e.g., Cysouw and Wälchli 2007)
• Alice in 9 European languages:
Dutch, English, Finnish, French, German,
Hungarian, Italian, Romanian, Spanish
Thanks to Annemarie Verkerk for making the Alice database available to us.
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
An overall complexity hierarchy of Alice
Adjusted overall complexity
scores. Negative residuals indicate
below-average complexity, positive
residuals indicate above-average
complexity (adapted from Ehret
and Szmrecsanyi 2015: Fig 4).
Hungarian
Romanian
Dutch
Finnish
German
Italian
Spanish
French
English
−2000
−1000
0
1000
average adjusted overall complexity score
Introduction
Preliminaries
Alice
ICLE
Conclusion
Distorting Alice
Morphological
complexity by syntactic
complexity. Abscissa
indexes morphological
complexity, ordinate
indexes syntactic
complexity (fixed word
order) (adapted from
Ehret and Szmrecsanyi
2015: Fig 7).
Introduction
Preliminaries
Alice
ICLE
Conclusion
Distorting Alice
Morphological
complexity by syntactic
complexity. Abscissa
indexes morphological
complexity, ordinate
indexes syntactic
complexity (fixed word
order) (adapted from
Ehret and Szmrecsanyi
2015: Fig 7).
Introduction
Preliminaries
Alice
ICLE
Conclusion
Distorting Alice
Morphological
complexity by syntactic
complexity. Abscissa
indexes morphological
complexity, ordinate
indexes syntactic
complexity (fixed word
order) (adapted from
Ehret and Szmrecsanyi
2015: Fig 7).
Introduction
Preliminaries
Alice
ICLE
Conclusion
Alice: Interim summary
• approach yields rankings that are in line with expectations
and “traditional” research
(e.g. Bakker 1998)
• syntactic complexity:
English > French > Spanish > Italian > Dutch >
German/Romanian > Hungarian > Finnish
• morphological complexity:
Finnish > Hungarian > Romanian > German > Dutch
> Italian > Spanish > French > English
Introduction
Preliminaries
Alice
ICLE
Assessing learner language
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
The International Corpus of Learner English
(ICLE)
• Version 1.1
• essays by advanced learners of English with different
mother tongue backgrounds
• 11 subcorpora:
Bulgarian, Czech, Dutch, Finnish, French, German,
Italian, Polish, Russian, Spanish, and Swedish
(see Granger et al. 2002 for details)
Introduction
Preliminaries
Alice
ICLE
Conclusion
The dataset
work in progress (Ehret in preparation)
• focus on argumentative essays
• grouping variable: time spent studying English at school/uni
ê distinguish between 6 groups:
Group
Group
Group
Group
Group
Group
5 (most instruction)
4
3b
3a
2
1 (least instruction)
7-9
7-9
7-9
4-6
4-6
4-6
yrs
yrs
yrs
yrs
yrs
yrs
@
@
@
@
@
@
school,
school,
school,
school,
school,
school,
4-5 yrs @ uni
3 yrs @ uni
1-2 yrs @ uni
4-5 yrs @ uni
3 yrs @ uni
1-2 yrs @ uni
• amount of instruction received: proxy for proficiency
Introduction
Preliminaries
Alice
ICLE
Conclusion
An overall complexity hierarchy of learner essays
Average adjusted overall
complexity scores. Negative
residuals indicate below-average
complexity, positive residuals
indicate above-average complexity.
N = 1000 iterations sampling 10%
of sentences in sample (adapted
from Ehret in preparation).
group 5
group 4
group 3b
Legend:
Group 5: most instruction
...
Group 1: least instruction
group 3a
group 2
group 1
−200
0
200
400
average adjusted overall complexity score
Introduction
Preliminaries
Alice
ICLE
Conclusion
Morphological and syntactic complexity in ICLE
Morphological
complexity by syntactic
complexity. Abscissa
indexes morphological
complexity, ordinate
indexes syntactic
complexity (fixed word
order). N = 1000
iterations sampling 10%
of sentences in sample
(adapted from Ehret in
preparation).
Legend:
Group 5: most
instruction
...
Group 1: least
instruction
Introduction
Preliminaries
Alice
ICLE
Conclusion
Complexity rankings and national background
• relationship between complexity and amount of
instruction received survives restriction of attention to
particular mother tongue backgrounds
• measure complexity in the German national subcorpus
(biggest ICLE component)
• again, focus on argumentative essays
Introduction
Preliminaries
Alice
ICLE
Conclusion
German subcorpus: An overall complexity hierarchy
Average adjusted overall
complexity scores. Negative
residuals indicate below-average
complexity, positive residuals
indicate above-average complexity.
N = 1000 iterations sampling 10%
of sentences in sample (adapted
from Ehret in preparation).
group 5
group 4
group 3b
Legend:
Group 5: most instruction
...
Group 1: least instruction
group 3a
group 2
group 1
−100
−50
0
50
average adjusted overall complexity score
Introduction
Preliminaries
Alice
ICLE
Conclusion
Morphological and syntactic complexity
Morphological
complexity by syntactic
complexity. Abscissa
indexes morphological
complexity, ordinate
indexes syntactic
complexity (fixed word
order). N = 1000
iterations sampling 10%
of sentences in sample
(adapted from Ehret in
preparation).
Legend:
Group 5: most
instruction
...
Group 1: least
instruction
Introduction
Preliminaries
Alice
ICLE
ICLE: Interim summary
• more instruction correlates with . . .
• . . . more overall complexity
• . . . more morphological complexity
• . . . less syntactic complexity
• the complexity rankings are independent of the learners’
mother tongue background
Conclusion
Introduction
Preliminaries
Alice
Conclusion
ICLE
Conclusion
Introduction
Preliminaries
Alice
ICLE
Summary & outlook
• exploring the frontiers of linguistically responsible
complexity research
• key findings:
• cross-linguistic complexity variation in line with
expectations
• learner language: correlation between instruction
received/proficiency and complexity
Conclusion
Introduction
Preliminaries
Alice
ICLE
Conclusion
Summary & outlook
• exploring the frontiers of linguistically responsible
complexity research
• key findings:
• cross-linguistic complexity variation in line with
expectations
• learner language: correlation between instruction
received/proficiency and complexity
• advantages & limitations of the construct for L2 research:
• objective, fairly holistic
• data sparsity
Introduction
Preliminaries
Alice
ICLE
Future directions
• measure phonetic & phonological complexity
• more advanced ways to measure syntactic variation
• language testing applications?
Conclusion
Introduction
Preliminaries
Alice
ICLE
Thank you!
[email protected]
[email protected]
Conclusion
Literatur
Bonus material
References I
Bakker, D. (1998). Flexibility and consistency in word order patterns in the languages
of Europe. In A. Siewierska (Ed.), Constituent order in the languages of Europe,
pp. 384–419. Berlin: Mouton de Gruyter.
Cysouw, M. and B. Wälchli (2007). Parallel texts: using translational equivalents in
linguistic typology. Language Typology and Universals 60 (2), 95–99.
Ehret, K. (2014). Kolmogorov complexity of morphs and constructions in English.
Linguistic Issues in Language Technology (LiLT) 11 (2).
Ehret, K. (f.c.). A corpus-based study of information theoretic complexity in World
Englishes. PhD dissertation, University of Freiburg.
Ehret, K. and B. Szmrecsanyi (2015). An information-theoretic approach to assess
linguistic complexity. In R. Baechler and G. Seiler (Eds.), Complexity and Isolation.
Berlin: de Gruyter.
Granger, S., E. Dagneaux, and F. Meunier (Eds.) (2002). The International Corpus of
Learner English: Handbook and CD-ROM. Louvain-la-Neuve: Presses universitaires
de Louvain.
Juola, P. (1998). Measuring linguistic complexity: the morphological tier. Journal of
Quantitative Linguistics 5 (3), 206–213.
Juola, P. (2008). Assessing linguistic complexity. In M. Miestamo, K. Sinnemäki, and
F. Karlsson (Eds.), Language Complexity: Typology, Contact, Change, pp. 89–108.
Amsterdam, Philadelphia: Benjamins.
Literatur
Bonus material
References II
Li, M., X. Chen, X. Li, B. Ma, and P. M. B. Vitányi (2004). The similarity metric.
IEEE Transactions on Information Theory 50 (12), 3250–3264.
Li, M. and P. M. B. Vitányi (1997). An introduction to Kolmogorov complexity and
its applications. New York: Springer.
Lubbe, J. C. A. v. d. (1997). Information theory. Cambridge, New York: Cambridge
University Press.
Ortega, L. (2003). Syntactic complexity measures and their relationship to L2
proficiency: A research synthesis of college-level l2 writing. Applied Linguistics 24,
492–518.
Ortega, L. (2012). Interlanguage complexity: A construct in search of theoretical
renewal. In B. Kortmann and B. Szmrecsanyi (Eds.), Linguistic Complexity:
Second Language Acquisition, Indigenization, Contact. Berlin: De Gruyter.
Pallotti, G. (2014). A simple view of linguistic complexity. Second Language Research.
Sadeniemi, M., K. Kettunen, T. Lindh-Knuutila, and T. Honkela (2008). Complexity
of European Union languages: A comparative approach. Journal of Quantitative
Linguistics 15 (2), 185–211.
Sampson, G., D. Gil, and P. Trudgill (2009). Language complexity as an evolving
variable. Oxford, New York: Oxford University Press.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System
Technical Journal 27, 379–423.
Literatur
Bonus material
References III
Szmrecsanyi, B. and B. Kortmann (2009). Between simplification and
complexification: non-standard varieties of English around the world. In
G. Sampson, D. Gil, and P. Trudgill (Eds.), Language Complexity as an Evolving
Variable, pp. 64–79. Oxford: Oxford University Press.
Szmrecsanyi, B. and B. Kortmann (2012). Introduction: Linguistic complexity –
second language acquisition, indigenization, contact. In B. Szmrecsanyi and
B. Kortmann (Eds.), Linguistic Complexity: Second Language Acquisition,
Indigenization, Contact, pp. 6–34. Berlin: De Gruyter.
Trudgill, P. (2011). Sociolinguistic typology: social determinants of linguistic
complexity. Oxford; New York: Oxford University Press.
Ziv, J. and A. Lempel (1977). A universal algorithm for sequential data compression.
IEEE Transactions on Information Theory IT-23 (3), 337–343.
Literatur
Bonus material
Bonus material
Literatur
Bonus material
How compresion algorithms see the world
• Ehret (in preparation)
•
•
•
•
re-programs gzip to retrieve an
inspectable lexicon
input text: Alice’s Adventures in
Wonderland
length of compressed sequences
ranges from 3 characters to 148
85% of all strings have a length
of three to ten characters
captures linguistic structure
alice was beginning to
get very tired of sitt
[29,4]ing by her
[15,3] sist
[7,3]er on the bank an
[41,5]d of hav
[40,4]ing noth
[77,7]ing to do
[40,3] on
[102,3]ce or tw
[111,4]ice s
[51,3]he had peep
[94,3]ed in
[37,3]to
[71,5]the book
[94,12]her sister
[151,4]was read
[120,5]ing but it
[55,5]had no pictures
...