Finding Productive Morphology in the Wild: A quantitive analysis of

Finding Productive Morphology in the Wild: A quantitive analysis of novel morpheme use
in Swahili-English Code Switching on Twitter
Rich cross-linguistic variation in the morphological marking of syntactic (inflectional) and word
internal (derivational) relations poses questions for theories of morphological productivity.
Additionally, the degree to which these processes may be incorporated into a second language
provides an avenue for studying both patterns of morphological borrowing, and their lexical
status. The aim of this study is to analyze morphological borrowing in Swahili-English codeswitching to evaluate the environments in which morphological borrowing occurs (MyersScotton 2002), and whether such borrowing is impacted by frequency via productivity (Bybee
1995, Hay 2002, Hay & Baayen 2002). The overall question guiding this investigation is the
following: When a language has a large amount transparent morphological structure, will the
lexicon likewise contain a higher numbers of productive morphemes to be incorporated during
code-switching?
In order to test this question, derived and underived frequencies of word forms containing 10
inflectional and 10 derivational morphemes will be evaluated using the Helsinki Swahili Corpus
(HCS) containing 13.1 million tokens gathered from literature and news sources. Examples of
the tokens containing inflectional and derivational morphemes under consideration (underlined)
are given in (1):
(1)
a. u-
-pig-an -aji
CL11-hit-RECIP-AGENT
‘rivalry’
c. m- -chez-aji
CL1-play-AGENT
‘player’
b. u-
-pig-an -o
CL11-hit-RECIP-INST
‘contest’
d. m- -chez -o
CL3-play-INST
‘game’
(Mohammed 2001, TUKI 2001)
Example (1) exhibits nominal classification morphology that is often semantically categorized
(e.g. m- in (1c) marks ‘human’), verbal derivational morphology (e.g. -an- in (1a-b) marks
‘reciprocal action’), and nominalization (e.g. -aji marks ‘agent nominalization’).
The ratio of these forms with their component morphemes, and the base lacking them1 will be
graphed and compared for the degree to which these forms are predicted for productivity using a
regression analysis of each morpheme’s relative frequencies (Hay & Baayen 2002). The distance
of the r2 line (fitting the underived and derived frequencies) to the X=Y line per graph will be
used as the predictor of the degree of productivity of each morpheme.
Morphological productivity will be evaluated using the novel technique of searching a corpus
automatically generated from Twitter data. Via supervised machine learning, the corpora have
1
individually: note that these forms are used for example and only bi-morphemic forms will be used
been created from tweets written by thousands of users from East Africa. The task of deciding
whether code-switching occurs within a tweet is completed by a trained classification algorithm
tweaked to accurately determine whether a given tweet should be in the corpus or omitted from
it. This corpus with around 5 million tokens will be used to find the degree to which morphemes
co-occur with English words. Examples of such combinations are in (2) where the underlined
forms are English forms, and bold forms represent the morphemes with which they occur:
(1)
a. ‘Enda
u-tweet … U-na
-niend
CL11-tweet … 2SG PRES 1SG.OBJ
‘End of the tweet … you are suffocating me.’
b. ‘Ni- me- kumbuka
ku1SG PERF remember
INF
‘I remember getting stuck…’
-suffocate.’
suffocate
stuck …’
stuck
If morphological borrowing is determined by morphological productivity, and likewise if
frequency is the largest factor for productivity, then morphemes predicted to have higher degrees
of productivity should occur in higher numbers in this corpus. However, if the relationship is
non-significant, then the causal chain should break down at some point. Either productivity does
not determine morphological borrowing, or rather other factors such as phonological and
semantic plausibility, blocking by non-complex forms, and general language change may have a
large influence on morpheme productivity in Swahili.
Although such data is noisy, their volatility represent an ideal ground for gathering larger
quantities of data on morphological borrowing and productivity. Such an analysis will add to
both to the growing number of methods useful to the linguist, and will crucially inform theories
of both morphological borrowing and productivity providing either further empirical coverage,
or challenging their findings.
Sources
Bybee, J. 1995. Regular morphology and the lexicon. Language and Cognitive Processes, 10(5), 425-455.
Hay, J. 2002. From speech perception to morphology: Affix ordering revisited. Language, 78(3), 527-555.
Hay, J., & Baayen, H. 2002. Parsing and productivity. Yearbook of Morphology 2001 (pp. 203-235).
Springer Netherlands.
Myers-Scotton, C. 2002. Contact linguistics: Bilingual encounters and grammatical outcomes. Oxford
University Press.
Mohamed, Mohamed Abdulla. 2001. Modern Swahili Grammar. East African Publishers.
Seidl, A., & Dimitriadis, A. 2003. Statives and reciprocal morphology in Swahili. Typologie des langues
d’Afrique et universaux de la grammaire, 1.
TUKI - English - Swahili Dictionary - Kamusi Ya Kiingerza - Swahili. 2001. TUKI - The Institute of
Swahili Research at the University of Dar es Salaam. Laurier Books Ltd. /AES; ISBN: 9976911297.