Predicting article agglutination in Mauritian

Predicting article agglutination in Mauritian
Olivier Bonami1
Fabiola Henri2
1 U. Paris-Sorbonne & IUF & LLF
[email protected]
2 U. Paris 7 & LLF
[email protected]
Formal Approaches to Creole Studies
Lisbonne, November 2012
Bonami & Henri (Paris/LLF)
FACS III
1 / 38
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
2 / 38
Introduction
The issue
œ
Creole words often agglutinate the phonology of two words from the
lexifier
œ
In French-based Creoles, inherited nouns come from the reanalysis of
the sequence det + noun in the lexifier.
French determiners
la
le
l
dy
d@l
ma
mÕ
ind.-n
plur.-z
œ
©
©
©
©
©
©
©
©
©
©
noun
example
trans.
tabl
tuK
li
amuK
te
o
tÃt
pEK
EspEs
animo
latab
letuÄ
lili
lamuÄ
dite
delo
matÃt
mÕpeÄ
nespes
zanimo
‘table’
‘turn’
‘bed’
‘love’
‘tea’
‘water’
‘aunt’
‘father’
‘species’
‘animal’
Also known as article ‘incorporation’ but is a misnomer
+ We avoid the term incorporation because of its use in morphology and
syntax
Bonami & Henri (Paris/LLF)
FACS III
3 / 38
Introduction
Introduction
œ
Previous work on article agglutination has focused on
œ
Why there is substantial agglutination in Mauritian compared to other
French-based Creoles (Baker, 1984; Grant, 1995)
œ
œ
œ
œ
œ
œ
Different question : what favors agglutination?
However, our approach is compatible with an initial substratic influence
Predicting variation in the form of the agglutinated string (Strandquist,
2003)
Reanalysis of the sequence as a case of interrupted transmission
(McWhorter and Parkvall, 2002)
We examine visible factors in the lexicon of contemporary French
which correlate with article agglutination
Interpretation of this correlation
œ
œ
œ
Substantiation or not of the previous observations
Addition of new factors that we think correlate with the phenomenon
We rely thoroughly on quantitative data both on the lexifier and on the
creole
Bonami & Henri (Paris/LLF)
FACS III
4 / 38
A semi-automatic exploration of article agglutination
The dataset
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
5 / 38
A semi-automatic exploration of article agglutination
The dataset
Sources
œ
Database of 8800 nouns compiled on the basis of forms available in
Carpooran (2011)
œ
œ
œ
We coded their etyma and language of origin and
their agglutinated forms together with information regarding alternation
For those 7325 nouns which are undisputably from French origin, we
coded their
œ
œ
œ
œ
frequency of the etyma obtained from Lexique 3 (New et al., 2007)
gender obtained from Lexique 3 (New et al., 2007)
frequency with the definite compiled from the French subtitles corpus
(New and Spinelli, 2012)
attestation dates obtained from the CNRTL
http://www.cnrtl.fr/
Bonami & Henri (Paris/LLF)
FACS III
6 / 38
A semi-automatic exploration of article agglutination
The dataset
Sources
œ
œ
We examine only a subset of the collected data, i.e. those of
undisputably French origin
Etymon
Size
example
trans.
French
7325
Other
1169
Creole innovation
306
latab
lestoma
zugader
larurut
koze
sÃte
‘table’
‘stomach’
‘player’
‘arrow-root’
‘talk’
‘song’
TOTAL
8800
Agglutination of the French article can be erratically found with
inherited nouns of other origin
Bonami & Henri (Paris/LLF)
Etymon
Status
example
trans.
English
Arabic
Hindi
undisputable
disputable
disputable
larurut
larak
lamaÄ
‘arrow-root’
‘arak’
‘rice water’
FACS III
7 / 38
A semi-automatic exploration of article agglutination
The dataset
Phonetizing the subset
œ
œ
We phonetized the Mauritian nouns based on their orthography
With phonetized etymons obtained from flexique, we conducted a
semi-automatic reconstruction of phonetic changes from French to
Mauritian
+ Work in Progress: Semi-automatic reconstruction of phonetic changes
occuring with inherited words
œ
With these informations, we are able to provide a matching between
the phonetic pattern of a noun and that of its etymon
œ
Currently available for 4881 of the remaining 7325 nouns
Bonami & Henri (Paris/LLF)
FACS III
8 / 38
A semi-automatic exploration of article agglutination
The dataset
Focussing on the definite
œ
We further focus on the subset which involves agglutination with the
definite article
+ The other types are either rare or ambiguous
French article
la
le
l
dy
d@l
ma
mÕ
ind.-n
plur.-z
TOTAL
œ
©
©
©
©
©
©
©
©
©
©
noun
example
trans.
size
tabl
tuK
li
amuK
te
o
tÃt
pEK
EspEs
animo
latab
letuÄ
lili
lamuÄ
dite
delo
matÃt
mÕpeÄ
nespes
zanimo
‘table’
‘turn’
‘bed’
‘love’
‘tea’
‘water’
‘aunt’
‘father’
‘species’
‘animal’
457
49
11
723
37
3
3
1
4
62
1350
Final count of dataset examined is 4760
Bonami & Henri (Paris/LLF)
FACS III
9 / 38
A semi-automatic exploration of article agglutination
Alternating forms
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
10 / 38
A semi-automatic exploration of article agglutination
Alternating forms
Distribution
œ
Mauritian particular wrt. to agglutination since it has by far
reanalyzed the article-noun sequence as a single noun (Grant, 1995)
œ
(1)
Interestingly, it has also developped a subset of alternating forms
a.
Donn mwa
enn liv
pwason.
give.SF 1SG.STF IND pound fish
‘Give me a pound of fish.’
b.
œ
Komie
ou
dir
laliv?
how_much 2SG.FOR say.SF pound
‘How much do you say the pound?
Among the 1240 nouns with an agglutinated form, 275 are alternating
according to Carpooran (2011)
Bonami & Henri (Paris/LLF)
FACS III
11 / 38
A semi-automatic exploration of article agglutination
Alternating forms
Alternating forms in the count
œ
œ
We don’t know whether both forms appeared simultaneaously
+ Both old and new imports have been found to be alternating
Bare form
aggl. form
age
trans.
koloni
aeropor
Ãtet
dÃs
lakoloni
laeropor
lÃtet
ladÃs
1308
1922
1838
1172
‘colony’
‘airport’
‘heading’
‘dance’
Accounting for the distribution of alternating forms is a complicated
task for various reasons
+ It seems that alternating forms are not randomly distributed but are
however subject to dialectal variation
œ Agglutinated forms may have acquired complex properties that should
be confirmed by corpus and experimental data
œ
We hence limit our study to predicting agglutination for the reasons
mentionned above
œ
Alternating forms are assessed together with agglutinated
nonalternating ones
Bonami & Henri (Paris/LLF)
FACS III
12 / 38
A semi-automatic exploration of article agglutination
Factors to consider
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
13 / 38
A semi-automatic exploration of article agglutination
Factors to consider
Factors contributing to agglutination
œ
Previous studies have claimed that a number of factors correlate with
agglutination in Mauritian
(2)
a.
b.
c.
d.
e.
Frequency of collocation
Homophony
Susbtratic influence
Number of syllables
Vowel harmony
+ All these studies are based on handpicked small samples and are in
need of quantitative substantiation
Bonami & Henri (Paris/LLF)
FACS III
14 / 38
A semi-automatic exploration of article agglutination
Factors to consider
Vowel harmony as a factor
œ
Following the idea that article agglutination is modelled on noun class
prefixes from Bantu languages, Strandquist (2003) argue that vowel
harmony, also a characteristic of the same languages, is also
determining.
It crucially has an effect on those nouns that will be agglutinated and
those that won’t
+ If the article’s vowel is in harmony with the noun, it will be
agglutinated. If not then either the vowel harmonizes or agglutination
does not occur
œ
(3)
œ
a.
b.
l@ky > liki (the ass > female genitals)
l@SjẼ > lisjẼ (the dog > dog)
We claim (Contra Strandquist, 2003) that the correlation between
vowel harmony and agglutination (le vs li) is real (Fisher’s exact test,
p < 0.0001), but it is not categorical and epiphenomenal (60
candidates, 1.3% of the data)
+ Nouns like letur, ledwa or lekuyÕ, for instance, do not harmonize
Bonami & Henri (Paris/LLF)
FACS III
15 / 38
A semi-automatic exploration of article agglutination
Factors to consider
Factors contributing to agglutination
œ
We further add to the above, the following factors which are generally
relevant when it comes to word formation (e.g. Plénat, 2000)
(4)
œ
a.
b.
c.
d.
Gender
Dissimilation effects
Phonotactic preferences
Raw frequency
Although there can be other factors to be considered (syntactic,
semantic, . . .), we limit our study to the above
+ Our choice is also limited by the data available
Bonami & Henri (Paris/LLF)
FACS III
16 / 38
Statistical analysis and modeling
Identifying correlations
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
17 / 38
Statistical analysis and modeling
Identifying correlations
1.0
2000
Relevant factors: length
0
0.6
0.0
0.2
0.4
proportion of agglutination
1000
500
counts
1500
0.8
aggl.
bare
1
2
3
4
number of syllables
5
6
1
2
3
4
5
6
number of syllables
The difference between monosyllabic and polysyllabic is highly significant (¬2 test
p < 2 £ 10°16 )
œ For polysyllabic words, length is barely significant (¬2 test ,p º 0.04)
œ
Bonami & Henri (Paris/LLF)
FACS III
18 / 38
Statistical analysis and modeling
Identifying correlations
1.0
Relevant factors: gender
0
counts
0.6
0.4
0.0
0.2
500
1000
1500
proportion of agglutination
0.8
2000
aggl.
bare
f
m
NEU
Gender (NEU = ambiguous between f and m)
Bonami & Henri (Paris/LLF)
f
m
NEU
Gender (NEU = ambiguous between f and m)
FACS III
19 / 38
Statistical analysis and modeling
Identifying correlations
0
0.6
0.4
0.0
0.2
proportion of agglutination
2000
1000
counts
3000
aggl.
bare
0.8
1.0
Relevant factors: initial segment type
C
V
Initial segment
Bonami & Henri (Paris/LLF)
C
V
Initial segment
FACS III
20 / 38
Statistical analysis and modeling
Identifying correlations
Irrelevant factors: dissimilation
0.8
0.6
0.4
0.0
0
C
L
Initial segment
œ
0.2
1500
500
counts
2500
aggl.
bare
proportion of agglutination
1.0
We expect a dissimilation effect: etymons beginning in l should
disfavor agglutination.
3500
œ
V
C
L
V
Initial segment
We find no such effect. Small effect to the contrary, but barely
significant (¬2 test p º 0.036)
Bonami & Henri (Paris/LLF)
FACS III
21 / 38
Statistical analysis and modeling
Identifying correlations
Irrelevant factors: homophony
We expect a homophony avoidance effect: existence of a verb
homonym should favor agglutination.
yes
Existence of homonym verb
œ
0.8
0.6
0.4
0.0
0
no
œ
0.2
3000
2000
1000
counts
4000
aggl.
bare
proportion of agglutination
1.0
œ
no
yes
Existence of homonym verb
We do find an effect, but in the opposite direction (¬2 test p º 0.008)
Anyway, not enough relevant data for this to be important.
Bonami & Henri (Paris/LLF)
FACS III
22 / 38
Identifying correlations
Relevant factor: age
0.6
0.4
0.2
0.0
200
400
600
800
proportion of agglutination
aggl.
bare
0.8
1.0
We plot the date of first attestation of an etymon, grouped by century:
1000
œ
0
counts
Statistical analysis and modeling
8 9
11
13
15
17
Appearance of etymon
19
8 9
11
13
15
17
19
Appearance of etymon
Words with more recent French etymons agglutinate less. (logistic regression
likelihood ratio: ¬2 test p < 0.0001)
œ Probable combination of factors: frequency, date of Entry in the Mauritian lexicon.
œ
Bonami & Henri (Paris/LLF)
FACS III
23 / 38
Statistical analysis and modeling
Identifying correlations
Relevant factor: collocation with SG definite article
0.0
0.2
0.4
0.6
0.8
1.0
Agglutination grows when collocation with the definite article grows.
Proportion of agglutination
œ
0%
50%
100%
Frequency of collocation with definite SG article (by quantile)
œ
Highly significant. (logistic regression likelihood ratio:
¬2 test p < 0.0001)
+ Surprisingly so, given that we are estimating on the basis of late 20th
century frequency information.
Bonami & Henri (Paris/LLF)
FACS III
24 / 38
Statistical analysis and modeling
Identifying correlations
Relevant factor: raw frequency of etymon
0.0
0.2
0.4
0.6
0.8
1.0
Agglutination grows when frequency of the etymon.
Proportion of agglutination
œ
0%
50%
100%
Raw frequency, SG+PL (by quantile)
œ
Highly significant. (logistic regression likelihood ratio:
Bonami & Henri (Paris/LLF)
¬2 test p < 0.0001)
FACS III
25 / 38
Statistical analysis and modeling
Identifying correlations
Summing up
œ
We have seen that the following features of the etymon all contribute
individually to predicting article incorporation:
Predictor
Accuracy
Acc. increase
Dx ,y
Monosyllabicity or Polysyllabicity
Initial segment type
Gender
Frequency of coll. with DEF.SG
Raw frequency
Age
0.8252101
0.8382353
0.8252101
0.8252101
0.8254202
0.8252101
0
0.0130252
0
0
0.0002101
0
0.180
0.464
0.274
0.274
0.349
0.277
Baseline (no predictor)
0.8252101
œ
However each is a pretty bad predictor on its own.
œ
A series of simple logistic regressions confirm this: no predictor leads
to a sizable increase in accuracy.
Bonami & Henri (Paris/LLF)
FACS III
26 / 38
Statistical analysis and modeling
Statistical modeling
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
27 / 38
Statistical analysis and modeling
Statistical modeling
A multiple logistic regression model
œ
To assess the joint predictive character of the predictors, we run a
multiple logistic regression.
œ
Model:
~~
~)=
P(agglutination|X
Where:
œ
œ
œ
œ
œ
œ
œ
e Æ+ØX
~
1 + e Æ+Ø~X
Æ = °4.4573
Ø1 X1 = 1.9814 £ monosyllabic
Ø2 X2 = 3.8591 £ vowel_initial
Ø3 X3 = 0.6507 £ log frequency
Ø4 X4 = 0.8935 £ def.sg_rel_frequency
Ø5 X5 = 1.2750 £ feminine
Ø6 X6 = °0.3156 £ century_of_attestation
Bonami & Henri (Paris/LLF)
FACS III
28 / 38
Statistical analysis and modeling
Statistical modeling
Assessing the model
œ
Simple but unsubtle measure of the quality of the model:
+ How often does the model correctly assign a probability above 0.5 for
agglutinating forms and below 0.5 for bare forms?
I.e., how often does it make the right prediction?
œ
Answer: 88.5% of the time
Predictors
Accuracy
Acc. increase
Dx ,y
all
0.8798319
0.0546218
0.846
Baseline (no predictor)
0.8252101
+ This is an unsubtle measure because it does not make a difference
between a probability of 0.51 and a probability of 0.99.
Bonami & Henri (Paris/LLF)
FACS III
29 / 38
Statistical analysis and modeling
Statistical modeling
Assessing the model: the gory details
œ
Good likelihood: ¬2 test p < 0.0001
œ
Good predictors: all predictors have a Wald Z value with p < 0.0001.
œ
Good prediction: rank correlation Dx ,y = 0.846
œ
Little overfitting: mean rank correlation for 200 bootstrap samples
Dx ,y = 0.845
+ It is very unlikely that the stated predictors do not jointly explain the
probability of agglutination
+ All predictors do make some contribution to the model
+ The model predicts quite reliably the probability of agglutination for
any combination of predictor values in the dataset
+ The model’s accuracy does not change when it is trained on a slightly
different dataset.
Bonami & Henri (Paris/LLF)
FACS III
30 / 38
Statistical analysis and modeling
Towards explaining unexpected correlations
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
31 / 38
Statistical analysis and modeling
Towards explaining unexpected correlations
Are raw frequency and DEF.SG frequency correlated?
œ
There is no obvious explanation for frequency favoring agglutination
One hypothesis to refute: little multicolinearity between raw frequency
and proportion of collocation with the DEF.SG article
0.8
0.6
0.4
0.2
0.0
Proportion of collocation with DEF.SG
1.0
œ
-2
0
2
4
6
Raw frequency (logarithmic scale)
+ Linear regression slope 0.011976, R 2 < 0.02
Bonami & Henri (Paris/LLF)
FACS III
32 / 38
Statistical analysis and modeling
Towards explaining unexpected correlations
Gender and age
œ
The remaining two surprising predictors are gender and age.
œ
Hypothesis: age might be relevant in the sense that words that belong
to the creole from its formation behave with respect to agglutination
differently from words borrowed from French by native creole speakers.
œ
To assess this, we ran separate regressions on newer (post-1800) and
older (pre-1800) nouns.
Bonami & Henri (Paris/LLF)
FACS III
33 / 38
Statistical analysis and modeling
Towards explaining unexpected correlations
Gender and age, continued
predictors
monosyllabic
vowel_initial
frequency
def.sg_rel_freq.
feminine
century
Bonami & Henri (Paris/LLF)
Older nouns
coefficients p values
Newer nouns
coefficients p values
2.0083 < 0.0001
3.8709 < 0.0001
0.6230 < 0.0001
0.8836 < 0.0001
1.3261 < 0.0001
°0.2604 < 0.0001
Dx ,y = 0.839
0.9618
0.2313
3.5815 < 0.0001
0.6997
0.0030
0.9503 < 0.0001
°0.3621
0.4583
°0.3835
0.2311
Dx ,y = 0.890
FACS III
34 / 38
Statistical analysis and modeling
Towards explaining unexpected correlations
Gender and age, continued
œ
œ
Conclusion: in the post-creolization period, monosyllabicity, gender
and age probably stopped playing a role
This might be interpreted as supporting the hypothesis of a substratic
influence (Baker, 1984; Grant, 1995):
œ
œ
The substrate influence hypothesis assumes that the French article was
analogized by creole speakers to a nominal class marker.
The feminine article makes for a much better class marker than the
masculine:
œ
œ
œ
The feminine has a full vowel, the masculine a droppable schwa
Because of schwa drop, the masculine is often identical to the gender
neutral prevocalic l .
Full adoption of the creole leads to the disappearance of Bantu
influence, and hence to the disappearance of a preference for feminine
agglutination.
Bonami & Henri (Paris/LLF)
FACS III
35 / 38
Conclusions
Outline
Introduction
A semi-automatic exploration of article agglutination
The dataset
Alternating forms
Factors to consider
Statistical analysis and modeling
Identifying correlations
Statistical modeling
Towards explaining unexpected correlations
Conclusions
References
Bonami & Henri (Paris/LLF)
FACS III
36 / 38
Conclusions
Conclusions
Using statistical modeling over large datasets, we showed that:
œ
œ
Article agglutination happens throughout the history of Mauritian, up
to this day.
Agglutination comes in two varieties:
œ
œ
Lexically fixed: one form per lexical item, either agglutinated or not.
Variable: two forms, with syntactic/semantic conditioning over their
respective use
+ Further research will determine whether this should be considered plain
inflectional morphology.
œ
Various causes contribute to the distribution of agglutination, in a non
categorical fashion.
œ
The causes may have varied over the course of the history of the
language.
+ Where relevant data is available, creole linguistics has some use for the
statistician (Robinson, 2008).
Bonami & Henri (Paris/LLF)
FACS III
37 / 38
References
Baker, P. (1984). ‘Agglutinated french articles in creole french: Their evolutionary significance’.
TeReo.
Carpooran, A. (2011). Diksioner Morisien. Sainte Croix (Mauritius): Koleksion Text Kreol, 2nd
Edition.
Grant, A. P. (1995). ‘Article agglutination in creole french: a wider perspective’. In P. Baker
and C. L. R. Group (eds.), From contact to Creole and beyond, Westminster Creolistics
series. University of Westminster Press.
McWhorter, J. and Parkvall, M. (2002). ‘Pas tout à fait du français, une étude cr/’eole’. Études
Créoles, XXV-1:179–231.
New, B., Brysbaert, M., Veronis, J., and Pallier, C. (2007). ‘The use of film subtitles to
estimate word frequencies’. Applied Psycholinguistics, 28:661–677.
New, B. and Spinelli, E. (2012). ‘Diphones-fr: A french database of diphones positional
frequency’. In revision.
Plénat, M. (2000). ‘Quelques thèmes de recherche actuels en morphophonologie française’.
Cahiers de lexicologie, 77:27–62.
Robinson, S. (2008). ‘Why pidgin and creole linguistics need the statistician’. Journal of Pidgin
and Creole Languages, 23:141–146.
Strandquist, R. E. (2003). Article Incorporation in Mauritian Creole, Master Thesis. Ph.D.
thesis, University of Victoria.
Bonami & Henri (Paris/LLF)
FACS III
38 / 38