Predicting article agglutination in Mauritian Olivier Bonami1 Fabiola Henri2 1 U. Paris-Sorbonne & IUF & LLF [email protected] 2 U. Paris 7 & LLF [email protected] Formal Approaches to Creole Studies Lisbonne, November 2012 Bonami & Henri (Paris/LLF) FACS III 1 / 38 Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 2 / 38 Introduction The issue œ Creole words often agglutinate the phonology of two words from the lexifier œ In French-based Creoles, inherited nouns come from the reanalysis of the sequence det + noun in the lexifier. French determiners la le l dy d@l ma mÕ ind.-n plur.-z œ © © © © © © © © © © noun example trans. tabl tuK li amuK te o tÃt pEK EspEs animo latab letuÄ lili lamuÄ dite delo matÃt mÕpeÄ nespes zanimo ‘table’ ‘turn’ ‘bed’ ‘love’ ‘tea’ ‘water’ ‘aunt’ ‘father’ ‘species’ ‘animal’ Also known as article ‘incorporation’ but is a misnomer + We avoid the term incorporation because of its use in morphology and syntax Bonami & Henri (Paris/LLF) FACS III 3 / 38 Introduction Introduction œ Previous work on article agglutination has focused on œ Why there is substantial agglutination in Mauritian compared to other French-based Creoles (Baker, 1984; Grant, 1995) œ œ œ œ œ œ Different question : what favors agglutination? However, our approach is compatible with an initial substratic influence Predicting variation in the form of the agglutinated string (Strandquist, 2003) Reanalysis of the sequence as a case of interrupted transmission (McWhorter and Parkvall, 2002) We examine visible factors in the lexicon of contemporary French which correlate with article agglutination Interpretation of this correlation œ œ œ Substantiation or not of the previous observations Addition of new factors that we think correlate with the phenomenon We rely thoroughly on quantitative data both on the lexifier and on the creole Bonami & Henri (Paris/LLF) FACS III 4 / 38 A semi-automatic exploration of article agglutination The dataset Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 5 / 38 A semi-automatic exploration of article agglutination The dataset Sources œ Database of 8800 nouns compiled on the basis of forms available in Carpooran (2011) œ œ œ We coded their etyma and language of origin and their agglutinated forms together with information regarding alternation For those 7325 nouns which are undisputably from French origin, we coded their œ œ œ œ frequency of the etyma obtained from Lexique 3 (New et al., 2007) gender obtained from Lexique 3 (New et al., 2007) frequency with the definite compiled from the French subtitles corpus (New and Spinelli, 2012) attestation dates obtained from the CNRTL http://www.cnrtl.fr/ Bonami & Henri (Paris/LLF) FACS III 6 / 38 A semi-automatic exploration of article agglutination The dataset Sources œ œ We examine only a subset of the collected data, i.e. those of undisputably French origin Etymon Size example trans. French 7325 Other 1169 Creole innovation 306 latab lestoma zugader larurut koze sÃte ‘table’ ‘stomach’ ‘player’ ‘arrow-root’ ‘talk’ ‘song’ TOTAL 8800 Agglutination of the French article can be erratically found with inherited nouns of other origin Bonami & Henri (Paris/LLF) Etymon Status example trans. English Arabic Hindi undisputable disputable disputable larurut larak lamaÄ ‘arrow-root’ ‘arak’ ‘rice water’ FACS III 7 / 38 A semi-automatic exploration of article agglutination The dataset Phonetizing the subset œ œ We phonetized the Mauritian nouns based on their orthography With phonetized etymons obtained from flexique, we conducted a semi-automatic reconstruction of phonetic changes from French to Mauritian + Work in Progress: Semi-automatic reconstruction of phonetic changes occuring with inherited words œ With these informations, we are able to provide a matching between the phonetic pattern of a noun and that of its etymon œ Currently available for 4881 of the remaining 7325 nouns Bonami & Henri (Paris/LLF) FACS III 8 / 38 A semi-automatic exploration of article agglutination The dataset Focussing on the definite œ We further focus on the subset which involves agglutination with the definite article + The other types are either rare or ambiguous French article la le l dy d@l ma mÕ ind.-n plur.-z TOTAL œ © © © © © © © © © © noun example trans. size tabl tuK li amuK te o tÃt pEK EspEs animo latab letuÄ lili lamuÄ dite delo matÃt mÕpeÄ nespes zanimo ‘table’ ‘turn’ ‘bed’ ‘love’ ‘tea’ ‘water’ ‘aunt’ ‘father’ ‘species’ ‘animal’ 457 49 11 723 37 3 3 1 4 62 1350 Final count of dataset examined is 4760 Bonami & Henri (Paris/LLF) FACS III 9 / 38 A semi-automatic exploration of article agglutination Alternating forms Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 10 / 38 A semi-automatic exploration of article agglutination Alternating forms Distribution œ Mauritian particular wrt. to agglutination since it has by far reanalyzed the article-noun sequence as a single noun (Grant, 1995) œ (1) Interestingly, it has also developped a subset of alternating forms a. Donn mwa enn liv pwason. give.SF 1SG.STF IND pound fish ‘Give me a pound of fish.’ b. œ Komie ou dir laliv? how_much 2SG.FOR say.SF pound ‘How much do you say the pound? Among the 1240 nouns with an agglutinated form, 275 are alternating according to Carpooran (2011) Bonami & Henri (Paris/LLF) FACS III 11 / 38 A semi-automatic exploration of article agglutination Alternating forms Alternating forms in the count œ œ We don’t know whether both forms appeared simultaneaously + Both old and new imports have been found to be alternating Bare form aggl. form age trans. koloni aeropor Ãtet dÃs lakoloni laeropor lÃtet ladÃs 1308 1922 1838 1172 ‘colony’ ‘airport’ ‘heading’ ‘dance’ Accounting for the distribution of alternating forms is a complicated task for various reasons + It seems that alternating forms are not randomly distributed but are however subject to dialectal variation œ Agglutinated forms may have acquired complex properties that should be confirmed by corpus and experimental data œ We hence limit our study to predicting agglutination for the reasons mentionned above œ Alternating forms are assessed together with agglutinated nonalternating ones Bonami & Henri (Paris/LLF) FACS III 12 / 38 A semi-automatic exploration of article agglutination Factors to consider Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 13 / 38 A semi-automatic exploration of article agglutination Factors to consider Factors contributing to agglutination œ Previous studies have claimed that a number of factors correlate with agglutination in Mauritian (2) a. b. c. d. e. Frequency of collocation Homophony Susbtratic influence Number of syllables Vowel harmony + All these studies are based on handpicked small samples and are in need of quantitative substantiation Bonami & Henri (Paris/LLF) FACS III 14 / 38 A semi-automatic exploration of article agglutination Factors to consider Vowel harmony as a factor œ Following the idea that article agglutination is modelled on noun class prefixes from Bantu languages, Strandquist (2003) argue that vowel harmony, also a characteristic of the same languages, is also determining. It crucially has an effect on those nouns that will be agglutinated and those that won’t + If the article’s vowel is in harmony with the noun, it will be agglutinated. If not then either the vowel harmonizes or agglutination does not occur œ (3) œ a. b. l@ky > liki (the ass > female genitals) l@SjẼ > lisjẼ (the dog > dog) We claim (Contra Strandquist, 2003) that the correlation between vowel harmony and agglutination (le vs li) is real (Fisher’s exact test, p < 0.0001), but it is not categorical and epiphenomenal (60 candidates, 1.3% of the data) + Nouns like letur, ledwa or lekuyÕ, for instance, do not harmonize Bonami & Henri (Paris/LLF) FACS III 15 / 38 A semi-automatic exploration of article agglutination Factors to consider Factors contributing to agglutination œ We further add to the above, the following factors which are generally relevant when it comes to word formation (e.g. Plénat, 2000) (4) œ a. b. c. d. Gender Dissimilation effects Phonotactic preferences Raw frequency Although there can be other factors to be considered (syntactic, semantic, . . .), we limit our study to the above + Our choice is also limited by the data available Bonami & Henri (Paris/LLF) FACS III 16 / 38 Statistical analysis and modeling Identifying correlations Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 17 / 38 Statistical analysis and modeling Identifying correlations 1.0 2000 Relevant factors: length 0 0.6 0.0 0.2 0.4 proportion of agglutination 1000 500 counts 1500 0.8 aggl. bare 1 2 3 4 number of syllables 5 6 1 2 3 4 5 6 number of syllables The difference between monosyllabic and polysyllabic is highly significant (¬2 test p < 2 £ 10°16 ) œ For polysyllabic words, length is barely significant (¬2 test ,p º 0.04) œ Bonami & Henri (Paris/LLF) FACS III 18 / 38 Statistical analysis and modeling Identifying correlations 1.0 Relevant factors: gender 0 counts 0.6 0.4 0.0 0.2 500 1000 1500 proportion of agglutination 0.8 2000 aggl. bare f m NEU Gender (NEU = ambiguous between f and m) Bonami & Henri (Paris/LLF) f m NEU Gender (NEU = ambiguous between f and m) FACS III 19 / 38 Statistical analysis and modeling Identifying correlations 0 0.6 0.4 0.0 0.2 proportion of agglutination 2000 1000 counts 3000 aggl. bare 0.8 1.0 Relevant factors: initial segment type C V Initial segment Bonami & Henri (Paris/LLF) C V Initial segment FACS III 20 / 38 Statistical analysis and modeling Identifying correlations Irrelevant factors: dissimilation 0.8 0.6 0.4 0.0 0 C L Initial segment œ 0.2 1500 500 counts 2500 aggl. bare proportion of agglutination 1.0 We expect a dissimilation effect: etymons beginning in l should disfavor agglutination. 3500 œ V C L V Initial segment We find no such effect. Small effect to the contrary, but barely significant (¬2 test p º 0.036) Bonami & Henri (Paris/LLF) FACS III 21 / 38 Statistical analysis and modeling Identifying correlations Irrelevant factors: homophony We expect a homophony avoidance effect: existence of a verb homonym should favor agglutination. yes Existence of homonym verb œ 0.8 0.6 0.4 0.0 0 no œ 0.2 3000 2000 1000 counts 4000 aggl. bare proportion of agglutination 1.0 œ no yes Existence of homonym verb We do find an effect, but in the opposite direction (¬2 test p º 0.008) Anyway, not enough relevant data for this to be important. Bonami & Henri (Paris/LLF) FACS III 22 / 38 Identifying correlations Relevant factor: age 0.6 0.4 0.2 0.0 200 400 600 800 proportion of agglutination aggl. bare 0.8 1.0 We plot the date of first attestation of an etymon, grouped by century: 1000 œ 0 counts Statistical analysis and modeling 8 9 11 13 15 17 Appearance of etymon 19 8 9 11 13 15 17 19 Appearance of etymon Words with more recent French etymons agglutinate less. (logistic regression likelihood ratio: ¬2 test p < 0.0001) œ Probable combination of factors: frequency, date of Entry in the Mauritian lexicon. œ Bonami & Henri (Paris/LLF) FACS III 23 / 38 Statistical analysis and modeling Identifying correlations Relevant factor: collocation with SG definite article 0.0 0.2 0.4 0.6 0.8 1.0 Agglutination grows when collocation with the definite article grows. Proportion of agglutination œ 0% 50% 100% Frequency of collocation with definite SG article (by quantile) œ Highly significant. (logistic regression likelihood ratio: ¬2 test p < 0.0001) + Surprisingly so, given that we are estimating on the basis of late 20th century frequency information. Bonami & Henri (Paris/LLF) FACS III 24 / 38 Statistical analysis and modeling Identifying correlations Relevant factor: raw frequency of etymon 0.0 0.2 0.4 0.6 0.8 1.0 Agglutination grows when frequency of the etymon. Proportion of agglutination œ 0% 50% 100% Raw frequency, SG+PL (by quantile) œ Highly significant. (logistic regression likelihood ratio: Bonami & Henri (Paris/LLF) ¬2 test p < 0.0001) FACS III 25 / 38 Statistical analysis and modeling Identifying correlations Summing up œ We have seen that the following features of the etymon all contribute individually to predicting article incorporation: Predictor Accuracy Acc. increase Dx ,y Monosyllabicity or Polysyllabicity Initial segment type Gender Frequency of coll. with DEF.SG Raw frequency Age 0.8252101 0.8382353 0.8252101 0.8252101 0.8254202 0.8252101 0 0.0130252 0 0 0.0002101 0 0.180 0.464 0.274 0.274 0.349 0.277 Baseline (no predictor) 0.8252101 œ However each is a pretty bad predictor on its own. œ A series of simple logistic regressions confirm this: no predictor leads to a sizable increase in accuracy. Bonami & Henri (Paris/LLF) FACS III 26 / 38 Statistical analysis and modeling Statistical modeling Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 27 / 38 Statistical analysis and modeling Statistical modeling A multiple logistic regression model œ To assess the joint predictive character of the predictors, we run a multiple logistic regression. œ Model: ~~ ~)= P(agglutination|X Where: œ œ œ œ œ œ œ e Æ+ØX ~ 1 + e Æ+Ø~X Æ = °4.4573 Ø1 X1 = 1.9814 £ monosyllabic Ø2 X2 = 3.8591 £ vowel_initial Ø3 X3 = 0.6507 £ log frequency Ø4 X4 = 0.8935 £ def.sg_rel_frequency Ø5 X5 = 1.2750 £ feminine Ø6 X6 = °0.3156 £ century_of_attestation Bonami & Henri (Paris/LLF) FACS III 28 / 38 Statistical analysis and modeling Statistical modeling Assessing the model œ Simple but unsubtle measure of the quality of the model: + How often does the model correctly assign a probability above 0.5 for agglutinating forms and below 0.5 for bare forms? I.e., how often does it make the right prediction? œ Answer: 88.5% of the time Predictors Accuracy Acc. increase Dx ,y all 0.8798319 0.0546218 0.846 Baseline (no predictor) 0.8252101 + This is an unsubtle measure because it does not make a difference between a probability of 0.51 and a probability of 0.99. Bonami & Henri (Paris/LLF) FACS III 29 / 38 Statistical analysis and modeling Statistical modeling Assessing the model: the gory details œ Good likelihood: ¬2 test p < 0.0001 œ Good predictors: all predictors have a Wald Z value with p < 0.0001. œ Good prediction: rank correlation Dx ,y = 0.846 œ Little overfitting: mean rank correlation for 200 bootstrap samples Dx ,y = 0.845 + It is very unlikely that the stated predictors do not jointly explain the probability of agglutination + All predictors do make some contribution to the model + The model predicts quite reliably the probability of agglutination for any combination of predictor values in the dataset + The model’s accuracy does not change when it is trained on a slightly different dataset. Bonami & Henri (Paris/LLF) FACS III 30 / 38 Statistical analysis and modeling Towards explaining unexpected correlations Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 31 / 38 Statistical analysis and modeling Towards explaining unexpected correlations Are raw frequency and DEF.SG frequency correlated? œ There is no obvious explanation for frequency favoring agglutination One hypothesis to refute: little multicolinearity between raw frequency and proportion of collocation with the DEF.SG article 0.8 0.6 0.4 0.2 0.0 Proportion of collocation with DEF.SG 1.0 œ -2 0 2 4 6 Raw frequency (logarithmic scale) + Linear regression slope 0.011976, R 2 < 0.02 Bonami & Henri (Paris/LLF) FACS III 32 / 38 Statistical analysis and modeling Towards explaining unexpected correlations Gender and age œ The remaining two surprising predictors are gender and age. œ Hypothesis: age might be relevant in the sense that words that belong to the creole from its formation behave with respect to agglutination differently from words borrowed from French by native creole speakers. œ To assess this, we ran separate regressions on newer (post-1800) and older (pre-1800) nouns. Bonami & Henri (Paris/LLF) FACS III 33 / 38 Statistical analysis and modeling Towards explaining unexpected correlations Gender and age, continued predictors monosyllabic vowel_initial frequency def.sg_rel_freq. feminine century Bonami & Henri (Paris/LLF) Older nouns coefficients p values Newer nouns coefficients p values 2.0083 < 0.0001 3.8709 < 0.0001 0.6230 < 0.0001 0.8836 < 0.0001 1.3261 < 0.0001 °0.2604 < 0.0001 Dx ,y = 0.839 0.9618 0.2313 3.5815 < 0.0001 0.6997 0.0030 0.9503 < 0.0001 °0.3621 0.4583 °0.3835 0.2311 Dx ,y = 0.890 FACS III 34 / 38 Statistical analysis and modeling Towards explaining unexpected correlations Gender and age, continued œ œ Conclusion: in the post-creolization period, monosyllabicity, gender and age probably stopped playing a role This might be interpreted as supporting the hypothesis of a substratic influence (Baker, 1984; Grant, 1995): œ œ The substrate influence hypothesis assumes that the French article was analogized by creole speakers to a nominal class marker. The feminine article makes for a much better class marker than the masculine: œ œ œ The feminine has a full vowel, the masculine a droppable schwa Because of schwa drop, the masculine is often identical to the gender neutral prevocalic l . Full adoption of the creole leads to the disappearance of Bantu influence, and hence to the disappearance of a preference for feminine agglutination. Bonami & Henri (Paris/LLF) FACS III 35 / 38 Conclusions Outline Introduction A semi-automatic exploration of article agglutination The dataset Alternating forms Factors to consider Statistical analysis and modeling Identifying correlations Statistical modeling Towards explaining unexpected correlations Conclusions References Bonami & Henri (Paris/LLF) FACS III 36 / 38 Conclusions Conclusions Using statistical modeling over large datasets, we showed that: œ œ Article agglutination happens throughout the history of Mauritian, up to this day. Agglutination comes in two varieties: œ œ Lexically fixed: one form per lexical item, either agglutinated or not. Variable: two forms, with syntactic/semantic conditioning over their respective use + Further research will determine whether this should be considered plain inflectional morphology. œ Various causes contribute to the distribution of agglutination, in a non categorical fashion. œ The causes may have varied over the course of the history of the language. + Where relevant data is available, creole linguistics has some use for the statistician (Robinson, 2008). Bonami & Henri (Paris/LLF) FACS III 37 / 38 References Baker, P. (1984). ‘Agglutinated french articles in creole french: Their evolutionary significance’. TeReo. Carpooran, A. (2011). Diksioner Morisien. Sainte Croix (Mauritius): Koleksion Text Kreol, 2nd Edition. Grant, A. P. (1995). ‘Article agglutination in creole french: a wider perspective’. In P. Baker and C. L. R. Group (eds.), From contact to Creole and beyond, Westminster Creolistics series. University of Westminster Press. McWhorter, J. and Parkvall, M. (2002). ‘Pas tout à fait du français, une étude cr/’eole’. Études Créoles, XXV-1:179–231. New, B., Brysbaert, M., Veronis, J., and Pallier, C. (2007). ‘The use of film subtitles to estimate word frequencies’. Applied Psycholinguistics, 28:661–677. New, B. and Spinelli, E. (2012). ‘Diphones-fr: A french database of diphones positional frequency’. In revision. Plénat, M. (2000). ‘Quelques thèmes de recherche actuels en morphophonologie française’. Cahiers de lexicologie, 77:27–62. Robinson, S. (2008). ‘Why pidgin and creole linguistics need the statistician’. Journal of Pidgin and Creole Languages, 23:141–146. Strandquist, R. E. (2003). Article Incorporation in Mauritian Creole, Master Thesis. Ph.D. thesis, University of Victoria. Bonami & Henri (Paris/LLF) FACS III 38 / 38
© Copyright 2025 Paperzz