Modeling stress assignment in English noun-noun compounds: a quantitative perspective Gero Kunter, Ingo Plag, Sabine Lappe & Maria Braun Universität Siegen Conference Quantitative Investigations in Theoretical Linguistics 2, 1-2 June 2005, Osnabrück The problem • compounds in English are stressed on the left-hand member (e.g. bláckboard, wátchmaker). • nuclear stress rule vs. compound stress rule (Chomsky and Halle 1968:17) • many unexplained exceptions, and cross-variety variation (e.g. BrE vs. AmE) Boston márathon summer níght may flówers Penny Láne aluminum fóil silk tíe In general: • claims on compound stress are largely based on anecdotal evidence and introspection • no systematic large-scale empirical evidence available yet Three approaches 1. The structural hypothesis (e.g. Giegerich 2004, Bloomfield 1933, Lees 1963, Marchand 1969 or Payne/Huddleston 2002) • modifier-head structures are regularly stressed on the RIGHT constituent (steel brídge) • argument-head structures are always LEFT-stressed (ópera singer) • left stress on modifier-head structures is due to lexicalization (ópera glasses) 2. The semantic hypothesis (e.g. Fudge 1984, Ladd 1984, Liberman and Sproat 1992, Olsen 2000, 2001) stress assignment according to semantic categories 3. The analogical hypothesis (e.g. Schmerling 1971, Liberman and Sproat 1992, Plag 2006) stress assignment in analogy to similar compounds in the lexicon Testing the hypotheses • Plag (2006, experimental study): all three types of factor interact in compound stress assignment in complex ways. • this paper: corpus study testing the three hypotheses more thoroughly - many more different word types many more tokens many more semantic relations computational modeling of analogical effects • Data - Boston University Radio Speech Corpus (Ostendorf et al. 1996) (N = 4410, V = 2476, AmE) - CELEX lexical data base (Baayen et al. 1995) (N = 4491, V = N, BrE) Boston Corpus: Example The device is attached to a plastic wristband . It looks like a watch. It functions like an electronic probation officer . When a computerized call is made to a former prisoner's home phone wristband , that person answers by plugging in the device. The can be removed only by breaking its clasp, and if that's done the inmate is immediately returned to jail. The description conjures up images of big brother watching. But Jay Ash, deputy superintendent of the Hampton County jail in Springfield, says the surveillance system sinister. is not that Procedure (cf. Farnetani et al. 1988, Ingram et al. 2003, Plag 2006) Step 1 Measure mean fundamental frequency (F0) of the main stressed vowels of the two members, respectively, and calculate the difference (left F0 minus right F0, logarithmically transformed into semitones (ST), ’pitch difference‘) wrístband +5.39 ST home phóne -0.97 ST Step 2 Look for statistically significant pitch differences between distinct kinds of compound Example: Left-headed compounds (such as attorney géneral) should have a significantly smaller pitch difference than right-headed compounds (e.g. wrístband) Illustration of measurements Right-headed vs. left-headed compounds in Boston Corpus 0.052 10 -5 0 5 3.332 -10 pitch difference in semitones 15 20 mean pitch difference in semitones rightleftheaded headed right-headed wrístband left-headed attorney géneral t (4408) = 4.91, p < 0.01, Cohen‘s d = 0.80 Boston Corpus: Structural hypothesis Argument-head vs. modifier-head compounds 3.250 10 0 5 3.736 -5 significant difference, but large overlap between the two groups -10 pitch difference in semitones 15 20 mean pitch difference in semitones argumentmodifierhead head argument-head modifier-head t (4089) = 2.36, p < 0.05, Cohen’s d = 0.01 effect size is very small Boston Corpus: Structural hypothesis A closer look at argument-head vs. modifier-head compounds morphology of head argument-head modifier-head -er law makers house speaker -ing fundraising spring training -ion jury selection health education conversion tax increase litmus test (also, with low frequency: -age, -al, -ance, …) Boston Corpus: Structural hypothesis Interaction of structure and morphology of head ing (N=81) ion (N=198) not significant 15 10 pitch difference in semitones 5 0 -5 -10 con (N=572) er (N=349) 15 10 5 0 -5 F (9, 4062) = 2.89 p < 0.01 R² = 0.015 -10 Argument-Head Modifier-Head Argument-Head Modifier-Head Boston Corpus: Lexicalization effect? Two ways of quantifying lexicalizationonly very small tendency for highly frequent compounds to be more left-stressed - Frequency Higher frequency should correlate with higher degree of lexicalization no difference between AH or MH compounds F (1, 4069) < 1 - Spelling Lexicalized compounds are more prone to one-word spellings Boston Corpus: Lexicalization effect? Pitch difference by Google frequency 10 3 0 2 5 • typical of categorical nochanges difference between AH or MH compounds 1 -5 F (1, 4069) < 1 0 -10 pitch difference in semitones pitch difference in semitones 4 15 5 only very small tendency • relation between pitch fordifference highly frequent and Google compounds be more frequency to shows an Sleft-stressed shaped distribution 0 0 5 5 10 15 10 15 log frequency (Google) log frequency (Google) F (1, 4071) = 15.58, p < 0.001, R² = 0.004 20 20 Boston Corpus: Lexicalization effect? Spelling and lexicalization Results: • expected effect • large effect size => spelling is an indicator of lexicalization 15 10 5 Prediction: compounds spelled as one word should have higher frequency than those spelled as two words log frequency (Google) Assumptions: • one-word spellings are indicative of lexicalization • high frequency is indicative of lexicalization 20 Frequency by spelling modifier-head compounds one word two words spelling t (3388) = 15.58, p < 0.001, Cohen´s d = 0.89 Boston Corpus: Lexicalization effect? Interaction between structure and spelling 4.0 4.5 Argument-Head Modifier-Head • Modifier-Head compounds spelled as one word should be more left-stressed than Modifier-head compounds spelled as two words • no effect of that kind with Argument-Head compounds 3.5 Results: 3.0 pitch difference in semitones 5.0 Predictions: one word two word spelling F (3, 4030) = 12.79, p < 0.001, R² = 0.009 • Modifier-Head compounds spelled as one word are indeed more left-stressed • spelling of Argument-Head compounds does not interact with stress position • only very weak effect Boston Corpus: Structural hypothesis A summary • significant effect of argument vs. modifier only with a subset of potential compounds (i.e. –er as righthand head morphemes) • a measurable lexicalization effect (based on frequency and spelling) • effect sizes are all very small – a lot of the variation is unaccounted for under this hypothesis The structural hypothesis is not well supported by the data Boston Corpus: Semantic hypothesis Methodological problems • Semantic categories and semantic relations mentioned in the literature (such as ‚N2 is a material‘, ‘N2 is located at N1’) are hard to test due to their being generally ill-defined • Items are often ambiguous (i.e. show more than one relation) • The number of potentially relevant semantic categories and relations is unclear Our methodology • We used a set of 18 semantic relations (based mainly on Levi 1978), also widely used in studies on compound interpretation (e.g. Gagné & Shoben 1997, Gagné 2001) • Semantic classification was done by two independent raters – only those data are analyzed where the two ratings agreed Boston Corpus: Semantic hypothesis The literature on rightward stress makes use of either categories referring to constituents or the compound as a whole or categories referring to semantic relation Boston Corpus: Semantic hypothesis Categories referring to constituents or the compound as a whole Rightward stress is predicted if... • • • • N1 N2 N2 N1 refers to a period or point in time (morning edition) is a geographical term (Boston area) is a type of thoroughfare (Sesame Street) and N2 form a proper noun (Tufts University) (e.g. Fudge 1984: 144ff, Liberman & Sproat 1992) Boston Corpus: Semantic hypothesis Categories referring to constituents or the compound as a whole N1 refers to POINT OF TIME? yes no 15 10 5 0 -5 -15 -10 -15 -10 -5 0 5 10 pitch difference in semitones 15 20 20 N1 is a PROPER NOUN? yes F (7, 4130) = 9.19, p < 0.01, R² = 0.0136 no yes -5 0 5 10 15 20 no -15 -10 pitch difference in semitones 15 10 5 0 -5 yes Compound is a PROPER NOUN? pitch difference in semitones N2 is a THOROUGHFARE? 20 no -15 -10 pitch difference in semitones 15 10 5 0 -5 -15 -10 pitch difference in semitones 20 N2 is a GEOGRAPHICAL TERM? no yes Boston Corpus: Semantic hypothesis Categories referring to semantic relation Rightward stress is predicted if... • • • • N2 N2 N2 N1 DURING N1 (summer vacations) IS LOCATED AT N1 (Newton residents) IS MADE OF N1 (canvas bags) MAKES N2 (Weld plan) (e.g. Fudge 1984: 144ff, Liberman & Sproat 1992) additional categories (18 in total): • N1 HAS N2 (wheel chair) • N2 USES N1 (breath test) • N2 FOR N1 (adult prisons) • N2 CAUSES N1 (AIDS virus) •… -5 0 5 10 15 no yes 0 5 10 15 no yes F (7, 2036) = 20.53, p < 0.01, R² = 0.063 0 5 10 15 20 N1 HAS N2 -5 no -10 20 yes pitch difference in semitones -5 20 no -10 pitch difference in semitones -10 -5 0 5 10 15 -5 0 5 10 15 yes N2 USES N1 -5 0 5 10 15 no no yes N2 FOR N1 yes -10 -5 0 5 10 15 pitch difference in semitones -10 pitch difference in semitones -10 pitch difference in semitones -10 20 20 20 20 N2 LOCATED AT/IN N1 -15 -15 -15 -15 pitch difference in semitones N2 DURING N1 -15 -15 -15 pitch difference in semitones Boston Corpus: Semantic hypothesis Categories referring to semantic relation N1 MAKES N2 N2 IS MADE OF N1 no yes Boston Corpus: Semantic hypothesis A summary • Some predictions are correct • Some predictions are wrong (i.e. no effect found) • Some effects are found where no prediction is made • A lot of the variation is unaccounted for under this hypothesis The semantic hypothesis is not well supported by the data Boston Corpus: Analogical hypothesis Analogical modeling is not yet possible at the moment, due to gradient stress measurements (But see Kunter/Plag (2006) on how this can be done) CELEX: General overview Contents stress position left right Oxford Advanced Learner's Dictionary (1974): 41,000 lemmata Longman Dict. of Contemp. Engl. (1978): 53,000 lemmata COBUILD corpus (92%) 17.9 million word tokens 90% overall: 52,446 lemmata representing 160,594 wordforms Position of stress is given for each entry in the data base NNN compounds = 4491 10% CELEX: Structural hypothesis Argument-head vs. modifier-head compounds significant difference is in the direction predicted by the hypothesis (i.e. more left stress with argumenthead compounds) left right stress position but: vast majority of modifierhead compounds is also left-stressed, which goes against the hypothesis modifier-head χ ² = 8.55, df = 1, p < 0.01, φ = 0.05 argument-head CELEX: Structural hypothesis Interaction of structure and morphology of head argument-head left • same significant interaction as in BURSC • other interactions are not significant con morphology • significant effect of argument vs. modifier only with a subset of potential compounds (i.e. –er as righthand head morphemes) right er ing ion logit regression, null dev. = 396.64, df = 680; residual dev. = 354.23, df = 673 structure modifier-head left right CELEX: Lexicalization effect? Frequency and stress position 15 10 5 Results: • Google log frequencies are not different for left- or right-stressed compounds • no interaction of stress position and structure (F (1, 4467) = 2.47, p = 0.12) Æ stress position is not related to frequency log Google frequency Prediction: left-stressed compounds should have higher frequency than rightstressed compounds 20 Assumptions: • lexicalized compounds prefer leftstress • lexicalized compounds are more frequent left right t (4470) = 1.097, p = 0.27 CELEX: Lexicalization effect? Spelling and stress position • the more lexicalized (in terms of spelling), the more frequent is left stress • no difference between argument-head and modifier-head compounds ⇒ evidence for general lexicalization effect on stress stress position one word hyphenated left right spelling χ ² = 512.08, df = 2, p < 0.01 two words CELEX: Structural hypothesis A summary • significant effect of argument vs. modifier only with a subset of potential compounds (i.e. –er as right-hand head morpheme) • BUT: the vast majority of compounds do not behave in accordance to the hypothesis • measurable general lexicalization effect (only w.r.t. spelling) The structural hypothesis is not supported by the data CELEX: Semantic hypothesis right right right N2 is a GEOGRAPHICAL TERM? no N2 is a THOROUGHFARE? left left right right Compound is a PROPER NOUN? logit regression, null dev. = 2784.0, df = 4125; residual dev. = 2693.1, df = 4120 yes no yes left yes no N1 refers to POINT OF TIME? left N1 is a PROPER NOUN? yes left no yes no Categories referring to constituents or the compound as a whole CELEX: Semantic hypothesis left left left right right right left left left right right right N1 IS N2 N2 FOR N1 logit regression, null dev. = 1149.16, df = 1629; residual dev. = 967.41, df = 1614 N2 ÍS NAMED AFTER N1 yes no N2 IS MADE OF N1 yes no N2 LOCATED AT/IN N1 yes no N2 DURING N1 yes no yes no yes no Categories referring to semantic relation CELEX: Semantic hypothesis Summary • Some predictions go in the right direction, but leave lots of data unexplained • Some predictions are wrong (i.e. no effect found) • Some effects are found where no prediction is made • A lot of the variation is unaccounted for The semantic hypothesis is not well supported by the data CELEX: Analogical hypothesis Specific hypothesis: Stress in compounds is determined by the stress pattern of the majority of similar instances that are stored in memory. Example: cárpet beater is assigned left stress because the most similar exemplar stored in memory, éggbeater, also has left stress. CELEX: Analogical hypothesis The data Compounds whose left and right members occur more than once in the corpus, i.e. for which the model has information about constituent families of the members. N = 2643 (Ntotal = 4491) The model Memory-based learner (TiMBL 5.1, Daelemans et al. 2004) How does TiMBL work? INSTANCE-BASED MEMORY SET OF NEAREST NEIGHBOURS oil painting oil, painting, noarg,-ing, semcat1 INPUT action, finger, wall, country, cottage, painting, painting, painting, party, hospital, noarg, noarg, arg, noarg, noarg, -ing, -ing, -ing, nosuff, nosuff, semcat1, stress: left semcat1, stress: left semcat1, stress: left semcat2, stress: left semcat3, stress: right life, ... work, noarg, nosuff, semcat2, stress: right action, finger, painting, noarg, -ing, painting, noarg, -ing, semcat1, stress: left semcat1, stress: left evaluation of input against nearest neighbours } stress left: 2x stress right: 0x oíl painting stress left OUTPUT How does TiMBL perform? - 94 % overall accuracy - predictive accuracy for right stresses: 20-25% Which features does TiMBL find useful? Any given set of features does the same thing as any other set of features (about 94 % accuracy): semantic categories, semantic relations, proper-noun status, morphological structure, argument-head status, left and/or right member ⇒ No abstract features needed, left and right member can do the job Left and right member are in fact better predictors than the other features, because only when we leave out left and/or right member do we find a significant drop in performance (left member: Yate’s χ2 = 7.42, p < 0.01, right member: Yate’s χ2 = 4.83, p < 0.05, left and right member: Yate’s χ2 = 4.83, p < 0.05) The results of analogical modeling Accuracy • pretty good overall predictive accuracy • better predictive accuracy than under any other hypothesis Predictors none of the grammatical/semantic features proposed in the literature improves predictive accuracy Theoretical implication constituent families (and thus analogy) play an important role in compound stress assignment Summary and Conclusion • Corpus data do not confirm the categorical stress assignment rules found in the literature • Compound stress is much more variable than previously thought • Argument structure effects are restricted to compounds ending in –er • There are only small effects of only some of the semantic categories proposed in the literature • Assignment of rightward stress is problematic for TiMBL. But TiMBL is better than any other “rule“ found in the literature • Analogical effects based on constituent families play an important role in compound stress assignment Thank you very much for your attention! Acknowledgements - Christian Grau, Christina Kellenter, and Taivi Rüüberg for their help with annotating the data - Harald Baayen and Mark Pluymaekers for statistical training and support, and for critical comments and suggestions - Heinz Giegerich for critical comments and support - Deutsche Forschungsgemeinschaft for funding this research (Grant PL151/5-1) References Baayen, Harald, R. H., R. Piepenbrock, and L. Gulikers (1995) The CELEX lexical database (CD-ROM). Philadelphia: Linguistic Data Consortium, University of Pennsylvania. Bloomfield, Leonard (1933) Language. Chicago: Holt. Chomsky, Noam and Morris Halle (1968) The Sound Pattern of English. New York: Harper and Row. Daelemans, Walter, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch (2004) TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. ILK Technical Report 04-02, available from http://ilk.uvt.nl/downloads/pub/papers/ilk0402.pdf. Farnetani, Edda and Cosi, Piero (1988) English compound versus non-compound noun phrases in discourse: An acoustic and perceptual study, Language and Speech 31, 157-180. Fudge, Eric C. (1984) English word-stress. London: George Allen & Unwin. Gagné, Christina (2001) Relation and lexical priming during the interpretation of noun-noun combinations. Journal of Experimental Psychology: Learning, Memory and Cognition 27: 236-254. Gagné, Christina and Edward J. Shoben (1997) Influence of thematic relations on the comprehension of modifier-noun combinations. Journal of Experimental Psychology: Learning, Memory and Cognition 23: 71-87. Giegerich, Heinz (2004) Compound or phrase? English noun-plus-noun constructions and the stress criterion, English Language and Linguistics 8, 1–24. Ingram, John, Thi Anh Thu Nguyen and Rob Pensalfini (2003) An acoustic analysis of compound and phrasal stress patterns in Australian English, submitted for publication. Kunter, Gero, and Ingo Plag (2006) What is compound stress? On the phonetics and phonology of prominence relations in English noun-noun constructs. Paper presented at the University of Edinburgh, May 23, 2006. Ladd, D. Robert (1984) English compound stress, in Gibbon, Dafydd & Helmut Richter (eds.) Intonation, accent and rhythm. Berlin: Mouton de Gruyter, 253-266. Levi, Judith N. (1978) The syntax and semantics of complex nominals. New York: Academic Press. Lees, Robert B. (1963) The Grammar of English Nominalizations. The Hague: Mouton. Liberman, Mark and Richard Sproat (1992) The stress and structure of modified noun phrases in English, in Sag, Ivan A. and Anna Szabolcsi (eds.) Lexical Matters. Stanford: Center for the Study of Language and Information, 131-181. Marchand, Hans (21969) The Categories and Types of Present-day English Word-formation, München: Beck. Olsen, Susan (2000) Compounding and stress in English: A closer look at the boundary between morphology and syntax, Linguistische Berichte 181, 55-69. Olsen, Susan (2001) Copulative compounds: a closer look at the interface between syntax and morphology, in Booij, Geert E. and Jaap van Marle (eds.) Yearbook of Morphology 2000, Dordrecht/Boston/London: Kluwer, 279-320. Ostendorf, Mari, Patti Price, and Stefanie Shattuck-Hufnagel (1996) Boston University Radio Speech Corpus. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. Payne, John, and Rodney Huddleston (2002) Nouns and noun phrases. In Huddleston, Rodney & Geoffrey K. Pullum, The Cambridge grammar of the English language. Cambridge: Cambridge, University Press. 323–524. Plag, Ingo (2006) The variability of compound stress in English: structural, semantic and analogical factors, English Language and Linguistics 10.1, 143-172. Schmerling, Susan F. (1971). A stress mess. Studies in the Linguistic Sciences 1: 52-65.
© Copyright 2026 Paperzz