Lilla Magyar Generals Paper MIT Gemination in Hungarian loanword adaptation Abstract An interesting puzzle of Hungarian phonology is gemination in recent loanwords borrowed from English, German (and occasionally, from French). A consonant following a short stressed vowel is often geminated in the loanword, even if the consonant doubling does not have an orthographic reflex in the source word, i.e. the consonant in question is not spelt as a double letter in the source word. Unlike similar phenomena in other languages (such as Japanese, Finnish or Telugu), this is a gradient phenomenon involving considerable inter- and intra-speaker variation, and seemingly there is nothing in native Hungarian phonotactics that would require such a process to happen, as both singleton and geminate consonants are equally acceptable in all the contexts where loanword gemination occurs. Gemination depends on context: it never applies when the consonant is preceded by a long vowel and is most productive in monosyllables word-finally. Apart from the influence of context, the propensity of consonants to undergo gemination also varies by consonant class: voiceless consonants are more likely to be geminated in loanwords than voiced ones, which reflects universal hierarchies of geminate markedness. In this paper, we will show that loanword gemination may not be as mysterious as it seems at first sight: it is a process regulated by faithfulness (to vowel length in source words) and universal markedness (of gemination - which is an important factor in native Hungarian phonotactics as well). Finally, we will present a Maximum Entropy model to account for this phenomenon. 1 Introduction Gemination in loanwords is a cross-linguistically widespread phenomenon: a singleton consonant in the source word is geminated in the loanword, even if the doubling does not have an orthographic reflex (that is, the geminated consonant is spelt with a single consonant letter in the source word). Source languages which these words are borrowed from do not allow phonetic geminates. Some examples of gemination in loanwords are listed below: • Japanese: [kat:o] ‘cut’ (Kubozono et al. (2008)) • Italian: [fan:] ‘fan’ (Passino (2004)) • Kannada: [kap:u] ‘cup’ (Sridhar (1990)) • Telugu: [ro: d:u] ‘road’ (Krishnamurti and Gwynn (1985)) • Finnish: [pop:i] ‘pop’ (Karvonen (2009)) • Hungarian: [sok:] ‘shock’ (Nádasdy (1989), Kertész (2006)) In Japanese, Kannada, Telugu and Finnish, the final consonant of the source word is geminated, and a vowel is inserted after the final consonant. Loanword gemination in Italian and Hungarian is slightly different from what we see in other languages. In Italian, the final consonant of monosyllabic words is geminated. In Hungarian, however, there are different contexts for gemination in loanwords, but it is most common in word-final position in monosyllabic words. 2 Loanword gemination in Hungarian In ‘recent’ (from 1750 onwards) Hungarian loanwords borrowed from English, German and French, short consonants following a short (usually stressed) vowel in the source word are regularly geminated in the loanword, even if the source word is not spelt with a double consonant letter (Nádasdy (1989), Kertész (2006)). I am grateful to Adam Albright, Donca Steriade, Edward Flemming, Michael Kenstowicz, Martin Hackl, Miklós Törkenczy, Szilárd Szentgyörgyi, Sudheer Kolachina, Csaba Oravecz, Katalin Mády, Tekla Etelka Gráczi, audiences at Phonology Circle at MIT and the 22nd Manchester Phonology Meeting, as well as all the participants in my experiments. All remaining errors and inaccuracies are attributed to me alone. 1 Lilla Magyar 2.1 Generals Paper MIT Gemination by orthographic influence Gemination in Hungarian loanwords can occur regardless of the influence of orthography. Some of the source words contain a double consonant letter in the spelling, while others are spelt with a single letter. Some of the following examples were also cited by Nádasdy (1989).1 2.1.1 Orthographic geminate in the source word In the words shown in Table 1, the consonants geminated in the loanwords are spelt with a double consonant letter in the source words.2 Source word Loanword IPA Gloss Hall (G), hall (E) hall [hOl:] ‘hall’ Koffer (G) koffer [kof:ER] ‘suitcase’ hobby (E) hobbi [hob:i] ‘hobby’ hippy (E) hippi [hip:i] ‘hippy’ bluff (E) blöff [bløf:] ‘bluff’ Stopper (G) stopper [Stop:Er] ‘stopwatch’ nett (G) nett [nEt:] ‘neat and tidy’ babysitter (E) bébiszitter [be:bisit:Er] ‘babysitter’ dollar (E) dollár [dol:a:r] ‘dollar’ shopping (E) shopping [Sop:iNg] ‘shopping’ masseur (F) masszőr [mOs:ør] ‘masseur’ roller (E) roller [rol:Er] ‘roller’ rapper (E) rapper [rEp:Er] ‘rapper’ Presser (G) Presszer [prEs:Er] <family name> Ritter (G) Ritter [rit:Er] <family name> Popper (G) Popper [pop:Er] <family name> Gösser (G) Gösser [gös:Er] <beer brand> Table 1: Double consonant letters in the source word 1 A list of loanwords and foreign names borrowed into Hungarian potentially or actually undergoing loan gemination is found in Appendix IV. It is my collection, which also contains words mentioned in Nádasdy (1989), including examples ranging from older, more established loanwords to very recent loans, foreign names or brands. 2 In many cases, the meaning of the source word and the loanword is not exactly same. Here, ‘gloss’ represents the meaning of the loanword - as it is used in Hungarian. 2 Lilla Magyar 2.1.2 Generals Paper MIT No orthographic geminate in the source word In the words shown in Table 2, the source word is spelt with either a single consonant letter or a digraph. The % sign means that there is inter-speaker variation: both forms (ending in a singleton and a geminate) are acceptable. Source word Loanword IPA Gloss fit (E) fitt [fit:] ‘fit’ set (E) szett [sEt:] ‘outfit’ cheque (F) csekk [tSEk:] ‘cheque’ tip (E) tipp [tip:] ‘tip, idea’ Wecker (G) vekker [vEk:Er] ‘alarm clock’ clip (E) klip [klip:] ‘video clip’ frisch (G) friss [friS:] ‘fresh’ fesch (G) fess [fES:] ‘handsome’ Kitsch (G) giccs [git:S] ‘kitsch’ club (E) / Klub (G) klub % [klub:] ‘club’ Jam (E / G) dzsem % [dZEm:] ‘marmalade’ chip (E) chip [tSip:] ‘chip’ step (E) sztep [stEp:] ‘step dance’ Witz (G) vicc [vit:s] ‘joke’ choc (F) sokk [sok:] ‘shock’ Putsch (G) puccs [put:S] ‘coup’ chic (F) sikk [sik:] ‘stylishness’ Table 2: Single consonant letter or digraph spelling in the source word 2.2 Gemination by position in the word Gemination in loanwords also depends on position within the word. It is very frequent in word-final position in monosyllables, regardless of spelling, fairly common in intervocalic position in polysyllables as long as the consonant in question is spelt with a double letter or a digraph.3 , Word-final consonants in polysyllabic words are geminated quite frequently when the consonant in question is spelt with a double letter or a digraph, but the gemination of consonants which are spelt with a singleton letter in the same context is extremely rare.. Gemination in loanwords never occurs after long vowels (in the source word or the loan 3 It is an interesting fact that when in the case of digraphs, gemination happens almost only in words ending in -er. It might be due to a paradigm uniformity effect (stem + -er), as native speakers of Hungarian seem to treat -er as a suffix. However, it may be something else rather than the effect of paradigm uniformity: -er suffixation is a fairly common recent process among younger speakers of Hungarian, and the consonant preceding the suffix is regularly geminated (e.g. from the word nyugdı́jas [ñugdi:jOS] ‘retired person’ it is possible to coin nyugger [ñug:Er] ‘(an obnoxious or opinionated) retired person’). 3 Lilla Magyar Generals Paper MIT word). These contextual restrictions are shown in Table 3. Spelling double letter digraph single letter Monosyll., VC# frequent frequent frequent Polysyll., VCV frequent less frequent rare Polysyll., VC# frequent less frequent rare after long V never never never Table 3: Gemination by position 2.3 Gemination by consonant class There are further restrictions on gemination in loanwords which interact with consonant classes and individual consonants. Some consonant classes and individual consonants are more likely to undergo gemination than others. 2.3.1 Voiceless stops Voiceless stops undergo gemination in loanwords regularly. In word-final position in monosyllables, all voiceless stops are geminated very regularly, even if the geminated consonant is spelt with a single letter or a digraph in the source word. Consonants [p] and [t] are geminated in intervocalic position following a stressed vowel primarily only when there is a double letter in the source word spelling, whereas [k] is geminated regularly when there is a digraph in the spelling (but only when the word ends in -er). In word-final position in polysyllables, [p] is very rarely geminated, whereas gemination is more frequent in the case of [t] and [k]. This is summarised in Table 4. [p] Monosyll., VC# more frequent following some vowels than others: e.g. tip [tip:] vs. top [top] [t] e.g. set [sEt:], fit [fit:] suite [svit:], Brit [brit:] [k] e.g. choc [Sok:], chic [Sik:] rock [rok:], Jacques [ZOk:] Polysyll., VCV mostly double letter spelling: e.g. hippy [hip:i], rapper [rEp:Er] one counterexample: doping [dop:i:Ng] (compensatory lengthening) mostly double letter spelling: e.g. setter [sEt:Er], Betty [bEt:i] BUT: sweater [svEt:Er] e.g. Wecker [vEkkEr], rocker [rok:Er] Black and Decker [blEk:EndEk:Er] Polysyll., VC# very rare: Galopp [gOlop:] e.g. cricket [krikEt:] ballet [bOlEt:] toilet [toOlEt:] e.g. baroque [bOrok:] Table 4: Gemination of voiceless stops in loanwords 2.3.2 Voiceless affricates Just like voiceless stops, voiceless affricates are geminated in loanwords very often in word-final position as well as in intervocalic position in monosyllabic words. There are no recent loanwords which contain affricates in word-final position in polysyllabic words. This is summarised in Table 5. [ts] [tS] Monosyll., VC# e.g. Witz [vit:s], spritz(en) [Sprit:s] Hetz [hEt:s] e.g. match [mEt:S], touch [tOt:S] Putsch [put:S] Polysyll., VCV e.g. Sitzer [zit:sEr] Kratzer [krOt:zEr] not many data e.g. Gletscher Polysyll., VC# no data no data Table 5: Gemination of voiceless affricates in loanwords 2.3.3 Voiceless fricatives Geminated voiceless fricatives are also common in loanwords. In word-final position in monosyllables, [f] is only geminated when it is spelt with a double consonant letter, while [S], [s] and [x] are frequently geminated even if they are spelt with a single consonant letter or a digraph. In intervocalic position, [f] is only geminated when the source word is spelt with a double consonant letter or was preceded by a stressed vowel in a language which has post-tonic lengthening, and even in these cases, it can be pronounced 4 Lilla Magyar Generals Paper MIT short. Intervocalic [S] does not seem to undergo gemination in loanwords, but we do not have much data. Intervocalic [s] only seems to geminate when the source word contains a double consonant letter in the spelling, while [x] is always spelt with a ch in source words and is very frequently geminated. There are no data about word-final voiceless fricatives in polysyllabic words. This is summarised in Table 6 below. [f] [S] [s] [x] Monosyll., VC# doble letter spelling e.g. Treff [trEf:] bluff [bløf:] e.g. couche [kuS:] plush [plyS:], Bush [buS:] e.g. plus [plus:] stressz [StrEs:] Krach [krOx:] Pech [pEx:], Bach [bOx:] Polysyll., VCV double letter or post-tonic lengthening in SL: e.g. Koffer % [kof:Er], mafia % [mOf:iO] not many data, no gemination: Fischer [fiSEr] double letter in spelling dessert [dEs:Ert], Presser [prEs:Er] only names ending in -er e.g. Pacher [pOx:Er] Polysyll., VC# no data no data no data no data Table 6: Gemination of voiceless fricatives in loanwords 2.3.4 Voiced stops Voiced stops hardly ever undergo gemination in loanwords. There is one example of a monosyllable ending in [b] spelt with a single consonant letter in the source word (as well as in the loanword) which is pronounced as a geminate by most speakers. We can find a few examples for gemination in intervocalic position, but in all of the cases, the consonant is spelt with a double consonant letter in the source word. There are no data on word-final voiced stops in polysyllabic words. This is shown in Table 7. [d] Monosyll., VC# rare e.g. club [klub:] no evidence [g] no evidence [b] Polysyll., VCV rare, only double letter e.g. hobby [hob:i] rare, only double letter Yiddish [jid:iS] no data Polysyll., VC# no data no data no data Table 7: Gemination of voiced stops in loanwords 2.3.5 Voiced fricatives Table 8 shows that here are hardly any recent loans from English, German, and French which have a short vowel followed by a voiced fricative. There is only one example in which the word-final consonant is geminated and devoiced by most (especially older or more conservative) speakers, but several younger speakers do pronounce it with a long [z]. [v] [z] [Z] Monosyll., VC# no data Polysyll., VCV no data Polysyll., VC# no data one example: jazz [dZEs:], but for many younger speakers: [dZEz:] no data no data no data no data no data Table 8: Gemination of voiced fricatives in loanwords 2.3.6 Nasals Nasals undergo gemination in loanwords fairly rarely. Occasionally, in word-final position, both [m] and [n] can be pronounced long even if the consonant is spelt with a single letter in the source word. There are not many examples for intervocalic [m] and it seems to be geminated only when there is an orthographic 5 Lilla Magyar Generals Paper MIT geminate in the source word. There are no source words in which [n] is spelt with a single consonant letter and is pronounced long in the loanword, and even when there is an orthographic geminate in the source word, [n] can be pronounced short in the loanword. There are no examples which contain word-final [m] and [n] in polysyllabic words. This is summarised in Table 9. [m] [n] Monosyll., VC# e.g. jam [dZEm:], slam [slEm:] e.g. gin [dZin:], Polysyll., VCV not many examples: e.g. shimmy [Sim:i] no evidence (kennel [kEn:El] or [kEnEl]) Polysyll., VC# no data no data Table 9: Gemination of nasals in loanwords 2.3.7 Liquids The consonant [l] undergoes gemination quite often, but all the examples contain orthographic geminates. The consonant [r] does not lengthen in loanwords unless there is a double consonant letter in the source word spelling. This is shown in Table 10. [l] [r] Monosyll., VC# double letter in spelling Hall / hall [hOl:] no evidence Polysyll., VCV double letter in spelling e.g. dollar [dol:a:r] possible in the case of double letter Harry [hEr:i] Polysyll., VC# e.g. model [modEl:] cartel [kOrtEl:] no data Table 10: Gemination of liquids in loanwords 2.4 Gemination in monosyllabic words without orthographic reflex in the source word In more recent loanwords, gemination is not productive in intervocalic position unless the source word is spelt with a double consonant letter. However, it is still fairly productive in monosyllabic loanwords, even without orthographic reflex. Since gemination in monosyllables spelt with single consonant letters is the most widespread and the most intriguing issue in connection with the whole phenomenon, discussion in this paper will be restricted to these cases. The following table shows the hierarchy of gemination in monosyllabic loans without orthographic reflex: Consonant class Gemination in loans Voiceless stops Voiceless affricates Voiceless fricatives Nasals Voiced stops Liquids Voiced fricatives widespread widespread frequent some one none none Table 11: The extent of gemination in loanwords by consonant class (without orthographic reflex in the source word) The hierarchy of gemination in loanwords reflects a universal markedness hierarchy, that is, voiceless consonants are more likely to be geminated than voiced ones, obstruents are less marked geminates than sonorants etc. (see Podesva (2002), Kawahara (2007), and Steriade (2004)). 6 Lilla Magyar 3 Generals Paper MIT The puzzle and possible hypotheses 3.1 3.1.1 The puzzle The influence of orthography At first blush, given the abundance of loans with a double consonant letter in the source word, one might assume that gemination is merely an application of an orthographic rule to pronunciation. Since the spelling of geminates and singletons is almost entirely phonetic in Hungarian, source language spelling rules are often carried over into the borrowing language as rules of pronunciation. This approach, however, would be too simplistic and would not be able to account for many issues associated with this phenomenon. To be able to claim that orthography has a primary role in loanword gemination, we have to assume that speakers must have encountered loanwords in written form for the first time. This may be true in the case of technical terms or words related to highbrow culture, but many other words - especially those which have become part of low-colloquial vocabulary - have been presented to native speakers mostly in spoken form. Furthermore, as was pointed out by Nádasdy (1989), it would be rather strange to assume that native speakers of Hungarian are particularly faithful to the spelling pronunciation of geminates when there is evidence that they are, in fact, fairly conversant with reading rules of foreign languages. In addition to the facts discussed above, there is another good reason to claim that gemination in loanwords is not solely a reflection of spelling: a considerable number of borrowings which contain geminates have a singleton consonant letter in the source word spelling, for example (dzsem [dZEm:] ‘jam’, set [sEt:] ‘set, outfit’, szlemm [slEm:] ‘slam’, etc.) (Nádasdy (1989)). 3.1.2 Gemination as a rule of Hungarian phonotactics Hungarian distinguishes between singleton consonants and geminates. In principle, all consonants can be geminated in the native Hungarian phonology. However, most geminates are created through morphological (or morphophonological) processes. Both singleton consonants and geminates are equally possible in intervocalic and word-final position following short stressed vowels, which indicates that gemination in loanwords in the same positions is not required by native Hungarian phonotactics. The following examples are taken from Nádasdy (1989): Word Gloss Word Gloss beteg [bEtEg] ‘ill’ - retteg ‘he is scared’ kosz [kos] ‘dirt’ - rossz [ros:] ‘bad’ vice [vitsE] ‘janitor’ - vicce[vit:sE] ‘his / her joke’ Table 12: Singletons and geminates in the same context However, there are phenomena similar to loanword gemination in the native Hungarian phonology as well. In West-Hungarian dialects and in colloquial speech, consonants can be geminated in the same contexts as in loanwords (for example, köpeny [køp:Eñ] ‘gown’, szalag [sOl:Og] ‘ribbon’, csat [tSOt:] ‘buckle’ (Nádasdy (1989)). Furthermore, there are sporadic cases of consonant lengthening which is not reflected by spelling in old Finno-Ugric words (for example, lesz [lEs:] ‘will be’, kisebb [kiS:Eb:] ‘smaller’, egy [EJ:] ‘one’ (Nádasdy (1989)). Another interesting fact is that a consonant very often lengthened by speakers of such dialects is [l] in intervocalic position (e.g. elem [ElEm] % [El:Em] ‘battery’ - Etelka Tekla Gráczi, p.c.), which is hardly ever geminated in a loanword when the source word is spelt with a single letter. 3.1.3 Speakers’ awareness of universal markedness hierarchies Since gemination in loanwords appears to reflect some sort of a universal markedness hierarchy, the question arises whether native speakers of Hungarian are aware of such markedness constraints on gemination and whether geminate markedness hierarchies are present even in languages which allow all kinds of geminates. For example, are consonants which are cross-linguistically more marked less likely to be geminated in Hungarian as well then those which are less marked geminates in most languages? If so, do native speakers 7 Lilla Magyar Generals Paper MIT have intuitions about markedness? Can they, for example, decide which nonce word is a more well-formed native Hungarian word or Hungarianised loanword by relying on their intuitions about geminate markedness? Are there any other phenomena in the native language - such as a subphonemic process like post-tonic lengthening - which reflect universal geminate markedness hierarchies? 3.1.4 Gemination as a strategy to preserve source vowel length? In many languages, vowels are shorter when they are followed by a geminate consonant. However, we do not know much about native speakers’ perception of foreign vowels compared to their Hungarianised counterparts in loanwords. Do speakers hear vowels in source words as shorter than their substitute vowels in loanwords? If the vowels in the source word preceding the consonant (which is geminated in the loanword) are perceived as shorter than their loan counterparts by native speakers of Hungarian, gemination can be seen as a way of preserving source vowel length in the loanword. 3.2 Possible hypotheses Based on the discussion above, we can formulate the following hypotheses, which will be tested and discussed in this paper later on. 3.2.1 Hypothesis 1 Geminate-to-singleton ratios in the lexicon (that is, type frequency distribution of geminate and singleton forms of each consonant) line up with the universal markedness hierarchy: cross-linguistically marked geminates are less common even in a language which allows all kinds of geminates, and native speakers’ judgements also reflect those patterns. 3.2.2 Hypothesis 2 There is a native subphonemic process - for example, post-tonic lengthening - which is very similar to gemination in loanwords in that it involves consonant lengthening and occurs following short stressed vowels. This lengthening hierarchy also lines up with universal geminate markedness. 3.2.3 Hypothesis 3 Patterns of universal geminate markedness are learnable from the native Hungarian lexicon. 3.2.4 Hypothesis 4 Native speakers perceive foreign short vowels as shorter in closed syllables than Hungarian vowels in the same context. Gemination is a way of preserving source (foreign) vowel length (shortness). 4 Testing Hypothesis 1: Do patterns of universal geminate markedness show up in the Hungarian lexicon and in native speakers’ judgements? In order to test whether the distribution of singletons and geminates in the lexicon by each consonant class reflects universal hierarchies of geminate markedness and if so, do these patterns show up in native speakers’ judgements, we did a corpus study and conducted a wug well-formedness judgement experiment. 4.1 Corpus study All monosyllabic words in which a short stressed vowel is followed by a singleton or a geminate consonant have been extracted from the Hungarian Webcorpus on the Szószablya Project Website (Halácsy et al. (2004)). Table 13 contains data in numbers and percentages about the distribution of geminate and singleton consonants following short vowels in monosyllables. For example, there are 66 monosyllabic words ending in a short vowel + short voiceless stop sequence and 63 has a short vowel + geminate voiceless stop combination, which means that voiceless stops occur in singleton forms in 51% and as geminates in 49% at the end of monosyllabic words. 8 Lilla Magyar Generals Paper MIT Geminate-to-singleton ratios are shown in Table 14 in descending order. The higher the percentage of geminates occurring after short vowels word-finally in monosyllabic word compared to the percentage of singleton consonants in the same context (in the case of a certain consonant class), the higher the geminate to singleton ratio is. Consonant class Voiceless stops Voiced stops Voiceless fricatives Voiced fricatives Voiceless affricates Nasals Liquids Singleton 66 51 35 11 14 35 51 % 51% 65% 54% 92% 35% 76% 72% Geminate 63 28 30 1 26 11 20 % 49% 35% 46% 8% 65% 24% 28% Table 13: Distribution of singletons and geminates by consonant class in Hungarian words (suffixed forms and widely used loanwords included) Consonant class Voiceless affricates Voiceless stops Voiceless fricatives Voiced stops Liquids Nasals Voiced fricatives Geminate to singleton ratio 0.65 0.49 0.46 0.35 0.28 0.24 0.08 Table 14: Geminate to singleton ratio by consonant class (descending order) As is clearly shown by the above tables, voiceless consonants tend to have higher geminate to singleton ratios than their voiced counterparts. Thus, voiceless consonants are geminated more frequently than voiced ones, which seems to reflect universal hierarchies of geminate markedness. 4.2 4.2.1 Wug test Testing material The test contains 236 words (118 word pairs). All words are nonce monosyllables ending in a short vowel + short consonant or geminate sequence. Word pairs are minimal pairs: a word ending in a particular short vowel + short consonant and another word ending in the same vowel + the geminated form of the same consonant (e.g. mok [mok] - mokk [mok:]). Filler items include nonce word pairs with a short and a long vowel. Test items are listed in Appendix I. 4.2.2 Participants 115 native speakers of Hungarian participated in the experiment. They grew up in Hungary, are currently living there and were recruited through Facebook and various mailing lists. All the participants volunteered to take the test for free. The test was administered online and participants remained anonymous. 4.2.3 Task Participants were presented with a list of target and filler items (all in word pairs) and were asked to decide which of the two nonce words they considered more well-formed as a native Hungarian word or a Hungarianised loanword. 4.2.4 Results The summary of results averaged for consonant class is presented in Table 15 and Table 16. 9 Lilla Magyar Generals Paper Consonant class Voiceless stops Voiced stops Voiceless fricatives Voiced fricatives Voiceless affricates Nasals Liquids Singleton 1037 1426 1126 2035 703 636 767 % 47% 65% 47% 85% 44% 61% 48% MIT Geminate 1148 761 1289 371 909 399 845 % 53% 35% 53% 15% 56% 39% 52% Table 15: Well-formedness judgements or forms ending in singletons and geminates Consonant class Voiceless affricates Voiceless stops Voiceless fricatives Liquids Nasals Voiced stops Voiced fricatives Geminate to singleton ratio 0.56 0.53 0.53 0.52 0.39 0.35 0.15 Table 16: Geminate to singleton ratio by consonant class (descending order) The first column of Table 15 shows which consonants the nonce words end in. The second and the third column contain the number and the percentage of participants (respectively) who found stems ending in singleton consonants more well-formed than the ones ending in geminates. The number and the percentage of test takers who preferred forms ending in geminate consonants are shown in columns four and five. The geminate to singleton ratios in Table 16 show to what extent geminates were preferred over singletons. In cases where speakers generally preferred geminated forms to the ones ending in singleton consonants, the geminate to singleton ratio is higher than 0.50. In general, what the table shows is that monosyllabic nonce words ending in voiceless obstruent geminates are more acceptable to native speakers than stems ending in sonorant geminates, except for liquids. Results by each vowel and consonant sequence will be compared with results of the phonotactic learning experiment in the next subsection. 4.3 Correlation between geminate-to-singleton ratios in the corpus and based on native speakers’ judgements In the previous subsection, we have seen that geminate-to-singleton ratios - calculated based on type frequency distributions found in a corpus and based on native speakers’ well-formedness judgements on nonce words - line up with hierarchies of universal geminate markedness. Although there are some differences between patterns in the corpus and in people’s judgements, the cross-linguistically most frequent geminates (voiceless obstruents) are equally highly ranked and the least frequent ones (voiced fricatives) are the lowest ranked in both hierarchies. There is a strong positive correlation (r=8.271) between geminate-to-singleton ratios based on type frequency distributions in the corpus and based on native speakers’ well-formedness judgements, which is illustrated in Figure 1. 10 Lilla Magyar Generals Paper MIT Figure 1: Correlation between geminate-to-singleton ratios in the corpus and in wug well-formedness judgements) in the corpus 5 Testing Hypothesis 2: Are there any other phenomena which show patterns of universal geminate markedness? The goal of the studies described in the previous sections was to find out if universal geminate markedness hierarchies are prevalent in a language which allows all consonants to be geminated, and if that is the case, are native speakers aware of such hierarchies, and do they - consciously or unknowingly - apply them when they are asked to give well-formedness judgements on nonce monosyllables ending in short vowel + short consonant / geminate sequences. The tentative answer to the question is yes, since cross-linguistically less marked geminates are more frequent in the native Hungarian phonology as well. Furthermore, speakers seem to be able to replicate these hierarchies fairly well in wug well-formedness judgement tasks. In this section, we intend to explore the possibility of other phenomena which might reflect universal markedness hierarchies of consonant lengthening. This particular phenomenon (and the (non-)existence thereof) to be investigated here is a subphonemic process referred to as post-tonic lengthening. The reason why this particular phenomenon was chosen is that it is found in languages in which loanword gemination works similarly to the phenomena observed in Hungarian loanword phonology (cf. Farnetani and Kori (1986), Passino (2004)). 5.1 Speech material The speech material consists of 76 native Hungarian words, which can be divided into 19 groups (which is the number of consonants possibly undergoing loanword gemination). Each group contains an existing Hungarian word with a specific consonant in: • • • • intervocalic position, following a stressed vowel intervocalic position, following an unstressed vowel word-final position, following a stressed vowel word-final position, following an unstressed vowel 11 Lilla Magyar Generals Paper MIT Examples are shown in Table 17: Position V(str)CV V(unstr)CV V(str)C# V(unstr)C# [p] kapar ‘to scratch’ alapos ‘meticulous’ pap ‘priest’ alap ‘foundation’ [d] retek ‘radish’ szeretek ‘love-1SG’ vet ‘to sow’ szeret ‘love-3SG’ [f ] röfög ‘grunt-3SG’ leböfög ‘blurp at s.-3SG’ döf ‘stab-3SG’ ledöf ‘stab to death-3SG’ Table 17: Examples of target items Each word was placed in a sentence frame like ‘X is a rather Y Z.’ (for example, ‘The sloth is a rather interesting animal.’) Two lists of sentences were created: one contains sentences in a pseudo-random order and the other one in a reverse order. Two filler sentences were included at the beginning and at the end of the list, in order to exclude effects of being the first and last sentence on the list (since speakers tend to read out the first and the last sentence on the list with different prosody). 5.2 Subjects Six (four female and two male) native speakers of Educated Colloquial Hungarian participated in the experiment. They are between the age of 20 and 48, currently living in Hungary and have spent most of their lives their home country (none of them have lived abroad for more than two years). They volunteered to participate in the experiment for free and signed a consent form. 5.3 Recording procedure Participants were asked to read out the two lists of sentences fluently, not too slowly or quickly, with natural intonation. Before the experiment, they had a short training session so that they could familiarise themselves with the reading material The speech material was recorded in a soundproof room in the building University of Pannonia Broadcasting Company in Veszprém, Hungary. 5.4 Measurement procedure and criteria The duration of each consonant was measured in Praat (Boersma and Weenink (2012)), an open-source speech analysis software. 5.5 Analysis 76 x 2 x 6 = 912 data points were subject to analysis. Durations of various consonants in various positions were compared and analysed. Linear mixed effects models were fitted to the duration data using the lme4 package (Bates et al. (2011)) in R (R Core Team (2013)) and some paired t-tests were used to make additional comparisons. Duration was the dependent variable and fixed effects were place of articulation, position (VCV vs. VC#), stress (stressed vs. unstressed). Random effects by speaker, such as place of articulation, position, stress, were also included in the model. 5.6 Results for consonants in word-final position Since ‘genuine’ loanword gemination with no orthographic reflex is found almost exclusively in monosyllables word-finally, following a short vowel, presentation of the results is restricted to only those cases in this paper. Results is plotted for each consonant and grouped by consonant class. Duration data are given in seconds. 5.6.1 Voiceless stops Voiceless stops show significant effects of post-tonic lengthening (t = -6.88, baseline: stressed), which is shown in Figure 2. 12 Lilla Magyar Generals Paper MIT Figure 2: Voiceless stops in word-final position 5.6.2 Voiced stops Voiced stops do not lengthen significantly following a short stressed vowel (t = -1.611, baseline: stressed). This is illustrated by Figure 3. Figure 3: Voiced stops in word-final position 5.6.3 Voiceless fricatives Voiceless fricatives are significantly longer in final position following short stressed vowels (-4.154, baseline: stressed), which is shown in Figure 4. 13 Lilla Magyar Generals Paper MIT Figure 4: Voiceless fricatives in word-final position 5.6.4 Voiced fricatives As we can see in Figure 5, Voiced fricatives show significant effects of post-tonic lengthening (t = -7.204, baseline: stressed). Figure 5: Voiced fricatives in word-final position 5.6.5 Voiceless affricates Voiceless affricates are considerably longer following short stressed vowels (t = -5.354, baseline: stressed). This is shown by Figure 6. 14 Lilla Magyar Generals Paper MIT Figure 6: Voiceless affricates in word-final position 5.6.6 Nasals As Figure 7 demonstrates, nasals undergo significant post-tonic lengthening in word-final position (t = -3.801, baseline: stressed). Figure 7: Nasals in word-final position 5.6.7 Liquids Liquids do not show significant effects of post-tonic lengthening (t = 1.686, baseline: stressed), which we can see in Figure 8 below. 15 Lilla Magyar Generals Paper MIT Figure 8: Liquids in word-final position 5.6.8 The hierarchy of post-tonic lengthening compared to hierarchies of gemination Table 18 shows which consonants undergo significant post-tonic lengthening and which do not. It also shows the amount of lengthening in ms and the significance values. Significant post-tonic lengthening Voiceless affricates (est.: 0.042427, t=5.354) Voiced fricatives (est.: 0.035669, t=7.204) Voiceless fricatives (est.: 0.028971, t=4.154) Voiceless stops (est.: 0.022527, t=6.88) Nasals (est.: 0.22288, t=3.801) No significant post-tonic lengthening Voiced stops (est.: 0.003102, t=1.611) Liquids (est.: 0.0019, t=1.686) Table 18: Hierarchy of post-tonic lengthening by consonant class The table showing the extent of gemination in loanwords by consonant class is repeated below in as Table 19. Consonant class Gemination in loans Voiceless stops Voiceless affricates Voiceless fricatives Nasals Voiced stops Liquids Voiced fricatives widespread widespread frequent some one none none Table 19: The extent of gemination in loanwords by consonant class (without orthographic reflex in the source word) 16 Lilla Magyar Generals Paper MIT Geminate-to-singleton ratios established based on type frequency distributions found in a corpus are repeated in Table 20. Consonant class Geminate-to-singleton ratio Voiceless affricates Voiceless stops Voiceless fricatives Voiced stops Liquids Nasals Voiced fricatives 0.65 0.49 0.46 0.35 0.28 0.24 0.08 Table 20: Geminate-to-singleton ratios based on type frequencies in the corpus Geminate-to-singleton ratios based on nonce well-formedness judgements by native speakers are repeated in Table 20. Consonant class Geminate-to-singleton ratio Voiceless affricates Voiceless stops Voiceless fricatives Liquids Nasals Voiced stops Voiced fricatives 0.56 0.53 0.53 0.52 0.39 0.35 0.15 Table 21: Geminate-to-singleton ratios based on nonce well-formedness judgements The hierarchies are very similar, but there is one crucial difference: voiceless fricatives seem to undergo post-tonic lengthening, but they are not affected by gemination in loanwords, are very rare geminates in the native Hungarian lexicon and are judged as very marked as geminates by native speakers. Furthermore, they are cross-linguistically marked as geminates, therefore they do not fit into the observed universal markedness hierarchy, for which there could be various reasons, e.g. cross-linguistic geminate markedness (Steriade (2004)); partial devoicing of word-final fricatives (Barkanyi and Graczi (2012)) or difficulty in perception of length contrasts (Kawahara (2007)). So, we can conclude that all phenomena in the native Hungarian phonetics and phonology which are related to gemination or consonant lengthening line up with patterns of universal markedness. 6 Testing Hypothesis 3: Are patterns of universal markedness learnable from the native Hungarian lexicon? We have seen in the previous sections that all phenomena (in native as well as in loanword phonology) related to gemination or consonant lengthening are regulated by universal markedness. Moreover, native speakers’ judgements also reflect those patterns. The goal of this section is to see whether the universally observed tendencies - which ore observed even in the native Hungarian phonology - are learnable without positing any handcrafted universal markedness constraints, that is, to find out whether the learner is able to make phonotactic generalisations based on data found in the corpus and make correct predictions based on that. Before proceeding with the analysis, we will provide a short description of the learning model I used. 6.1 Some background on accounting for gradience and free variation An often cited weakness of classic categorical Optimality Theory is its inability to account for grammars with gradience and free variation. Unfortunately, even most OT learning algorithms (for example, Tesar and Smolensky (1993), Pulleyblank and Turkel (1996), Prince and Tesar (1999)) are not able to account for those phenomena, either. First, they are not trained to learn from noisy training data or they have convergence problems when presented with such data. Second, they learn a single constraint ranking, 17 Lilla Magyar Generals Paper MIT therefore they cannot model grammars with gradience or free variation (more than one winner, different probabilities). There have been several attempts to account for gradience and free variation within the narrow confines of categorical OT. These include floating constraints (Nagy and Reynolds (1997)), free ranking of constraints within strata (Anttila (1997)), strictness bands (Hayes (2000)) and factorial typology applied to a set of unranked constraints (Ringen and Heinemäki (1999), and Magyar (2009)). The probabilistic model proposed by Boersma (1997) (which moves away from standard OT) uses the Gradual Learning Algorithm. Although it is successful at learning from noisy data, it is still unable to account for cumulatively effects. This weakness of GLA was pointed out by Keller and Asudeh (2002). Keller proposed a model earlier, which was called Linear Optimality Theory (Keller (2000)). This model is able to account for cumulatively effects, but is only able to learn from acceptability judgement data instead of actual linguistic forms. If we want an algorithm to learn gradient grammars, the ideal algorithm for our purposes should be able to learn from a corpus consisting of real (not idealised and thus potentially noisy) data, handle effects caused by cumulative constraint violations and generalise to novel data (examples not seen during the training). Maximum entropy grammars generally meet all of these expectations. The principle of maximum entropy is a concept first introduced to information theory by Shannon (1948). Maxent models are log-linear models widely used in different fields including natural language processing. The application of maximum entropy models to grammars has been discussed in Berger et al. (1996), Rosenfeld (1996), Della Pietra et al. (1997), Jelinek (1999), Manning and Schütze (1999), Eisner (2000), Eisner (2001), Keller (2002), Klein and Manning (2003), Goldwater and Johnson (2003), Jäger (2004) and Hayes and Wilson (2008), amongst many others. The advantage of maxent models to other frameworks is that it is able to account for both gradience / free variation and categorical phenomena, can handle cumulatively effects and is mathematically more simple than GLA. 6.2 The model In this learning experiment, we used the UCLA Phonotactic Learner developed by Bruce Hayes and Colin Wilson (Hayes and Wilson (2008)). The grammar created by the model consist of constraints that are weighted according to the principle of maximum entropy. Testing words are assessed by the grammar based on the weighted sum of their constraint violations. The learning algorithm produces grammars that can capture both categorical and gradient phonotactic patterns. The algorithm is not provided with any constraints in advance, but uses its own resources to form constraints and weight them. The basic idea in the application of maxent models to phonotactics is that well-formedness can be interpreted as probability. We suppose that there is an infinite set Ω consisting of all universally possible phonological surface forms. The maxent grammar assigns a probability to every member x of this set: this probability P(x) expresses its phonotactic well-formedness. If this is an infinite (or just a very large) set, the probability of any form will be extremely low. What is important here is the difference between the probabilities, which can be significant and meaningful. A maxent grammar model assigns probabilities with a set of constraints. Constraints in the Hayes and Wilson model are all markedness constraints by the definition of standard Optimality Theory. Each constraint has a weight which is a nonnegative real number. Constraints are situated on a scale: the ones with higher weights are more powerful and can significantly lower the probability of forms violated by them. The score is defined in the following way: Definition 1: Score The score of phonological representation x, denoted as h(x), is N P h(x) = wi Ci (x ) i=1 where wi is the weight of the ith constraint, Ci (x ) is the number of times that x violates the ith constraint, N P denotes summation over all constraints (C1 , C2 , ..., CN ). i=1 18 Lilla Magyar Generals Paper MIT The maxent value of x is calculated as follows: Definition 2: Maxent value Given a phonological representation x and its score h(x) under a grammar, the maxent value of x, denoted P*(x), is P*(x) = exp(-h(x)). The probability of x is calculated by determining its share in the total maxent values of all possible forms in Ω, which is a quantity designated as Z. Probability is defined below: Definition 3: Probability Given a phonological representation x and its maxent value P*(x), the probability of x, denoted P(x), is P(x) = P*(x) / Z where Z = P P*(y). y∈Ω We have shown above how well-formedness (probability) is calculated from constraint violations and different weights. Now we are turning to learning. The core assumption of the Hayes and Wilson model is that the learner has access to a vast amount of data from the target language, which is a representative set of observed forms. However, it is not exposed to negative evidence, in other words, it is not told what forms are illegal. Hopefully, this is a plausible simulation of real language learning. The goal is to find a set of constraint weights which maximises the probability of observed forms. Since total probability is fixed at 1, maximisng the probability of observed forms will automatically minimize the probability of the unattested or not observed forms. This probabilistic concept of well-formedness relates well to the principle of maximum entropy or entropy, which is the measure of the amount of surprise / randomness in the system, given by the following formula: − P P(x) log(P(x)) y∈Ω According to a theorem proven by Della Pietra et al. (1997), if probability is defined as above (Definition 3), maximising entropy is in fact equivalent to maximising the probability of observed forms, given the constraints. If we are making the assumption that all the observed data are independent and equally distributed, then the observed data P(D) is simply the product of probabilities of individual observed data. This is shown below: Definition 4: Probability of the observed data under a given set of weights and constraints Given a maxent grammar and a set D of observed data, the probability of D under the grammar is P(D) = Q P(x) x∈D where P(x) is as defined in Definition 3. Now the learner has to find a set of weights that maximises P(D), which is a search issue. The search always begins by giving every constraint the same initial weight (which is 1 in the Hayes and Wilson model). This calculation is done iteratively until P(D) is maximised. In fact, the search does not find the maximum for the actual P(D), but that of its natural logarithm instead. However, since the log function is monotonic, the weights that maximise log(P(D)) are the same weights that maximise P(D). The observed violation count of each constraint is also calculated: it is done by summing the violations of the given constraint over all examples in the learning data. The calculation of expected violation count, however, is much less straightforward and nearly impossible. To do so, we would have to sum over the set 19 Lilla Magyar Generals Paper MIT of all possible phonological representations x ∈ Ω, which is an infinite set. So, first, we only give the formal definition of the expected violation count: Definition 5: Expected violation count Given a grammar that determines maxent values, the expected number of violations of constraint Ci is E [Ci ] = P P(x) Ci (x), x∈Ω where P(x) is the probability of the representation x, CP i (x) is the number of times that x violates Ci , and represents summation over all x in Ω. x∈Ω Naturally, it is impossible to calculate the exact expected values. As a more plausible solution, we can approximate the values by only looking at strings in Ω which are not the longest strings in the learning data D. This is still an extremely large set, but at least, it is finite. Hayes and Wilson used methods previously described in Ellison (1994), Eisner (1997), Albro (1998), Albro (2005) and Riggle (2004). This work has shown that properties of a very large set of strings can be computed if the set is represented as a finite state machine. The machine is constructed in the following way: first, each constraint is represented as a weighted finite state acceptor, then constraints are combined into a single machine which is a full grammar. Every path in this machine corresponds to a phonological representation also containing a vector of constraint violations. Then we obtain expected violation counts by summing over all paths through the machine with the help of a method also used by Eisner (2001) and Eisner (2002). This sum is rescaled according to the frequency of forms of the given length in the training data. To sum up the procedure of constraint weighting, we can say that the process is basically an iterated hill-climbing search, which is designed to maximise the probability of learning data (or rather, the probability of its logarithm - but it amounts to the same thing). At each stage, a local gradient based on difference between observed and expected violation count for each constraint. Observed violation counts are calculated by examining the learning data, whereas the number of expected violations is approximated by the aforementioned finite state method. As mentioned earlier, constraints found and used by this learner are exclusively markedness constraints based on co-occurence restrictions of different distinctive features (forming natural classes). The entire process of learning alternates between constraint selection and constraint weighting, as follows: a new constraint is selected, and then all the constraints are reweighted, and this process is iterated several times until the learning has been completed. The overall algorithm (procedure of learning) is summarised below: Definition 6: Phonotactic learning algorithm Input: a set Σ of segments classified by a set F of features, a set of D of surface forms drawn from Σ*, an ascending set A of accuracy levels, and a maximum constraint size N 1 2 3 4 5 6 begin with an empty grammar G for each accuracy level a in A do select the most general constraint with accuracy less than a (if one exists) and add it to G train the weights of the constraints in G while a constraint is selected in Step 4 The algorithm terminates when the search in Step 4 fails to generate a new constraint at the least stringent accuracy level. Now, after discussion of the basic mechanisms of the UCLA Phonotactic Learner (based on Hayes and Wilson 20 Lilla Magyar Generals Paper MIT (2008)), we will continue with the description of specific details about the learning simulations. 6.2.1 Learner input The learner must be provided with input files containing learning (or training) data, a feature chart, and testing data. Learning data Since earlier learning experiments showed that there are no consistent co-occurrence restrictions on short vowel + geminate vs. singleton combinations in the native Hungarian phonology (Magyar (2014)), we used only consonants (geminates and singletons) in this simulation. All words (only monosyllables) ending in a short vowel + short consonant or geminate sequence have been extracted from the Hungarian National Corpus on the Szószablya Project Website (Halácsy et al. (2004)). Other parts of words than the final consonant were cut and word-final consonants (both singletons and geminates) were organised into a list. The data were collapsed across consonant classes: for example, [p], [t] and [k] were transcribed as t or t:, a general segment representing singleton and geminate voiceless stops, respectively. We used the following shorthands to represent consonant classes: t voiceless stop (singleton) t: voiceless stop (geminate) d voiced stop (singleton) d: voiced stop (geminate) ts voiceless affricate (singleton) t:s voiceless affricate (geminate) s voiceless fricative (singleton) s: voiceless fricative (geminate) n nasal (singleton) n: nasal (geminate) l liquid (singleton) l: liquid (geminate) This simplification was necessary because gemination in loanwords does not seem to depend on individual consonants (place of articulation) and universal hierarchies of geminate markedness are usually described in terms of consonant classes rather than referring to more fine-grained differences between individual consonants.4 Native speakers’ judgements suggest the same: there is consistency across consonant classes, but not individual consonants. Besides, there was no need to specify the context as ”word-final position in monosyllables”, as the corpus sample contained only monosyllables. Feature chart The learner also needs a feature chart, which defines the symbols according to their phonetic properties in the following format: the top row labels contain feature names, the left side labels are speech sounds, and the values can be +, - or 0 (unspecified). This contains all Hungarian consonant classes (that may participate in loanword gemination) and all the features which all members of a particular consonant class share but are enough to distinguish the consonant classes from each other. The feature chart is shown in Table 22. 4 There is work on universal markedness hierarchies reflected in native patterns which refer to more fine-grained distinctions, such as place of articulation (Sano (2014)). Therefore, it would be interesting to explore more carefully the possibility of place of articulation influencing gemination processes. 21 Lilla Magyar Generals Paper t d ts s z n l t: d: t:s s: z: n: l: long + + + + + + + cons + + + + + + + + + + + + + + son + + + + syll - voice + + + + + + + + MIT cont + + + + + + del rel + + - nasal + + - Table 22: Feature chart Testing data Testing data include a singleton and a geminated version of each consonant class. 6.2.2 Learner settings Specifying gram size Gram size is the number of feature matrices that appear in the constraints. For example, the gram size of the *[+liquid][-voice] is two. The higher the gram size, the longer it takes the learner to finish finding the constraints. In this case, since the learner is trained to learn a manageable number of symbols and features, I anticipated that it will be able to terminate the process without having to impose any restrictions. Therefore, we did not specify the gram size. Specifying maximum O/E O/E means the value of ‘observed over expected’, which is a measure of constraint effectiveness, the ratio of the number of times a constraint is violated in the learning data, to the number of times it would be expected to be violated based on the grammar learnt so far. (For a more detailed discussion, see Hayes and Wilson (2008)). If we want to find the most important and the most powerful constraints, the O/E value has to be low. However, for the reasons mentioned above (in the discussion of specifying gram size), we did not specify the maximum O/E in this experiment. Specifying the maximum number of constraints to discover Specifying the maximum number of constraints to discover is another way of ensuring that the process will terminate within a reasonable time span. As the learning procedure was not expected to take very long, we did not specify the maximum number of constraints to discover. 6.2.3 Results 6.2.4 Constraints and weights The constraints found and weighted by the learner are shown in Table 23 in descending order based on weight (and not the order in which they were discovered). 22 Lilla Magyar Generals Paper MIT Constraints Weights *[+del.rel,-long] *[-son,+voice,+cont,+long] *[+nasal,+long] *[+voice,+cont,+long] *[-son,+voice,+cont,-long] *[+del.rel,+long] *[-son,+voice,+long] *[-son,+cont,+long] *[+nasal] *[-son,+cont,-long] *[+voice,+cont,-long] 1.44 1.375 1.14 1.1 0.979 0.831 0.764 0.695 0.541 0.54 0.164 Table 23: Constraints and weights Since only a limited number of features and segments were being tested, the learner has found only 11 constraints. The constraints listed above are defined as follows: *[+del.rel,-long]: Short affricates are not allowed in word-final position. *[-son,+voice,+cont,+long]: Long voiced fricatives are not allowed in word-final position. *[+nasal,+long]: Long nasals are not allowed in word-final position. *[+voice,+cont,+long]: Long voiced sonorants and sibilants are not allowed in word-final position. *[-son,+voice,+cont,-long]: Short voiced obstruents are not allowed in word-final position. *[+del.rel,+long]: Long affricates are not allowed in word-final position. *[-son,+voice,+long]: Long voiced obstruents are not allowed in word-final position. *[-son,+cont,+long]: Long sibilants are not allowed in word-final position. *[+nasal]: Nasals are not allowed in word-final position. *[-son,+cont,-long]: Short sibilants are not allowed in word-final position. *[+voice,+cont,-long]: Short voiced sibilants are not allowed in word-final position. As we can see, the learner was able to create constraints based on phonotactic generalisations, and the constraints are basically the same as that of universal markedness. The more a constraint lines up with the restrictions of universal markedness, the higher the weight the learner has assigned to it. 6.2.5 Predicted and observed probabilities The learner assigns a sum of constraint violations to each form in the testing data: the higher the score, the lower the probability of a given form is. We converted the sums of violations into probabilities based on the following formula: predicted-rating(x) = P*(x)1/T , where P*(x) = exp( h(x)) and h(x) is the score output by the model 23 Lilla Magyar Generals Paper MIT Table 24 shows both sums of violations and probabilities assigned to each consonant class. Singleton and geminated forms by each consonant class are listed in descending order with respect to probability (well-formedness predicted based on the learning data). Consonant class Sum of violations Probability voiceless stops (singleton) voiceless stops (geminate) voiced stops (singleton) liquids (singleton) voiceless fricatives (singleton) nasals (singleton) voiceless fricatives (geminate) voiced stops (geminate) voiceless affricates (geminate) liquids (geminate) voiceless affricates (singleton) nasals (geminate) voiced fricatives (singleton) voiced fricatives (geminate) 0 0 0 0.164 0.54 0.541 0.695 0.764 0.831 1.1 1.44 1.681 1.683 3.916 1 1 1 0.84874202188 0.58274825237 0.58216579539 0.49907444798 0.46579949765 0.43561345498 0.33287108369 0.23692775868 0.18618769521 0.18581569195 0.01992061806 Table 24: Violation sums and probabilities of candidates 6.2.6 Correlation between probabilities predicted by the model and type frequencies in the corpus In Table 25, predicted probabilities are shown in comparison with observed probabilities (relative type frequencies in a corpus). Consonant class Probability Type frequency (%) voiceless stops (singleton) voiceless stops (geminate) voiced stops (singleton) liquids (singleton) voiceless fricatives (singleton) nasals (singleton) voiceless fricatives (geminate) voiced stops (geminate) voiceless affricates (geminate) liquids (geminate) voiceless affricates (singleton) nasals (geminate) voiced fricatives (singleton) voiced fricatives (geminate) 1 1 1 0.84874202188 0.58274825237 0.58216579539 0.49907444798 0.46579949765 0.43561345498 0.33287108369 0.23692775868 0.18618769521 0.18581569195 0.01992061806 14.93212669683258 14.25339366515837 11.538461538461538 11.538461538461538 7.918552036199094 7.918552036199094 6.787330316742081 6.334841628959276 5.88235294117647 4.524886877828054 3.167420814479638 2.48868778280543 2.48868778280543 0.22624434389140274 Table 25: Violation sums and probabilities of candidates It is clearly shown by Table 25 that the learner was able to predict the probability of both singleton and geminate consonants correctly - the descending order of both singletons and geminates based on type frequency in the lexicon matches up with the probabilities predicted by the learner. The correlation between the probabilities predicted by the model and the relative frequency distributions in the corpus is highly significant (r=0.988). It is plotted in Figure 9. The probabilities assigned by the learner are shown on the x axis, while type frequency distributions are on the y axis. 24 Lilla Magyar Generals Paper MIT Figure 9: Correlation between probabilities predicted by the model and type frequencies (%) in the corpus 6.3 Comparison with a model using hand-crafted markedness constraints We have seen previously that the model is able to learn the frequency distribution of geminate and singleton consonants based on phonotactic generalisations drawn from a corpus, and the constraints it learns roughly correspond to that of universal markedness. Therefore, we can conclude that there is no specific need for equipping the learner with particular information about geminate and singleton markedness. However, it is important to see how well a model provided with such information would perform compared to the one discussed in the previous subsection. This model was implemented using the Maxent Grammar Tool (Hayes and Wilson (2009)). Just like the learning model, this one is also based on the principle of maximum entropy. However, it uses handcrafted constraints instead of discovering new constraints based on phonotactic generalisations. It has to be provided with OT tableaus containing the relevant (unranked) constraints and the observed probability of each possible candidate. Based on this information, the model will weight the constraints and assign predicted probabilities to each form, which may or may not match up with the actual observed probabilities. 6.3.1 Input data OT tableaus and constraints were used as input data. Each tableau contains a singleton and a geminated version of a segment representing a consonant class, violation marks for constraints and observed probabilities for each possible output candidate. We used the following markedness constraints (for both singletons and geminates): *zz: Geminated voiced fricatives are forbidden. *ss: Geminated voiceless fricatives are forbidden. *tt: Geminated voiceless stops are forbidden. *dd: Geminated voiced stops are forbidden. 25 Lilla Magyar Generals Paper MIT *t:s: Geminated voiceless affricates are forbidden. *nn: Geminated nasals are forbidden. *ll: Geminated liquids are forbidden. *z: Singleton voiced fricatives are forbidden. *s: Singleton voiceless fricatives are forbidden. *t: Singleton voiceless stops are forbidden. *d: Singleton voiced stops are forbidden. *ts: Singleton voiceless affricates are forbidden. *n: Singleton nasals are forbidden. *l: Singleton liquids are forbidden. These constraints are given unranked to the learner, which will weight them based on the observed probabilities and constraint violations of possible output candidates. 6.3.2 Results The results of the simulation are shown in Table 26. Consonant class Observed probability Predicted probability voiced fricatives (singleton) voiced fricatives (geminate) 0.92 0.08 0.9199511662223586 0.08004883377764128 voiceless fricatives (singleton) voiceless fricatives (geminate) 0.54 0.46 0.5399967938614894 0.4600032061385106 voiceless stops (singleton) voiceless stops (geminate) 0.51 0.49 0.5099991991080961 0.490000800891904 voiced stops (singleton) voiced stops (geminate) 0.65 0.35 0.6499876212125152 0.3500123787874848 voiceless affricates (singleton) voiceless affricates (geminate) 0.35 0.65 0.3500123787874848 0.6499876212125152 nasals (singleton) nasals (geminate) 0.76 0.24 0.7599769430688887 0.24002305693111126 liquids (singleton) liquids (geminate) 0.72 0.28 0.7199811163658887 0.28001888363411126 Table 26: Observed and predicted probabilities of geminates and singletons by consonant class The table shows clearly that predicted probabilities line up entirely with observed probabilities. There is full positive correlation between predicted and observed probabilities (r=1), which is plotted below in Figure 10. 26 Lilla Magyar Generals Paper MIT Figure 10: Correlation between probabilities observed in the corpus and predicted by the model We can conclude that both models predict the distribution of singletons and geminates almost equally well. However, the model which uses handcrafted constraints and OT tableaus has one advantage over the one which discovers constraints based on phonotactic generalisations drawn from the corpus: it is possible to have direct comparison between the singleton and the geminate form by each consonant class. 7 Testing Hypothesis 4: Perception of vowel length There is no literature on the length of short vowels before geminates in Hungarian, but in many (including Finno-Ugric) languages, vowels shorten before geminates5 (Kawahara (to appear), Doty et al (2007)). Based on evidence from other languages, we could hypothesise that gemination in Hungarian loanword is a strategy to preserve source vowel length (shortness), if native speakers of Hungarian perceive the vowel in the source word as shorter than its ‘substitute’ vowel in the loanword. The goal of this section is to test this hypothesis. 7.1 Adaptation of English vowels Native speakers substitute foreign vowels with the perceptibly closest one in the native vowel inventory or the one which they consider to be the most similar to the source vowel. Some of the substitute vowels may not seem phonetically ‘close’ to the original, however, Hungarian speakers regard them as the best possible (for want of a better) replacement for the source vowel. Since English, German and Hungarian has short and long vowels, faithfulness to source-vowel length can easily be satisfied in the process of borrowing words from English and German to Hungarian. However, as Hungarian does not have exactly the same vowel inventory as English and German, faithfulness to vowel quality cannot be achieved as easily as the preservation of vowel length. At first sight, it looks like faithfulness to vowel length is ranked higher than faithfulness to vowel quality, but this claim cannot be proven. Faithfulness to vowel quality will practically always violated given the different vowel inventories of the source language and the borrowing language and we cannot find examples which clearly show that both constraints could be satisfied but one of them is more important. Probably the only example that shows this kind of pattern 5 There are some languages - for example, in Japanese - in which it works the opposite way: vowels lengthen before geminates. 27 Lilla Magyar Generals Paper MIT is the way the word spray [sprej] was borrowed into Hungarian. It is pronounced as [spre:] or [Spre:] as a loanword, which involves a source-word diphthong changing into a monophthong, even though words like éj [e(:)j] or kéj [ke(:)j] are perfectly well-formed in Hungarian. The reason why spray was borrowed with a monophthong is that diphthongs are not found in standard Hungarian: what looks like a diphthong similar to [ej] is a vowel + liquid sequence which is never described as a diphthong in Hungarian phonology. Therefore, even this example cannot prove that there is a clear preference of faithfulness to length over vowel quality without coercion (e.g. the lack of the same vowel in the native Hungarian vowel inventory). Table 27 shows English vowels and their Hungarian counterparts which native speakers use in loanwords. SW [I] [E] / [e] [æ] [O] [A] [U] LW [i] [E] [E] [o] [O] [u] Table 27: English vowels and their Hungarian loan counterparts German vowels and their Hungarian substitutes in loanwords are shown in Table 28 below. SW [I] [E / e] [y] [œ] [O] [U] [a] LW [i] [E] [y] [ø] [o] [u] [O] Table 28: German vowels and their Hungarian loan counterparts 7.2 Perception experiment 1 We ran a pilot experiment in order to see how native speakers of Hungarian perceive vowel length in English and German words compared to Hungarian words and whether their perception of vowel length plays a role in loanword gemination. 7.2.1 Participants Four native Hungarian speakers participated in the experiment. All of them currently reside in Hungary and have not lived in any other countries for a longer time. They speak some English or German, but they are not conversant with phonetics and do not have constant exposure to English or German spoken by native speakers. 7.2.2 Speech material The speech material consists of English, German and Hungarian words recorded by native speakers of English, German and Hungarian. Word pairs comprise one Hungarian and one English or German word which are minimal pairs or quasi minimal pairs. In order to avoid considerable differences in duration, we paired up words which were produced by speakers of similar voice quality. All words in the experiment are monosyllabic, ending in a short vowel + short voiceless or voiced consonant sequence. None of the words in any word pair is a loanword. In the case of minimal pairs, the only difference is in the vowel: Hungarian words contain vowels which are considered to be the most faithful substitute for the vowel in the English or German member of the word pair. Word pairs used as speech material are listed in Appendix III/A and III/B. 7.2.3 Task Word pairs were presented orally to participants. Each word pair was played twice, but the order of word pairs coming after one another was randomised. The order of the Hungarian and the English / German word in each word pair was presented in two ways: once, the order was Hungarian-English / German, then next, it was the other way around. Each word pair consists of a Hungarian word and an English word or a Hungarian word and a German word (for example, Hungarian hit [hit] ‘faith’ and English hit [hIt]). They were asked to decide which one of the two words has a shorter vowel, that is, the vowel in which word they perceive as shorter. 28 Lilla Magyar 7.2.4 Generals Paper MIT Results English-Hungarian word pairs Table 29 shows how participants in the experiment perceived English vowels in comparison to their Hungarian counterparts. In addition, actual durations of the vowels are included (in seconds). Word pair Perceived as shorter Perceived as longer Duration (s) mit [mit] hit [hIt] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.127382 0.115009 fut [fut] foot [fUt] by 0 subjects by 0 subjects by 0 subjects by 0 subjects 0.101144 0.098444 hat [hOt] but [h2t] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.117373 0.078111 vet [vEt] set [set] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.130203 0.088622 szid [sid] hid [hId] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.116010 0.083068 tud [tud] good [gUd] by 0 subjects by 0 subjects by 0 subjects by 0 subjects 0.107769 0.105599 had [hOd] bud [b2d] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.141419 0.087658 szed [sEd] said [sed] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.138829 0.102780 szed [sEd] sad [sæd] by 4 subjects by 0 subjects by 0 subjects by 4 subjects 0.138829 0.149757 Table 29: Vowel length in Hungarian-English word pairs As is shown above, the four Hungarian native speakers perceived most of the English vowels as shorter than their Hungarian counterparts. There are two English vowels, however, that they did not perceive as shorter than their Hungarian counterparts: [U] and [æ]. English [U] and Hungarian [u] were perceived as having the same length, whereas English [æ] was rated as longer than Hungarian [E]. The judgements given by participants seem to line up with the actual duration data. German-Hungarian word pairs Table 30 shows how Hungarian-speaking subjects perceived German vowels in comparison to their Hungarian counterparts. Vowel duration data are also included in the table (in seconds). 29 Lilla Magyar Generals Paper MIT Word pair Perceived as shorter Perceived as longer Duration (s) kis [kiS] Tisch [tIS] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.123737 0.096124 mos [mos] Bosch [boS] by 0 subjects by 0 subjects by 0 subjects by 0 subjects 0.100956 0.101296 vas [vOS] wasch [vaS] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.137081 0.092235 tus [tuS] Fusch [fuS] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.119037 0.085063 köt [køt] Schött [Sœt] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.138472 0.080 les [lES] fesch [feS] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.134030 0.087056 Table 30: Vowel length in Hungarian-German word pairs Participants in the experiment perceived most German vowels as shorter than their Hungarian counterparts. There is only one German vowel which they did not rate as shorter: it is [o]. They perceived German [o] as having the same length as Hungarian [o]. 7.3 7.3.1 Perception experiment 2 Participants The participants were the same as the ones in the previous perception experiment. 7.3.2 Speech material The speech material consists of Hungarian minimal pairs or near minimal pairs. There is only one difference between the two members of each minimal pair, which is the length of the final consonant: one word contains a singleton and the other one a geminate, and the examples contains voiceless consonants. All words used in this experiment are monosyllabic (e.g. hit [hit] ‘faith’ and hitt [hit:] ‘believe-3rd.p.-past’): both words contain the same vowel and they only differ in the length of the consonant. We only used short vowels and voiceless stops in the experiment. Items are listed in Appendix III/C. 7.3.3 Task The minimal pairs were played to participants, and participants were asked to decide which word in each minimal pair has a shorter vowel. All word pairs were played twice, but in a random order: that is, no word pair was directly followed by its repetition. Word pairs were presented in two different orders: (a) geminate - singleton and (b) singleton - geminate. 7.3.4 Results Table 31 shows how participants in the experiment perceived short vowels followed by singletons and geminates, and actual durations of vowels (in seconds) in both context are also included. 30 Lilla Magyar Generals Paper MIT Word pair Perceived as shorter Perceived as longer Duration (s) hit [hit] bitt [hit:] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.115988 0.057 vet [vEt] vert [vEt:] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.142483 0.090191 lap [lOp] lapp [lOp:] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.146527 0.107326 lop [lop] hoop [hop:] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.138545 0.076842 luk [luk] pukk [puk:] by 0 subjects by 4 subjects by 4 subjects by 0 subjects 0.115937 0.063932 Table 31: Vowel length before singletons and geminates It is clearly shown above that all four subjects perceived vowels followed by geminates as shorter than those followed by singletons. Their judgements line up with durations of vowels in both contexts. 7.4 Implications for loanword gemination Although we have not tested all possible short vowel and geminate / singleton combinations, the previous experiment suggests that native speakers of Hungarian perceive vowels before geminates in closed syllables as shorter than before singleton consonants. Moreover, by looking at loanwords which undergo gemination, we cannot find any examples for loanword gemination among words in which an English or German vowel perceived as longer or of the same length as its Hungarian counterpart and is followed by a singleton consonant in the source word. (For example, speakers will pronounce English fit as [fit:] but not the name Pat as [pEt:]), since English [I] is perceived shorter than Hungarian [i], while English [æ] is heard as longer than Hungarian [E].) Gemination in German loanwords seems even more widespread, extending to consonants preceded by almost all kinds of short vowels, which may explain why gemination in German words is more widespread than gemination in English words. It implies that there is a connection between the preservation of source vowel length (shortness) and gemination. 8 Interim summary The goal of this paper is to explore a cross-linguistically widespread phonological process - gemination in loanwords - which is also found in Hungarian. In connection with this phenomenon, the following hypotheses were put to test in the previous sections: (1) geminate-to-singleton ratios based on type frequencies in the lexicon line up with patterns of universal geminate markedness, and native speakers have a knowledge of these patters; (2) there may be other processes in the native Hungarian phonetics or phonology which are similar to gemination and universal markedness patterns are reflected in them, too; (3) patterns which correspond to universal markedness are learnable from the native Hungarian lexicon; (4) gemination in loanwords is a strategy to preserve source vowel shortness in closed syllables. In Section 4, Hypothesis 1 was tested in the form of two studies: a corpus study and a wug test. We found that there is a strong positive correlation between geminate-to-singleton ratios calculated based on type frequencies in the corpus and based on native speakers’ well-formedness judgements on nonce words, and the distribution of geminates and singletons by consonant class based on both type frequencies and human judgements line up with patterns of universal geminate markedness. The goal of Section 5 was to test Hypothesis 2, that is, to find out whether there are other processes in the native phonetics/phonology which are similar to loanword gemination and also line up with patterns of universal markedness. We managed to find one such process, which is post-tonic lengthening in closed syllables: singleton consonants lengthen following stressed vowels. We have found that in general, consonants which were more likely to undergo gemination in loanwords were more likely to lengthen post-tonically, too. In Section 6, we ran learning simulations to test Hypothesis 3. Results of the experiments suggest that patterns of loanword gemination - which line up with universal geminate markedness - are learnable from the native Hungarian lexicon based on phonotactic generalisations. We have also shown that a learner which 31 Lilla Magyar Generals Paper MIT is not equipped with special information on segment markedness performs almost as well as a model using hand-crafted markedness constraints. Hypothesis 4 was tested in Section 7. We ran two perception experiments, which have two results: (1) native speakers have perceived certain English and German vowels shorter than their Hungarian counterparts (which are widely used as their substitute vowels in loanwords), and loanword gemination is more likely to happen in the presence of these vowels; (2) The same native speakers perceived vowels followed by geminates shorter than those followed by singletons. All this suggests that gemination in loanwords can be a strategy to preserve the length (shortness) of the vowel in the source word. We can conclude that all of our hypotheses have been confirmed, and that gemination in loanwords is motivated by the need to preserve source-vowel length (shortness), but this need is not always satisfied: whether gemination occurs or not is regulated by universal geminate markedness. With all this in mind, we have all ingredients ready for an analysis. 9 Analysis: A Maximum Entropy model As mentioned earlier, Hungarian loanword gemination is a gradient phenomenon with considerable interand intra-speaker variation. Therefore, we are going develop a maximum entropy model instead of providing a categorical OT account. 9.1 The method The method is the same as in Section 6.3., that is a maxent model using handcrafted constraints and observed probabilities to weight constraints and predict probabilities for the possible output forms. It was implemented with the hep of the Maxent Grammar Tool (Hayes and Wilson (2008)). 9.2 Constraints In Section 6.3, we tested geminate markedness, therefore only markedness constraints were used. In this case, we need both markedness and faithfulness constraints.6 9.2.1 Faithfulness constraints IdentSV-LV(length): Vowel length of the source word must be preserved in the loanword. MaxOrthGem: If the consonant is question is spelt with a double letter in the source word, it must be geminated in the loanword. 9.2.2 Markedness constraints The markedness constraints below are not yet listed in the hierarchical order of universal geminate markedness. When ranked (as in a categorical OT analysis) or weighted (as in a maxent model), they reflect the hierarchy of universal geminate markedness. *zz: Geminated voiced fricatives are forbidden. *ss: Geminated voiceless fricatives are forbidden. *tt: Geminated voiceless stops are forbidden. *dd: Geminated voiced stops are forbidden. 6 Apart from those listed in this sections, there are two additional constraints which could be used. One is a more general Ident constraint, IdSL, which requires faithfulness to the vowel in the source word, that is, the same vowel must be used in the loanword as in the source word. However, this constraint is violated by all winning candidates and the more particular constraint IdSL(length) suffices. The other constraint would be *nnv, which bans all vowels from surfacing which are not native to the Hungarian vowel inventory. However, for the sake of simplicity, this issue will also be ignored in the present analysis. 32 Lilla Magyar Generals Paper MIT *t:s: Geminated voiceless affricates are forbidden. *nn: Geminated nasals are forbidden. *ll: Geminated liquids are forbidden. 9.3 9.3.1 Input data Words We are going to use the following words to train and test the grammar and find constraint weights. szett [sEt:] ‘set or outfit’ Pat [pEt] <diminutive of Patrick or Patricia> Ted [tEd] <diminutive of Edward or Theodore> giccs [git:S] ‘kitsch’ hall [hOl:] ‘hall’ Hal [hæl] <diminutive of Harry> dzsem [dZEm] or [dZEm:] ‘marmalade’ friss [friS:] ‘fresh’ dzsessz [dZEs:] ‘jazz’ 9.3.2 OT tableaus The input is not an abstract underlying representation in this case, but both the written and the spoken form of the source word. The following tableaus are not classic OT tableaus in a sense that constraints are not ranked. Therefore, fatal violations are not marked, and instead of winners, probabilities are indicated. Since gemination in loanwords exhibits much inter- and intra-speaker variation, oftentimes there is no categorical winner among the candidates. Probabilities are assigned to possible surface forms, ranging from 0 (not attested) to 1 (most likely). Probabilities are based on Nádasdy’s (1989) description, my own judgements, personal conversations with native linguists (Miklós Törkenczy, Péter Rebrus, Péter Siptár, Etelka Tekla Gráczi) and several native speakers. Constraints will be weighted by the model, therefore, they are not ranked in the tableaus. Tableau 1 shows a word which does not contain a source-word orthographic geminate and ends in a voiceless stop. It was borrowed from English through German. Most native speakers prefer the surface form containing a low mid vowel [E] followed by a geminated [t]. set /set/ a. b. sEt sEt: IdSV-LV(length) 0.1 0.9 MaxOG *ss *tt *dd *tts *nn *ll *zz * * Tableau 1: szett ‘set/outfit’ Candidate (a) violates IdSL(l), since [E] was perceived as longer than English [e] by native speakers of Hungarian, while candidate (b) violates *tt, since it contains a geminated voiceless stop. Tableau 2 contains an example for a word ending in a singleton voiceless stop preceded by a [æ] and an [E] in the loanword. Candidate (a), that is, a surface form ending with a singleton [t] is preferred by most native 33 Lilla Magyar Generals Paper MIT speakers of Hungarian over candidate (b). Pat /pæt/ a. b. pEt pEt: IdSV-LV(length) 0.9 0.1 MaxOG *ss *tt * *dd *tts *nn *ll *zz * Tableau 2: Pat <name> Candidate (a) does not violate any constraints. Candidate (b) violates two constraints, IdSL and tt. It violates IdSL because English [æ] was not perceived shorter by native Hungarian speakers than Hungarian [E]. If the Hungarian vowel (which is not longer than the English source vowel) is followed by a geminate, it will shorten more and therefore becomes much more different in length from the source vowel. Candidate (b) also violates *tt because it ends with a geminated voiceless stop. A foreign name well-known to Hungarian speakers and ending in a voiced stop is shown in Tableau 3. Most people prefer candidate (a), a surface form ending in a singleton [d]. Ted /ted/ a. b. tEd tEd: IdSV-LV(length) 0.9 0.1 MaxOG *ss *tt *dd *tts *nn *ll *zz * * Tableau 3: Ted <diminutive of Edward or Theodore> Candidate (a) violates IdSL(l) because native speakers perceived English [e] as shorter than Hungarian [E]. Candidate (b) violates *dd, since it ends with a geminated voiced stop. Tableau 4 shows a word which was borrowed into Hungarian with a geminated voiceless affricate.7 The only acceptable form of this loanword is candidate (b), which ends [t:S]. Kitsch /kItS/ a. b. gitS gitS: 0.0 1.0 IdSV-LV(length) MaxOG *ss *tt *dd *tts *nn *ll *zz * * Tableau 4: giccs ‘kitsch, gaudy stuff’ Candidate (a) violates IdSL(l) because native speakers perceived German [I] as shorter than Hungarian [i]. Candidate (b) violates *tts, as it ends with a geminated voiceless affricate. The following word (shown in Tableau 5) is an example of a borrowing ending in a geminated liquid. The source word is German and contains a double consonant letter in the spelling. The most widespread form of this loanword is candidate (b), that is, the one ending with a geminate [l]. Hall /hal/ a. b. hOl hOl: 0.1 0.9 IdSV-LV(length) MaxOG * * *ss *tt *dd *tts *nn *ll *zz * Tableau 5: Hall ‘hall’ Candidate (a) violates two constraints: IdSL(l) because native speakers perceived German [a] as shorter than Hungarian [O], and MaxOG, since the source word is spelt with a double consonant letter. Candidate (b) violates only one constraint, *ll, because it ends with a liquid geminate. Similarly to Pat, the example in the following tableau (Tableau 6) is not a loanword per se, but it is a common English name that many Hungarians know from films or books. People pronounce it with a short [l]. 7 In this paper, we are not going into details concerning why this the [g] has changed into a [k] in the loanword. 34 Lilla Magyar Generals Paper Hal /hæl/ a. b. hEl hEl: IdSV-LV(length) 1.0 0.0 MaxOG *ss MIT *tt *dd *tts *nn * *ll *zz * Tableau 6: Hal <name> Candidate (a) does not violate any constraints. Candidate (b) violates two constraints: IdSL(l) because native speakers did not perceive English [æ] as shorter than Hungarian [E], and *ll for containing a geminated liquid. An example of a loanword optionally ending in a short or a long nasal is shown in Tableau 7. The source word which the loanword is based on contains a singleton consonant spelt with a single consonant letter. The original source is an English word, but it was borrowed into Hungarian through German.8 Jam /dZem/ a. b. dZEm dZEm: IdSV-LV(length) 0.5 0.5 MaxOG *ss *tt *dd *tts *nn *ll *zz * * Tableau 7: dzsem ‘marmalade’ Candidate (a) violates IdSL(l) as native speakers perceived German [e] as shorter than Hungarian [E], while candidate (b) violates *nn, because it ends with a geminated nasal. The example in Tableau 8 is a loanword borrowed from German, which ends in a voiceless fricative. The source word does not contain an orthographic geminate, only a trigraph which represents [S] in German spelling. The winner is a form which contains an [i] instead of the source vowel [I], and always ends in a geminate [S]. frisch /frIS/ a. b. friS friS: 0.0 1.0 IdSV-LV(length) MaxOG *ss *tt *dd *tts *nn *ll *zz * * Tableau 8: friss ‘fresh’ Candidate (a) violates IdSL(l) as native speakers perceived German [I] as shorter than Hungarian [i]. As candidate (b) contains a long voiceless affricate, it violates *tts. The following word (shown in Tableau 9) comes from English, has a double consonant letter in the spelling of the source word and a devoiced geminate in the loanword for most native speakers of Hungarian. jazz /dZæz/ a. b. c. dZEz dZEz: dZEs: 0.05 0.05 0.9 IdSV-LV(length) MaxOG *ss *tt *dd *tts *nn *ll *zz * * * * * Candidate (a) violates MaxOG because the source word is spelt with a double consonant letter. Both candidate (b) and (c) violate IdSL(l). Apart from that constraint, candidate (b) violates *zz for containing a long voiced fricative, while candidate (c) violates *ss, since it ends with a geminated voiceless fricative. As mentioned earlier in this subsection, the above tableaux do not present a full OT analysis of the data: instead, they show input forms (both spelling and pronunciation of the source word), possible surface forms and their violation marks for each constraint, and the probability score of each candidate. What they do not show is fatal violations, since the constraints are not ranked. In the following subsection, we will find out the weights for the constraints and see if the model is able to make predictions that line up with the actual 8 That is the reason why we are using [e] instead of [æ]. 35 Lilla Magyar Generals Paper MIT observed probabilities. 9.4 9.4.1 Results Constraint weights Based on the input data, the learner has assigned the following weights to the constraints listed in 9.2. Constraints with higher weights are more powerful than the ones with lower weights. Constraints and weights are shown in Table 32. Constraint Weight *zz *ll MaxOG *dd IdSL(l) *nn *tt *t:s *ss 8.660745165047155 6.573670003683399 5.8290854593356105 5.139753175751042 2.9424513636090874 2.942214625961538 1.3375742260864934 0.01589221942517886 0.0 Table 32: Constraints and their weights The weights assigned to markedness constraints correspond quite closely to hierarchies of universal geminate markedness. Constraints banning geminated voiced fricatives, liquids and voiced stops have much higher weights than those not allowing voiceless stops, fricatives and affricates. However, the ranking between constraints banning geminated voiceless consonants could be made more precise by providing the learner with more input data. 9.4.2 Predicted and observed probabilities of competing candidates Table 33 shows the observed probability of each possible output form and the predicted scores assigned to the same forms by the model. Input Candidate Observed Predicted set /set/ sEt sEt: 0.1 0.9 0.05009449622784214 0.9499055037721578 Pat /pæt/ pEt pEt: 0.9 0.1 0.9499055037721581 0.05009449622784202 Ted /ted/ tEd tEd: 1.0 0.0 0.8998973370338739 0.10010266296612606 Kitsch /kItS/ gitS git:S 0.0 1.0 0.050094496227842074 0.9499055037721579 Hall /hal/ hOl hOl: 0.1 0.9 0.09994221307400378 0.9000577869259961 Hal /hæl/ hEl hEl: 1.0 0.0 0.9999263506337052 0.00007364936629 Jam /dZem/ dZEm dZEm: 0.5 0.5 0.4999408155883891 0.5000591844116109 frisch /frIS/ friS friS: 0.0 1.0 0.050094496227842074 0.9499055037721579 jazz /dZæz/ dZEz dZEz: dZEs: 0.05 0.05 0.9 0.0501743010704173 0.05005760161197584 0.8997680973176069 36 Lilla Magyar Generals Paper MIT Table 33: Predicted and observed probabilities of candidates Even without calculating the correlation between predicted and observed probability scores, it is easy to see that the model’s predictions closely correspond to native speakers’ preferences (that is, observed probabilities). The correlation between observed and predicted scores is close to 1 (r = 0.9969941), which is plotted as Figure 11. Figure 11: Correlation between observed and predicted probabilities 10 Conclusions This paper discussed a cross-linguistically widespread phenomenon of loanword phonology, namely, gemination in loanwords, which is rather different from the same process in other languages. It has been claimed to not be motivated by native Hungarian phonology and phonotactics, but the patterns of loanword gemination seem to line up with universal hierarchies of geminate markedness. Our goal was to find out what the motivation is for this process and whether it is regulated by universal markedness or native Hungarian phonotactics. We found that type frequency distributions of singleton and geminate consonants in the native Hungarian lexicon lines up with universal hierarchies of geminate markedness. Speakers are also aware of these patterns: their judgements of nonce words also line up with universal geminate markedness. Apart from loanword gemination, frequency distributions and native speakers’ judgements, we discovered another phenomenon which reflects universal patterns of geminate markedness: this a subphonemic process, which is referred to as post-tonic lengthening. Furthermore, we have shown by a learning simulation that geminate markedness hierarchies can be learned from the native Hungarian lexicon based on phonotactic generalisations, which disproves the claim that the propensity of some consonants to geminate on loanwords or the lack of gemination in the case of others has nothing to do with native Hungarian phonotactics. We also showed that native speakers of Hungarian perceive vowels followed by geminates to be shorter than those followed by singleton consonants. They also perceived certain English and German vowels shorter 37 Lilla Magyar Generals Paper MIT than the Hungarian vowels which are generally used as their substitute vowels in Hungarianised loanwords, and gemination occurs only in words which contain those vowels. This seems to indicate that gemination in loanwords is a strategy to preserve the shortness of the source word vowel in the loanword. Finally, we provided a maximum entropy model to account for the non-categorical nature of the phenomenon. The model was able to predict the probability of each possible output form correctly and assigned weights to the markedness constraints in a way that it lines up with universal hierarchies. However, there are additional issues that we will have to address in the future and that are outside of the scope of this paper. In the nonce well-formedness judgement task, subjects were asked to choose from two monosyllabic words and decide which one would be a more well-formed Hungarian word or Hungarianised loanword. Therefore, when they were trying to make a decision, they were looking at full word forms and were influenced by several additional factors, including similarity of a given string to existing word endings (for example, if a nonce word looks like boz, it is likely to be preferred over bozz not only because of geminate markedness, but as a result of its similarity to the ending of the existing word doboz ‘box’). At the same time, in the learning experiment, we were training and testing word endings or rhymes. It would be interesting to see how the learner performs if we try to find phonotactic restrictions for whole monosyllabic word forms. There is another factor that we did not consider in our analysis, but it might play a role in the gemination of consonants in loan monosyllables.In a data collection we have done using a reverse alphabetised dictionary (Papp (1969)), we have found that in the cases when a source words ends in an unattested vowel + consonant sequence (that is, a rhyme which is missing from the language (for example, there are no Hungarian words ending in -up or ip - it may or may not be an accidental gap), gemination is more likely to occur in the loanword. However, this generalization is very hard to express in a formal analysis. If a short vowel + short consonant sequence is unattested, but at the same time, the geminated form is equally unattested, why would the word be borrowed with a geminate consonant? Is it some sort of a minimal word requirement? In the perception experiment, we used recordings by native speakers of Hungarian, English and German. The speakers reading out word pairs to be compared had similar voice qualities, in order to exclude drastic individual differences of vowel length. However, ‘similar voice quality’ is a rather nebulous concept and is hard to verify, therefore it would be more advantageous to ask bilingual people (Hungarian-English and Hungarian-German) to record words in both languages. In the same experiments, we used recordings by speakers of British English (that is why we are using the symbol [e] instead of [E]). It would be interesting to see which vowels the subjects would perceive as shorter, if we used recordings by American speakers who pronounce [E], which is very similar to what Hungarian speakers use. An interesting observation which may be related to the vowel length issue was made by Törkenczy (1989). The observation is that in loanwords which have complex onsets, gemination is almost completely predictable. It would also be worth exploring why this is the case. Apart from the perception of source-loan vowel length, consonant length is also worth examining. It would be interesting to see - similarly to the perception experiment discussed above - how native speakers of Hungarian perceive consonant length in contexts where loanword gemination applies. If they perceive English and German consonants shorter in monosyllables, preceded by short vowels, this could also be a primary motivation for gemination in loanwords. References D. Albro. Evaluation, implementation, and extension of primitive Optimality Theory. Master’s thesis, UCLA, Los Angeles, CA, 1998. D. Albro. Computational Optimality Theory and the phonological system of Malagasy. PhD dissertation, UCLA, Los Angeles, CA, 2005. A. Anttila. Variation in Finnish phonology and morphology. PhD thesis, Stanford University, 1997. D. Bates, M. Maechler, and B. Bolker. lme4: Linear mixed-effects models using S4 classes. http://CRAN.R-project.org/package=lme4. 2011. URL A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. A Maximum Entropy approach to natural language processing. Computational Linguistics 22:39-71, 1996. 38 Lilla Magyar Generals Paper MIT P. Boersma. How we learn variation, optionality, and probability. Proceedings of the Institute of Phonetic Sciences of the University of Amsterdam 21, 43-58, 1997. P. Boersma and D. Weenink. Praat, Version 5.3.26. 2012. URL www.praat.org. S. A. Della Pietra, V. J. Della Pietra, and J. D. Lafferty. Including features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:380-393, 1997. J. Eisner. Efficient generation in primitive Optimality Theory. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 313-320, 1997. J. Eisner. Review of Kager: ”Optimality Theory”. Computational Linguistics, 26(2):286-290, 2000. J. Eisner. Expectational semirings: Flexible EM for finite-state transducers. In: G. van Noord (ed.), Proceedings of the ESSLLI Workshop on Finite-State Methods in NLP (FSMNLP), 2001. J. Eisner. Parameter estimation for probabilistic finite state transducers. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 1-8, 2002. T. M. Ellison. Phonological derivation in Optimality Theory. Proceedings of the Fifteenth International Conference on Computational Linguistics, 1007-1013, 1994. E. Farnetani and S. Kori. Effects of syllable and word structure on segmental durations in spoken Italian. Speech Communication, 5:17–34, 1986. F. Goldwater and M. Johnson. Learning OT Constraint Rankings Using a Maximum Entropy Grammar. In: J. Spenader, A. Eriksson and Ö. Dahl (eds.), Proceedings of the Stockholm Workshop on Variation within Optimality Theory, 111-120. Stockholm: Stockholm University, Department of Linguistics, 2003. P. Halácsy, A. Kornai, L. Németh, András Rung, I. Szakadát, and V. Trón. Creating open language resources for Hungarians. Proceedings of the 4th International Conference on Language Resources and evaluation, 2004. B. Hayes. Gradient well-formedness in Optimality Theory. In: J. Dekkers, F. van der Leeuw, and J. van de Weijer (eds.), Optimality Theory: Phonology, Syntax and Acquisition. Oxford University Press, Oxford., 2000. B. Hayes and C. Wilson. A Maximum Entropy Model of Phonotactics and Phonotactic Learning. Linguistic Inquiry, 2008. G. C. Jäger. Maximum entropy models and stochastic Optimality Theory. Ruthgers Optimality Archive ROA-625, 2004. F. Jelinek. Statistical methods for speech recognition. Cambridge, MA: MIT Press, 1999. D. Karvonen. The Emergence of the Unmarked in Finnish loanword phonology. Paper presented at the 17th Manchester Phonology Meeting, 2009. S. Kawahara. Sonorancy and geminacy. University of Massachusetts Occasional Papers in Linguistics 32: Papers in Optimality III, 2007. F. Keller. Gradience in grammar: Experimental and computational aspects of degrees of grammaticality. PhD thesis, University of Edinburgh, 2000. F. Keller. Optimality-Theoretic Lexical Functional Grammar. In: P. Merlo and S. Stevenson (eds.), The Lexical Basis of Sentence Processing: Formal, Computational and Experimental Issues, 59-74. John Benjamins, Amsterdam, The Netherlands, 2002. F. Keller and A. Asudeh. Probabilistic learning algorithms and Optimality Theory. Linguistic Inquiry, 33(2):225-244, 2002. Zs. Kertész. Approaches to the phonological analysis of loanword adaptation. The Even Yearbook 7, Department of English Linguistics, Eötvös Loránd University, Budapest, 2006. D. Klein and C. Manning. Maxent models, conditional estimation, and optimization, without the magic. Tutorial presented at NAACL-03 and ACL-03, 2003. B. Krishnamurti and J. P. L. Gwynn. A Grammar of Modern Telugu. Oxford University Press, 1985. H. Kubozono, J. Ito, and A. Mester. Consonant gemination in Japanese loanword phonology: A phonological account. Proceedings of the 18th International Congress of Linguists, 2008. 39 Lilla Magyar Generals Paper MIT L. Magyar. Hungarian Vacillating Stems: A Statistical and Optimality Theoretic Account. MA thesis, University of Pannonia, 2009. L. Magyar. Learnability of word-final gemination in loan monosyllables. Squib for 24.981, 2014. C. Manning and H. Schütze. Foundations of statistical natural language processing. Cambridge, MA: MIT Press, 1999. Á. Nádasdy. Consonant length in recent borrowings into Hungarian. Acta Linguistica Hungarica, 39, 1989. N. Nagy and B. Reynolds. Optimality theory and variable word-final delition in Faetar. Language Variation and Change, 9:37-55, 1997. F. Papp. Reverse-Alphabetized Dictionary of the Hungarian Language. Akadémiai Kiadó, Budapest, 1969. D. Passino. Adaptation of loanwords and licensing strategies in Italian. Paper presented at the 12th Manchester Phonology Meeting, 2004. R. Podesva. Segmental constraints on geminates and their implications for typology. LSA Annual Meeting, 2002. A. Prince and B. Tesar. Learning phonotactic distributions. Technical Report TR-54, Rutgers Center for Cognitive Science, Rutgers, ROA-353, 1999. D. Pulleyblank and W. J. Turkel. Optimality Theory and learning algorithms: The representation of recurrent featural asymmetries. In: J. Durand and B. Laks (eds.), Current trends in phonology: Models and methods, pages 653-684, University of Salford, 1996. R Core Team. R: A language and environment for statistical computing. 2013. URL http://www.R-project.org. J. Riggle. Generation, recognition, and learning in finite state Optimality Theory. Phd dissertation, UCLA, Los Angeles, CA, 2004. C. O. Ringen and O. Heinemäki. Variation in Finnish Vowel Harmony. Natural Language and Linguistic Theory 17., 303-37., 1999. R. Rosenfeld. A maximum entropy approach to adaptive statistical language modeling. Computer Speech and Language 10:187-228, 1996. C. E. Shannon. A Mathematical Theory of Communication. Bell System Technical Journal 27(3):379-423, 1948. S. N. Sridhar. Kannada. New York: Routledge, 1990. D. Steriade. Sources of markedness and why they matter. GLOW, Markedness Workshop, 2004. B. Tesar and P. Smolensky. The learnability of Optimality Theory: An algorithm and some complexity results. Ms., Department of Computer Science and Institute of Cognitive Science, University of Colorado, Boulder. Rutgers Optimality Archive ROA-2, 1993. M. Törkenczy. Does the onset branch in Hungarian? Acta Linguistica Hungarica, 39, 1989. 40 Lilla Magyar Generals Paper MIT Appendix I: Wug test items The following nonce words were used as items in a forced choice well-formedness judgement task in which participants were asked to choose the word that they found more acceptable as a Hungarian word or a hungarianised loanword. The items were presented in a randomised order (with the singleton consonant as the first option or the other way round). Filler items included nonce word pairs with a monosyllabic word ending in a long vowel and a short consonant sequence and a word ending in a short vowel and a short consonant sequence (such as nab [nOb] - náb [na:b]). Target items are listed here. nab [nOb] nabb [nOb:] zuc [zuts] zucc [zut:s] cen [tsEn] cenn [tsEn:] lür [lyr] lürr [lyr:] toz [toz] tozz [toz:] dem [dEm] demm [dEm:] par [pOr] parr [pOr:] zot [zot] zott [zot:] fub [fub] fubb [fub:] zucs [fzutS] zuccs [zut:S] nüd [nyd] nüdd [nyd:] tög [tøg] tögg [tøg:] nal [nOl] nall [nOl:] pöl [pøl] pöll [pøl:] nus [nuS] nuss [nuS:] füsz [fys:] füssz [fys] zik [zik] zikk [zik:] zosz [zos] zossz [zos:] küt [kyt] kütt [kyt:] nis [niS] niss [niS:] tid [tid] tidd [tid:] lüf [lyf] lüff [lyf:] dic [dits] dicc [dit:s] deg [dEg] degg [dEg:] zuk [zuk] zukk [zuk:] bif [bif] biff [bif:] cil [tsil] cill [tsil:] zor [zor] zorr [zor:] zib [zib] zibb [zib:] dum [dum] dumm [dum:] det [dEt] dett [dEt:] nuz [nuz] nuzz [nuz:] gak [gOk] gakk [gOk:] szöf [søf] szöff [søf:] decs [dEtS] deccs [dEt:S] tac [tOts] tacc [tOt:s] zocs [zotS] zoccs [zot:S] zom [zom] zomm [zom:] zop [zop] zopp [zop:] pesz [pEs] pessz [pEs:] cez [tsEz] cezz [tsEz:] död [død] dödd [død:] nad [nOd] nadd [nOd:] nül [nyl] nüll [nyl:] peb [pEb] pebb [pEb:] mec [mEts] mecc [mEt:s] nacs [nOtS] naccs [nOt:S] naf [nOf] naff [nOf:] lig [lig] ligg [lig:] dök [døk] dökk [døk:] nüm [nym] nümm [nym:] masz [nym] massz [nym:] kob [kob] kobb [kob:] zöb [zøb] zöbb [zøb:] lüb [lyb] lübb [lyb:] küz [kyz] küzz [kyz:] döz [døz] dözz [døz:] döt [døt] dött [døt:] noc [nots] nocc [not:s] niz [niz] nizz [niz:] föc [føts] föcc [føt:s] püc [pyts] pücc [pyt:s] lics [lit:S] liccs [lit:S] döcs [døtS] döccs [døt:S] saz [SOz] sazz [SOz:] nut [nut] nutt [nut:] dösz [døs] dössz [døs:] nücs [nytS] nüccs [nyt:S] zed [zEd] zedd [zEd:] zod [zod] zodd [zod:] lef [lEf] leff [lEf:] zud [zud] zudd [zud:] zof [zof] zoff [zof:] nag [nOg] nagg [nOg:] zuf [zuf] zuff [zuf:] zog [zog] zogg [zog:] fek [fEk] fekk [fEk:] nüg [nyg] nügg [nyg:] mok [mok] mokk [mok:] tug [tug] tugg [tug:] nük [nyk] nükk [nyk:] pel [pEl] pell [pEl:] nam [nOm] namm [nOm:] zol [zol] zoll [zol:] lim [lim] limm [lim:] zul [zul] zull [zul:] döm [døm] dömm [døm:] san [sOn] sann [sOn:] cit [tsit] citt [tsit:] cip [tsip] cipp [tsip:] cer [tsEr] cerr [lEf] döp [døp] döpp [døp:] ces [tsEs] cess [tsEs:] nup [nup] nupp [nup:] nör [nør] nörr [nør:] küp [kyp] küpp [kyp:] sat [sOt] satt [sOt:] cir [tsir] cirr [tsir:] nusz [nus] nussz [nus:] pur [pur] purr [pur:] tisz [tis] tissz [tis:] pas [pOS] pass [pOS:] zos [zos] zoss [zos:] müs [myS] müss [myS:] dös [døS] döss [døS:] 41 Lilla Magyar Generals Paper Appendix II: Production experiment items List 1 Szerintem a csiga elég érdekes állat. ‘In my opinion, the snail is a rather interesting animal.’ Szerintem a ‘halad’ elég érdekes szó. ‘In my opinion, ‘to make headway’ is a rather interesting expression.’ Szerintem a rab elég veszélyes. ‘In my opinion, the inmate is rather dangerous.’ Szerintem az ‘ánizsos’ elég érdekesen hangzik. ‘In my opinion, ‘anise-flavoured’ sounds rather interesting.’ Szerintem a mez elég nagy rád. ‘In my opinion, this jersey is rather oversized for you.’ Szerintem a ‘szeretek’ elég érdekes szóalak. ‘In my opinion, ‘I love’ is a rather interesting expression.’ Szerintem a konyakok elég drágák. ‘In my opinion, brandies are rather expensive. Szerintem a hal elég drága. ‘In my opinion, fish is rather expensive.’ Szerintem a vogul elég érdekes nyelv. ‘In my opinion, Mansi is a rather interesting language.’ Szerintem a ‘döf ’ elég furcsa szó. ‘In my opinion, ‘to stab’ is a rather weird word.’ Szerintem a fababa elég jó ajándék. ‘In my opinion, wooden doll is a rather nice present.’ Szerintem a malacok elég sokat esznek. ‘In my opinion, pigs eat quite a lot.’ Szerintem Alap elég nagy község. ‘In my opinion, Alap is rather big village.’ Szerintem az adoma elég hihetetlen. ‘In my opinion, that urban legend is rather hard to believe.’ Szerintem a kas elég nagy. ‘In my opinion, the beehive is rather large.’ Szerintem a rizs elég finom. ‘In my opinion, rice is quite tasty.’ 42 MIT Lilla Magyar Generals Paper Szerintem az ‘alapos’ elég érdekes szó. ‘In my opinion, ‘meticulous’ is a rather interesting word.’ Szerintem a doh elég elviselhetetlen. ‘In my opinion, musty smell is rather hard to tolerate.’ Szerintem az arab elég nehéz nyelv. ‘In my opinion, Arabic is a rather difficult language.’ Szerintem a ‘szavadat’ elég érdekes szóalak. ‘In my opinion, ‘your word-ACC is a rather interesting phrase.’ Szerintem a pirogok elég finomak lettek. ‘In my opinion, the pirogs have turned out quite well.’ Szerintem a kanalak elég kicsik. ‘In my opinion, the spoons are quite small.’ Szerintem a ‘komoly’ elég furcsa szó. ‘In my opinion, ‘serious’ is a rather strange word.’ Szerintem a ‘Kemenes’ elég érdekes név. ‘In my opinion, Kemenes is a rather interesting name.’ Szerintem a kakasok elég vadak. ‘In my opinion, roosters are rather wild.’ Szerintem a ‘bekever’ elég érdekes szó. ‘In my opinion, ‘to mix’ is a rather interesting word.’ Szerintem az orosz elég nehéz nyelv. ‘In my opinion, Russian is a rather difficult language.’ Szerintem a ‘ledöf ’ elég kifejező szó. ‘In my opinion, to stab someone to death is a rather expressive phrase.’ Szerintem a zsizsik elég kicsi bogár. ‘In my opinion, the wheat weevil is a rather small bug.’ Szerintem a rom elég rossz látvány. ‘In my opinion, the ruin is a bad sight.’ Szerintem a ‘hadar’ elég vicces szó. ‘In my opinion, ‘to sputter’ is a rather funny word.’ Szerintem a kukac elég jó csali. ‘In my opinion, worm is a good fishing bait.’ Szerintem a potroh elég jellegzetes testrész. ‘In my opinion, the abdomen of insects is a rather prominent body part.’ 43 MIT Lilla Magyar Generals Paper Szerintem a kacs elég érdekesen néz ki. ‘In my opinion, the the tendril looks rather interesting.’ Szerintem a ‘röfög’ elég vicces szó. ‘In my opinion, ‘grunt’ is a rather funny word.’ Szerintem a kabar elég ismert nép. ‘In my opinion, Kabar is quite a famous nation.’ Szerintem a ‘vakar’ elég furcsa szó. ‘In my opinion, to scratch is a rather strange word.’ Szerintem a kar elég hosszú. ‘In my opinion, the handle is rather long.’ Szerintem a ‘kapar’ elég furcsa szó. ‘In my opinion, to scratch is a rather strange word.’ Szerintem az oroszok elég sokan vannak. ‘In my opinion, there are quite a lot of Russians.’ Szerintem a ‘hamar’ elég érdekes szó. ‘In my opinion, soon is a rather interesting word.’ Szerintem az ezer elég sok. ‘In my opinion, a thousand is quite a huge amount.’ Szerintem Lev elég hı́res ı́ró. ‘In my opinion, Lev is quite a famous writer.’ Szerintem a retek elég jó vacsorára. ‘In my opinion, radish is quite good for dinner.’ Szerintem a lemezek elég régiek. ‘In my opinion, the records are quite old. Szerintem a ‘facsar’ elég érdekes szó. ‘In my opinion, ‘to extract juice’ is a rather interesting phrase.’ Szerintem a dac elég furcsa reakció. ‘In my opinion, ‘defiance’ is a rather strange reaction.’ Szerintem a ‘makacs’ elég vicces szó. ‘In my opinion, ‘stubborn’ is a rather funny word.’ Szerintem a fog elég hamar kihullott. ‘In my opinion, the tooth fell out quite soon.’ Szerintem a ‘hasal’ elég érdekes szó. ‘In my opinion, ‘to lie on your stomach’ is a rather interesting phrase.’ 44 MIT Lilla Magyar Generals Paper Szerintem a nyak elég kényes testrész. ‘In my opinion, the neck is a rather sensitive part of the body.’ Szerintem a ‘ken’ elég rövid szó. ‘In my opinion, ‘to smear’ is a rather short word.’ Szerintem Jemen elég érdekes hely. ‘In my opinion, Yemen is a rather interesting place.’ Szerintem a pap elég komoly ember. ‘In my opinion, the priest is a rather serious man.’ Szerintem a ‘leböfög” elég vicces szó. ‘In my opinion, ‘to blurp at someone’ is a rather funny word.’ Szerintem a kakas elég nagy. ‘In my opinion, the rooster is quite big.’ Szerintem a ‘marad’ elég érdekes szó. ‘In my opinion, ‘to stay’ is a rather interesting word.’ Szerintem a kosz elég kiábrándı́tó. ‘In my opinion, the filth is rather disappointing.’ Szerintem a lemez elég drága. ‘In my opinion, the record is rather expensive.’ Szerintem Kijev elég nagy város. ‘In my opinion, Kiev is quite a big city.’ Szerintem a ‘vet’ elég rövid szó. ‘In my opinion, to sow is a rather short word. Szerintem a haszon elég fontos. ‘In my opinion, profit is rather important.’ Szerintem a ‘szeret’ elég szép szó. ‘In my opinion, to love is a rather beautiful word.’ Szerintem a fadarab elég nagy. ‘In my opinion, that piece of wood is rather big.’ Szerintem a Szenes elég udvariatlan. ‘In my opinion, Szenes is rather impolite.’ Szerintem a ‘kacag’ elég vicces szó. ‘In my opinion, ‘to laugh’ is a rather funny word.’ Szerintem a konyak elég drága. ‘In my opinion, brandy is rather expensive.’ 45 MIT Lilla Magyar Generals Paper Szerintem a ‘nyafog’ elég vicces szó. ‘In my opinion, ‘to whine’ is a rather funny word.’ Szerintem Párizs elég nagy város. ‘In my opinion, Paris is quite a big city.’ Szerintem a karom elég vastag. ‘In my opinion, my arm is rather thick.’ Szerintem a ‘dohog’ elég furcsa szó. ‘In my opinion, to mumble in a grumpy way is a strange phrase.’ Szerintem a vad elég nagy. ‘In my opinion, that wild animal is quite big.’ Szerintem a szavad elég biztosı́ték. ‘In my opinion, your word is enough guarantee.’ Szerintem az ‘evez’ elég gyakori szó. ‘In my opinion, to row is a rather frequent word.’ Szerintem a potrohok elég kicsik. ‘In my opinion, the abdomens of insects are rather small.’ Szerintem a ‘kifacsar’ elég furcsa szó. ‘In my opinion, ‘to extract juice fully’ is a rather strange word.’ Szerintem a sör elég népszerű ital. ‘In my opinion, beer is a rather popular drink.’ List 2 Szerintem a sör elég népszerű ital. Szerintem a ‘kifacsar’ elég furcsa szó. ‘In my opinion, ‘to extract juice fully’ is a rather strange word.’ Szerintem a potrohok elég kicsik. ‘In my opinion, the abdomens of insects are rather small.’ Szerintem az ‘evez’ elég gyakori szó. ‘In my opinion, to row is a rather frequent word.’ Szerintem a szavad elég biztosı́ték. ‘In my opinion, your word is enough guarantee.’ Szerintem a vad elég nagy. ‘In my opinion, that wild animal is quite big.’ Szerintem a ‘dohog’ elég furcsa szó. 46 MIT Lilla Magyar Generals Paper ‘In my opinion, to mumble in a grumpy way is a strange phrase.’ Szerintem a karom elég vastag. ‘In my opinion, my arm is rather thick.’ Szerintem Párizs elég nagy város. ‘In my opinion, Paris is quite a big city.’ Szerintem a ‘nyafog’ elég vicces szó. ‘In my opinion, ‘to whine’ is a rather funny word.’ Szerintem a konyak elég drága. ‘In my opinion, brandy is rather expensive.’ Szerintem a ‘kacag’ elég vicces szó. ‘In my opinion, ‘to laugh’ is a rather funny word.’ Szerintem a Szenes elég udvariatlan. ‘In my opinion, Szenes is rather impolite.’ Szerintem a fadarab elég nagy. ‘In my opinion, that piece of wood is rather big.’ Szerintem a ‘szeret’ elég szép szó. ‘In my opinion, to love is a rather beautiful word.’ Szerintem a haszon elég fontos. ‘In my opinion, profit is rather important.’ Szerintem a ‘vet’ elég rövid szó. ‘In my opinion, to sow is a rather short word. Szerintem Kijev elég nagy város. ‘In my opinion, Kiev is quite a big city.’ Szerintem a lemez elég drága. ‘In my opinion, the record is rather expensive.’ Szerintem a kosz elég kiábrándı́tó. ‘In my opinion, the filth is rather disappointing.’ Szerintem a ‘marad’ elég érdekes szó. ‘In my opinion, ‘to stay’ is a rather interesting word.’ Szerintem a kakas elég nagy. ‘In my opinion, the rooster is quite big.’ Szerintem a ‘leböfög” elég vicces szó. ‘In my opinion, ‘to blurp at someone’ is a rather funny word.’ Szerintem a pap elég komoly ember. 47 MIT Lilla Magyar Generals Paper ‘In my opinion, the priest is a rather serious man.’ Szerintem Jemen elég érdekes hely. ‘In my opinion, Yemen is a rather interesting place.’ Szerintem a ‘ken’ elég rövid szó. ‘In my opinion, ‘to smear’ is a rather short word.’ Szerintem a nyak elég kényes testrész. ‘In my opinion, the neck is a rather sensitive part of the body.’ Szerintem a ‘hasal’ elég érdekes szó. ‘In my opinion, ‘to lie on your stomach’ is a rather interesting phrase.’ Szerintem a fog elég hamar kihullott. ‘In my opinion, the tooth fell out quite soon.’ Szerintem a ‘makacs’ elég vicces szó. ‘In my opinion, ‘stubborn’ is a rather funny word.’ Szerintem a dac elég furcsa reakció. ‘In my opinion, ‘defiance’ is a rather strange reaction.’ Szerintem a ‘facsar’ elég érdekes szó. ‘In my opinion, ‘to extract juice’ is a rather interesting phrase.’ Szerintem a lemezek elég régiek. ‘In my opinion, the records are quite old. Szerintem a retek elég jó vacsorára. ‘In my opinion, radish is quite good for dinner.’ Szerintem Lev elég hı́res ı́ró. ‘In my opinion, Lev is quite a famous writer.’ Szerintem az ezer elég sok. ‘In my opinion, a thousand is quite a huge amount.’ Szerintem a ‘hamar’ elég érdekes szó. ‘In my opinion, soon is a rather interesting word.’ Szerintem az oroszok elég sokan vannak. ‘In my opinion, there are quite a lot of Russians.’ Szerintem a ‘kapar’ elég furcsa szó. ‘In my opinion, to scratch is a rather strange word.’ Szerintem a kar elég hosszú. ‘In my opinion, the handle is rather long.’ Szerintem a ‘vakar’ elég furcsa szó. 48 MIT Lilla Magyar Generals Paper ‘In my opinion, to scratch is a rather strange word.’ Szerintem a kabar elég ismert nép. ‘In my opinion, Kabar is quite a famous nation.’ Szerintem a ‘röfög’ elég vicces szó. ‘In my opinion, ‘grunt’ is a rather funny word.’ Szerintem a kacs elég érdekesen néz ki. ‘In my opinion, the the tendril looks rather interesting.’ Szerintem a potroh elég jellegzetes testrész. ‘In my opinion, the abdomen of insects is a rather prominent body part.’ Szerintem a kukac elég jó csali. ‘In my opinion, worm is a good fishing bait.’ Szerintem a ‘hadar’ elég vicces szó. ‘In my opinion, ‘to sputter’ is a rather funny word.’ Szerintem a rom elég rossz látvány. ‘In my opinion, the ruin is a bad sight.’ Szerintem a zsizsik elég kicsi bogár. ‘In my opinion, the wheat weevil is a rather small bug.’ Szerintem a ‘ledöf ’ elég kifejező szó. ‘In my opinion, to stab someone to death is a rather expressive phrase.’ Szerintem az orosz elég nehéz nyelv. ‘In my opinion, Russian is a rather difficult language.’ Szerintem a ‘bekever’ elég érdekes szó. ‘In my opinion, ‘to mix’ is a rather interesting word.’ Szerintem a kakasok elég vadak. ‘In my opinion, roosters are rather wild.’ Szerintem a ‘Kemenes’ elég érdekes név. ‘In my opinion, Kemenes is a rather interesting name.’ Szerintem a ‘komoly’ elég furcsa szó. ‘In my opinion, ‘serious’ is a rather strange word.’ Szerintem a kanalak elég kicsik. ‘In my opinion, the spoons are quite small.’ Szerintem a pirogok elég finomak lettek. ‘In my opinion, the pirogs have turned out quite well.’ Szerintem a ‘szavadat’ elég érdekes szóalak. 49 MIT Lilla Magyar Generals Paper ‘In my opinion, ‘your word-ACC is a rather interesting phrase.’ Szerintem az arab elég nehéz nyelv. ‘In my opinion, Arabic is a rather difficult language.’ Szerintem a doh elég elviselhetetlen. ‘In my opinion, musty smell is rather hard to tolerate.’ Szerintem az ‘alapos’ elég érdekes szó. ‘In my opinion, ‘meticulous’ is a rather interesting word.’ Szerintem a rizs elég finom. ‘In my opinion, rice is quite tasty.’ Szerintem a kas elég nagy. ‘In my opinion, the beehive is rather large.’ Szerintem az adoma elég hihetetlen. ‘In my opinion, that urban legend is rather hard to believe.’ Szerintem Alap elég nagy község. ‘In my opinion, Alap is rather big village.’ Szerintem a malacok elég sokat esznek. ‘In my opinion, pigs eat quite a lot.’ Szerintem a fababa elég jó ajándék. ‘In my opinion, wooden doll is a rather nice present.’ Szerintem a ‘döf ’ elég furcsa szó. ‘In my opinion, ‘to stab’ is a rather weird word.’ Szerintem a vogul elég érdekes nyelv. ‘In my opinion, Mansi is a rather interesting language.’ Szerintem a hal elég drága. ‘In my opinion, fish is rather expensive.’ Szerintem a konyakok elég drágák. ‘In my opinion, brandies are rather expensive. Szerintem a ‘szeretek’ elég érdekes szóalak. ‘In my opinion, ‘I love’ is a rather interesting expression.’ Szerintem a mez elég nagy rád. ‘In my opinion, this jersey is rather oversized for you.’. Szerintem az ‘ánizsos’ elég érdekesen hangzik. ‘In my opinion, ‘anise-flavoured’ sounds rather interesting.’ Szerintem a rab elég veszélyes. 50 MIT Lilla Magyar Generals Paper MIT ‘In my opinion, the inmate is rather dangerous.’ Szerintem a ‘halad’ elég érdekes szó. ‘In my opinion, ‘to make headway’ is a rather interesting expression.’ Szerintem a csiga elég érdekes állat. ‘In my opinion, the snail is a rather interesting animal.’ Appendix III: Perception experiment items III/A: Hungarian-English word pairs The following Hungarian and English minimal and quasi minimal pairs were used in the perception experiment. Hungarian: mit [mit] ‘what-ACC’ English: hit [hIt] Hungarian: fut [fut] ‘he/she runs’ English: foot [fUt] Hungarian: hat [hOt] ‘six’ English: but [b2t] Hungarian: vet [vEt] ‘sow-3rd. p. sing.’ English: set [sEt] or [set] Hungarian: szid [sid] ‘he/she scolds (someone)’ English: hid [hId] Hungarian: tud [tud] ‘he/she knows’ English: good [gUd] Hungarian: had [hOd] ‘warfare’ English: bud [b2d] Hungarian: szed [sEd] ‘he/she collects’ English: said [sEd] or [sed] III/B: Hungarian-German word pairs The following Hungarian and German minimal and quasi minimal pairs were used in the experiment. Hungarian: kis [kIS] ‘small’ German: Tisch [tIS] ‘table’ Hungarian: mos [moS] ‘he/she washes’ German: Bosch [boS] <brand name> Hungarian: vas [vOS] ‘iron’ German: wasch [waS] ‘wash-imperative’ Hungarian: tus [tuS] ‘douche or ink’ German: Fusch [fuS] <a village in Austria> 51 Lilla Magyar Generals Paper Hungarian: köt [køt] ‘he/she knits’ German: Schött [Sœt] <a name> Hungarian: les [lES] ‘he/she looks furtively’ German: fetch [fES] ‘handsome’ III/C: Hungarian word pairs The following Hungarian word pairs were used in the experiment. Singleton: hit [hit] ‘faith’ Geminate: hitt [hit:] ‘believe-3rd.p.-past’ Singleton: vet [vEt] ‘sow’ Geminate: vett [vEt:] ‘buy-3rd.p.-past’ Singleton: lap [lOp] ‘piece of paper’ Geminate: lapp [lOp:] ‘Sami’ Singleton: lop [lop] ‘steal’ Geminate: hopp ‘oops’ Singleton: luk [luk] ‘hole’ Geminate: pukk <bursting sound> 52 MIT Lilla Magyar Generals Paper MIT Appendix IV: List of monosyllabic loanwords with consonants subject to gemination Loanword Gloss Orth. gem. in SW Geminate in LW Singleton in LW hall [hOl:] blöff [bløf:] nett [nEt:] fitt [fit:] szett [sEt:] csekk [tSEk:] klip [klip:] friss [friS:] fess [fES:] giccs [git:S] klub % [klub:] dzsem % [dZEm:] chip [tSip:] vicc [vit:s] sokk [sok:] puccs [put:S] sikk [Sik:] drukk [druk:] chat [tSEt] net % [nEt:] bit % [bit:] flott [flot:] spicc [Spit:s] trükk [tryk:] rock [rok:] pop % [pop:] smukk [Smuk:] dekk [dEk:] dokk [dok:] pucc [put:s] sacc [SOt:s] hecc [hEt:s] Schütz [Syt:s] fuccs [fut:S] Bach [bOx:] pech [pEx:] sah [sOx:] trapp [trOp:] nipp [nip:] nopp [nop:] kepp [kEp:] shop % [Sop:] blikk [blik:] blokk [blok:] stop % [stop:] meccs [mEt:S] taccs [tOt:S] dog [dog] szmog [smog] blog [blog] HIV hiv Liv [liv] ‘hall’ ‘bluff’ ‘neat and tidy’ ‘fit’ ‘set or outfit’ ‘cheque’ ‘videoclip’ ‘fresh’ ‘handsome’ ‘kitsch’ ‘club’ ‘marmalade’ ‘electronic chip’ ‘joke’ ‘shock’ ‘coup’ ‘stylishness’ ‘anxiety’ ‘online chat’ ‘internet’ ‘bit’ (IT) ‘fast and easy’ ‘Spitz’ ‘trick’ ‘rock music’ ‘pop music’ ‘jewellery’ ‘cigarette’ ‘dock’ ‘poshness’ ‘guess’ ‘hoax’ <name> ‘waste’ <name> ‘bad luck’ ‘shah’ ‘trap’ ‘figurine’ ‘knot’ ‘hooded gown’ ‘shop’ ‘wink’ ‘block’ ‘stop sign’ ‘match’ ‘touch’ ‘dog’ ‘smog’ ‘blog’ ‘HIV’ <name> yes yes yes no no no no no no no no no no no no no no no no no no yes no no no no no no no no no no no no no no no no no yes yes no no no no no no no no no no no yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no no no no no no no no no no no no no no no yes (less common) yes no no no no no no yes yes (more common) more common no no no no yes (more common) no no no no no no no no no no no no no no no yes (more common) no no yes no no yes yes yes yes yes 53 (more common) (less common) (less common) (less common) (less common) Lilla Magyar Generals Paper MIT Loanword Gloss Orth. gem. in SW Geminate in LW Singleton in LW bob [bob] sznob [snob] dub [dOb] pub [pOb] Ted [tEd] Bud [bOd] Hal [hEl] tipp [tip:] stramm [StrOm:] sift [sit:] pack [pOk:] kuss [kuS:] sztepp [stEp:] kit % [kit:] Pat [pEt] brit % [brit:] szvit % [svit:] asz % [as:] pikk [pik:] skicc [Skit:s] treff [trEf:] procc [prot:s] snassz [SnOs:] klassz [klOs:] bessz [bEs:] priccs [prit:S] slussz [Slus:] plussz [plus:] tus % [tus:] plüss [plus:] krach [krOx:] stich [Stix:] Mann [mOn:] finn [fin:] gramm [grOm:] dzsinn [dZin:] tüll [tyl:] brill [bril:] Fred [frEd] LED [lEd] Elle [El:] gif % [gif] klikk [klik:] ceh [tsex:] skeccs [skEt:S] top % [top:] web % [vEb] Webb [vEb:] krossz [kros:] dressz [drEs:] Liz [liz] Dell [dEl:] Mac [mEk] Scholl [Sol:] ‘bobsleigh’ ‘snob’ ‘dub(step)’ ‘pub’ <name> <name> <name> ‘tip’ ‘tough’ ‘debris’ ‘package’ ‘shut up’ ‘step dance’ ‘kit’ <name> ‘Brit’ ‘suite’ <musical tone> ‘spades’ ‘sketch’ ‘clubs’ ‘snobbish’ ‘mediocre’ ‘great’ ‘baisse’ ‘iron bed’ ‘end’ ‘plus’ ‘ink’ ‘plush’ ‘financial crash’ ‘something fishy’ <name> ‘Finnish’ ‘gram’ ‘genie’ ‘tulle’ ‘diamond’ <name> ‘LED’ <a magazine> ‘gif’ ‘click or clique’ ‘bill’ ‘sketch’ ‘top’ ‘web’ <name> ‘cross’ ‘dress’ <name> <computer> <computer> <shoes> no no no no no no no no yes yes no no no no no no no no no no yes no yes yes yes no yes no no no no no yes yes no no yes yes no no yes no no no no no no yes yes yes no yes no yes no no no no no no no yes yes yes yes yes yes yes (less common) no yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes no no yes yes yes yes yes yes no yes yes yes no yes no yes yes yes yes yes yes yes yes no no no no no no yes (more common) yes yes yes yes no no no no no no no no no no yes no no no no no no no no no yes yes no yes no no no no yes no no no yes no yes no 54
© Copyright 2026 Paperzz