Università degli studi di Salerno Facoltà: Lingue e letterature straniere Corso di laurea: Lingue e culture straniere Tesi in: Linguistica applicata Dividing CLIPS' Phonemic Layer into Syllables An SPP Based Syllabification Program with Python/NLTK Relatore: Prof.ssa Renata Savy Correlatore: Dott.ssa Marina Lops Candidato: Luca Iacoponi Matricola: 4310200182 Ad Andrea, la mia memoria; a Marina, il mio presente Table of Contents ABSTRACT.................................................................................................................7 ACKNOWLEDGEMENTS......................................................................................10 1 SYLLABLE AND SYLLABIFICATION............................................................11 1 Syllable..........................................................................................................................11 1. Syllable Structure..................................................................................................................11 2. Syllable Weight.....................................................................................................................14 2 Syllabification ..............................................................................................................15 1. Orthographic Syllabification.................................................................................................15 2. Sonority Scale.......................................................................................................................16 3. Sonority Distance .................................................................................................................19 4. Phonotactical Constraints......................................................................................................19 5. Internal Evidence..................................................................................................................20 6. External Evidence.................................................................................................................21 7. Comparison of Principles......................................................................................................22 8. Conclusion............................................................................................................................24 3 From SPE to Optimality ...............................................................................................25 1. SPE Rules..............................................................................................................................25 2. The Syllable in SPE..............................................................................................................26 3. Autosegmental Theory..........................................................................................................27 4. Autosegmental Syllabification..............................................................................................29 5. Metrical Phonology...............................................................................................................35 6. Foot, mora and P-Word.........................................................................................................39 7. Optimality Basic Principles..................................................................................................40 8. Optimality Procedure............................................................................................................41 9. Optimality Formalisation......................................................................................................42 10. Syllabification in OT...........................................................................................................44 4 Conlcusion....................................................................................................................47 1. Definitions of syllable...........................................................................................................47 2. Which Syllabification?..........................................................................................................49 2 AUTOMATIC SYLLABIFICATION..................................................................50 1 Input, Model and Purposes............................................................................................50 1. Written or Spoken Language.................................................................................................51 2. Transcriptions........................................................................................................................52 3. Software Purposes.................................................................................................................54 4. Epistemology........................................................................................................................55 5. Data Driven - Rule Based.....................................................................................................57 2 Data Driven Models......................................................................................................60 1. Artificial Neural Networks....................................................................................................60 2. Calderone's ANN...................................................................................................................65 3. Look-up Procedure................................................................................................................69 3 Rule based Models........................................................................................................72 1. Computational OT ................................................................................................................72 2. Hammond's Algorithms ........................................................................................................74 3. Others OT Implementations..................................................................................................81 4. Cutugno et al. (2001)............................................................................................................82 4 Conclusion....................................................................................................................84 3 CLIPS.....................................................................................................................85 1 Transcription.................................................................................................................86 1. Transcription Principles........................................................................................................86 2. Annotated Transcription........................................................................................................88 3. Transcription Procedure........................................................................................................91 4. Labelling...............................................................................................................................92 5. Phonological Layer...............................................................................................................96 2 Diatiopic, Diamesic and Diaphasic Variation................................................................99 1. Northern Italy Cities..............................................................................................................99 2. Dialogic...............................................................................................................................101 3. Read Speech........................................................................................................................103 4. Radio and TV......................................................................................................................103 5. Telephonic...........................................................................................................................105 6. Orthophonic........................................................................................................................107 7. Corpus structure..................................................................................................................108 4 SYLLABIFICATION PROGRAM....................................................................110 1 Python and NLTK........................................................................................................110 1. Python .................................................................................................................................110 2. NTLK..................................................................................................................................112 2 Implementation............................................................................................................113 1. Syllabification.....................................................................................................................113 2. CLIPS's STD.......................................................................................................................115 3. Core SY...............................................................................................................................117 4. Phonological Syllabification...............................................................................................123 3 Final Developing.........................................................................................................125 1. Corpus Reader.....................................................................................................................125 2. SY and NLTK......................................................................................................................130 3. NTLK and SY.....................................................................................................................133 4. Further studies.....................................................................................................................141 5 CONCLUSION....................................................................................................143 APPENDIX A: SONORITY SCALE.....................................................................148 APPENDIX B: SAMPLE SYLLABIFICATION OUTPUT................................149 APPENDIX C: PHONOLOGICAL SYLLABIFICATION.................................150 BIBLIOGRAPHY....................................................................................................151 Illustration Index Tree representation of a CVC syllable................................................................................................12 Tree representation of the Italian syllable structure...........................................................................13 Syllable weight representation in moraic theory................................................................................14 Sonority representation of the word 'candle'......................................................................................17 Sonority representation of the word 'gatto'.........................................................................................17 Fake nasal assimilation rule................................................................................................................26 SPE rule for French [e] and [ɛ] alternance........................................................................................27 SPE rule for French [e] and [ɛ] alternance including syllabe...........................................................27 1 to 1 correspondence between melodic and skeletal tier..................................................................30 Example of a 2 to 1 correspondence between melodic and skeletal tier............................................31 Example of 1 to 2 correspondence between melodic and skeletal tier...............................................31 N-Placement.......................................................................................................................................33 CV Rule..............................................................................................................................................33 Onset Rule..........................................................................................................................................33 Coda Rule...........................................................................................................................................33 Autsegmental syllabification step for the word 'pastrocchio'.............................................................35 Metrical three for the word 'compitare'..............................................................................................38 Phonological hierarchy.......................................................................................................................39 Simple Artificial Neural Network unit...............................................................................................61 Articial Neural Network unit..............................................................................................................62 Artificial Neural Network with three hidden layers...........................................................................64 Feedforward Neural Network.............................................................................................................64 Phonoctactic and syllabic window.....................................................................................................68 Attraction values for the word 'sillaba'...............................................................................................68 Attraction values for the word 'pasta'.................................................................................................69 Hammond's candidate encoding for the word 'apa'............................................................................74 Hammond's second algorithm rule formalisation...............................................................................80 DG utterance filename example.........................................................................................................93 Word sì 'yes' labelling on WaveSurfer................................................................................................95 Syllable Cumulative Frequency Distribution Plot............................................................................138 Index of Tables Rhyme, assonance, consonance and alliteration.................................................................................12 Sonority Hierarchy.............................................................................................................................18 Coursil's Sonority Scale......................................................................................................................18 Syllabification of the French word 'moustique' according to Coursil (1992).....................................18 Davis (1990) Sonority Scale for Italian..............................................................................................19 A comparison of possible CC cluster division strategies...................................................................23 Autosegmental Syllabification Algorithm for Italian.........................................................................33 Hypotetycal language 1 tableau..........................................................................................................43 Hypotetycal language 2 tableau..........................................................................................................43 Hypotetycal language 2 tableau..........................................................................................................44 Tableau for the syllabification of 'pasta'.............................................................................................45 Tableau for the syllafication of the word 'studente'............................................................................46 Tableau for the syllafication of the word 'klok'..................................................................................46 Corpora based studies until 1991.......................................................................................................56 Important differences between rationalism and empiricism...............................................................58 Rule based and data driven models....................................................................................................59 Number of candidates if epenthesis and deletion are considered by Gen .........................................73 Example of an unparsed Hammond's tableau.....................................................................................76 Number of evaluations for a 10X5 tableau.........................................................................................79 Number of evaluations reduction using fatal violations.....................................................................79 CLIPS corpus summary (Savy and Cutugno 2009)...........................................................................85 Semi-lexical phenomena....................................................................................................................89 Non lexical phenomena......................................................................................................................89 Interjections........................................................................................................................................89 Non verbal and non lexical phenomena.............................................................................................90 Operator comments............................................................................................................................90 Transcript units...................................................................................................................................92 Transcript and labelled transcript unit................................................................................................93 SAMPA vowel set for CLIPS.............................................................................................................96 SAMPA consonant set for CLIPS.......................................................................................................97 Transcript symbols used in STD........................................................................................................98 Final location sites with codes..........................................................................................................100 Italian networks audience sharing....................................................................................................103 Minutes of recording distribution on RD and TV............................................................................104 ABSTRACT La sillaba è tra le unità fonologiche più controverse della linguistica moderna. Quasi ignorata dalla fonologia generativa classica, ha assunto un'importanza decisiva nella teoria fonologica autosegmentale e nei suoi successivi sviluppi (fonologia metrica e prosodica, Government Phonology ecc.). Parallelemente, in ambito ingenieristico, l'unità sillabica conquista rilevanti spazi di interesse a partire degli anni Novanta, quando alcuni studi rivelano che a livello psicolinguistico e fonetico-acustico, la sillaba costituisce un'importante unità sub-lessicale per l'accesso al lessico e la segmentazione del continuum fonico. Mentre è riscontrabile una certa omogeneità nella definizione di struttura sillabica, l'argomento più controverso in ambito linguistico concerne l'individuazione dei principi che determinano la distribuzione dei confini sillabici. Teorie e principi si sovrappongono in una babele in cui l'ambiguità delle analisi empiriche non permette di avallare con sicurezza nessuna delle ipotesi proposte. La sillaba puo' essere definita in termini di preferenze fonotattiche, in base alla sonorità intrinseca dei fonemi che la compongono, secondo criteri distribuzionali e statistici, ciascuna definizione implicando un particolare tipo di algoritmo, tecnica o principio di sillabificazione diverso. Se esista o meno un principio universale ed una sillabificazione fonologicamente determinata non ci è possibile al momento affermarlo con sicurezza. Un programma per la sillabificazione come quello sviluppato nella tesi terrà tuttavia conto della problematica, partendo dal presupposto che solo alcune sillabazioni sono certe, mentre le altre sono possibili, incerte, improbabili o impossibili in base a quanto i vari principi divergono da un'unica soluzione. Le scelte finali quindi, sia a livello linguistico che computazionale, saranno dettate soprattutto dall'obiettivo finale della tesi: la creazione di un programma per sillabificazione fonologica di un corpus di parlato allineato al segnale. Nel capitolo I, si delinea a grandi linee lo status della sillaba nello sviluppo delle teorie fonologiche moderne: in un primo momento si è cercato di fornire una descrizione della sillaba in accordo con le teorie fonologiche non lineari che ne prevedono una strutturazione interna. Nella seconda sezione, si è mostrato in ambito prevalentemente storico l'approccio al problema da diverse prospettive, da quella generativa classica a l'Optimality Theory, mostrando come da semplice tratto fonemico la sillaba sia divenuta unità fonologica fondamentale in numerose teorie fonologiche contemporanee. Nella terza sezione, diverse rappresentazioni e assunti teorici hanno portato a delineare alcune metodologie e principi di sillabificazione, che sembrano confermare l'assunto secondo cui ad un nucleo di processi fonologici deterministicamente definito si oppone una periferia in cui l'applicazione dei fenomeni fonologici risultà più vacua ed incerta. Nel capitolo II, una breve introduzione epistemologica si propone di costituire l'assunto per la descrizione di due modelli computazionali: uno di tipo simbolico o definito dalla codifica computazionale di regole fonologiche, e un altro sub-simbolico, basato invece sull'estrapolazione da corpora di regolarità e strutture prevalentemente fonotattiche. I principi e le teorie illustrati nel primo capitolo vengono coniugati con i modelli computazionali analizzati, conducendo ad un'analisi critica sulla compatibilità e coerenza dei modelli computazionali con le teorie fonologiche. Analizzato lo stato dell'arte della disciplina in ambito sia linguistico che informatico, nel capitolo III viene descritto il corpus CLIPS. L'argomento merita un capitolo a sé in quanto il principio e le finalità del corpus stesso, e quindi dei dati da sottoporre ad analisi, definiranno la scelta dei principi di sillabificazione adottati durante la fase di progettazione del programma. Oggetto della sillabificazione è il livello fonematico del corpus, allineato temporalmente al segnale. Inoltre, come evidenziato nel documento di presentazione del corpus, uno degli obiettivi di un corpus di parlato quale CLIPS è “la predisposizione di strumenti applicativi che servano come base per la realizzazione di sistemi di riconoscimento del parlato e di produzione di voce sintetica di buona qualità.” Si è quindi scelto di prediligere un tipo di sillabificazione di tipo semi-acustico. Il principio di sonorità è stato notato essere l'unico tra quelli analizzati in letteratura a riflettersi nel segnale, in particolare sotto il profilo dell'energia. L'applicazione stretta del principio su una sequenza fonematica prevedeva comunque la risoluzione di alcuni problemi di sillabificazione, alcuni ampiamente trattati in letteratura (nessi sC, geminate), altri meno discussi (risillabificazione, sequenze di vocoidi). Alla base delle scelte vi è stata l'aderenza a fondamentali esigenze linguistiche, riflessa nell'adozione di un principio ampiamente supportato dalla fonologia, e le finalità del programma, ovvero la sillabificazione di un corpus di parlato allineato al segnale. La soluzione più semplice ed elegante è consistita nell'applicare senza eccezioni il principio di sonorità e rilegare invece nell'assunto ampiamente accettato in letteratura che la scala di sonorità ammette variazioni linguospecifiche. Si è quindi constatato che cambiando il valore di sonorità dei fonemi /s/ e /r/ le sillabificazioni ottenute presentavano un'ottima organicità e che, anche nei casi dei nessi più problematici, si ottenevano dei risultati molto incoraggianti sia a livello linguistico che computazionale. Perfino nella sillabificazione dei nessi non nativi, nonostante si sia deciso a priori di non tenerne conto per motivi teorici di indecidibilità e di praticità, in quanto completamente assenti nel corpus. Sempre attenendosi al suddetto principio, le finalità del sillabificatore hanno spinto a prediligere la tautosillabicità all'eterosillabicità dei nessi sC e delle geminate. Favorendo la prima infatti si è ottenuto una minore varietà di sillabe, si è evitato il problema dell'extrasillabicità tout court, si è ottenuta la possibilità di riconoscere e distinguere a posteriori le sillabe geminate da quelle scempie, si è ridotta la variabilità delle strutture sillabiche presenti nel corpus. Seppur discutibile a livello puramente fonologico, la soluzione adottata si è dimostrata essere la più valida per le finalità del sillabificatore, che in tal modo è in grado di associare il maggior numero di informazioni possibili al minor numero di porzioni di segnale e senza dover ricorrere a regole, eccezioni o risillabificazione post-lessicale per includere i segmenti extrasillabici. Le sillabificazioni ottenute sono perfettamente adatte all'analisi automatica del segnale, permettendo di soddisfare uno degli scopi fondamentali del progetto CLIPS: la possibilità di disporre di un'importante risorsa per il trattamento automatico del parlato. Ciononostante, per verificare il valore innanzitutto fonologico dei principi adottati, è stato necessario dimostrare la corretta sillabificazione delle geminate e dei nessi sC, trattati come tautosillabici per meglio attenersi alle finalità del programma. Il principio di sonorità è stato quindi considerato nella sua forma restrittiva, che prevede l'unità sillabica fintanto che la sonorità decresce, escludendo quindi i casi di sonorità piatta. Rispettando questa interpretazione del principio, si è ottenuta una sillabificazione perfettamente aderente alla teoria fonologica, ivi incluso il rispetto dell'eterosillabicità di nessi sC e geminate, senza introdurre eccezioni o modifiche di sorta al principio e alla scala di sonorità precedentemente proposti. I risultati ottenuti si dimostrano essere ancora più importanti a livello linguistico: il solo principio di sonorità predice un sistema di sillabificazioni avallato dalla letteratura fonologica, senza alcuna eccezione se non le variazioni ammesse alla scala di sonorità. Non è necessario assumere che i parlanti ricorrano ad operazioni aritmetiche per determinare la sillabificazione di alcun nesso, né introdurre ulteriori principi o condizioni contestuali. Inoltre, mantenendo la stessa scala di sonorità e cambiando il valore di /s/ da 1 a 0 si ottiene l'interpretazione tautosillabica del nesso sC, anche in questo caso non risultante in segmenti extrasillabici all'interno di parola, come già descritto nel caso di /e.kstra/. L'ipotesi di Bertinetto (1999) sullo slittamento diacronico del nesso sC da eterosillabico a tautosillabico, potrebbe quindi, sotto questa prospettiva, essere giustificato e spiegato in termini di perdita di sonorità del fonema /s/. Il programma è stato sviluppato in Python, insieme ad un'interfaccia ad hoc basata su NLTK che permette l'interazione, la codifica e l'analisi dei dati presenti nel corpus. Un maggiore approfondimento di alcune problematiche è sicuramente necessario, ma i risultati ottenuti aprono sicuramente la strada a numerose altre possibilità di studio ed ambiti di applicazione. ACKNOWLEDGEMENTS 11 ACKNOWLEDGEMENTS First and foremost I would like to thank my advisor, Renata Savy, for her patience, support and advice. My profound ammiration goes to her for having introduced me to Linguistics. Thanks to Franco Cutugno, for his continuous support and for opening his NLP laboratory to me. Without them this thesis would never have come about. I will never sufficiently thanks all my friends, especially Rocco and Gabriele. They have helped me in so many ways it would double size this thesis to thank them as they would deserve. Thanks also to Carmen and Vito, that made me graduate, to the whole 'Poznan cool egg' group for giving me the best generative holiday of my life, to my new friends in Pisa and to the old ones in Warsaw. Thanks to my family for supporting me. Thanks to Jerzy Rubach, Piotr Banski, Markus Poechtrager, David Pesetsky and to all the scholars who have given me hints and stimuli. A special thanks also goes to Karolina Iwan for the endless discussions of Optimality Theory. Syllable and Syllabification 1 1 12 Syllable and Syllabification Syllable The term ‘syllable’ is defined by the Merriam-Webster dictionary as “a unit of spoken language that is next bigger than a speech sound and consists of one or more vowel sounds alone or of a syllabic consonant alone or of either with one or more consonant sounds preceding or following.” The definition of syllable is controversial as much as the concept itself. To define the term the adjective syllabic is used and there is no clue on how to distinguish phonematic sequences from syllables. As I will show in this chapter, the definition of the Merrian-Webster reflects two main points regarding a phonological debate that has not lasted yet: what is the syllable and how to define its boundaries. In this paragraph, I will introduce basic concepts about syllable structure. In the second paragraph, various syllabification principles will be analysed. In the third paragraph, I will show how the concept of syllable has evolved through some phonological theories. 1. Syllable Structure While a linear approach to the syllable was adopted by linear phonologies, for instance in structuralism, SPE and in otehr notable examples such as Kahn (1976), Clements and Keyser (1983), the binary structure in image 1.1 could be considered the most found in the phonological theories treated on this chapter1. It is made of: ➢ the onset, which is one or more consonants preceding the nucleus ➢ the nucleus, which is obligatory in all languages and constitutes the core of the syllable. Usually vowels in the form of monophthong, diphthong, or triphthong. Some languages may also allow sonorants as nuclei2: ➢ the coda, which is one or more consonants following the nucleus in the syllable; 1 Other notably descriptions are moraic (Hyman 1985, Prince 1986, Hayes 1989) and ternary branching : σ → Onset Nucleus Coda (Hockett 1955, Haugen 1956, Davis 1985) 2 For example the word 'little' in RP. Syllable and Syllabification 13 ➢ the rime, which is obligatory and group together nucleus and coda; ➢ the syllable, which include rime and onset. It is generally indicated with a σ (sigma). Image 1.1: Tree representation of a CVC syllable3 Nonlinear representation of the syllabe was inspired by a new approach to phonology4, and helped improving the formalisation of other known phonological processes. Syllable structure for example was used to describe how two words echo one another by means of rhyme, assonance, consonance and alliteration. Rhyming words will have the same rime in the last syllable, an assonance could be described as two words having the same last syllable nucleus and so forth (see table 1.1). Example Onset Nucleus Coda Rhyme pill, mill Different Same Same Assonance cap, hat Different Same Different Consonance silly, Sally Same Different Same Alliteration silly, solar Same Different Different Table 1.1: Rhyme, assonance, consonance and alliteration To represent the structure of syllables, phonemic segments are usually reduced both to 'C' for consonantal phonemes or as 'V' to indicate vowels. More specific phonemic properties (such as features) may be used according to the referring theory to describe phonotactic constraints on syllabic position (see image 1.2). 3 This hierarchical representation of the syllable was proposed by the autosegmental theory. 4 I will focus on Autosegmental theories later on this chapter. Syllable and Syllabification 14 Image 1.2: Tree representation of the Italian syllable structure5 It has been argued that preferred syllable structures are either CV or V and CV structure has even been considered as a Linguistic Universal by Blevins (1995). Recent works on Government Phonology also suggest that first some kind of templatic languages are CV only and then, as in Lowenstamm (1996), that “syllable structure universally, i.e., regardless of whether the language is templatic or not, reduces to CV”. CV In Italian, French and Spanish CV structure has at least 50% frequency (Vogel, 1993) and it is universally the least marked, so that in some languages no other configuration is allowed. For example, in Boumaa Fijian all syllables are either CV or V and if a word is loaned from other languages epenthetic segments might be added to reduce to those syllable structures. This is the case of loanwords such as koloko and aoopolo, from the English cloak and apple (Zec, 1995). On the other hand, some languages allow syllables with complex onsets and codas. For example, English syllables can be sCCVCCC and word-finally even more can occur in the coda; German syllables can be SCCVCCC like in springst. However, in most cases codas are severely restricted. In Lardil and Ponepean languages, syllables are maximally CVC with restricted coda, and many Chinese languages are CGVC. In Italian, syllables can be maximally sCCVC, within the coda generally limited to sonorants or /s/. However, some Italian words, such as the 5 adapted from Nespor, 1993 Syllable and Syllabification 15 acronym CAP. /kap/ - 'post code', or borrowings from Germanic languages (e.g., en. /kart/) may include non-sonorant or complex codas, thus resulting in possible sCCVCC structures. 2. Syllable Weight Syllables can be furthermore grouped according to their weight. A heavy syllable is VV(V)6, V: or VC, that is, the rime contains more than one segments. A light syllable instead has no coda and a simple nucleus (i.e., composed by either a short vowel, a sonorant, or generally by a single segment). For example V and xCV syllable are considered light syllables. In some languages only the nucleus account for the weight of the syllable. In such a language, a syllable with a coda – as CVC – would be considered light as well. Traditionally it was supposed that weight criteria - even if they may differ from language to language – are uniform in the same language (McCarthy and Prince 1986, Hayes 1989). In a recent study Matthew Gordon (2004) argued that '[…] weight criteria are frequently non-uniform within a given language.' (Hayes, 1989; Goldsmith 1999) Syllable weight may be represented differently according to the given theory. I will take as an indicative example the representation of syllable weight in moraic theory. Image 1.3 shows three syllables, two heavy – CVVC and CVC – and a light CV. Mu (Greek: μ) indicates a segment weight. The first syllable is heavy because the vowel is long and therefore bimoraic; the second syllable because has two segments in the coda and then two moras. Image 1.3: Syllable weight representation in moraic theory Syllable weight has played an important role in recent studies, in particular to describe stress 6 Dipthong, triphthong or a long vowel Syllable and Syllabification 16 assignment (Waltermire, 2004) and African tonal languages. It also had a crucial role in classical metric and has been used for the description of some important Italian phonological phenomena such as Raddoppiamento Sintattico, il/lo allomorphy and vowel lenghtening (see paragraph 2). 2 Syllabification Syllabification can be simply defined as the separation of a word into syllables. In this paragraph I will start by analysing theoretical effort of prescriptive grammarians to define some syllabification rules that could be useful to divide orthographic words into syllables. I will show that while these principles are mostly non-linguistic in English, where spelling differs greatly from the pronunciation, in Italian they are closer to phonological principles and empirical evidence. Descriptive linguistics have tried to formulate some formal principles to account for the syllable division problem. But while counting the number of syllable of a word is a simple task for any speaker of a language, the description of this speaker ability and an accurate identification of syllable boundaries is still a debated problem. I will also show results of some experiments based on corpora and on speaker competence that are supposed to give a psycholinguistic value to syllable division. Finally, I will summarise in a table how each of these approaches results on dividing consonantal clusters (on which there is the most disagreement). In paragraph 3 I will analyse how these principles (in particular the descriptive ones) have been formalised in generative theories. 1. Orthographic Syllabification In English, as an effect of the very weak correspondence between sounds and letters in the spelling, orthographic and natural syllabification are usually fairly different. The word ‘learning’ for example is syllabicated learn-ing instead of lear-ning, despite the fact that, in spoken language, the word would have been syllabified as the latter. Orthographic syllabification is mostly nonphonological and is considers not only the phonemic sequence, but also the etymology of the word, its morphological constituents, the ambiguity of possible pronunciation due to spelling Syllable and Syllabification 17 idiosyncrasies and so on.These syllabifications are the ones taught at school and used in music scores or in written texts. The same problem arises for Italian. Most of the dictionaries that display syllabic information are controversial (McCrary, 2004) and adopt the following rules, as indicated in prescriptive grammars or dictionaries (Sabatini and Coletti, 1997; Serianni, 1989; Lesina, 1986)7: ➢ CV.CV - if only a consonant precede the nucleus, the consonant goes in the incipit; ➢ VC.CV - geminates are separated, one belong to the preceding nucleus, the other to the following; ➢ V.CCxV8 - if intra-vocalic consonants are different and CCV is word-initial the cluster belong to the second syllable; ➢ VC.CxV - if CCV does not appear word-initially the cluster is divided after the first consonant, which goes in the coda9; ➢ Vx - vowels are never divided if they form a diphtong (CVVC). On the other hand, if they form a hiatus they are divided. xV.Vx. Glides always belong to the following vowel syllable (i.e., go in the coda); ➢ x.sCx - 's' before a consonant, if it is not geminated, always belongs to the incipit of the following syllable. These rules have had a few importance in formal linguistics. Their formal weakness lies in the fact that they are not justified by any internal or external phonological evidence but they are imposed as a set of rules to take as they come. 7 These rules are used to divide graphemes. However, digrams and trigrams are considered as a single unit and always belong to the same syllable. 8 With 'x' I indicate one or more optional occurrence of the previous symbol. For example Cx indicates that a consonant may be followed by zero, one or more consonants. 9 Note that in Italian the coda allows only one consonant. Hence, VCCxV will always be syllabified as V.CCxV or VC.CxV Syllable and Syllabification 2. 18 Sonority Scale The first formal principle to be found in the literature is probably the Sonority Sequencing Principle (SSP). The SSP is based on the Sonority Hierarchy10 (SH), which ranks phones by sonority. In articulatory phonetics least sonorous phones are the ones produced with a minor opening of the vocal tract while in acoustic phonetics they are described as characterised by a minor magnitude. The syllable is then defined as a sequence of speech consisting of a sonority peak and margins of sonority which decrease. Image 1.4 shows a possible SH and the syllabification of the word 'candle' in accordance with it. (Selkirk, 1984; Jespersen, 1904; Sievers, 1876). A common and longly disputated problem in Italian is whether geminates are tautosyllabic or heterosyllabic. Generally, a strict interpretation of this principles would require them to be divided as the segments would otherwise form a sonority plateau and therefore sonority would not decrease (Image 1.5). But at the acoustic and articulatory level Italian geminates are evidently a single unit, which could be hardly divided. It is evident that in languages like Italian, geminates11 are realised whithing the same “chest pulse” and at the acoustic level the energy keeps decreasing without interruption during the production of the entire sequence. Image 1.4: Sonority representation of the word 'candle' 10 The term Sonority Scale is also used. 11 Produced as a long consonantal sound. In other languages each consonant is produced with a single and complete articulation of the sound, i.e., [an.na] instead of Italian [an:a]. In this case, even at an acoustic and articulatory level, geminates are probably heterosyllabic. Syllable and Syllabification 19 Image 1.5: Sonority representation of the word 'gatto' Sonority Type Voiced 1 (lowest) Plosives no yes 2 Affricates no yes 3 Fricatives no yes 4 Nasals yes 5 Liquids yes 6 Approximants yes 7 High vowels yes 8 (highest) Non-high vowels yes Table 1.2: Sonority Hierarchy Sonority Segments Sonority Segments 1 Occlusives 5 Glides 2 Fricatives 6 High vowels 3 Nasals 7 Medium vowels 4 Liquids 8 Low vowels Table 1.3: Coursil's Sonority Scale Syllable and Syllabification 20 Coursil's (1992) syllabification system for French was based on a SH more similar to Saussure's (1914). Table 1.3 shows that the author divided vowels in three groups and used vocal tract aperture to discriminate sonorities. To each segment in the sequence was assigned a binary value (called plosion value) which could be 1 if the sonority decreased or 0 if not. Syllable boundaries were then placed whenever this value changed from 1 to 0. For example the French word moustique 'moustique' was syllabified 'mus.tik' as in table 1.4. m u s t i k Aperture ranks 3 6 2 1 6 1 Plosion values 0 1 1 0 1 1 Table 1.4: Syllabification of the French word 'moustique' according to Coursil (1992) 3. Sonority Distance A SH proposed by Davis (1990) for Italian is showed in table 1.5. According to Davis (1990), in Italian a consonant cluster violates the sonority principle and is heterosyllabic if the distance between two phoneme is less than four. Otherwise the cluster is tautosyllabic. VCCxV will be syllabified as VC.Cx if the sonority of C1 => 4, V.CCxV if s(C1)- (C1) > 4. For example the word padre 'father' will be syllabified as pa.dre, because the sonority distance between /d/ and /r/ reaches +4: /padre/ → Sonority(p) – Sonority(r) ≥ 4 → tautosyllabic→ pa.dre While 'pasta' as pas.ta: /pasta/ → Sonority(s) – Sonority(t) < 4 → heterosyllabic→ pas.ta A similar principle was also used by Peereman's (1998) on his syllabification model for French. According to his sonority scale, the distance necessary to have syllable boundary is three. Principle based on an relative interpretation of the SH are generally called Sonority Distance Principles. Syllable and Syllabification 21 Sonority Segments Phones 1 Voiceless stops /p, t, k/ 2 Voiced stops /b, d, g/ 3 Noncoronal fricatives /f, v/ 4 Coronal fricatives /s, S/ 5 Coronal Nasal /n/ 7 Noncoronal Nasal /m/ 8 Liquids /r,l/ 9 Vowels /a, e, i, o, u/ Table 1.5: Davis (1990) Sonority Scale for Italian 4. Phonotactical Constraints Other widely accepted syllabification principles are based on phonotactical assumptions. The main point, as expressed by Pulgram (1970), Hooper (1972), Kahn (1976) lies in the fact that possible codas or onsets are only that phonotactically possible word-initially or word-finally. This principle is based on two assumptions: firstly that only medial clusters that could be analysed as a word-final followed by word-initial exist in language and secondly that speaker intuition tends to divide syllable in units that match these phonotactical constraints. This same principle was developed by Kahns into the Maximum Onset Principles (MOP) which regulate the distribution of ambiguous intervocalic consonant cluster. This principle is based on the fact that CV syllables are the preferred ones (i.e., the least marked) in all languages. For example, this principle accounts for the division of V.CV instead of VC.V. Ambiguous intervocalic consonant clusters are also syllabified according to the principle. For instance, in a sequence VCCV the application of the MOP will give V.CCV if CCV is a possible word-initial cluster or VC.CV otherwise. 5. Internal Evidence Studies have justified the necessity of the syllable in the phonological theory by discussing phenomena that for the best description would require this unit to be postulated 12. In Italian 12 I will give an example in the next paragraph, where phonoogical theories will be threated Syllable and Syllabification 22 literature most work has been based on three phonological processes: Raddoppiamento Sintattico (RS), Vowel Lengthening and il/lo Allomorphy. ➢ Raddoppiamento Sintattico (RS): the gemination of a word initial consonant if the preceding word meets some conditions which vary from Italian variety to variety. RS is syllable sensitive because – in some theories - only tautosyllabic clusters on the second word seems to undergo the RS. For example, metà [s]carpa vs metà [k:]oso. (Vogel, 1982; Chierchia, 1982, 1986; Repetti 1989, 1991) ➢ Vowel Lenghtening: the lengthening of a vowel if it is stressed, not word final and belongs to an open syllable. For example ['ka:.sa], but ['kar.ta] or ['pas.ta]. (Chierchia, 1982, 1986; Nespor and Vogel, 1986; Nespor, 1993; Vogel, 1977, 1982). ➢ il/lo allomorphy: the selection of the definite article allomorph il or lo before various word initial consonant cluster. lo is claimed to heterosyllabic cluster while il prefers tautosyllabicity. For example, [los.karpone] but [il.korpo] (Davis, 1990; Marotta, 1993) The convergence of multiple phonological processes on the same syllable structure is argued to provide evidence for the claimed syllabifications. For example, in a VCCV sequence, supposed that vowel lengthening occurs on open syllable only, the syllabification would be V.CCV13 if V is lengthened on that context or VC.CV otherwise. 6. External Evidence Various efforts have been made in the literature to give external evidence to language syllabification and structure. For example, Bertinetto (1999) analysed how the sC cluster is treated by 20 speakers from the University of Pisa using some permutation tasks (syllable reduplication and substitution) and found that, despite the descriptive phonological prevision14, sC are treated more 13 Note that the vowel lengthening in Italian, requires the stress to be assigned. will not indicate stresses for the sake of simplicity by now. 14 Previous study analysing the convergence of il/lo allomorphy, RS and vowel lenghtening brought to a generally agreed heterosyllabicity for sC clusters. Syllable and Syllabification 23 like a tautosyllabic cluster V.sC. Caldognetto also got contrastive results analysing a corpus of 2500 speech errors. In fact, while for consonant substitution tautosyllabic sC cluster was suggested, in deletion and insertion errors heterosyllabic clusters were probably detected by speakers. An exhaustive study on consonant cluster syllabification was recently made by McCrary (2004). Using various tasks she tested a 51 Pisan subjects. The experiments aimed to verify: ➢ how native speakers treat consonant clusters ➢ if segment duration (vowel lengthening and RS) and definite article allomorphy (the three phonological process listed in the previous section) really converge on syllable structure According to McCrary (2004) the results obtained shows that “the standard syllable-based analyses of consonant cluster divisions, definite article allomorphy and segment duration are not supported by the experimental evidence.” In a previous study Steriade (1999) argued that syllable division experiments are influence by phonotactical knowledge of the speaker, in particular for the division of consonantal clusters. Word edge knowledge is claimed to be used to divide words in syllables, so that given a syllable-initial/final segment there is a word of which the first/final segment is the syllable-initial/final segment. According to the theory, speakers should show uncertainty about the syllabification of Italian /s/ because the phoneme is a possible word-initial and word-final segment. The theory seems to be partially confirmed by McCrary (2004) which states that the word-based syllables strategy and the phonotactical-constraint satisfaction strategy appear to be adopted by speakers, as it emerge from the result of his tests. In fact he states that “ […] ambiguous and contrastive syllabification were given in the case the two principles contrasted ” 7. Comparison of Principles I show in table 1.6 all consonant clusters analysed and tested by McCrary (2004) and I will compare them with syllabification obtained by the application of the other principles in the paragraph. I will start with native clusters (i.e., Cl, sC, sL, sN, LC, NC) where major accordance is Syllable and Syllabification 24 given and then I will proceed to the more problematic non-native clusters. A brief summary of the proposals follows: ➢ MOP: Maximum Onset Principle includes word-initial condition. In a CC cluster if CC is word initial then the cluster is tautosyllabic. For example is syllabified as pa.dre, as /dr/ is a possible word initial cluster. ➢ SSP: sonority decrease from nucleus to margins. For example, mir.to as /t/ is less sonorous than /r/. But la.dro because /r/ is more sonorous than /d/. ➢ SD: if the sonority distance according to Davis' SH is major than 4 the cluster is tautosyllabic. pas.ta is heterosyllabic because the distance between /s/ and /t/ is +3. ➢ Experimental evidence: the first value indicates how many speaker treated the cluster as tautosyllabic, the second as mixed and the third as heterosyllabic. The pattern is tauto/mixed/hetero as for the cluster /nd/ which is indicated as 12/22/22. ➢ Dictionary: DISC syllabification is also given. Syllable and Syllabification Cluster Experiment SSP SDistance Garzanti CL:pl,kl, aeroplano, pr,tr,kr, padre,litro, dr... ecc. 40/2/5 tauto pa.dre yes tauto pa.dre 6 tauto pa.dre tauto pa.dre sC:sp,st, caspita, sk,sb pasta,kasko, ecc. 25/6/16 tauto pa.sta no hetero pas.ta 3 hetero pas.ta tauto pa.sta sL:sl sr 20/11/16 tauto? di.slessia yes tauto di.slessia 3 hetero tauto dis.lessia di.slessi a sN:sn, sm nichilismo, masnada, bisnonno, 18/4/25 hetero? bis.nonno yes tauto bi.snonno 1 hetero bis.nonno tauto bi.snonno LC:rp,rt, korpo, rk,lp,lt, arto,arko, lk alto, ecc. 0/0/47 hetero ar.to no hetero ar.to 6 tauto ar.to hetero ar.to Cn:pn, tn, kn 9/12/26 hetero tek.nika yes tauto te.knika 4 tauto te.knika hetero tek.nica CT:pt, kt sinaptico, ektoplasma penectomia 7/16/24 hetero sinap.tiko no hetero sinap.tiko 0 hetero hetero sinap.tiko sinap.tik o Cs:ps,ks micropsia, kapsula, rokstar 8/11/28 hetero kap.sula yes tauto ka.psula 3 hetero kap.sula tauto ka.psula ft lifta,lifting , nafta 9/10/28 hetero naf.ta no hetero naf.ta 2 hetero naf.ta hetero naf.ta tl atletico, atlante, genetliako 26/12/9 tauto a.tlante yes tauto a.tlante 6 tauto a.tlante tauto a.tlante GM:dm,gm kadmio, segmento, dogma 6/10/31 hetero dog.ma yes tauto do.gma 4 tauto do.gma hetero dog.ma bn abnorme, abnegare, subnukleare 4/15/28 hetero ab.norme yes tauto a.bnorme 3 hetero ab.norme hetero ab.norme afnio 8/16/23 hetero af.nio yes tauto a.fnio 2 hetero af.nio hetero af.nio fn Word ex. 25 bislakko, dislessiko, israele apnea, etnia, teknika Table 1.6: A comparison of possible CC cluster division strategies (adapted from McCrary 2004) Syllable and Syllabification 8. 26 Conclusion Unanimous syllabifications are given for CL, LC, NC. The most interesting aspect of that consensus is the fact that there are also the only clusters in which speaker operated an omogeneus and unanimous division. For the CL cluster 40 people syllabified it as tautosyllabic, only two gave mixed responses and five threated it as heterosyllabic. Other clusters are even clearer showing 0/2/45 for LC clusters and an impressive 0/0/47 for NC. The fact that the consensus from phonological theories coincide with experimental evidence only in this cases demonstrates that there is a real convergence between the principles and that phonological basis probably exist. Other interesting cases are the ones which concern the syllabification of sC cluster. I have already cited studies on which the contrastant treatment of sC arised from the analysis of phonological phenomena, corpora and experiments. Considering table 1.6 further evidence of syllabification discrepancies is given. It is interesting to note that the only divergent phonological principles are the experimental and the dictionary. The syllabification of sC as tautosyllabic might then be due to an interference of the orthographic syllabification learned at school. Additional evidence to such hypotesis comes from the fact that children tend to split out the cluster and to the special place of /s/ on the syllable structure (remember it can go in the coda even if it is not a sonorant) and from the various experiments and studies already cited. In the next paragraph we will also show how an algorithm based on phonotactical context confirm the uncertain behaviour of sC. Various syllabification techniques have been proposed in this paragraph, but while there is convergence on detecting syllable number and a core area on which syllabification is easily predictable, some cases are still debated and unresolved. The reason to this uncertain syllabification might be due to various factors: • the non deterministic nature of phonology or linguistics itself, thus leading to a broad theoretical discussion I wanted to avoid; Syllable and Syllabification • 27 to the non necessity of the speaker to face that problem, that is to have the possibility to have rules which do not account for every possible cases of a language, simply because they are not relevant to that language or do not occur frequently enough, that might be the case of non native clusters which do not appear even once in CLIPS; • to the interference or interaction of various component in speakers' competence and knowledge of its language, such as the phonotactic ability, the sonority scale principle and orthographic hints, as is in my opinion clearly emerges from McCrary (2004) data; • a diachronic change in act in the language, such as for the sC cluster, whom ambiguity is argued by Bertinetto (1999) to be due to a diachronic shift of the cluster from heterosyllabicity to tautosyllabicity. Syllable and Syllabification 3 From SPE to Optimality 1. SPE Rules 28 Most of the classical generative phonology (based on SPE15) was based on the analyses of the discrepancies between Mental and Surface representation. Mental or Underlying Representation (UR) includes unpredictable and contrastive language information, while the Surface Representation systematic and predictable one. The discrepancies between Surface and Mental representations are accounted in derivation using a set of rules (whose order of application has to be specified) which, applied to the UR will result in the surface form. Rules are formalised in formulas similar to the following: A → B / C__ D Which could be paraphrased as: A becomes B if preceded by C and followed by D. Each rule defines a Structural Description, which consists of a class of possible context - CAD in this case - and of a Structural Changes, which are the derivation rules to apply when the context is met. Variables are usually expressed in the form of distinctive features or phones, but other symbols are also found (the first two loaned from the Chomsky and Halle's syntax works): ➢ # indicating a word margin ➢ + indicating a morpheme boundary ➢ C or V indicating a consonant or a vowel For example, the nasal assimilation rule for Italian was described by the rule in image 1.6 which could be paraphrased as following. Total Nasal Assimilation: nasals totally assimilate the following sonorant traits before morpheme boundaries. 15 The Sound Pattern of English (1968), Chomsky and Halle's phonology work which stands as a landmark for any generative work on Phonology. Syllable and Syllabification 29 Image 1.6: Nasal assimilation rule Features with Greek letters indicate that the two segments share the same value. For example, if [αant] is positive in the context the derived segment will have the same value. As the name of the rule suggests, assimilation consist on the assimilation of some traits by an adiacent segment. The derivation of the word /illegale/ would be the following: //in+legale// UR /il+legale/ Total Nasal Assimilation /illegale/ Surface Representation 2. The Syllable in SPE The application of SPE principles to the most various languages, made an unlikely context appear more recurrently than expected. For example, the alternance in France of [e] and [ɛ] (e.g., ins[eR]é, ins[ɛR] and ins[ɛR]sion) was described with the rule in image 1.7. But the same rule does not apply to many words which match the CC context, such as [mepRiz]er, [səvr]er and so forth. Moreover, even in typologically different languages the same context frequently recurred. For instance, English dark and clear /r/ alternate in the same context as well as many phonological phenomena in Turkish, such as epenthesis, final devoicing and vowel shortening (Clements and Keyser, 1983). The recurrence of these two unrelated contexts (word-margin or a consonant) in different languages can not be casual. The adoption of the syllable provided the most elegant solution to the problem. For example, the French rule can be rewritten as in image 1.8, with the Syllable and Syllabification 30 dollar sign ($) marking a syllable boundary. Still, linear representation of the syllable was soon abandoned by most authors in favour of an nonlinear one. Image 1.7: SPE rule for French [e] and [ɛ] alternance Image 1.8: SPE rule for French [e] and [ɛ] alternance including syllabe 3. Autosegmental Theory During the '70, numerous studies about tones and phonological phenomena which spanned across multiple segments (such as vowel and nasal harmony) led some linguists to rethink Chomsky and Halle's theory. For instance, in SPE tones were usually assigned to a segment (generally a vowel), but many coeval studies about African tone languages showed that tones can be assigned to a phone, a sequence of phones, a syllable or a phonological word and that the deletion of a segment would not eventually lead to the deletion of the tone itself (this property being called stability). Rules still apply (on section 1.2.7 I will show that the Optimality Theory will replace rules with constraints) but instead of thinking the phonetic representation as a single sequence of segments the autosegmental theory propose them as a set of autosegments (where auto stands for autonomous, independent). On such view, phonological representations consist of more then one linear sequence of segments; each linear sequence constitutes а separate tier of autosegments, also called planes. The autosegmental theory can be dated back to the framework that John Goldsmith Syllable and Syllabification 31 submitted in 1976 at the Massachusetts Institute of Technology. Goldsmith developed а formal account of ideаs thаt hаd been sketched in eаrlier work by severаl linguists, notаbly by Bloch (1948), Hockett (1955) аnd Firth (1948). Goldmith stated that “phonological representations consist of more than one linear sequence of segments; each linear sequence constitutes a separate tier”. The realisation of a segment implies the coordination and the simultaneity of the tiers. In other words, each tier has to be associated and finally converge on a chronological linear sequence. This tier is called the skeleton. The skeleton is represented by using neutral X-slots, in which no features or articulatory properties changes16. Instead, X-slots organise autosegments into temporal units. For this reason the skeleton is also called timing tier. Note that at no point different tiers merge. The planes instead are linked together and organised by association lines, which indicate that different autosegments are simultaneous. Tiers and association lines are always organised according to hard constraints, which can never be violated. Association lines are drawn according to a series of principles called Well-Formedness Constraints (Clements and Goldsmith, 1984) and are supposed to be universal in their specific domain. We have already stated that WFCs can not violate Hard Constraints. However, some Soft Constraints might be specified. Unlike hard constraints, these can be violated. If a derivation violates a soft constraints it is not marked as ill-formed. Instead a Repair Mechanism is specified. In this way phonological phenomena are described 'in terms of rule that delete and reorganize the various autosegment, through the readjustments of association lines'. The difference with the SPE lies in the fact that derivation was made by applying a sequence of rules which directly changed features of linear segments. We will see an example of autosegmental derivation in the following section. 16 In some literature, mostly on that concerning tones and syllables, it is possible to mark the skeletal tier with C or V instead of X-slots. I will be using such notation too when necessary. Syllable and Syllabification 4. 32 Autosegmental Syllabification In this section I will start by showing the new autosegmental representation of the syllable (after Goldsmith 1976, 1984; McCarthy 1979, 1981; J. Trommer 2008). Then, I will propose a minimalistic description of Italian syllabification in an autosegmental fashion. To describe the Italian syllabification at least two tiers of representation in addition to the skeleton are necessary: ➢ melody: the articulators described in term of features; ➢ syllable: organising X-slots into syllable structure; The melodic tier is linked to the skeleton according to the following WFCs: ➢ Every skeletal node is linked to a melodic node ➢ Every melodic node is linked to a skeletal node ➢ Every melodic node should be associated to at most one skeletal node The following Soft Constraint determines whether a segment in the melodic tier is correctly associated with an X-slot: ➢ Every skeletal node should be associated to at most one melodic node A Repair Algorithm account for constraints violation: 1. If there are unassociated S-nodes and M-nodes: Associate S-nodes and M-nodes 1:1 from left to right 2. Else: If there are unassociated S-nodes: Associate every unassociated S-node S to the M-node to which the S-node immediately preceding S is associated Association lines links the melodic to the skeleton tiers straightly. In fact, most of the time the realisation of a segment in the melodic tier is represented as a single unit in the skeleton, that is Syllable and Syllabification 33 without violating any constraints. But there are cases in which articulators are not linked one-to-one to the skeleton and soft constraints are violated. Generally, these combinations are possible17. ➢ one-to-one: this is the commonest case. Each set of traits distinguishes a unit in the skeleton, like in the representation of the word cane 'dog' (Image 1.9): Image 1.9: 1 to 1 correspondence between melodic and skeletal tier ➢ many-to-one: even if affricates are considered a single phoneme, the articulation of affricates is complex, as it implies that the trait [± continuant] changes within the phoneme . For example, the articulation of /ts/ can be described as the sequence of [t] followed by [s]. The trait [± continuant] shifts in fact from a negative to a positive value in the melodic tier. However, since in Italian phonological system the sound behaves as a single unit, in the skeletal tier the affricate /ts/ will be represented as a single segment. Affricates are represented as in image 1.10. Image 1.10: Example of a 2 to 1 correspondence between melodic and skeletal tier 17 Others configurations are possible as well, like 0 → many and many → 0 but will not be treated here. Syllable and Syllabification ➢ 34 one-to-many: In Italian an open and debated problem regards the syllabification of geminates. Following the traditional autosegmental approach, geminates are considered a single unit in the melodic tier, but as two X-slots. In fact, there is no distinctive feature change during the articulation of geminates, that is articulators and distinctive features remain the sames during the production of the speech sound. On the other hand, as consonantal length in Italian has a phonological and a phonetic value (it serves to differentiate minimal pairs and is determined by an effective percepted lengthen of the segment in contrast with same non-geminate segment appearing on the same context) the melodic segment will be linked to two X-slots as proposed by Danesi (1985) to resolve the problem of geminate syllabification in Italian. For example, the word 'gatto' is represented as in the image 1.11: Image 1.11: Example of 1 to 2 correspondence between melodic and skeletal tier Well-formedness rules accounting the association of the syllabic to the skeleton tier require an higher degree of complexity. We will use as the a principle the Sonority distance based on Davis (90). To correctly represent simplified syllabification in Italian we will need at least the following rules, adapted from Rubach (1990) proposal for Polish. Rules are stated in their correct order of application in table 1.7 and an example is given in image 1.12. Yet, the application of those principles may result in unlinked segment. A repair Mechanism would be necessary for words which violate the Sonority Scale Principle on word margins like Syllable and Syllabification 35 skala 'ladder', or for syllable with a complex coda (sport)18. For the last case extrasyllabicity may be assumed and the segment directly linked to the phonological word, for the former a recursive Onset rule before Coda Rule might be hypotised (in this case we will have word margin resyllabification as well). For example, the phrase la skala (the ladder) will be syllabified according to the following rules: N-Placement > CV Rule > Onset Rule (Blocked by SSP) > Coda Rule (add /s/ to the first syllabe) > las.ka.la If the phoneme /s/ in the word scala is considered extrasyllabic a rule or the repair algorithm stated above will link it to the phonological word or to the following syllable onset. For example: N-Placement > CV Rule > Onset Rule (Blocked by SSP) > Coda Rule (No segment, skip) > Complex Onset Rule19 > ska.la 18 According to Rubach (1986), floating segments are typically not present in the phonetic representation. 19 Complex Onset put other segment in the onset, a positive constraint may allow for specific phonotactical configuration, such as sCCV syllable in Italian and overcome the SSP. Syllable and Syllabification Rule 36 Representation 1 N-Placement: for every vowel on the melody tier place an N20 in the syllabic tier 2 CV Rule : If there is something to the left of an N, it is included into the onset. In any case, a N” node is created. 3 Onset Rule: put the remaining consonants on the onset as long as they do not violate the Sonority Scale Principle. It may apply several times(optional) 4 Coda Rule: put the remaining consonant in the coda.(optional) Table 1.7: Autosegmental Syllabification Algorithm for Italian 20 N indicates a node. It is possible to rename the node according to syllable structure consitutents. In this case, the correspondence is the following: N → Nucleus N' → Rime N'' → Syllable Syllable and Syllabification 37 An example of the application of the algortihm for the syllabifaction of 'pastrocchio' is following: Syllable and Syllabification 38 Image 1.12: Autsegmental syllabification step for the word 'pastrocchio' 5. Metrical Phonology On a first stage autosegmental theory was used to describe tonal features. However, by the mid '80 autosegmental theory became a full theory capable of represent all kind of phonological features. The success of the theory contributed to the creation of two other nonlinear theories: the metrical and the prosodic phonology. But while Autosegmental phonology began when linguists failed to account for tonal phenoma in some African languages, metrical phonology was introduced when the available instruments for the analysis of stress patterns became insufficient. So forth, generative phonology had represented prominence as a feature [±accent], assigned by Syllable and Syllabification 39 rules during the derivation to individual vowels as a segmental feature21. In some cases, stress was further indicated using a discrete numeric scale, defined by syntactic structures, which came along a single phonetic dimension (Chomsky and Halle, 1968; Halle and Keyser, 1971;. The following digits were used to mark stresses: ➢ 0 – for unstressed vowels ➢ 1 – to indicate primary stress ➢ n > 1 for other stresses, with higher numbers indicating a weaker stress For example, the phrase 'black board'22 is described in the following terms: And the compound name 'blackbord' as following: The differences reflect the syntactic structure of the two constituents: blackboard [[black]A [board]N]N black board [black]A [board]N]NP To discuss the intonational system of English, in his doctoral dissertation Liberman (1975) proposed a new representation of the phonological hierarchy. He organised segments into groups of relative prominence and into different levels, and assumed stress as a supreasegmental feature. Stress patterns are then described as a sequence of weaker and stronger constituents, which belong to different domains and which finally converge on the syllable level. To represent this organisation 21 Another simplification is presented here. In SPE in fact most important stress assigment ru les are at least two: CSR which applies to strings dominated by a lexical category and NSR to strings dominated by phrasal category 22 In this and in the following examples I will always assume a normal stress. Emphatic stress will not be considered as it would require a further investigation into pragmatics and marked stress patterns. Syllable and Syllabification 40 of phonological constituents; units are displayed in trees similar to those used by autosegmental phonology. The metrical tree for the Italian verb compitare 'to spell out' (Image 1.13) is an interesting example to show how different stress levels are organised in the representation. As it emerge from the tree, constituents belong to different levels and are organised into groups of relative preminence whitin each level. The previous tree included three levels at which strong and weak constituents are juxtaposed. Those are the phonological word (P-Word or ω), the foot (f) and the syllables (σ). The syllable dominated only by strong constituents (all up the tree) is called Designated Terminal Element (DTE) – red in the prervious tree - and is the one bearing the primary stress.The geometry of metrical tree is defined by principles, which may differ among authors (as well as the layout). For example, in Vogel (1986) a rule states 'trees have an n-ary ramification', but some theories may assume a restricted binary represantion only. Image 1.13: Metrical tree for the word 'compitare' Syllable and Syllabification 41 To represent different levels of prominence, the metrical grid is also used. X X X X X X X com pi ta re The more Xs in a column the more prominent the syllable is. The syllable with the major number of Xs is the DTE. The three levels showed above are not the only the ones included in the theory. The necessity of new levels and units was generally justified when it appears as domain of application of phonological phenomena. Moreover, as explicited in the phonological grid, in metrical phonology each level defines an additional word accentual level. For a comprehensive list of levels see image 1.1423 (adapted from Selkirk, 1986; Vogel and Nespor, 1986). Image 1.14: Phonological hierarchy24 23 Phonologists may disagree on the arrangement and inclusion of units in the hierarchy. As it is of no interest for this thesis arguing the existence of any level, I will assume that the necessity of every level is justified. 24 In this thesis I will not focus on units from P-Word upwards. Instead, in the next section I will better illustrate the mora and the foot level, which are important for some syllable based phonological analysis. Syllable and Syllabification 6. 42 Foot, mora and P-Word As said, vowel lenghtening and RS is argued to be triggered by the FOOTBIN25 constraint, which states that Feet must be binary at either the mora or syllable level (McCrary 2002; Prince and Smolensky; 1993; Vogel 1982). For example, in the case of 'kasa' vowel lenghtning occurs to avoid FOOTBIN violation. A light stressed syllable lenghtens to create a bimoraic heavy syllables [L] → [H]: In the case of 'pasta' instead there is no vowel lenghtening , therefore /s/ is assumed etherosyllabic. In fact, the /s/ being in the coda results in a syllable which is already heavy and do not need to lenghten. The same is true for all the other hetherosyllab clusters, such as [kar.ta] and [al.to]. Word final stressed vowel do not undergo vowel lenghtening because of another rule/constraint which forbids word-final long vowels (Vogel 1982, Chierchia 1982, 1986, Davis 1990). Therefore, 'papà' is CVCV, while 'papa' CV:CV. To satisfy the foot binarity principle then, the consonant following the stressed vowel is lenghtned instead of the vowel itself. Concerning syllabification the same is true for RS. Argued tautosyllabic consonants do not cause RS as the heavy syllable is given by the resyllabification of the consonant which goes in the coda. In 'metà morto' RS double the second word first consonant. Foot binarity violates light stressed syllable but being vowel lenghtening forbidden word-finally resylabification puts /m/ in the coda, which results in the the form cit.tàm.mor.ta. On the other hand, in 'ctttà sporca' stressed /a/ is lenghtened by resyllabification of /s/: cit.[tàs].por.ca. 7. Optimality Basic Principles As showed in 1.3.1, SPE linguistic investigation aimed “ to explicate the system of predicates used to analyse inputs — the possible Structural Descriptions of rules — and to define the operations available for transforming inputs — the possible Structural Changes of rules. ” 25 The description of these phenomena is simplified but mostly complete according to cited authors. Others have analysed the same phenomena following different theories (eg. expressing them in term of rules) and obtaining different results. However, only this solution is reported in the thesis due to its fundamental importance in syllable division. Syllable and Syllabification 43 (Prince and Smolensky, 2004). However the necessity and the importance of well-formedness constraints became crucial in many important works especially in morphology and phonology for example by Bach and Wheeler (1981) Broselow (1982), Goldsmith (1990) and many others. The place of these constraints in the phonological system and theirs interaction was obscure and did not assume that constraints in language are highly conflicting. According to Prince and Smolensky (2004) the first necessary step forward a new theory was to abandon a couple of presuppositions: first, to abandon the theory that “it is possible for a grammar to narrowly and parochially specify the Structural Description and Structural Change of rules” (Prince and Smolensky, 2004); second, it also to abandon the theory that “constraints are language-particular statements of phonotactical truth” (Prince and Smolensky, 2004). Instead of these, they support the idea that the grammar should contain these constraints with means of resolving these conflicts. In other words, one of the major innovation the theory allowed in formalising linguistics process is the systematic use of constraints instead of rules. One of the most ambitious goal of the theory was to create a set universal constraints. The task is still of course unmet, but theoretically possible. In SPE instead given that each well-formedness constraints had to surface or at least be level true (remember that the application of a rule was compulsory), it was harder to imagine a universal set of constraints lying in the UG. In Optimality theory UG provides a set of general constraints. The way in which languages differ would then lie only in the hierarchy on which such constraints are ranked. 8. Optimality Procedure In Prince and Smolensky (2004) the procedure is schematically represented like this: a. Gen (Ink) → b. H-Eval(Outi, 1∢i ∢∞ ) → {Outi, Out2, …} {Outreal} Syllable and Syllabification 44 Gen contains representational principles and their relations. For example, according to syllable theory a sigma which always dominates the rime. Given an input – Gen(Ink) – Gen generates a number of outputs. The input H-heval – H-eval() - is then constituted by Gen outputs, in a number comprised between 1 and infinite. H-vel will give the best candidate in the output according to the set of constraints called CON. To paraphrase the procedure illustrated above we can say that: 1. given an input, a set of possible candidates are generated by a GEN function, in accordance with the unit representational principles. 2. The EVAL function - following a set of hierarchically ranked constraints (CON) - evaluate each candidate. 3. The optimal candidate26 is then chosen so that a violation of a higher-ranked constraint is always worse than a violation of a lower-ranked, that is the most harmonic one. Two basic assumptions of the OT is that the Gen generates for a given output by freely applying basic structural resources of the theory; and, second, that constraints are typically universal and of general formulation, with disagreements over the wellformedness of analyses (Prince & Smolensky 2002). These two are among the Universal Grammar and both are simple and general. 9. Optimality Formalisation An OT procedure is formalised using a 'tableau'. As example I will assume a hypothetical language in which the UR /ABCD/ surface as [ABC]. Optimal candidate are indicated using an arrow hand right. Hypothetical Lаnguаge 1:/АBCD/ ☞ [АBC] 26 'The degree to which a possible analysis satisfies the set of conflicting well-formedness constraints will be referred to as the Harmony of that analysis […] The actual output is the most harmonic analysis of all, the optimal one.' Syllable and Syllabification 45 We assume CON contains two constraints: DEP: All segments must be underlying Con1: C must not precede D The ranking of rules is represented with this notation and is generally included for reference before the tableau: H-ranked >> Mid-Ranked >> Low-Ranked … The procedure will then be formalised as the following: DEP >> Con1 /ABCD/ DEP i. [ABCD] ii. ☞[ABC] Con1 !* * Table 1.8: Hypotetycal language 1 tableau In the first column the input /ABCD/ is indicated, in the second possible candidates – [ABC], [ABCD]. Third and fourth columns are more interesting. DEP and Con1 indicate the constraints. If a representation violates a constraint, the corresponding cell is marked with an asterisk (*). If it is a fatal constraints, which means it is the higher-ranked an exlamation mark (!) is added before the asterisk. Shadowed cells are not necessary to the choice of optimal candidate. The entire Co1 row is shadowed because DEP was violated and hence, being Con1 lower ranked, such constraints violation are not evaluated by EVAL to choose the Optimal candidate. As I said, languages differs only in CON, so supposing an Hypothetical language 2, with the same constraints but the ranking reversed the optimal form would be ABC in this case. Syllable and Syllabification 46 Con1 >> DEP /ABCD/ Con1 i. [ABC] ii. ☞[ABCD] Dep !* * Table 1.9: Hypotetycal language 2 tableau 10. Syllabification in OT It is worth to say that the theory was proposed by Prince and Smolensky (1993) using syllabification as working example. We will do the same, but instead of exploring exotic languages I will use examples taken from Italian and see how OT would determine the best syllabified output. In paragraph 1.2 I showed that the universally accepted syllabification for Cl and Cn clusters is tautosyllabic (e.g., pa.dre, li.tro). Two general syllable constraints can account for this syllabification. -COD Syllables do not have codas. COMPLEXONSET Syllables do not have complex onset The ranking of the two constraints would determine whether Cl and Cn cluster are treated as tautosyllabic or heterosyllabic in a language. If we rank COMPLEXONSET above NOCODA (i.e., COMPLEXONSET >> NOCODA) we will have an erroneous heterosyllabic division in Italian, resulting in syllabification like pad.re and lit.ro. We would instead have the reversed ranking (see table 1.10). For more complex cases we may need to formalise the Sonority Principle according to Davis (90). SD+4 Syllable do not contains cluster with a sonority distance < 4 Syllable and Syllabification 47 To allow an heterosyllabic syllabification of clusters like SC (e.g., pas.ta, kas.ta), SD+4 has to be ranked before NOCODA (table 1.11). (McCrary, 2002) NOCODA >> COMPLEX ONSET /litro/ -COD i. ☞ li.tro ii. lit.ro COMPLEX ONSET * !* Table 1.10: Hypotetycal language 2 tableau SD+4 >> NOCODA /pasta/ SD+4 i. ☞ pas.ta ii. pa.sta -COD * !* Table 1.11: Tableau for the syllabification of 'pasta' SD+4, COMPLEX ONSET and -COD are called syllable structure constraints. The application of these constraints to the Eval input allow to choose the optimal syllable structure, for example VC.CV instead of V.CCV. But in many languages phenomena like ephentesis and deletion are structural. Then Gen has to generate candidates which include epenthesised and deleted segments27. This candidate will be then evaluated against a class of constraints which define the correspondence28 of segments in the input to segments in the output. This kind of constraints are called faithfulness constraints and determine the relation between output structure and input. We assume that in Italian syllable structure do not force segments to be deleted or inserted 29. Two faithfulness constraints PARSE and FILL are then assumed and higher ranked to avoid epenthesis the former and deletion the latter. (McCrary, 2002) 27 The Gen module candidate generation problem will further discussed on III.3 28 The concept of correspondence was formalised by McCarthy and Prince (1995) and will be deeper discussed in the next section 29 This does not mean that ephentesis and deletion phenomena do not occur at all in Italian. Syllable and Syllabification 48 FILL Syllable positions must be filled with underlying segments. PARSE Underlying segments must be parsed into syllable structure PARSE >> FILL >> SD+4 >> -COD / studente/ PARSE i. tu.den.te ii. es.tu.den.te iii. ☞ stu.den.te FILL SD+4 -COD !* !* * * Table 1.12: Tableau for the syllafication of the word 'studente' In the first paragraph we have seen that some languages codas and onset are severly limited. For example, in the case of loaning epenthesis fulfilled the onset in Bouma Fijian. In this case, unlike Italian, FILL is ranked below structural constraints. -COD >> FILL /klok/ -COD i. klok ii. ☞ koloko FILL !* * Table 1.13: Tableau for the syllafication of the word 'klok' Other constraints might be necessary in Italian, like ONS, which states that 'syllable must have onsets' and the HNUC (nuclear harmony constraints) which specify that a higher sonority nucleus is more harmonic than one of lower sonority. However, a complete analysis of Italian syllable constraints (and of any language) is still an open and complex problem and unfortunately can not be treated here. Syllable and Syllabification 4 Conlcusion 1. Definitions of syllable 49 So far I have considered the place of the syllable in phonological theories (mostly generative) and how it is identified. It is now possible to better expose the problem of syllable definition. A good literature review is to be found on Cutugno et al. (2001). From a phonological perspective I have showed that systematic studies about the syllable began in the 70', with the investigation of suprasegmental phonomena. In theories such as the autosegmental, the metrical and the prosodic phonology, the syllable is then defined as a phonological unit, because domain of phonological process. Blevins and Goldsmith (1995) echo this aspect by saying: “The first argument for the syllable as a phonological constituent derives from the fact that there are phonological process and/or constraints which take the syllable as their domain of application.” Trubeckoj (1958) had already recognised the syllable as domain of prosodic phenomena, but in metrical phonology it becomes the building block of the rhythm, prosody, poetic meter and stress patterns of languages. For example, in Hooper (1972) says that the syllable “always has a prosodic function – i.e., it is the phonological unit that carries the tone, stress or length”. Fudge (1969) argues that the syllable has both a prosodical and a phonoctactical function and Goldsmith refers to syllables in term of possible words: “ […] the syllable is a linguistic construct tightly linked to the notion of possible word in each natural language, thought not, striclty speaking, reducible to it.” Finally, a phonological sonority scale serves as a fundamental principle for syllabification in many theories, based on the fact that each phoneme has an intrisic sonority it is possible to define the phonological syllable as in Bloomfield (1983): Syllable and Syllabification 50 “In any succession of phonemes there will thus be an up-and-down of sonority […] evidently some of phonemes are lore sonorous than the phonemes (or the silence) which immediately precede or follow […] any such phoneme is a 'crest of sonority' or a 'syllabic'; the other phonemes are 'nonsyllabic.” Note that the sonority has an important place to the definition of the syllable both in acoustic phonetics, i.e., the energy of the sound wave, and in articulatory phonetics, as a result of the vocal trait aperture. Concerning these two phonetic field in fact, the definition of the syllable is different and consider other aspects of the concrete realisation of the syllable in speech production. From an articulatory perspective the syllable was defined by Stetson (1951) as consisting of “a single chest pulse usually made audible by the vocal folds.” The same principle is found on Pike (1955), which says that “physiologically syllables may also be called chest pulse.” Sausurre, Grammont and Sommerfelt definition of the syllable also considers physiological observation: “ […] en principe il y a une syllabe nouvelle là ou l'on passe d'une tension décroissante à une tension croissante ou là où il y a une interruption dans une série de tension décroissantes ou croissantes.” Saussure (1922) Malmberg (1971) highlight from an acoustic perspective that elements of a phonetic sequence are attracted and influence one another whithin different degree of strenght. To sum up, it is necessary to distinguish between the concrete expression and the abstact representation of the syllable. For the former, the articulatory and the acoustic dimension is to be considered. In the first case the syllable is defined as a 'chest pulse' or as continuous 'puff of air', in the second the only studied realisation of the syllable is to be found on the sound wave energy. At an abstract, phonological level, different principles have been proposed. Some argue that the syllable is influenced or completely depends on phonotactics or in terms attraction between phonemes, other assume a phonological scale of sonority, with different variations and exceptions, Syllable and Syllabification 51 others argue all of these to have some place and so on. How syllable and syllabification are integrated in a phonological system, their role and function also varies among theories. 2. Which Syllabification? I have showed how the concept of syllable itself is debated and evolved fast during three decades. Different approaches, theoretical assumptions and aims may lead to different definition of the syllable and therefore to different syllabifications. Any serious approach to the syllable however would require at least one of the theory to be taken into account and different syllabification techniques may be preferred or allowed in accordance with the chosen theory. As we seen in paragraph 2 there is no convergence on a unique syllabification principle. However, choosing the best syllabification system for a language is not a totally arbitrary task. The system has to be systematic and organic, and each choice has to be justified and harmonic with the rest of the system. Some syllabification principle may not be suited for particular theories or just do not fit together with principles of different nature. An acoustic approach to syllable division will probably require the sonority principle to be considered, while a phonological syllabification will never be based on orthographic rules. Different syllabification principles are harldy comparable, as no gold standard exists. This conceptual assumption may lack among some scholars, in particular computer scientists, leading to conclusions that evidently need to be revised. In the next chapter I will give to examples of this misunderstanding. On the other hand, I will show that the algorithm I will be implementing takes into account the linguistic problematic of syllabifcation and therefore results in a concrete and organic solution to those problems. Automatic Syllabification 2 1 52 Automatic Syllabification Input, Model and Purposes In chapter I, I have analysed various syllabification principles basing my investigation both on phonological and orthographic forms. However, most Natural Language Processing (NLP) studies are based on raw speech recording data. In fact, in NLP the importance of the syllable became evident when syllabic units appeared to give optimal results in automatic speech recognition and in text-to-speech systems (Laurinčiukaitė and Lipeika, 2006; Ostendorf,1999; King et al., 1998, 1991). In linguistics, evidence emerged from various psycholinguistic experiments, arguing the importance of the syllable as sub-lexical unit in lexical access and in speech segmentation (Willem, Levelt et al., 1999; J Segui et al., 1984, 1991; Cutler and Norris, 1986, 1988). However, The problems that an acoustic computational approach to syllable division has to face are various. The interaction between signal recognition and syllable description may lead to ambiguous context and the manipulation of speech recording data add a great degree of complexity to the system. In linguistics the study of phonetic representation of the signal is of particular interest, especially in the field of statistical analysis and contrastive description of language varieties (see chapter III). It is possible to discover correlates between the physical characteristics of the signal and their linguistic properties, how the production may vary among speakers, coarticulation phenomena and so on. In the first chapter I have showed that defining a general and unique syllabification principle is impossible. No gold standard exists and syllabification techniques, principles and representations may vary within the same theory. However, in order to proceed, it is necessary to make some operative decisions, having in mind the purposes of the algorithm. Input structure, model, and finally possible uses of the program are the necessary choices that has to be made before starting implementing any system. Note that these three elements are necessairily related. For example, if Automatic Syllabification 53 you want to develop an orthographic syllabification software, the input will be orthographic and the model possibly rule based. 1. Written or Spoken Language A first choice has then to be made between two different input structures. In this study, as I prefer to face phonological problems directly related to syllabification, I will not work directly on recordings; the algorithm input will be a sequence of strings. ➢ Raw speech recording: the sound wave is analysed and a syllabification algorithm distinguish acoustic syllable boundaries. The syllabified output may be used for speech recognition or for prosodic analysis. ➢ Sequence of strings: the input is made of a sequence of strings. Another important differentiation has to be made between orthographic and phonological data. Most word processors include a syllabification function which allows to divide a document or a word into syllables. The obtained syllabification will be orthographic and, as said in chapter I, will probably diverge from the phonological one. The syllabification module implemented in the word processor can look up in a dictionary – that is generally the case of English - or implement a set of rules for languages which allow the syllabification to be automatically obtained from the orthographic form (e.g., Italian). Phonological transcripts of spoken language are a more genuine form to work on. As said in chapter I, the syllable is a phonological unit and there is more interest in studying it if the data are spoken language transcripts. It is possible in this case to analyse segmental and suprasegmental phenomena, to obtain statistical information on effective syllable usage, to exploit the obtained syllabification for signal analysis application. Finally, the recent possibility to exploit corpus of spoken language makes this new field particularly interesting. ➢ written text, such as a journal article, a paper, a romance. This will be probably parsed according to orthographic syllabification rules and most phonological phenomena not Automatic Syllabification 54 accounted (such as pause, resyllabification). ➢ Transcription of spoken language:. This is normally considered the genuine data to work on by descriptive linguistics and the one in which natural syllabification is more relevant. 2. Transcriptions Different kinds of transcription exist, but only orthographic, phonetic and phonological ones will likely be used for syllabification. Being syllabification a phonological phenomenon, it would better requires phone or phonemes as unit to be parsed. Graphemes do not belong to the any of these domains and therefore such kind of syllabification would be of little or no interest in descriptive linguistics. However, it could still be possible to exploit orthographic texts for phonological analysis. The syllabification system will likely include a module which perform a grapheme-to-phoneme conversion or take a transcribed form as input. Given an orthographic transcription you will then have the following procedures: 1. Convert graphemes to phonemes 2. syllabify phoneme segments 3. convert syllabified segments back into graphemes (optional). Note that in this case a phonological transcription will always be preferred. Phonological transcripts in fact usually include information useful to syllabification (such as stress, pauses), while a converted text could not in any case provide such kind of data. Phonological transcripts can include different information. The transcription for instance may have been broad or narrow. Generally, it is useful to dispose of as many information as possible. This does not mean that any element of the transcription is relevant to syllabification. A study has to be made in order to discriminate relevant information and to account for its role in the syllabification procedure (see chapter IV). The phonetic transcription is the most difficult to be Automatic Syllabification 55 processed as it will present additional boundary identification problems similar to the ones that speech recognition has to face. Phenomena like epenthesis and deletion will have to be treated as well as segment assimilation, modification and so on. Moreover, most of the work done on syllabification analysis has been made by generative linguistics which - as said on chapter I – does not account for non-systematic, performance related phenomena. To sum up, possible transcript to be used as input for the syllable division program can be: ➢ Orthographic: an orthographic transcript is converted into phonemes, divided in syllable and eventually converted back to the original orthographic form. A phonological transcript would be better used instead if available. ➢ Phonological: the preferred transcript form. It may contain various phonological information in addition to the sequence of phonemes. A preliminary study has to be made to discriminate relevant information and determine its role on syllabification. ➢ Phonetic: this is the most complex one as various phonetic phenomena may render the sequence difficult to syllabify. Moreover, traditional syllabification literature assumed the syllable as a phonological unit. It is then difficult to bend the syllabification to nonphonological principles. 3. Software Purposes Depending on the future use of the program, different solutions might be more suitable or even necessary during the development of the software. This is a very important point but it is usually neglected. One of the common assumption by computer scientists is that a unique and universal syllabification procedure is possible, and therefore an algorithm can be described as doing a syllable division task better than another (Marchand et al., 2009a, 2009b; Weerasinghe et al., 2005). While a solution can be faster and more precise within a specific domain, as we have been seeing so forth is impossible for various reasons. I will better illustrate these problematic by the end Automatic Syllabification 56 of the chapter. What is important to state now it is that is necessary to be aware of the final use of the algorithm before developing it. In fact, as we will see in the rest of this chapter, syllabification principles, procedures and computational models will not only be more suited for specific tasks, but in certain cases a necessary choice. In general, three major approaches are possible: ➢ Speaker Behaviour Investigation: this kind of algorithm should simulate the linguistic behaviour of the speaker, possibly simulating his/her idiosyncrasies as well as getting light on the psycholinguistic aspects of the problematic. Data Driven method are the most indicated for this kind of programs. ➢ Theory investigation: in this case we try to implement a phonological theory. The manual application of a theory to a certain amount of data (e.g. Optimality theory, see section 1.3.9) can be tedious and error prone. An automatic approach will allow to better and faster analyse a lager number of data. This could also be helpful to test particular aspect of a theory in order to clarify or confute them. We can also get statistical results by applying the theory on a corpus or discover new problematics during the development of the program. ➢ Engineering goal: that is if we need a syllabification program for a specific engineering or linguistic task, for example divide orthographic words into a syllable to obtain an automatic hypenator or to syllabify corpora in order obtain statistically relevant information and data. In this case, it is possible to give limited importance to the most controversial linguistic debates if they are irrelevant to the the final application of the program and eventually bend the theory to suit pratical needs. This concept will be further explained in chapter IV as the program I will develop will basically be of this type. 4. Epistemology From an epistemological perspective, it may be interesting to summarise a long-lasting Automatic Syllabification 57 debate about the nature of language, which will allow me to introduce an important differentiation between two broad types of computational models and of linguistic theories. The debate has ancient roots but in modern linguistics it has began in 1957 after the publication of Buhrrus Skinner's Verbal Behavior. Explicitly against structuralists (Edward Titchener) and functionalists (James R. Angell), Skinner proposed a theory of language which is based only on explicit behaviour, that is the experience and the production of the speaker. Knowledge is all supposed to be given by cognitive connections which are strengthened in case of positive stimuli or weakened after negative feedbacks. Noam Chomsky (1959) review of VB demolished Skinner thesis and served as background of his future generative theory (Chomsky, 1965; Chomsky and Halle, 1968). Experience is not the solely source of knowledge, but constitute a stimulus to the activation of parameters of an innate and general faculty of language. “ It seems plain that language acquisition is based on the child's discovery of what from a formal point of view is a deep and abstract theory - a generative grammar of his language.” (Chomsky, 1965) Chomsky's theories weakened for a couple of decades any linguistic work directly based on natural data. Nonetheless, many fields of linguistic – such as phonetic and language acquisition - still required to work on speaker production, as demonstrated by table 2.1 (McEnery and Andrew Wilson 2001), which highlights how corpora based studies have multiplied during these two decades. Automatic Syllabification 58 Period Studies to 1965 10 1966-1971 20 1971-1975 30 1976-1980 80 1981-1981 160 1985-1991 320 Table 2.1: Corpora based studies until 1991 During the 90's the critics to Chomskian model become numerous and important. Many argued for the lack of precision on the definition of fundamental concepts, such as that of the language faculty: what is this faculty made of? has it a corresponding biological structure? what kind of knowledge is it involved? But more conceptual fall-backs have been argued. For example, the ability of the speaker to recognise and exploit statistically systematic phenomena have been demonstrated and theorised in many recent studies (Cleeremans et al., 1998, 1993, 1991). Other works, such as Tomasello's (2005, 2003), argue that hypothesis about language, during children language acquisition period, are made after a statistical extrapolation of linguistic data and are not due to any innate ability. Sampson (2002) criticised the generative introspective methodology, which is based on data which are artificial and impossible to observe. Finally, many studies made on first language acquisition seem to invalidate the fundamental chomskian thesis of the poverty of the stimulus. According to this principle, an innate faculty of language has to be assumed as long as children are not exposed to such an amount of data which could possibly allow for a language to be learned. Yet, many studies have demonstrated that even the rarest syntactic structures are heard with a certain frequency by children and that the majority of child wrong linguistic production is corrected by the parent or receives a negative feedback On the other hand, correct sentences are usually recast and extended by adult speaker. (Bohannon. and Stanowicz, 1988; Bohannon et al., 1990a, 1990b; Gordon, 1990). (Markie, 2008; Russel, 2008; Griffiths 2009) Automatic Syllabification 5. 59 Data Driven - Rule Based In the next two paragraphs I will be showing two computational models, called data driven and rule based. The main difference between these two approaches reflect the epistemological debate cited in the previous section: the former consists in a series of rule and procedures hard coded in the program, the latter acquires knowledge from a set of given data. Most of the syllabification principles I have analysed in chapter I, were all based on set of rules or constraints, whose application to a given input led to the syllabified output. The choice was due to the fact that most of the work recently done about syllable and syllabification has been based on generative theories. However, the recent development of NLP and computational linguistics granted the possibility of implementing learning machines which demonstrated good results even on real-life tasks, such as speech recognition and text-to-speech systems. Connectionist theories have been implemented into Artificial Neural Networks and widely used in several NLP applications (Kasabov et al., 2002; Dale et al., 2001; Amari and Kasabov, 1998; Kasabov, 1996; Goldsmith, 1994, 1992). The main differences which distinguish the two epistemological assumptions are summarised in table 2.2. Automatic Syllabification 60 Rationalist (Generative) Empiricist Knowledge The faculty of Language is innate, universal, equally present among men and overall could not be learned or forgotten (but acquired during defined childhood development periods). A Universal Grammar and a language organ (which refers to a specific cognitive organ) allow the speaker to develop a specific language. Human being has no language specific organs. Instead, other developed abilities (mainly the ability to share the attention with others and to statistically infers pattern and regularities) allowed the exploitation of complex sign systems such as the language. Learning Few stimuli suffice to activate the proper parameters (this term may differ from theory to theory) in the Universal Grammar. A great deal of stimuli is necessary in order to infer patterns from a language. Production Performance or speaker production has little or no influence on language description. Speaker language production constitute the basis of the analysis as it is what determine speaker language ability itself. Table 2.2: Important differences between rationalism and empiricism I have highlighted three particularly important differences for our analysis of algorithmic models. The first row shows that knowledge for a rationalist is innate 30 while for an empiricist it is acquired through experience. Computationally, we can express this concept by saying that a knowledge (in our case concerning the syllable and syllabification) is hard-coded a priori in a rulebased program while in a data driven patterns are derived by a training set. In other words a rulebased algorithm will include in its structure any information required for that knowledge while data 30 This particular school of thought is called innativism and is the most used definition used to describe generative linguistics. However, being an innatist does not strightly implies being a non-empiricist. Many simplifications are assumed on this exposition. Automatic Syllabification 61 driven will infer it. Concerning the example of syllabification the programmer will have to code rules and procedures necessary for an input to be syllabified, such as 'if the sonority of an x segment is major than the sonority of a y segment put a syllable boundary'. Data driven models will of course have to include some kind of information, but unlike the rule-based this will not directly affect the studied process or knowledge. The coded algorithm will be more general and the programmer will have to provide only a set of statistical or simply context dependent rules and a training set. The application of such rules to the data will result in a set of patterns which will eventually correspond to the previously cited hard-coded rules. A first setback is evident: rule-based models do not require any kind of linguistic data to be developed while for data-driven models a corpus or a collection of specific linguistic data is required. For some models such a corpus could not be available and the data collection could be more time-consuming than the development of the program itself. The application of the knowledge is similar for the two models. Generally an input is required and according to the knowledge coded or inherited by the algorithm an output is given. In the case of data-driven method the output could be used for further training. Table 2.3 summarises the cited procedure. Rule Based Data Driven Pre-processing data Nothing A corpus or a collection of linguistic data. Imprinting Set of rules (parameters or procedures) are specified by the programmer. The learning machine infers knowledge and patterns from given data using few statistical operations. Operations Rules or constraints apply to the given input. No other knowledge is incorporated. Resulted patterns/ knowledge applies to the given input. The output may contribute to strengthen or create other knowledge. Table 2.3: Rule based and data driven models Automatic Syllabification 2 62 Data Driven Models In this paragraph I will briefly expose some data driven models. In the first section I will show the model for Italian syllabification proposed by Calderone and Bertinetto (2006). In the second section a small literature review is proposed, on which other possible implementations are cited with no warranty of completeness. 1. Artificial Neural Networks The term Data Driven was used to indicate a very broad and general category of computational models. Connectionism belongs to this category. The term connectionism is generally used to define an approach to information processing which is based on the design and architecture of brain. In this paragraph I will discuss Artificial Neural Networks (ANN)31: a computational model which simulate structures and functions of biological neural networks. ANN are useful in linguistics for at least three cases: ➢ when we do not dispose of an algorithmic solution. This is the case of the syllable division, as showed in chapter I, where multiple possible syllabifications and principle contrast. In this case ANNs can be used to pick out structures from linguistic data; ➢ when we dispose of noisy data and we want to normalise or generalise them. (This could be useful to analyse speech recordings); ➢ when we want to simulate the speaker behaviour32 in certain context or phenomena. Again, this is the case of syllabification as showed in 1.2.8 where empirical evidence showed discrepancies on how cluster division is handled by speakers. Data Driven model's disadvantages are: ➢ ANNs need training to operate; 31 Not all ANNs are connectionist and not all connectionist models are ANNs. 32 Note that biological neural network orders of magnitude more complex than any artificial neural network so far realised. The results obtained are just an idealisation of the possible cognitive process involved. Automatic Syllabification ➢ 63 data collection and ANN's tuning may be time-consuming; An ANN – and generally a connectionist model – consists of four parts – unit, activation, connection and connection weight - each of which correspond to a particular structure of process in the biological neural network. Units are generally indicated by circles while connections are represented as arrows, to indicate the direction of information and therefore distinguish input from output (image 2.1). Units and connections are in an ANN what neurons and synapses are in a biological neural network. Image 2.1: Simple Artificial Neural Network unit One of the characteristics of natural neural networks is the simplicity both in the nature of the signals and in their transmission. To convey an information each neuron receives an input, which is composed by the other neurons electric signals, and propagates the output to other neurons if its stronger enough to break the synaptic cleft threshold. Connections weight among neurons might be excitatory or inhibitory and stronger or weaker, thus affecting in different ways the amount of action potential transmitted. The huge number of neurons permit the complex operations of animal's brain. Artificial neural networks works in a similar way. Generally, units compute the input from other neurons and, given a threshold, forward it to other neurons. The synaptic weight may differentiate among weaker and stronger inhibitory (negative value) or excitatory (positive value) connections, Automatic Syllabification 64 generally by multiplying each input before they are summed up33. The procedure is schematically presented as following: 1. The unit (neuron) receives input from connected units → x1, x2, x3 ... Xi 2. Each input is multiplied for the synaptic weight (the strength of the connection) xiwi 3. multiplied input is summed up Image 2.2: Artificial Neural Network unit So the resulted value Vk is obtained by the sum of each input multiplied by its synaptic weight: Vk = W1X1 + W2X2 + W3X3 + WkjXj... In mathematical terms, it is described by the following formula: Before having this value fired to other neurons an activation function squashes this value in a range which is generally 1 to 0 or -1 to 1. This is to avoid that the output of a unit never exceeds the actrivation value. 33 This kind of model is called McCulloch and Pitts model (MCP). Automatic Syllabification 65 Neurons are organised in various layers, for example the cerebral cortex is organised into 6 layers. In a common ANN model, called feed-forward network, units are organised in three group or layers34(image 2.3): ➢ the input layers, which receives as input raw information fed into the network (called bias). ➢ hidden layer, which computes the input and construct its representation inside the network itself ➢ output layers, which send resulting information outside the network Depending on the way communication occurs among layer a first major distinction is possible between two types of networks: ➢ Feedforward: signal is transmitted from one layer to another. Recursion in the same layer or back to other layer is not allowed (image 2.4). ➢ Feedback: signal is transmitted to any layer. This kind of networks are dynamic because the signal does not follow a path to an end, as for the feed-forward, but units keep sending it until an equilibrium is found and activation values do not change anymore. What makes ANNs interesting, especially in linguistics, is their ability to learn. In fact, even if connection strengths can be manually hard-coded on each units of the network generally ANNs are trained before being used. Training a network consists on altering units' connection weights until the difference between desired and obtained output is minimal. The cost function determines the mismatch between desired and obtained output. The modification of connection weights usually consists on a variant of the Hebbian learning rule. This basically says that if two units are active simultaneously their interconnection is strengthened according to a learning rate value. The error derivative of the weight (EW) indicates how the error changes as each connection weight is modified. Various algorithm are used to calculate this value, but the the most common is the back34 Note that more than one hidden layer may be used: Automatic Syllabification 66 propagation algorithm (Rumelhar et al.,1986). Basically, at a first stage the error at the output unit is calculated. The error is then back-propagated to previous hidden layers and their weights altered accordingly. Image 2.3: Artificial Neural Network with three hidden layers Image 2.4: Feedforward Neural Network In general, three learning methods are used: supervised learning, on which the input and the desired output is given, unsupervised learning, where only the input is given and the network organises itself, and the reinforcement learning, which could be considered an intermediate variant of the two. I will focus on the first method, as it will be used on the network I will be studying in Automatic Syllabification 67 the next section. In supervised learning a set of pairs (x,y) containing an input and the desired output is given by a teacher. Connection weights are then altered in order to obtain the optimal configuration. 2. Calderone's ANN Many ANN models have been used for syllable division such as generic neural algorithm (Oudeyer, 2001), dynamic systems (Laks, 1995), recursive networks (Stonaiov and Nerbonne, 1998) and so forth. I will concentrate on Calderone (2006), which propose a feed-forward neural network, trained with a back-propagation. A correlation between syllable structure and phonoctactic principles have already been argued in Chapter I. The main assumption of the algorithm lies in the fact that syllabification is governed by the speaker phonotactical competence. Clusters which tend to occur together will be considered stronger than segments which do not. These associations, which are defined by the phonotactical knowledge of the speaker will then determine syllable structure and syllabification process. Neighboring segments are described in terms of attraction and the syllabification as the process which divides segment distant from each others or gather together segments strongly attracted. For example, suppose that /r/ and /a/ have a linking value of x. /r/ and /t/, which are less phonotactically related will have a value y < x. The speaker's knowledge of these values is determined by its exposition to linguistic data and to its ability to determine statistical inference. Syllabification will result by this ability. The algorithm is based on a small corpus of 83 words35 for 51 syllable configurations. Each segment was then described according to the following phonological classes (manner of articulation mostly): V (vowel), G (glide), L (liquid), N (nasal), F (fricative), O (occlusive) and A (include) dental affricates and palatal geminates). For example, the phonological classes of segments in the word cane 'dog' would be following: 35 An English and a Spanish corpus are also considered in the paper but I will concentrate only on Italian. Automatic Syllabification 68 k a n e O V N V Each segment was represented by a binary array Vi (where i is the number of the segment in the corpus) of seven segments, made by the focus segment and by the three segments on its right and on its left. Vi = (Vi – 4, Vi – 3 … Vi, Vi + 1 … Vi + 4) If a position is not occupied by any segment the position is filled with a Null value. The focus segment is also saved as Null. Given the example above for the word cane, this is how the vector of the second segment /a/ would appear if the word was the first element of the corpus. Note that there is only one segment /c/ before the focus segment, the other position as well as the focus /a/ are Null. V1 = (Null, Null, O, Null, N, V, Null) For example, the arrays necessary to describe the word kane would be these four. K V0 = (Null, Null, Null, Null, V, N, V ) A V1 = (Null, Null, O, Null, N, V, Null ) N V2 = ( Null, O, A, Null, V, Null, Null ) E V3 = (O, A, N, Null, Null, Null, Null ) As said, the network was trained using a supervised learning (see 2.2.1). For the input the phonoctactic vectors described above were given. For the output the teacher chose syllabification Automatic Syllabification 69 vectors of the same length and nature of the input (image 2.5). Image 2.5: Phonoctactic and syllabic window The network was trained using a feedforward network with no hidden layers and a backpropagation algorithm to change the unit's pseudo-random initial connection weights. The learning machine trained, the ANN is able to determine the attraction values of an input sequence. The syllable is then composed by groups of segments which are strongly attracted one to another. For example, the word sillaba 'syllable' could be represented as governed by the following attraction values (image 2.6). Image 2.6: Attraction values for the word 'sillaba' The system obtained a 99.2% syllabification accuracy for input contained in the training set, but most important showed the most interesting results on clusters that were not included in the Automatic Syllabification 70 training set. For example, the learning machine was not directly trained to handle sC clusters. Given it as an input it clearly reflects its ambiguous behaviour (see chapter 1) and the reason why it could be treated both as tautosyllabic or a etherosyllabic by speakers. There is a very small difference between the two attraction values and therefore the /s/ could both be parset in the coda or in the onset of the following syllable. Image 2.7 shows the attraction values of the word pasta: Image 2.7: Attraction values for the word 'pasta' 3. Look-up Procedure Not all data-driven methods are ANNs. For example, another commonly used method, which derives from automatic pronunciation systems, consists on building a table (a lattice or a database) containing the phonotactical context of the words in the training set and then compare the input to the table in order to have it syllabified. Generally, the form of units in the table and the matching functions distinguish this kind of algorithm. The three algorithms I will propose here are all instances of lazy learning. In artificial intelligence lazy learning (Atkeson et al., 1997) is distinguished by eager learning because it defers processing of the examples until a query is made to the system. One of the outstanding advantages of lazy learning is the ease with which algorithms can be transferred to new tasks. In fact, all three methods studied here were originally designed for automatic pronunciation but are readily modified to perform syllabification. One of the first model was proposed Weijters (1991) as an automatic pronunciation system and then adapted for syllabification by Daelemans and van den Bosch (1997, 1992a, 1992b). The Automatic Syllabification 71 training set is constituted by syllabified words, which are stored in a look-up table in the form of N-Grams36. Each N-Gram is constituted by a focus character and by its left and right context. The 'N' in N-Gram indicates the length of the gram. For example, to allow each character to be a focus the six 4-Grams of the word <kidney> would be: <– kid>, <kidn>, <idne>, <dney>, <ney –>, <ey – –> Each N-gram is stored in a table, including its juncture class, which specify whether or not there is a syllable boundary ob the focus character. To syllabify an input, each entry in the look-up table is compared to the input. A match value is then assigned to each N-Gram depending on how much a context is similar to the input. Note that context positions are weighted, this means that not every position in the N-Gram will affect the match value in the same way. Generally, focus character weights more than right contexts, which weights more that left contexts. With the algorithm 15 sets of weights are given and are stored in a table. (Daelemans and van den Bosch, 1992) This is the algorithmm as it appears in (Weijters, 1991) and describes how two N-Grams, the input (NgramT) and the one in the look-up table (NgramS) are compared. Each character in the input NgramT is compared with the ones in in the table NgramS. If two characters are identical, the MatchValue is increased by the weight of its context position. FindMatchValue(weights, NgramT, NgramS) MatchValue := 0 for i := 1 to length(weights) do if (NgramT[i] = NgramS[i]) then 36 n-gram is a subsequence of n items from a given sequence. Automatic Syllabification 72 MatchValue := MatchValue + weights[i] end if end for The N-Gram with the highest MatchValue will then be used to syllabify the input, according to its juncture classes. For example, for the word midnight the closest N-Grams will be <kidn>, as it differs only by one character to <midn> and is also in the rightward context (which has the less weight). As a juncture class is indicated in the look-up table afer the <d> in <kidn> (highlighted in image ) the syllable boundary is placed in the same position on the input, thus resulting in <mid| night>. The look-up procedure was modified by Daelemans, van den Bosch and Weijters (1997). Basically, the procedure remains the same, but weights are not pre-defined. Instead each weight is calculated with a function which determines how much a position contributes to determine the placement of syllable boundary. Automatic Syllabification 3 73 Rule based Models This chapter will expose some Rule Based syllabification systems. I will start by showing how it is possible within a theory (OT) to adopt different solution and get different results. Then, I will analyse a program which tries to integrate OT and autosegmental theory. I will finally consider an Italian syllabigfication algorithm based on the SH principle. 1. Computational OT The main problem an OT based algorithm has to face is how to implement the OT generation component. Potentially, Gen could create a huge set of candidates if epenthesis and insertion are considered. For example, if epenthesis is assumed in the generation of the threesegments word 'pin' candidates, epenthetic segments could go in each of the space showed here: _p_i_n_ 2n+1 candidates would have to be generated, in our example 23+1 = 16. The same is true for deleted segments, the candidate set for the same word would be the following: pin, pi, pn, p, in, n, i. But the two phenomena have to be considered together. Table 2.3 (Hammond, 1997) illustrates the number of the candidate set for the segment number with the phenomena considered alone and together. segment epenthesis deletion both 1 4 1 4 2 8 3 16 3 16 7 52 4 32 15 160 5 64 31 484 6 128 63 1456 Table 2.4: Number of candidates if epenthesis and deletion are considered by Gen Automatic Syllabification 74 Each candidate will then have to be multiplied for the number of possible syllabified candidate to evaluate, resulting in an enormous candidate set. Various solutions to the problem have been adopted in the literature which will be analysed in the following section. According to Hammond (1997) a generator or a parser could be implemented: 'A generator would map input forms onto properly syllabified output forms. A parser would take output forms and map them onto input forms.' In other words, a generator would provide the set of candidates required by the syllabificator to be generated and evaluated them in the same module. A Parser instead will take the already generated form as input (epenthesis and deletion will not have to be considered by the parser which will parse already generated form), avoiding the problem of oversized candidate set. Implementing the parser would then be easier as its generator component will have to generate only syllabified candidates. In the case of the parser, syllabified output will also have to be checked against a few number of constraints as faithfulness constraints will be redundant. Nonetheless, even for parsers the problem of big candidate sets has to be resolved. In the case of four phonemes, possible syllabification are 8: XXXX, X.XXX, XX.XX, XXX.X, X.XX.X, XX.X.X, X.X.XX, X.X.X.X. Supposing that there are 6 syllabification constraints we will have at least 4*8 = 48 evaluations to be done. 2. Hammond's Algorithms The first program I will analyse is Hammond's (1995). Hammond's program is a parser, so he avoids the faithfulness problem by assuming that it has already been treated by another module (and therefore ignoring it). To further reduce syllabified candidates, Hammond uses what he calls the local programming. The Eval module of his program analyses only a segment at each cycle, evaluating as possible candidates for the segment only four states o (onset), n (nucleus), c (coda) u (unsyllabified). For the word 'apa', there would only be a set of 43 = 12 candidates, as showed in 2.8. Automatic Syllabification 75 Image 2.8: Hammond's candidate encoding for the word 'apa' The program aimed to simulate in a Optimality approach, English and French syllabification differences. This is an important point because, unlike its second algorithm did not aim to be universal even within the two languages. Given an input, each segment is parsed in a linear fashion from the rightmost character until the end. Each time possible values (candidates) are assigned to the segment until the syllabification is reached. The relevant constraints indicated by Hammond to highlight cases of different syllabifications in French and English are the following: PARSE all segments must be syllabified NOONSET stressless syllables do not have onset ONSET syllables have an onset The three are ranked differently in English and French. The former ranks: PARSE>>NOONSET>>ONSET While the latter: PARSE>>>>ONSET>>NOONSET The constraints are coded as the following: Automatic Syllabification 76 PARSE → &parse eliminates 'u' if other parses are available ONSET → &onset eliminates 'c' as an option if the current segment is a vowel NOONSET → &nonoonset eliminates 'o' as an option for the preceding segment if the current segment is stressless vowel The algorithm work as following37: 1 CV skeleton is firstly generated: a→V 2 Candidates for each segments are generated: oncu V Which will result as the following In a tableau: /x/ PARSE i. o ii. n iii. c iv. u ONSET NOONSET Table 2.5: Example of an unparsed Hammond's tableau 37 For each algorithm presented in this section I will describe only operations relevant to syllabification. Function that remove stress, verify correct inputs or allow to choose various parameters are ignored. Automatic Syllabification 77 3 A first set of 'housekeeping' constraints are applied, this includes constraints such as: a) vowels can not be onset or codas b) consonants can not be nuclei c) word-initial consonants can't be codas d) word-final consonants can't be onset ... 4 The violation of a) will eliminate c and o as possible candidate. un V 5 The next passage eliminates u if a constraint applies, that is the case: n V 6) Others specific constraints will then apply, we will see how in the next segment as this one is already syllabified as a nucleus. (7) Finally, the segment is converted in the corresponding phoneme (V → /a/) and the constraints evaluation re-applied. n a The following segment /p/ is more interesting as it is an intervocalic consonants and will be syllabified differently in English and French. 1, 2, 3c, 4 will result in the following segment: co C Automatic Syllabification 78 &donoonset can not apply now as it needs to consider the following vowel, we will then have the following sequence: co Sp If we assumed the /apa/ was an English word the algorithm evaluates the candidates according to three constraints shown above for English (5a): ➢ &doparse ➢ &donoonset Otherwise (5b) will apply: ➢ &doparse; ➢ &doonsetS ➢ &donoonset; In French ONSET is higher ranked: PARSE>>>>ONSET>>NOONSET therefore &donoonset applies first and 'c' is eliminated resulting In the following syllabification: n o n a p a a.pa In the case of an English word, /a/ &donoonset applies first PARSE>>>>ONSET>>NOONSET Automatic Syllabification 79 &dononset applies and eliminates 'o' as an option .The resulted syllabification will be the following: n o n a p a ap.a Second Hammond's algorithm (1997) aimed to be describe English syllabification entirely. This determined the necessity to overcome some limitations that the previous proposal had. The linear approach was in fact insufficient. The problem were solved by introducing two important changes. The programmings shifted to a declarative approach. Perl, which is an interpreted language, is replaced by Prolog, which allows a set of relations to be indicated and the constraints to simultaneously apply on the input. Concerning the candidate set problem, a cyclic CON-EVAL loop will not permit constraints to be evaluated if a higher ranked constraint is violated first. For example, if we assume a set of 10 candidates to be evaluated against 5 constraints we will need 50 cycles to evaluate each candidate: A 1 B !* C * 2 !* 3 !* 4 D * * * !* 5 * 6 * 7 !* 10 !* * 8 9 E * S !* * !* Table 2.6: Number of evaluations for a 10X5 tableau Automatic Syllabification 80 In Hammond's program once a constraint eliminates a candidate (because of a violation of a highest ranked constraint), remaining cells are shadowed and not computed. A 1 B !* D !* 3 !* 4 !* * ** * !* * * ** 6 !* 7 !* 8 * 9 * 10 E * 2 5 C !* ** *! * Table 2.7: Number of evaluations reduction using fatal violations Constraints are implemented like in the previous algorithm and their application results in the pruning of possible candidates [oncu]. The rules are represented in formal statements that say that structural option alpha ([oncu]) is removed from element of X type38 (image 2.9): Image 2.9: Hammond's second algorithm rule formalisation Once the input is given it is converted into a prolog list. For example, the word /apa/ would be converted to the following: [a,p,a] 38 As I said this algorithm aimed to be universal and allow for a more exhaustive description of syllabification. Constraints are more in number than in the previous example and better generalised. Here I give only a contrastive example to show main differences with the first Hammond solution. Automatic Syllabification 81 As for the previous algorithm Gen pairs each element with the candidate set: [a/[o,n,c,u],p/[o,n,c,u,a/[o,n,c,u]] Which could be represented as in the following a p a o o o n n n c c c u u u Each constraint will then prune away possible candidates. The previous housekeeping constraints will result with the following grid: a p a o n n c Two constraints may apply now, ONSET and NOCODA. In English the former is higher-ranked. Hence, the 'c' is pruned away and the resulting syllabification grid is: p a o n n Automatic Syllabification 3. 82 Others OT Implementations Other approaches have handled the potentially infinite candidate set problem differently. One of the first attempt to implement OT computationally was made by Black (1993). However, the model did not strictly simulate the OT, but was rather inspired by it. Major differences concern the Generator component which created candidates according to a set of rules. Constraints, then, operate more or less as repair-mechanism triggers. Ellison (1995) implemented the model using automata and representing the output of Gen and the constraints with regular expressions. Similar to Ellisson's are the models proposed by Eisner (1997). Tesar proposals (1995, 1998) are based on the technique of dynamic programming. To explain this model suppose that you have to go from point A to Z. You can divide the path in two, and say in each half there are three paths. A dynamic programming approach would split the problem and calculate first the distance to the three points in the first half, save the result in a table, and then will proceed resolving the problem. In Tesar then the input is parsed segment by segment considering previous segment best candidate and its structural position. The works I have discussed so far does even does not treat autosegmental representations at all (Tesar, 1995a; Hammond, 1995, 1997) or does so in a cursory way (Ellison, 1995, Eisner, 1997). Heiberg (1999) instead develop a program which implement the OT tied to the autosegmental representation. An object-oriented approach was used so that, as put by MacLennan (1986) for object-oriented programming, “the code is organized in the way that reflects the natural organization of the problem”. This allows the code to reflect the theory and then to be easily modified in order to experiment various kind of simulations. The model was not designed for the syllabifaction, but an implementation of such a process is possible and extremely interesting. 4. Cutugno et al. (2001) Previous algorithms constituted an attempt to implement a phonological theory computationally, in particular the autosegmental and the OT. However, another approach might be only based on some of the syllabification principles proposed in the first chapter and on their Automatic Syllabification 83 application to the input in a linear fashion. The most interesting algorithm is to be found in Cutugnoi et al. (2001), most importantly because its purpose is very similar to the program I am trying to realise. The algorithm is based on the SSP and on the same SH I will use (with some modification) for my program. The algorithm was designed to syllabify a portion of AVIP, a spoken language corpus labelled in a way very similar to CLIPS (see chapter III). The corpus is constituted by a collection of recordings with time-aligned phonetic and phonemic transcriptions. These layers were syllabified using the algorithm and then compared with an automatic syllabification done on the signal. The pseudocode used to describe the algorithm is the following: ASSIGN a sonority value to each phoneme. {find least sonorous segments} FOR EACH phoneme IF (sonority is minor than the preceeding phoneme sonority) AND (sonority is equal or major than the following phoneme sonority) THEN: the phoneme is a least sonorous segment END FOR FOR EACH least sonorous segment {a sonorant not followed by a vowel} IF phoneme sonority > 9 AND the following phoneme sonority < 18 THEN: it is the preceeding syllable end ELSE: it is the beginning of the following syllable First it parses the input and gets the least sonorous phone/phonemes of the sequence. Then, for each of these segments it puts a syllable boundary before the segment in case of a sonorant not followed by a vowel or after the segment everywhere else (for examples and discussion see 4.2.3). Automatic Syllabification 4 84 Conclusion In various essays, percentage indicating syllabification accuracy is stated. This is the case of Marchand (1999), which compares different syllabification algorithms to demonstrate that data driven model is better suited than rule based syllabification systems. However, as clearly demonstrated so forth in this chapter, principles are likely to contrast and using dictionaries as gold standard is probably not a relevant parameter to argue that a syllabification algorithm is better than another. Data driven algorithms are trained with the syllabified words that will be used for the comparison and the fact that they get better results is obvious. For example, Marchand (1999) argues that Hammond algorithms can correctly syllabify only a 30% of words. But Hammond's implementation is based on the OT theory, which resulting application may be fairly different from any syllabification given in a dictionary. The same is true when he considers Fisher's implementation of Kahn's theory (Kahn, 1976). The correct syllabification should be compared with the result obtained by a human doing the syllabification by hand and following the same principles (as in an Weerasinghe et al., 2005). In this case the performance and accuracy of an algorithmic solution can be tested, but not the accuracy of the syllabification itself. Different algorithms may be used to obtain different results and therefore the algorithm must be based on the principle which better reflect the primary use of the software. CLIPS 85 3 CLIPS CLIPS (http://clips.unina.it) is the largest corpus of spoken Italian ever collected39. It contains more than 100 hours of recording and 1 million words totaling more than 20GB of storage space (1). It is annotated and balanced to give broad dialectal coverage (Savy and Cutugno, 2009). Unlike many other corpora collected for specific purposes, CLIPS aims to give a general representation of Italian. A detailed socio-economic and sociolinguistic analysis (2) has been made to obtain a corpus representative of Italian, with the full understanding that Italian is notorious for its peculiar diatopic variability (Lepschy, 1977; Bruni, 1992). In fact, language may differ greatly from region to region and the standard language is hardly spoken even on national television. CLIPS is also structured into five sub-corpora, for diamesic and diaphasic variation and includes time-aligned orthographic, phonemic, phonetic and sub-phonetic labelling of the recordings (Savy and Cutugno, 2009). One of the main purpose of the corpus was to provide a support which could be used for statistical and probabilistic language analysis, especially in the field of speech processing applications. For this reason, particular attention was given to the phonetic correlation between data representation and acoustic signal. Diatopic Textual Dialogic Read speech Radio and TV Telephonic Orthophonic 15 sites 15 sites 15 sites 15 sites standard map-task read senences Auto read sentences Woz word list 90+180 broadcast talk show commercials culture 333RD+240TV 1077+7628 2400+1200 Units spot difference 120+120 the word list Transcription 30% 30% 30% 100% 100% Labelling 10% 10% 10% 3,5% 16% Table 3.1: CLIPS corpus summary (Savy and Cutugno 2009) 39 In fact, the corpus I refer to is only a part of a largest project whose name is CLIPS. For simplicity, I will keep using this notation to indicate only 'the spoken language corpus of the CLIPS project'. CLIPS 1 Transcription 1. Transcription Principles 86 The importance of corpus transcription has been argued by various authors (Gibbon et al., 1997; Ide, 1996). A corpus of spoken language which contains only raw speech recordings can be used only for a limited number of applications. Transcription of corpus recordings in fact allows to drastically increment the possibility to exploit the corpus itself for studies otherwise particularly time-consuming or even impossible. On the other hand, the main setback of corpus transcription lies in the fact that it requires a great amount of human work to be done. For this reason, it was chosen to transcribe only a portion of each corpus, but to an amount sufficient for statistical based analysis and for application support, being this kind of approach possible only by disposing and comparing a great number of structured data. Recording transcription implies an encoding operation, which means giving a permanent representation and an interpretation of the raw data (i.e., speech recordings). By assuring a unique transcript it is possible to give warranties that any study made on it will be reproducible, comparable and consistent with other studies made using the same corpus transcripts. If raw data of a corpus has to be kept unaltered to avoid obtaining different results within a change of the corpus itself - the same is true for transcriptions. By providing a unique transcription standard - that is always using the same set of graphic symbols and procedures to describe a phenomena - the representation of data will always be the same. The principles of consistency is to be applied also on a technical ground. It is in fact recommended to refer to a single set of symbols and transcription procedures to keep the research consistent, organic and save time to researchers. For the same reason , it is recommended to adopt already well known and widely accepted standards from other corpora. CLIPS was created after the experience of ATIS, SPEECHDAT, POLYPHONE, PHONDAT e VERBMOBIL (Kohler et al., CLIPS 87 1995) and of other Italian corpora such as AVIP and API . In particular, VERBMOBIL was used as a basic reference because similar to CLIPS for 'purposes, materials and procedures' (Savy, 2007). Clips transcription design was mainly based on Edwards (1993) principles of category design, readability and computational tractability. According to Edwards categories must be: ➢ systematically discriminable: for every case in the data it has to be clear whether a category applies or not; ➢ exhaustive: for each case in the data there must be a category which applies; ➢ systematically contrastive: each category must determine other category boundaries. The principle of readability states that a transcription to be readable has to have satisfy these conditions: ➢ the temporal sequence of the events has to be reflected on the spatial sequence of the text ➢ similar events are to be kept spatially closer to each other, qualitatively different events visually separate. ➢ prerequisite information for the understanding of an event has to be placed following a logical priority ➢ categories are encoded in an iconic way so that a human reader can easily recover their meaning. Finally, the principle of computational tractability states that the encoding has to be systematic and predictable. CLIPS 2. 88 Annotated Transcription So far I have talked about corpus transcription in general. To be more precise, CLIPS is distributed with annotated transcripts of recordings. This means that transcripts not only consist of lexical information but also contain labels used to describe semi-lexical, non lexical and non vocal phenomena. As said on paragraph 3.1.1, one of the main purposes of the project was to obtain a corpus that could be used for the automatic computation of the acoustic signal. For this reason, among all possible phenomena that could be described and annotated, it was chosen to focus only on those that altered or interfered with the acoustic signal itself (Savy, 2007). All symbols used in the annotated transcript are listed on the following tables (3.2 - 3.6). Note that all transcribed words are lowercase, except for acronyms (all capitals) and proper name (first letter capital). To transcribe a sequence of letters (as in the case of acronyms) if the letters are pronounced in their phonetic spelling (e.g., AVIP pronounced as /'avip/) every letter is transcribed between slashes (/A/ /V/ /I/ /P/). In case of spelling pronunciation (AVIP - pronounced in Italian /a/ /vu/ /i/ /pi/) each letter is transcribed in its spelling form, so AVIP was transcribed A-Vu-I-P. Finally, any other comment by the operator concerning the alteration of the acoustic signal is added between square brackets, and its duration indicated by braces. For example, in the case of a sentence in dialect the comment [dialect] and the target sequence are included between braces as in the example: {[dialect] ka ditto ?} If only an element of the transcription has to be described by the comment, the comment is added between square brackets just after the element itself and no braces is required. For example, in the following case only the word guaglio 'guy' is indicated as a dialectal form: ho detto guagliò [dialect] CLIPS 89 Symbol Phenomema Example + Uncompleted words (disfluences) non lo vedo → non lo ve+ _ Word internal interruptions mon_tato * Lapsus linguae, pronunciation errors altalenante → altanelante / False starts ma tu / dove sta la figura?40 <unclear> Unclear word or sequence ho <unclear> ? interrogative sentence vieni ?41 ! exclamation vieni ! , semantic/syntactic boundary no , non mi sembra /LETTER/ phonetic pronunced acronomys /A/ /V/ /I/ /P/) - spelled acronyms A-Vu-I-Pi Table 3.2: Semi-lexical phenomena Symbol Phenomena Example <sp> or <lp> Short or long pause vedi <lp> la macchina? <P> Long pause, ends an utterance ma tu <P> no, vabbè42 <eeh> or <ehm> Full pause and full nasal pause la <ehm> macchina <CC> or <VV> Final segment lenghtening allora<aa>...; con<nn> <cc> Word initial segment lenghtening <ss>sì Table 3.3: Non lexical phenomena Symbol <eh>, <ah> <mh>, <mhmh>, <'mbè> Phenomema <ahah>, Assent labels, <'mbè?> is used to ask a question, like English 'so what?' <oh> Ends and begins of a sub-task (DG) <ah!>, <oh!>, <eh!> Exclamation Table 3.4: Interjections 40 but you … where is the picture? 41 Note the blank space between word and symbols (?!,) 42 but you... no, right CLIPS 90 Symbol Phenomema <laugh>,<cough>, <breath>, <inspiration>, Non verbal phenomena <tongue-click>, <clear-throath> <vocal> Others non verbal <i.talkers> Background voice noise <NOISE> Non vocal noise <MUSIC> Background music (RD and TV) #TURN# turn overlapping {TURN} turn overlapping Table 3.5: Non verbal and non lexical phenomena Symbol Phenomema [dialect] dialect sequence [foreign word] foreign word [screaming] and others other comments Table 3.6: Operator comments As the temporal continuity of the signal had to be segmented and further labeled with reference to the spectrogram (see paragraph 4), annotations indicate any audible phenomena, overlapping speaker utterances included. If only one acoustic event overlapped with another, the two were indicated between braces, the overlapping segment on the left and the overlapped sequence on the right. 1. no deve andare verso la sinistra del foglio <sp> cancella e vai verso {<laugh> sinistra} 2. fatto questo<oo> {<NOISE> <lp>} sei arrivata In 1. the word sinistra 'left' is said laughing. In 2. a noise is present during a long pause of the speaker. In case of turn overlapping (i.e., two speakers talking at the same time) the following notation was used in both speakers turns: CLIPS 91 ➢ a hash indicates the begin of the overlapping portion (#); ➢ the hash is followed by the turn indicator between inequality signs <> (e.g., <F#8> means follower, turn 8); ➢ the overlapping sequence is transcribed (<lp> sulla); ➢ another hash indicates the end of the overlapping sequence (#); ➢ the same is done on the other speaker turn transcription. p1G#7: #<F#8> <lp> sulla# sinistra <sp> c'è scritto fiume p2F#8: #<G#7> no# <lp> non c'è This example indicates that during a speaker 1 long pause followed by the word sulla, speaker 2 says the word no. 3. Transcription Procedure The first operation was to individuate a segmentation unit which could make the transcription easy to access, describe, codify (annotation and labelling) and consult. Units reflect the characteristics of each sub-corpus item, which means that not all sub-corpus recording will be divided in portion of the same nature (table 3.7). Each transcription is included into a TXT file (unicode)43. Corpus Transcription Unit Dialogic Dialog Radio and TV Transmission Telephonic Call Read speech List's item (word or sentence) Orthophonic List's item (word or sentence) Table 3.7: Transcript units 43 A complete description of each sub-corpus file name format is to be found on the corresponding sub-corpus section on paragraph 3.3. CLIPS 92 Each transcription file begins with a header, which includes all information about recording, speakers and transcription. CLIPS header layout is conform to the SAM standard (Gibbon et al., 1997). It follows the header schema, divided in four sections: text information, speaker information, transcription information and recording information.Finally, each candidate recording was transcribed in four phases. 1. Lexical transcription: all lexical elements of the recording are transcribed within turn indicators. Numbers , acronyms, dialectal and short forms are transcribed according to their pronunciation. 2. Annotation: comments and annotation are added to the transcription. 3. Overlaps: particular attention was given to turn overlapping and transcription partially revised. 4. Revision: transcription revision was made on a regular basis by different operators. 4. Labelling The labelling procedure aimed to give a phonetic, phonemic, sub-phonetic and orthographic time-aligned representation of the signal (Savy and Cutugno, 2009). Transcripts were used as a basis for labelling portions of each sub-corpora. First, transcript files were divided into smaller units according to their corpus characteristics (table 3.11) Corpus Transcription Unit Labelled transcript unit Dialogic dialogue Turn Radio and TV Transmission Utterance Telephonic Call Instruction Read speech List's item (word or sentence) word or sentence Orthophonic List's item (word or sentence) word or sentence Table 3.8: Transcript and labelled transcript unit CLIPS 93 Each labelled transcript name was then composed by the name of the original transcription followed by a descriptor of the labelled unit. For example, the transcript of the dialogue DgmtA01T.txt is divided in turn and each turn labelled. Labelled transcription file names will be constituted by the original transcription file name (DgmtA01T) followed by speaker and turn indicator (_p2G#1), resulting in files such as DgmtA01T_p2G#1. Image 3.1 describes a dialogic utterance filename by its components. Image 3.1: DG utterance filename example One of the main purposes of the project was to obtain a corpus that could be used for automatic speech processing. Therefore the whole annotation and labelling procedures focused on the phonetic description. Concerning transcripts, any acoustic event present in the signal was transcribed in its temporal succession. As far as labelling is concerned, not only the temporal succession was preserved, but all relevant information was time-aligned with the signal. To do this, a modified version of WaveSurfer (software website: http://www.speech.kth.se/wavesurfer/, version used for CLIPS: http://www.clips.unina.it/downloads/wavsxclips.zip) was used to read and label the spectrogram at different levels. (Savy, 2007b) In CLIPS, five different layers were labelled: ➢ ACS: sub-phonemic layer used for the description of occlusives and affricates. It contains CLIPS 94 the begin of the silence phase, its end and the end of the release phase; ➢ PHN: phonetic transcription enriched with diacritics and annotation of various phonological phenomena; ➢ STD: standard phonological transcription. ➢ WRD: orthographic transcription. ➢ ADD: includes operator comment, turn overlapping, and other non lexical phenomena left out from other levels (such as <vocal>, <NOISE>) Image 3.2 shows how labels appears on WaveSurfer. The output of the program is saved by Wavesurfer as text files. File extensions indicate which layer was labelled (for example .phn for phonetic transcription). On a label file every line is divided in three colums, which contain a TIMIT sample indicating the begin, the end and the content of the label. Image 3.2: word sì 'yes' labelling on WaveSurfer Note that the first label of such a file can be two underscores (__) or two underscores followed by the percentage sign (__%) in the case it was impossible to determine the exact beginning of the turn. This is an example of an STD TIMIT. CLIPS 95 0 159 __ 159 9808 ok"Ei 9808 11734 <sp> 11734 16008 v"ai 16008 17382 un 17382 20189 p020189 27405 dZ"u At the phonological level, each label basically corresponds to a word. Therefore, each line describes a word, giving the temporal indication of its begin, its end and its phonological representation. For example, the second line label the signal from TIMIT 159 to 9808 as the word /oK”Ej/. 5. Phonological Layer STD files contain word phonemic transcription 44(Savy, 2007b). In CLIPS a word is defined as a sequence of letters separated by a blank space. Words separated by an apostrophe (and forms syntactically identical to those) are grouped together as a single unit. The alphabet used for the transcription is SAMPA. The list of symbols used to indicate vowels is given in table 3.9, for consonants see 3.10.45 Most of the tags used in the orthographic transcript were both not transcribed (i.e., the orthographic form is kept unaltered) or not included at all in STD (see table 3.11). For example, turns and turn overlapping were not included at all in STD. On the other hand, false starts were included but not transcribed phonemically. In the sequence no, non ca+ capisco 'No, I don't understand' ca+ is kept in its orthographic form (instead of being transcribed as 'ka+') at STD and the comma not included. The sequence then will be transcribed as no non ca+ kap”isko. 44 In this thesis I will focus only on CLIPS STD layer (see chapter 4 for explanation). 45 http://www.phon.ucl.ac.uk/home/sampa CLIPS 96 SAMPA Description IPA example transcription Translation i Front, Close i fino [f”ino] thin e Front, Close-mid e pera [p”era] pear E Front, Open-mid ɛ meta [m”Eta] half a Front, Open a nata [n”ata] born (fem.) O Back, Open-mid, rounded ɔ nota [n”Ota] note o Back, Close-mid, rounded o voto [v”oto] vote u Back, Close, rounded) unico [“uniko] unic u Table 3.9: SAMPA vowel set for CLIPS Transcription STD element between bracekts <> Not transcribed dialectal and foreign words Not transcribed parole troncate Not transcribed intterrupted words Not transcribed false starts Non transcribed46 lapsus linguae Not transcribed punctuation Not included turn overlapping symbols Not included Table 3.10: Transcript symbols used in STD 46 False starts are followed by a slash '/' on transcriptions, which was not included on STD. For example 'non ca+ /' is simply transcribed as non 'non ca+' CLIPS 97 SAMPA Description IPA example transcribed English p occlusive,labial p palla [p“alla] ball b occlusive,labial,voiced b bolla [b“olla] ball t occlusive,dental t tana [t”ana] liar d occlusive,dental,voiced d dado [d”ado] dice k occlusive,palatal, k cane [k”ane] dog g occlusive,palatal,voice d g gatto [g”atto] cat ts affricate,dental ʦ zio azione lo zio [ts”io] [atts”ione] [lotts”io] uncle action the uncle dz affricate,dental,voiced ʣ zolla mezzo la zona [dz”Olla] [m”Eddzo] [laddz”Ona] clod half zone tS affricate,dental ʧ cena [tS”ena] dinner dZ affricate,dental,voiced ʤ giro [dZ”iro] turn f fricative,labiodental f faro [f”aro] lighthouse v fricative,labiodental, voiced v vano [v”ano] vain s fricative,alveolar s sale [s”ale] salt z fricative,alveolar,voic ed z sbaglio [zb”aLLo] mistake S fricative,palatal ʃ sciarpa [S”arpa] pesce [p“eSSe] è sciolto [ESS”Olto] scarf fish melted m nasal,labial m mamma [m”amma] mommy n nasal,alveolar n nonna [n”Onna] grandmather J nasal, palatal ɲ gnomo legno lo gnomo [J”Omo] [l”eJJo] [l”oJJOmo] gnome wood the gnome r vibrant,alveolar r rana [r”ana] frog l lateral,alveolar l lana [l”ana] wool L lateral,palatal ʎ paglia [p”aLLa] straw j semivowel, palatal j ieri [j”Eri] yesterday w semivowel, labial w nuovo [nw”Ovo] new Table 3.11: SAMPA consonant set for CLIPS The following two elements, not present on the transcript, were added in STD: dash, in the case of words with apostrophe due to apocope and inverted commas '' to mark le lexical stress. CLIPS 2 Diatiopic, Diamesic and Diaphasic Variation 1. Diatopy 98 CLIPS aims to give a broad dialectal coverage of Italian. For this reason, fifteen cities were chosen according to results of a detailed sociolinguistic and socioeconomic study (Sobrero and Tempesta, 2007). The study has taken into account both static (percentage of agriculture, industry, service and GDP section composition) and dynamic economic values (annual GDP increment). Other parameters were also considered, each of them with a different weight or importance on efining the representativeness of candidate locations. The most important indicator was the presence and the demand of economic (such as transportation, communication, energy, water management) and social infrastructures (such as education, health, sport, culture). In addition to this, other parameters were also considered such as consistency and demographic dynamism, urban typology, economic importance of the city, both at regional and national level.The 15 highest ranked cities according to these parameters were: Milano, Bologna, Modena, Parma, Reggio Emilia, Firenze, Brescia, Roma, Vicenza, Torino, Trieste, Ravenna, Bergamo, Verona, Venezia. All these cities lie in northern Italy. A further grouping has considered four other parameters: GDP, GDP increment during the period of 1951-1991,cities with a specific economic vocation (agriculture > 20; industry > 40%, services > 70%) and low unemployment rate.Cities which shared the following characteristics were then grouped together. Each city was further described according to its population size. The 25 lowest ranked cities resulted by the application of the socio-economic analyses in the previous section all lie in Southern Italy. Being the least dynamic cities , they were compared and chosen according to different criteria, like specific economic vocation and high unemployment rate (major than the area average value). From those cities, a list of cities was chosen to be representative for the geographic, socio-economic and linguistic variation of Italian. This final list was further modified in order to further balance the number of cities representative of CLIPS 99 each italian region. The resulting 15 cities, chosen as collection sites, are shown in table 3.12. The table also contains the abbreviated form (code) used in CLIPS file names and headers. Speakers from these collection sites were then chosen so that the samples could be organic and representative of the population analysed. To reduce the influence of uncontrollable variables, the chosen speakers had to fulfill the following requirements: ➢ age: between 18 and 30 years old ➢ social and cultural status: at least middle-high ➢ education: undergraduate or college students ➢ city: born and risen in the target city by parents of that same city. Location Code Linguistic Area Turin T Gallo-Italica Milan M Gallo-Italica Bergamo D Gallo-Italica Venice V Veneta Parma E Gallo-Italica Genova G Gallo-Italica Florence F Toscana Perugia O Mediana Rome R Mediana Naples N Meridionale Bari B Meridionale Lecce L Merid. estrema Catanzaro H Merid. estrema Palermo P Merid. estrema Cagliari C Sarda Table 3.12: Final location sites with codes CLIPS 2. 100 Dialogic CLIPS is also structured into 5 diamesic/diaphasic layers (Savy 2009): dialogic, read speech, radio and TV, telephonic and orthophonic. In the end of each sub-corpus sections I will resume all important information about corpus transcription and labelling notation. The dialogic corpus is composed by 240 dialogues of high quality semi-spontaneous speech recordings. It is important to note that, unlike other corpora of spoken Italian, the project aimed to obtain sufficiently good recordings, so that the acoustic and the phonetic analysis of the signal could be possible (Savy, 2007). To reach this result and obtain at the same time a spontaneous speech, elicitation techniques were used to reduce the observer's paradox argued by Labov (1977): "[...] the researcher has to observe how people speak when they are not being observed.” A speaker aware of being recorded for linguistic purposes in fact will probably overcontrol his/her speech, thus leading to an artificial linguistic behaviour. On the other hand, a hidden recording apart from rising any kind of privacy and legal issue - will inevitably result in a great loss of quality, probably to such an extent to make the phonetic analysis of the signal hardly possible. Elicitation techniques are used and consist on shifting the attention of the speaker from the form to the content of what is being said. The elicited dialogue is therefore spontaneous but at the same time of high quality because recorded on a controlled environment. This kind of techniques also allows the linguist to have the speaker focus on a particular subject, thus reducing the linguistic complexity (syntactic, pragmatic and lexical) of the speech. The elicitation techniques used in CLIPS are based on two non-linguistic tasks, which require two speakers to achieve a goal by exchanging verbal instructions (also called instruction giving dialogues). Two types of elicitation techniques were used: map task and spot the difference. Map task was introduced by Brown et al. (1984) and developed by the HCRC of Edimburgh for the acquisition of the HCRC map task corpus (Anderson et al. 1992). Each speaker disposes of a map consisting of a collection of objects. A path is drawn CLIPS 101 on the instruction giver's (Speaker 1) map. The instruction follower (speaker 2) then will have to follow giver's instructions in order to draw the same path on his/her map. Some minor differences on the location of the objects allow for more spontaneousness and variety. Still, the dialogue is unbalanced. The giver will have longer turns and the entire dialogue will show a fixed structure and a limited pragmatic variation. To avoid the balancing problem, CLIPS maps were drawn so that only half of the path is represented on each map. This way each speaker will be follower and giver during the same recording session. In order to obtain less structured dialogue a second elicitation technique has been used based on the spot the difference task. Two speakers are given two pictures and have to discover the differences between them. Note that in the both cases – map task and spot the difference - speakers can not see each others. Thus, only verbal language can be used to communicate. Task's pictures were chosen according to some specific criteria based on a previous work on infant audiometry by Cutugno et al. (2001). The words to be chosen had to be known by 3year-old children, had to be easy to represent with simple pictures and to be among the most frequent of the Italian lexicon. Dialogue transcription are characterised by turn indicators and turn overlapping. The former mark the begin of a turn and indicates speaker and turn number. For the latter see paragraph 3.1. 3. Read Speech The read speech corpus contains 16 hours of recording (De Masi, 2007). It is divided in two categories: word list reading and sentence reading. The list of sentences was created using the following procedures. A list of lemmas was firstly obtained by merging four frequency lexicon: ➢ Frequency Lexicon of Spoken Italian (LIP) ➢ Frequency Lexicon of Contemporary Italian (LIF) ➢ Italian Electronic Dictionary (VELI) ➢ Basic Lexicon (LE) CLIPS 102 Function words, adverbs (the correspondent adjectives were instead kept) and other ambiguous categories (possessives, indefinites and numerals) were all removed from the list. The 70 remaining words with the highest usage index were then chosen to create the 20 micro-textes of the sentence list. The word list instead is simply constituted by the name of the objects drawn in the map task and spot the difference maps. 4. Radio and TV To be representative and balanced, the Radio and TV sub-corpus is structured for diamesic, diaphasic and diatopic variance. To account for the diatopic variance, 20% of the data was taken from national channels, the rest 80% from regional television. The proportion of contacts between national and regional television was not respected in order to obtain a balanced representation of diatopic variation. A faithful representation would have penalised too much regional networks (see table 3.13 for percentages). Even national televisions in fact are characterised by a minor diatopic variance. In particular, middle Italian and southern Italian traits are dominant on RAI programs while northern Italian traits are more frequent on MEDIASET network (A. Sobrero, 2007); Network Percentage Rai 25,00% Mediaset 35-40% Syndications 15,00% Private networks 20-25% Table 3.13: Italian networks audience sharing Concerning the diamesic variation it has been noted (Dardano, 1994; Rivola, 1989) that radiophonic and television language show basically the same properties in Italian. For this reason, it was chosen to collect 50% of the corpus from television and the other 50% from the radio. CLIPS 103 Four other categories were introduced to account for the diamesic variation: Entertainment: very high audience and contains live calls, which are of particular interest ➢ due to their spontaneousness. ➢ Broadcast: important for the audience and for the language used, very close to written text. ➢ Culture: contains less data because of the few audience and of the diaphasic variation similar to the broadcast Advertisement: the high audience and the peculiarities of the advertisement language were ➢ considered as a positive factor, but the spare attention given by the audience as well as the minor linguistic influence on the audience brought the authors to limit the amount of collected data. Table 3.14 resumes the variables listed above; 50% of the data comes from television, the other 50% from the radio. Typology Talk show Local radios for every node 15' Local TV, for every node RAI (total) Mediaset (total) 15' 50' 50' Advertisement 5' 5' 15' 15' Broadcast 2' 2' 15' 15' Culture 3' 3' 10' 10' Total 25' 25' 90' 90' Table 3.14: Minutes of recording distribution on RD and TV 5. Telephonic The Telephonic corpus contains recordings of calls from simulated tourists to a virtual assistance service. Each speaker from the 15 cities of the corpus received 10 scenarios, which contain information about the request that should be given to stem (Di Carlo and D'anna, 2007). For example, a possible scenario is the following: CLIPS 104 “You are at home. You are calling Hotel Excelsior in Paris to book a triple non-smoker room, with view, shower, and a strongbox. You are booking for the week of Christmas, for three friends of yours. Your credit card number is 7497 3792 1801 9340.” The user would call the assistance system and ask for the service indicated. Two different modalities of interaction were used: automatic and Wizard of Oz (WoZ). The former did not require a human operator to be present and was used when Wizard of Oz modality was not possible. Once received the call, the automatic system will proceed to the following operations: ➢ take the call; ➢ recognise the DTMF that the user will digit to indicate the scenario; ➢ record the request; ➢ end the call. The Wizard of Oz mode required the presence of an operator. In addition to the basic operation showed above, the operator was able to send messages to the client and to record information based on the scenario. When a call was received, the operator had to fill a form with the information received by the client. For example, for the scenario above, the form would be the following: Identity: Username Obligatory information: Room size Check-in date Check-out date Credit card number Facultative information: Room with view Bathroom service CLIPS 105 Strongbox Room for smoker Arrival time To ask facultative information the operator could interactively send recorded messages. Recorded messages could also be used to ask a request to be repeated (in case of unclear instructions or lack of relevant information) or to conclude the call. Note that any message sent by the operator was recorded using a synthetised voice in order to keep the client aware that he/she is interacting with an automatic speaker and not with a human. The entire procedure is the following: ➢ the client calls; ➢ the OC receives the call; ➢ the client is asked to digit the number of the scenario, the numer is saved on a logfile; ➢ stem starts the recording; ➢ the client gives the instructions of the scenario; ➢ if an operator is present (WoZ): • he may ask to recast the instruction by sending a recorded message • messages from the operator will be saved in a log file • he fills in the module of the scenario • the client concludes his/her scenario ➢ the recording ends ➢ a good-bye message is sent ➢ the call is terminated CLIPS 6. 106 Orthophonic The read speech corpus contains a list of words and sentences read by non-professional speakers of Italian. The orthophonic corpus consists of the same item read by professional speakers. The aim of the orthophonic corpus was to obtain a corpus of high quality recordings, parallel to the read speech, which could be representative of standard Italian. Ten professional speakers were chosen (5 males and 5 females) to read and repeat three times in an anecoic chamber47 the twenty sentences previously cited. Being the items of the read speech corpus chosen according to lexical criteria only, the original corpus was extended with another list of sentences which could provide a phonetic coverage of Italian phonotactic clusters. Thus, the corpus could be used as a basis for the evaluation of verbal communication and codification systems. With reference to the SQEG (Speech Quality Expert Group) , ITU (International Telecommunication Union) and the expert European group of the ETSI (European Telecommunications Standards Institute) another list of 120 short sentences was added. 7. Corpus structure The structure of the corpus can be summarised as following: ➢ dialogic (DG) • map task (mt) • spot the difference (td) ➢ sentence read (LF) ➢ map task words reading (LM) ➢ spot the difference words reading (LT) ➢ Radio (RD) 47 Istituto Superiore C.T.I., viale America 201 - Rome. For technical information please refer to the original document: CLIPS 107 • culture (dc) • entertainment (it) • broadcast (is) • advertisement (pb) ➢ Television (TV) - as RD (see above) ➢ Telephonic (TL): • Automatic (A) • Wizard of Oz (M) ➢ sentence reading (LP) ➢ balanced sentence reading (LB) Syllabification Program 4 1 Python and NLTK 1. Python 108 Syllabification Program Guido Von Rossum (2000), the ideator and main developer of Python describes Python as “an interpreted, object-oriented, high-level programming language with dynamic semantics”. Python was chosen as the programming language for this project for various reasons. First, it is particularly suited for Rapid Application Development. Python programs are typically shorter then equivalent Java or C++ programs and the development eventually faster. Python's built-in high-level data types and dynamic typing do not require variable declaration, allow operators to be overloaded, save human typing time, code lines and avoid memory allocation bugs (overhead of buffer overruns, pointer-aliasing problems, malloc/free memory leaks and so on). Moreover, being Python an interpreted programming language, debugging is usually fast and trouble-free. It is possible to edit the code by including print statements, test it and obtain a clear stack trace. A powerful source level debugger written in Python also “ allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time and so on ”. Second, Python is known for being elegant. Its very clear syntax and semantics, the use of indentation, transparent operator symbols and many other features make the code reusable and easy to understand, learn and customise. For example, even the debated use of indentation instead of braces reduces the number of variability in the code (e.g., there are at least three different conventions for the placement of braces in C), its visual complexity and therefore the entire readability of the program. Third, Python is portable: according to the documentation it runs on many Unix variants, on the Mac, and on PCs under MS-DOS, Windows, Windows NT, and OS/2 and it is included in most Linux and BSD distributions (such as Debian Lenny, Ubuntu). Fourth, python includes an extensive and documented standard library which provides powerful general Syllabification Program 109 purpose modules and a big collection of libraries and packages that could be easily used and included in one's code. Fifth, Python is widely used in academia and in industrial applications by many major brands (such as Yahoo, Google, Firaxis games, Rackspace) and in many application domains such as software development, arts, business, education, biology, chemistry, engineering. Finally, “Python is absolutely free, even for commercial use (including resale) ” and its licence is certified as Open Source by the Open Source initiative48 (http://www.python.org/psf/license/). (Rossum and Drake 2003, 2000, 1993; Rossum et al.,1994) Python is a very powerful language and for computational linguistic tasks it provides all the necessary tools even for the most advanced applications49. For example, ConceptNet (http://web.media.mit.edu/~hugo/conceptnet/) is the largest freely available commonsense knowledgebase and is used in dozens new innovative research projects at MIT and elsewhere. (Havasi et al., 2007; Liu and Singh, 2004) However, as any other programming langugage, some structural choices rendered some functionalities unavailable or discouraged. The one major drawback with Python is code run-time speed. Python code is supposed to run slower than equivalent programs in C++ and Java. This is mainly due to its memory manager and to dynamic typing (cf. above). However, with the exception of some applications where speed optimisation and control of memory usage is a prerogative (scientific computing, kernel hacking and so on), the loss in speed is in most cases irrelevant with today's machine efficiency and would be negligible in our software. (Raymond, 2000) 48 All software used to write this thesis (OpenOffice.org 3.1), the scripts (vim) as well as the Operating System (Debian Lenny) and all programs are free or open source. 49 Hence there is no necessity to discuss about technical details such as the lack of tail recursion (see http://neopythonic.blogspot.com/2009/04/tail-recursion-elimination.html) in this thesis Syllabification Program 2. 110 NTLK Before implementing it was necessary to find a method to manage CLIPS data. It was important to have a program that could store corpus data without losing information, but at the same time allows to easily manage it. We needed in fact an interface between and the corpus that could fetch all necessary information from the corpus and make it available to. TIMIT and transcripts had to be kept available, as well as the possibility to find and read the audio samples associated with each transcripts and their recording information. To do this, we could choose between two possibilities: write a parser from scratch or try to look for a project that had already implemented such kind of CR. The latter choice was preferred and NLTK's TIMIT corpus reader (hereforward CR) was used as a basis. NLTK (Natural Language Tool-kit) “ is a suite of program modules ,datasets, tutorials and exercises, covering symbolic and statistical natural language processing ” (Loper and Bird, 2002; Loper, 2004). It is widely used in the most prestigious university NLP courses (including at the MIT, UPENN, University of Edinburgh) and in research systems (Liddy and McCracken, 2005; Sætre et al., 2005). According to Loper (2004), the main requirements for the tool-kit design have been ease of use, consistency (in data structures and interfaces), extensibility (the tool kit includes now a lot of third party modules), documentation and simplicity. In addition to these characteristics we decided to choose NLTK for various reasons. First of all it is entirely developed in Python. As said before this was the programming language we intended to use for the development of . Even if it is possible to use Python as a glue language by having it to interact with other languages, a fully consistent implementation was preferred. NLTK was also designed with simplicity and and extensibility in mind. NLTK is distributed under the GNU General Public licence and all documentation under a Creative Commons non-commercial licence. This allows to customised and extend NLTK to include the corpus reader and in a consistent and elegant way. UNIX's divide and conquer philosophy was also privileged. UNIX design principle of organising complexity by Syllabification Program 111 breaking it into parts permits to have a main application actively developed and rich of functions (NLTK), a small syllabification program which could be easily modified and adapted to other languages leaving the other components untouched, and a simple interface between the two (CR). NLTK will allow us to analyse and interact with the corpus in an autonomous way. In fact, it will be possible to exploit NLTK's structures and functionalities in the same way directly on the corpus or in syllabified data. and the CR were designed so that they could be used by NLTK without issues as will simple permit to dispose of a new token. A token is a linguistic unit used by NLTK as a basic processing unit. In our case it will be a syllable, a phone, a phoneme or a word depending on the case. The main functions will be used on different tokens in an identical way, with no issue of compatibility. Syllabification Program 2 Implementation 1. Syllabification 112 In this paragraph I will illustrate the final syllabification procedure and make important choices on how the specific problems of Italian syllabification will be treated and why. In chapter I showed various syllabification principles and procedures and demonstrated that different theories may lead to different assumptions on the nature of the syllable. As it is not the purpose of this thesis to draw out a brand new phonological theory it is necessary to make choices on the basis of available theories and principles. The aim of this thesis is to write a syllabification program, the algorithm I will show in the next paragraph has engineering goals (it does not aim to demonstrate, confute, simulate a linguistic theory, see 2.1) and therefore the main design principle will be practical. This argumentation is controversial, but actually lays out an important theoretical assumption. As said in chapter II it is important to make design choices so that they can reflect the nature of the problem. The syllabification program will be used to syllabify CLIPS time-aligned transcripts. In the scope of the corpus itself it was important to find a principle that could rely on the acoustic properties of the signal so that the phonological syllabification and syllabification on the signal could be as closer as possible. Remember that a syllable in an acoustic domain is defined in term of its energy. The phonological principle which better reflects this property is then the SSP. Still, we had to face various problems. The basic design principle I used was however to keep the procedure as simple as possible. This is not only for practical reasons (a neat transparent, bug free code) but also because the analysis of the signal would not allow for phonological restrictions. For example, in the case of the SD+4 principle, it would be necessary while working on the signal to determine if there is a sonority distance in terms of energy major or minor than four between two consonants, which is impossible if we do not want to abstract that portion of the signal to its corresponding phoneme. While it is easier, on an abstract phonological level, given two phonemes, Syllabification Program 113 to obtain their relative sonority distance it is almost impossible to get a precise relative distance at the acoustic level as it would require at least the phonological class to be detected. The main reason to use the syllable in NLP was because it allowed to ease speech recognition tasks by using a unit which was not as hard to recognise and variable as the phone. Going back to segment recognition to determine the relative sonority of a portion of signal would require a segmental analysis, which is what we wanted to avoid. Finally, the principle of simplicity means to have the least number of segments associated with their corresponding signal portion. In the case of a sC clusters for example we want every sCC clusters recognised as the same sequence everywhere. That would not be the case of a heterosyllabic division. sCC would be kept togeter word-initallly but split up word internally as in Vs.CC. The same with geminates. By keeping them together the opposition between minimal pairs as pap.pa and pa.pa would be clearer because the contrasting segment appears in one syllable and it's consistent with its representation on the signal. Moreover, in the case of etherosyllabic geminates, it would have been harder in the signal to detect and split the two segments up. The signal is continuous and it would have added a lot of arbitrariness and complexity to the algorithm to decide where to put syllable boundary. Concerning non native cluster syllabification it was chosen to have them syllabified by the SSP without adding useless complexity to the algorithm. The underlying linguistic debate is probably the most controversial and sometimes considered irrelevant even from a phonological perspective. Moreover, the frequency of these clusters in spoken language (and in the lexicon) is so spare that it does not really need to be accounted (cf 4.3.4). Finally, concerning diphthongg syllabification we used the strict definition of hiatus adopted by Fiorelli (1941) and Canepari (2004). According to these authors, in Italian a hiatus is given only by vowel clusters stressed in the second vowel (e.g. /ka”ino/, /be”ato/, /pa”ura/). Again, in the signal a concrete rise in energy is visible only in that case. In the other cases we did not consider it sufficient to have the vowel constitutes a nucleus. Syllabification Program 2. 114 CLIPS's STD In chapter III, I have described only the phonemic transcription layer (STD). As a first stage of work in fact it was decided that a syllabification of sub-lexical level (PHN basically) would be too problematic to start with. A test simulation was made within the PHN but overwhelming presence of deletion, epenthesis and assimilation phenomena which had to be treated showed that obtained syllables were too far from being accurate for any possible application. Nonetheless, the importance of a phonetic syllabification is still evident. The PHN labels correspond to phones while phonemic labelling to phonological words, which means that at a phonological level we have word boundary TIMITs only. It is then impossible to obtain an automatic correspondence between acoustic syllable boundaries and syllabification obtained through SY, due to the lack of temporal indication at a lexical level. For example, suppose we want to syllabify the word 'ka.za'. At the phonetic level we would have the TIMIT of all phones, so that it would possible to trace syllable boundary between 'a' and 'z' back to the signal. At a phonemic level however we would dispose only of word boundary TIMITs. Without any temporal indication on where the phones 'a' and 'z' are, we would not be able to individuate the signal portion of syllable. The WRD level was also not implemented because, as explained in chapter 2, we did not want to implement an orthographic syllabification program and there is no reason to use orthographic transcripts if the corresponding phonemic transcription is available for syllabification. One of the reason is that the phonemic transcription allowed us to dispose of phonological information useful for syllabification. In particular, the algorithm relied on transcript labels to determine a syllabification that could correspond to real acoustic syllable in the signal. For example, in case of pauses between words, re-syllabification does not apply as there is avoid between two segments which was considered as an evidence of syllable boundary. Pauses inside words instead were considered disfluences, labelled with an underscore _ and therefore not transcribed (all the sequences not transcribed in STD were not syllabified, cf 3.1.5). Lexical stress was used only to distinguish hiatus Syllabification Program 115 from diphthongs. If two vowels are separated by a stress label (“) they will be considered heterosyllabic (cf. previous section) and divided. Dashes used to indicate apostrophes were stripped off because word boundaries were not considered during sentence syllabification (simulating the natural re-syllabification) and therefore the whole sentence was treated as a single sequence of phonemes ([l-albero] is the same as [lalbero] due to re-syllabification). 3. Core SY As I will be showing in the next section, SY contains two functions used to pre-process and eventually purge the input. The core syllabification function however was implemented so that it could simply take a sequence as input and give the same sequence with syllables separated by dots as output. The input can be any phonemic sequence (word, phrases, sentences and so on). input: 'lakarta' → do_syllabify() → output: 'la.kar.ta' The original implementation of the algorithm (Cutugno et al., 2001) was divided in two parts, the first one parsed the sequence and stored the indexes of least sonorous elements: def do_syllabify(sequence, verbose=0): """ Divide a phonematic sequence in syllables """ if son(sequence[i]) < son(sequence[prevchar]) and \ son(sequence[i]) <= son(sequence[nextchar]): less_son.append(i) Another portion of code then implemented the rest of the algorithm as the following: # it is a sonorant if config.getint('Sonorities', sequence[ph_index]) > 9: # it is not followed by a vowel or semivowel, coda Syllabification Program 116 if config.getint('Sonorities', sequence[nextchar]) < 15: syllable_boundaries.append(nextchar) # followed by a vowel or semivowel, incipit else: syllable_boundaries.append(ph_index) else: # not a sonorant, incipit syllable_boundaries.append(ph_index) In my opinion, this solution was not the best one, from a computational and overall from a linguistic point of view. In particular, it treated sonorants in a different way from other segments and required a lot of code whose behaviour was hard to predict. In fact, three semantic bugs were found. In case of sonorants geminates, the algorithm put a syllable boundary between the two segments. In case of other geminates it did not. For example, the word gallo 'cock' was syllabified as gal.lo while gatto 'cat' as ga.tto. This could not be acceptable. A solution was to use relative indexes. In the case of geminates have the algorithm consider them as a single unit. nextchar = ph_index + 1 if sequence[ph_index] == sequence[nextchar]: nextchar += 1 Another correction had to be made to include semi-vowels while evaluating structural position of sonorants. In fact, V.SGV behaves exactly the same way as V.SV and not as VS.CV. For example, 'karje' had to be syllabifiedd as ka.rje and not as kar.je. Finally, the sCC cluster was syllabified as s.CC at the beginning of a word or even as V.Cs.CC between two consonants as in word like /ekstra/ (syllabified [ek.s.tra]). The left alone /s/ Syllabification Program 117 was recognised as a syllabic nucleus or better as extra-syllabic because left out from adjacent syllables. This kind of clusters had to be purged and other lines included. The resulting code lacked elegance, did not really reflect the nature of the phonological phenomena and a lot of code to be added to handle situations where the original algorithm failed. The original purpose of the algorithm was to implement the SSP, which benefits of being a simple and universal principle. But if the principle is universal, it is widely accepted that some variations can be found across language in the SH. Instead of modifying the code, the algorithm or the principle itself, what I had to look for was some possible variations in the SH. The result was that by slightly changing the sonority value of /l/ so that it could be less sonorous than /r/ and by setting the sonority value of /s/ to 1 (see previous section) the principle could be implemented in only three lines of code and gave good results. It required no messy exception handling instructions, no crazy sonorant restrictions, gave the desired syllabification for sC cluster, geminates and hiatus, and an elegant and transparent design50. Table 4.2 shows how some words were syllabified by the two algorithms. It follows the only three lines of code needed. if son(sequence[i]) < son(sequence[prevchar]) and \ son(sequence[i]) <= son(sequence[nextchar]): less_son.append(i) Note that sonorities are specified in a configuration file and parsed using the module ConfigParse: def son(phone): return config.getint('Sonorities', phone) 50 I have been argued that this is a programming hack. It is and it is not only this. I am not arguing here that /s/ has a sonority value of 1 in phonology nor that the tautosyllabic syllabification is the correct one. What I am saying is that I assumed no phonologic framework but found an algorithm that had to work on the signal and I am pretty sure this is the best way to do it, both in implementation and in design. In addition to this, keeping it as simple as possible, and demonstrating that is possible to obtain a syllabification in Italian by only tuning sonority values the SSP remains universal and simple. Syllabification Program 118 For the sonority.cfg see APPENDIX A (divided in two colums to save space). Hashes indicate comments. An integer is assigned to each phoneme used in CLIPS. Typos are also handled as in the case of 0 (zero) used instead of O (capital <o>), 'c' instead of 'k' and so on. Symbols are also included and assigned a sonority value for compatibility. More work is necessary to handle CLIPS labels. In case of stress, if adjacent segments are vowels in the form V"V, a syllable boundary is placed between the two vowels V."V. if itis(sequence[prevchar], 'V') and itis(sequence[nextchar], 'V'): syllable_boundaries.append(i) The syllabification of most relevant Italian clusters by the program is given in APPENDIX B.The opaque interaction with the CR is interesting as well. The CF (cf. next section) can provide four types of data: a list of words, a list of sentences, one word or one sentence. The syllabify function recognises the input and provide the syllabification of the sequence. For example, in the case of a sentence the CR return a list which contains a list of strings for each sentence [[Sentece1], [Sentence2]]. Each sentence is then represented as a list of strings (words) [S1[W1, W2, W3]S2['W1','W2']]. The list is parsed and every sentence is turned into a string made by the sequence of all the words in the sentence and given as argument of do_syllabify(). If syllabify is called with the option 'rich' it will print non transcribed portion of the sentence between parenthesis (such as disfluences). Otherwise, it will just ignore them. Comments are always ignored: # it is a sentence for word in sequence: # ignore comments if '[' in sequence or '<' in sequence: continue elif '+' in word: if not rich: Syllabification Program 119 continue else: # save the word and substitute it with a '+' oldwords.append(word) word = re.sub(r'.*\+', '+', word) nsequence.append(word) # syllabify the sequence tmp_syllabified = do_syllabify(''.join(nsequence), verbose) if rich: for word in oldwords: # put the non transcribed words back between parenthesis tmp_syllabified = re.sub(r'\+', '.(' + word[:-1] + ')', tmp_syllabified, 1) # if the sentence begins with a '.' strip it off if tmp_syllabified[0] == '.': tmp_syllabified = tmp_syllabified[1:] The two additional functions are cvsyll(word, ph_class = 0) and itis(). The former take a string (syllabified or not) and return its syllable structure. If ph_class is set to 0 the phonological class of each sentence is given, otherwise the CV structure of syllable. For example, the word 'ka.za' will return as 'OV.FV' if ph_class is specified or as 'CV.CV' if set to 0. The Phonological class used are: Occlusives, Fricative... cvsyll uses itis() to determine whether a phone is a consonant or a vowel and eventually its phonological class. If the argument query specify a phonological class itis return 1 if the phone belongs to that class or 0 if not. To determine the phonological class the sonority values used in sonority.cfg are used. These two functions are particularly useful to abstract statistical analysis as we will see in the next section. def itis(phone , query = 0): """ query == CLASS return 1 if a phone belongs to the desired class query == 0 return C or V query == 1 return ph class of a phone """ Syllabification Program 120 try: config.getint('Sonorities', phone) except: return 0 if son(phone) in [99, 0]: # symbols phoneis = 'X' elif 27 > son(phone) > 18: # vowels phoneis = 'V' elif 19 > son(phone) > 14: # glides phoneis = 'G' elif 15 > son(phone) > 9: Another useful function is demo_syll(). By simply running the function demo_syll(), it is possible to get a set of example syllabifications (APPENDIX B). The set is thought so that only representative examples of relevant sequences are syllabified, as for example geminates, sC cluster, non native clusters, and so on. This might be particular useful, especially for linguists. In fact, it is possible to make any change to the algorithm or simply to the sonority scale and immediately see how the change has affected the whole syllabification system. 4. Phonological Syllabification A controversial aspect of the syllabification program was the treatment of geminates. Even if some authors assumes the opposite (De Gregorio, 1935; Martinet, 1975 and others), evidence clearly shows that Italian speakers recognise geminates as heterosyllabic segments (for a recent and complete analysis see Loporcaro, 1996). If the syllabifications so far given by the program might be considered phonologically erroneous because of this aspect, the proposed SSP and the SH are perfectly able to describe and predict the correct phenomenon. The algorithm so far described put a syllable boundary if there is a decreasing Syllabification Program 121 sonority and therefore, in the case of geminates, it keeps the two identical segments together in the onset as there is no sonority shift in a broad sense. However, as noted on 1.2.2, a strict or exclusive interpretation of the SSP implies that because sonority does not decrease throughout syllable margins in the case of a sonority plateau, two identical segments have to belong to different syllables. By strictly applying this interpretation of the principle you get a different syllabification, in particular that of geminates, which becomes heterosyllabic. The resulting syllabification system shows no idiosyncrasies and perfectly reflects Italian phonological theory. The output of the demo_syll() function (see APPENDIX C) shows all relevant cluster syllabifications that result from the application of this principle. It is important to note that no resyllabification or exceptions is required. By changing the SH and applying the SSP you get two possible syllabifications: one that prefers tautosyllabicity and seems to be more usable on an acoustic-computational ground, the other which is phonological and results by the strict interpretation of the SSP. The algorithm is identical to the first one, thus demonstrating that the core principles have not changed, with the exception that a syllable boundary is placed even if two phonemes have the same sonority (while in the other one it was required a sonority shift). if son(sequence[i]) <= son(sequence[prevchar]) and \ son(sequence[i]) < son(sequence[nextchar]): Even more interesting is the fact that even the sC cluster is treated as heterosyllabic with the strict implementation of the SSP, thus reflecting the hypothesis of most literature discussed so far. Most important, sC cluster does not cause extrasyllabicity word-internally with this SH and SSP, as in the word /ekstra/, syllabified as /e.ks.tra/. In fact, the literature has never accounted, especially for Italian, of phenomena of extrasyllabicity which do not occur on word margins, which instead would result from the application of the SSP using the standard sonority value for /s/. By using the exclusive or phonological interpretation of the SSP and the same SH I used so far, you get Syllabification Program 122 divided geminates, heterosyllabic sC cluster and no word-internal extrasyllabicity. A further evidence which may justify the special sonority of /s/, and overall avoid the necessity of other principles, is the fact that by changing the sonority value of /s/ from 1 to 0 you get a tautosyllabic syllabification of the cluster and no extrasyllabicity (i.e., /E.kstra/, /stra.no/, /pa.sta/). In this case, it is easy to justify the ambgious behaviour of the cluster only in terms of sonority. Davis (1990) proposed a principle based on the assumption that speakers resolve an arithmetic operation to determine if a value of 4 in the sonority distance is reached to determine whether to put or not a syllable boundary. My hypothesis is that there is no principle that make the speaker able to do such a fine-grained distinction between arbitrary sonority distances. The SD principle for Italian was justified by il/lo allomorphy, but as noted by notable authors such as Bertinetto (1999), proposed data is quite controversial. Moreover, as demonstrated in 1.2.7 (cf. McCrary, 2004), in the cases where a few sonority distance, cluster syllabification is particularly ambiguous and it is hard to determine where to put a syllable boundary in a sequence of vocoids possibly because there is no clear sonority shift. I assume then that Davis (1990) himself indirectly constitutes an evidence that Italian speakers are better able to distinguish between high sonority distances to determine correct syllabification. The sonority of /s/ might be 1, 0, or might be changing diachronically from 1 to 0. Its ambiguity lies in the fact that this little sonority shift/difference causes the cluster to be heterosyllabic or tautosyllabic. Syllabification Program 3 Final Developing 1. Corpus Reader 123 Instead of reinventing the wheel and to keep compatibility among programs it was decided to build a CR based on the NLTK TIMIT one. The main methods and classes are kept and with future reference to a possible merging of the two CRs, I have tried to keep the code compatible. To do this it was necessary to modify the directory structure of CLIPS so that it could be parsed in the same way as the TIMIT one. The script onlineclips.py allows to download the entire corpus from Internet and prepare it to be processed by the CR. As it is impossible to directly download fileeee of the corpus via HTTP, GNU wget is used by onlineclips.py as web-crawler to get the URls downloadable files of the corpus. Within the CR we will have access to all information contained in the corpus. Most important, the CR will codify the data so that it can be processed and manipulated by Python and NLTK. As we will see, this will also mean to be able to have any sequence syllabified and processed. First it is necessary to load the CR. >>> from nltk.corpus import clips Now we can operate using the imported object methods. clips.utteranceids(corpus) return a list containing the id of the specified corpus. Instead of getting all the ids we can choose to get only a part of it, let's say the first the first five utterance ids of the dialogic corpus: >>> item = clips.utteranceids()[5:10] And print them to see the the content of the list. >>> print clips.fileids('txt')[5:10] ['DGmtA01L_p1F/115.txt', 'DGmtA01L_p1F/117.txt', 'DGmtA01L_p1F/119.txt', 'DGmtA01L_p1F/121.txt', 'DGmtA01L_p1F/123.txt'] Syllabification Program 124 Now it is possible to easily get all information from the chosen items as showed in the following examples. Phones: >>> print clips.phonemes(item) ['akk"anto%', '%a', 'sin"istra', '__%', '%t"utto%', '%"alla', 'sin"istra%', 'margi+', 's"i', 's"i'] Orthographic words (the 'u' before the strings stands for unicode): >>> print clips.words(item) [u'accanto', u'a', u'sinistra', u'tutto', u'alla', u'sinistra', u'margi+', u'si', u's\xec'] To codify the character print the single element instead of a the list representation: for word in clips.words(item): ....: print word, accanto a sinistra tutto alla sinistra margi+ si sì Orthographic words with TIMIT: >>> print clips.word_times(item) [(u'accanto%', 8264, 20419), (u'%a', 20419, 21789), (u'sinistra', 21789, 37007), (u'__%', 0, 694), (u'%tutto%', 694, 6333), (u'%alla', 6333, 10477), (u'sinistra%', 10477, 25731), (u'margi+', 37599, 47610), (u'si', 10016, 16372), (u's\xec', 3013, 8870)] Syllabification Program 125 Phonemes with timit: >>> print clips.phoneme_times(item) [('akk"anto%', 8264, 20419), ('%a', 20419, 21789), ('sin"istra', 21789, 37007), ('__%', 0, 694), ('%t"utto%', 694, 6333), ('%"alla', 6333, 10477), ('sin"istra%', 10477, 25731), ('margi+', 37599, 47610), ('s"i', 10016, 16372), ('s"i', 3013, 8870)] Sentences with TIMIT: >>> print clips.sent_times(item) [('akk"anto%', 8264, 20419), ('%a', 20419, 21789), ('sin"istra', 21789, 37007), ('__%', 0, 694), ('%t"utto%', 694, 6333), ('%"alla', 6333, 10477), ('sin"istra%', 10477, 25731), ('margi+', 37599, 47610), ('s"i', 10016, 16372), ('s"i', 3013, 8870)] Play a sentence >>> clips.play(item) Play from the first and the third word >>> clips.play(item, clips.ut_start(item,0), clips.ut_end(item,2)) If you leave $start out the beginning of the sentence is assumed, for $end the end. >>> clips.play(item, clips.ut_start(item,4)) You can also play one or more phones, in this case from the second to the fifth ([kanto]) >>> clips.play(item, clips.ut_start(item,1), clips.ut_end(item,4), phone = 1) Syllabification Program 126 print a tree containing the orthographic and the phonemic transcription of a sentence: >>>for tree in clips.phone_trees(item): … print tree (S (__% __%) (%quindi %kw"indi) (bisogna biz"OJJa) (prepararsi prepar"arsi) (per per) (metter m"etter) (le le) (piante% pj"ante%) (%in %in) (condizione kondittsj"one) (di di) (autodifese autodif"eze)) These methods combined together are extremely powerful. This simple script prints the entire corpus: # import clips corpus reader from nltk.corpus import clips # all the utterances in the corpus item = clips.utteranceids() Syllabification Program # for every sentence for it in item: print it + ":" # print the sentence with timit indicators print clips.sent_times(it) # for every word in the sentence print phonemic and orthographic transcription # with timit for word, phone in zip(clips.word_times(it), clips.phoneme_times(it)): print "%s -> %s" % (phone, word) Output: DGmtA01L_p1F/203: [('o"kay % %tSisj"amo', 0, 35656)] ('o"kay', 3543, 15284) -> (u'okay', 3543, 15284) ('%tSi', 23086, 27684) -> (u'%ci', 23086, 27684) ('sj"amo', 27684, 35656) -> (u'siamo', 27684, 35656) DGmtA01L_p1G/1: [('all"oram"arko', 0, 31267)] ('all"ora', 9865, 17344) -> (u'allora', 9865, 17344) ('m"arko', 17344, 31267) -> (u'Marco', 17344, 31267) 127 Syllabification Program 2. 128 SY and NLTK SY can be used from the command line by specifying the sequence to syllabify as an argument: $ python syllable.py 'colore' co.lo.re In case the argument verbose is specified it will print to the standard output the entire syllabification procedures: >>> do_syllabify('lakarta', verbose = 1) lakarta Trovato minimo di sonorita: k Trovato minimo di sonorita: t Confini di sillaba: [2, 5] la.kar.ta It is also possible to run the program without argument. In this case the user will be prompted for the word to syllabify and the syllabification procedure shown. $ python syllable.py Sequenza fonematica da dividere in sillabe: colore colore Trovato minimo di sonorita: l Trovato minimo di sonorita: r Confini di sillaba: [2, 4] co.lo.re Sequenza fonematica da dividere in sillabe: kasa ... The most interesting use of SY is possible by exploiting the interaction between SY, NLTK and CLIPS, which allows to syllabify any part of the corpus, interactively query it and get statistical and categorical information with ease. In the next sections I will finally show how the three components can be combined together, how the SY can be exploited to syllabify the corpus and Syllabification Program 129 how to use syllabified data with NLTK. First, import two modules, one is the CR the other the SY. >>> import syllable >>> from ntlk.corpus import clips Now it is possible to query the corpus and syllabify the output. As said in the previous section the CR returns data types depending on the linguistic unit to parse. However, SY is designed to syllabify any input received by the CR despite its nature. First, we define an object item which contains the ids of a corpus unit, in this case the fifth dialogue of DG. >>> item = clips.utteranceids('DG')[5] You can get each word syllabified by using the method syllabify(). >>> syllable.syllabify(clips.phonemes(item)) ['a.kk"an.to', 'a', 'si.n"i.stra'] If a sentence has to be syllabified, it is considered as a sequence of phonemes and syllabification applies without considering word boundaries. >>> syllable.syllabify(clips.sents(item)) ['a.kk"an.toa.si.n"i.stra'] As yo can see the sequence toa is considered as a single unit do to re-syllabification You can also syllabify each word separately. >>> for word in clips.phonemes(item): ... print syllable.syllabify(word) ... ['a.kk"an.to'] ['a'] ['si.n"i.stra'] Syllabification Program 130 Or syllabify a single word and use the verbose mode. >>> syllable.syllabify(clips.phonemes(item)[0], verbose = 1) single word: akk"anto Confini di sillaba: [1, 6] a.kk"an.to ['a.kk"an.to'] Finally, it is possible to display the TIMIT as well as any other information available in the desired layout. For example, this simple code will display the entire sentence, its syllabification, the phonological transcription of each word, the orthographic transcription and its TIMIT. print clips.sent_times(item), ' > ', syllable.syllabify(clips.sents(item)) for word, phoneme, syll in zip(clips.word_times(item), clips.phoneme_times(item), \ syllable.syllabify(clips.phonemes(item))): print word[0], word[1], '-', word[2], ':', phoneme[0], '>', syll Output: [('akk"anto% %asin"istra', 0, 37007)] > 'a.kk"an.to%%a.si.n"i.stra' accanto% 8264 - 20419 : akk"anto% > a.kk"an.to %a 20419 - 21789 : %a > a sinistra 21789 - 37007 : sin"istra > si.n"i.stra 3. NTLK and SY This paragraph will expose the potentiality of NLTK, SY and CLIPS and will only serve as a demonstration of what could be done with them 51. To have NLTK interacting with both the corpus and SY allows to exploit NLTK functionalities to analyse CLIPS. In this paragraph I will explore NLTK statistical processing , a feature of particular interest in corpus linguistic. CLIPS was 51 Note that all data given in this paragraph does not prove any linguistic theory. It can be exploited in linguistic analsys but that would require them to be discussed and analysed. Moreover, I will be using only a sample portion of CLIPS and not the entire corpus. Syllabification Program 131 designed as a support for automatic speech processing applications but also for statistical analysis of Italian spoken language. Note that I will show only some of the functionalities featured by NLTK and use them on a syllabified output. But it is always possible to do the same kind of processing at a lexical, phonological, phonetic level as well as investigate other NLTK features that I will not discuss here (for a complete reference see S. Bird, 2009). For example, a phonotactic study of Italian could be particularly interesting in relation to the syllable. In this case you will be using the phonological or the phonetic layer instead of the syllable. In paragraph 4.2 we said that non native clusters in Italian spoken language are so rare that could be ignored. This simple script gives us a clue about it52. import nltk from nltk import FreqDist from nltk.corpus import clips import syllable, re item = clips.utteranceids('DG') clusters = ['pr', 'sp', 'rt', 'pt', 'ft', 'fn', 'pn'] for cluster in clips.phonemes(item): m = re.search(r'(pr|sp|rt|pt|ft|fn|pn)', cluster) if m: clusters.append(m.group()) fdist1 = FreqDist(clusters) fdist1.tabulate() pr 866 rt 656 sp 515 ft 1 pt 1 pn 1 fn 1 52 In order to have all the results showed in the table they had to be set to 1. In fact, the real result shows that there is none of the non native clusters in the corpus! Syllabification Program 132 Before working on CLIPS syllables it is recommended to add a new layer SYL to the corpus. This way the CR will have direct access to CLIPS syllables and wont have to syllabify and store the entire corpus (which could be particularly time consuming) every time the program is run. The new layer will have the characteristics of all other layers and will be saved as text files in TIMIT format. It is strongly recommended to use this approach on research. In fact, being the corpus transcripts permanent and immutable there is no reason to process them if no change of the corpus is made. Moreover, as said before, this will create another representation of the corpus at a syllabic layer and will allow all the benefits argued for the other transcripts. In this paragraph I will syllabify the corpus on the fly, keeping SY and the CR separated. This because the operation is slightly more complicated than having a syllabic layer integrated with the corpus and because it would better show the functioning of SY itself. The first operation to do is to syllabify the entire corpus. This could be done by creating a list containing the desired syllabified units (such as phonemes and sentences). To reduce the processing time we will process only the first 100 units of the DG corpus. Note that the verbose argument indicates that we do not want any message on stdout. item = clips.utteranceids('DG')[:1000] std_words = [syll for syll in syllable.syllabify(clips.phonemes(item), verbose = -1 )] std_sentences = [syll for syll in syllable.syllabify(clips.sents(item), verbose = -1 )] The two lists will look like the following: >>> std_words[10:20] ['s"i', '".io', '"O', 'un', 'p"e.tti.ne', '"a.lla', 'm".ia', 'si.n"i.stra', 'lon.t"a.no', 's"i'] >>> sent_sylls[1:4] ['s".i."E.kko.nO.non.tSe.l".O%%".io.nO', 's".i."E.kko.nO.non.tSe.l".O%%".io.nO.s"i.s"i.s".i.".io %%".O%%un.p"e.tti.ne%%"a.lla.m".ia.si.n"i.stra.lon.t"a.no', 's".i."E.kko.nO.non.tSe.l".O% %".io.nO.s"i.s"i.s".i.".io%%".O%%un.p"e.tti.ne%%"a.lla.m".ia.si.n"i.stra.lon.t"a.no.s"i'] Syllabification Program 133 It is also possible to have syllbification to be done on phonological classe or CV structures by using the cvsyll() function: classed_words = [syllable.cvsyll(syll,1) for syll in syllable.syllabify(clips.phonemes(item), verbose = -1 )] cv_words = [syllable.cvsyll(syll) for syll in syllable.syllabify(clips.phonemes(item),verbose=-1 )] The content of the lists would then be the following: >>> cv_words[:10] ['CV', 'V.CCV', 'CV', 'CVC', 'CCV', 'CV', '.VV', 'CV', 'CV', 'CV'] >>> classed_words[:10] ['OV', 'V.OOV', 'SV', 'SVS', 'OOV', 'SV', '.VV', 'SV', 'OV', 'OV'] NTLK's frequency module does statistics using elements of a list. In other words, to have the module count syllables (or any linguistic units) we need that each element of the list corresponds to a syllable (or to any desired linguistic unit).The following code first joins all the string of the list (i.e. words or sentences) in a single list using a dot as separator and then splits it up in correspondence of every dot. For example, ['ka.za', 'di', 'lu.ka'] first will be merged in ['ka.za.di.lu.ka'] and then divided back into ['ka', 'za', 'di', 'lu', 'ka']. # create a list whose elements are syllables w_sylls = [syll for syll in '.'.join(std_words).split('.')] c_sylls = [syll for syll in '.'.join(classed_words).split('.')] cv_sylls = [syll for syll in '.'.join(cv_words).split('.')] s_sylls = [syll for syll in '.'.join(sent_sylls).split('.')] Now it is possible to have the lists processed by the NLTK's FreqDist module. First of all we have to import it and initialise a frequency object. from nltk import FreqDist fdist1 = FreqDist(w_sylls) Syllabification Program 134 The object methods will allow us to dispose of frequency information and graphs. These next lines of code will allow us to easily obtain generic statistical information, in this case of CLIPS's syllabified phonemic transcripts. Note the frequency of 'si', 'no', 'la' syllables in the DG corpus which is probably lexical. print "Computed syllables: ", fdist1.N() print "100 most frequent syllables" for i in range(10, 101, 10): vocabulary1 = fdist1.keys()[i-10:i] print i-10, '-', i ,vocabulary1 print "Most frequent syllable:", fdist1.max() print "Recurrences:", fdist1[fdist1.max()] print "Frequency:", fdist1.freq(fdist1.max()) And this is the output: Computed syllables: 12585 100 most frequent syllables 0 - 10 ['a', 's"i', 'o', 'lla', 'la', 'e', 'no', 'na', 'te', 're'] 10 - 20 ['tto', 'le', 'stra', 'di', 'vi', 'il', 'ra', '"E', 'ne', 'to'] 20 - 30 ['k"Ei', 's"o', 'in', 'ta', 'd"e', 'pra', 'si', 'll"o', 's"O', 'nO'] 30 - 40 ['n"i', 'ti', 'ssa', '"u', 'un', 'do', 'ma', 'io', 'ke', 'm"a'] 40 - 50 ['"a', 'tSi', 'p"a', 'ko', '"', 'non', 'da', 'so', 'd"E', 'kki'] 50 - 60 ['va', 'v"Er', 'lle', 'kw"e', 'tra', 'p"Oi', 'mo', 't"or', 'me', 'r"o'] 60 - 70 ['pj"u', 'tu', 'd"a', 'tSe', 'ri', 'sso', 't"i', 'se', 'm"en', 'li'] 70 - 80 ['tta', 'pa', 'po', '"O', 'l"O', 'al', 'del', 've', 'tS"E', 'l"i'] 80 - 90 ['per', 'v"a', 'ro', 'st"E', '"ai', 'vo', 'f"a', 'ka', 'd"', 'ue'] 90 - 100 ['tti', 'kko', 'v"ai', 'tro', 'rri', 'm"E', 'tSo', 'd"o', 'kkj"a', 'Li'] Most frequent syllable: a Recurrences: 379 Frequency: 0.0301152165276 Syllabification Program 135 It is also possible to obtain frequency and cumulative freqeuency distribution plots (image 4.1 and 4.2). print "Syllable Frequency Distribution Plot..." fdist1.plot(25) print "Syllable Cumulative Frequency Distribution Plot..." fdist1.plot(25, cumulative=True) The argument of the function specifies the number of results showed in the plots. You can get simple frequency information on the object but also more complex and personalised one in a very clear and elegant way thanks to python. In this example we will create a first list containing only syllable lengths, initialise another frequency object and get frequency information. The function tabulate() will format the results in a table. len_sylls = [len(syll) for syll in cv_sylls if 0 < len(syll) < 6] fdist2 = FreqDist(len_sylls) print 'Most frequent syllable lengths' fdist2.tabulate() The result will be the following. The method tabulate formats the output in a table. Most frequent syllable lengths 2 3 1 4 5 6660 3893 1344 563 46 Syllabification Program Image 4.1: Syllable Frequency Distribution Plot. Image 4.2: Syllable Cumulative Frequency Distribution Plot. 136 Syllabification Program 137 You may also use the phnoological class list to get the structure of most frequent long syllables (which contain more than two elements). Again, the code will be very clear. you will just have to create a list containing only syllable longer than 2 and initialise another FreqDist object. long_sylls = [syll for syll in c_sylls if len(syll) > 2] fdist4 = FreqDist(long_sylls) print '10 Most frequent long syllables (l > 2)', fdist4.keys()[:10] Output: 10 Most frequent long syllables (l > 2) ['OOV', 'SSV', 'OVS', 'OSV', 'OGV', 'OVV', 'SVS', 'OOSV', 'FVS', 'OOVS'] Another interesting NLTK feature is the conditional frequency module. CLIPS is well structured and conditional frequency can be used to obtain highlight differences between different linguistic variaties (diatopic, diaphasic and so on). In the next example I will use the function clips.corpinfo() to obtain a list of CLIPS sub-corpora (such as 'DG', 'TV', 'TL'). These corpora will be paired with CV syllabification of their data. The resulting pairs will be used to create a FreqDist object with the same characteristic as the previous ones. Note that only the first 200 occurrences of each corpus are processed. cv_syllsP = [(subc, syllable.cvsyll(syll)) for subc in clips.corpinfo('subcorpora') for syll in '.'.join(syllable.syllabify(clips.phonemes(clips.utteranceids(subc)[:200]), verbose = -1 )).split('.')] The pairs will look like the following: >>> for pair in cv_syllsP[-3:]: … print pair ('DG', 'CV') ('DG', 'CCV') Syllabification Program 138 This time we will have to initialise a conditional FreqDist object. cfd = nltk.ConditionalFreqDist(cv_syllsP) The object will give the possibility to control the conditions once initialised. For example, we may want to obtain the number of occurrences of a particular element in a sub-corpus >>> print 'CV occurences in DG:', cfd['DG']['CV'] CV occurences in DG: 1291 Or get a plot of syllable structure occurrences in the various sub-corpora. We can also specify the conditions argument to select only particular sub corpora to compare at once: 4. Further studies The aim of this thesis was to design and implement a robust system for the automatic syllabification of CLIPS. In the last section I showed how this system can be exploited to get statistical information and how to create a SYL layer in a completely automatic fashion. But SY and the CR constitute just the basis for the future investigation of the subject. Python and NLTK allows an unlimited number of possibilities and a great versatility that could be exploited for the linguistic analysis of CLIPS. Moreover, the phonological syllabification could be used as a reference or as a basis for syllabification of PHN. To have a syllabified phonetic transcription of the corpus will allow the segmentation of the signal and therefore the possibility to exploit it in numerous ways. For example, an ANN can be trained using phonetic syllables and their corresponding signal portion for speech recognition, to describe the acoustic characteristics of most frequent syllables or for textto-speech systems. Conclusion 139 5 Conclusion On the first chapter, I have showed that different syllabification principles and definition have been proposed for Italian. The biggest problem lays on how to divide particular sequences of two segments, in other words on whether some types of clusters are heterosyllabic or tautosyllabic. Syllabification principles basically diverge only on the syllabification of sC cluster, geminates, sequence of vocoids and non native clusters. Depending on the theory, these conflicts are resolved by means of diffent interacting constraints (Optimality Theory), converging levels of representation (Autosegmental representation), exceptions or variations of a principle (Sonority Distance principle) and so on. However, because of various interacting factors, such as the phonotactic knowledge of speakers, the few occurrences of non native clusters, the rules learned at school for orthographic syllabification, a possible grey area of unpredictable phenomena or a change in act in the language it has been difficult to give experimental or external evidence to support a definitive principle or syllabification. In the second chapter, two computational models are presented in the light of the ever lasting epistemological and linguistic debate of empiricism versus rationalism. The brief literature review shows how different models can be more suited to reflect phonological theories and principles. However, as a result of chapter I investigation, I conclude that in most cases the accuracy results given by authors is fallacious. As no gold standard exists, it is illogical to argue that percentage of syllabification is 'correct'. In my opinion, the only way to test such kind of algorithms is to verify the obtained syllabification against the expected one, that is compare it with a corpus of syllabification which results by the application of the same principles by a human. In this case the performance and accuracy of an algorithmic solution can be tested, but not the accuracy of the syllabification itself. For example, Marchand (1999) to argue the superiority of data driven against rule based models argues that Hammond algorithms can correctly syllabify only 30% of words Conclusion 140 while its data driven algorithm can do fairly better. But its data driven algorithms are trained using dictionary syllabifications and then tested back against dictionaries. The division between testing data and training data is obviously maintained, but has argued in chapter I, syllabification given by dictionaries is among the least reliable from a phonological perspective and it has few value to compare it with a computational implementation of the OT theory. This is the reason why some authors avoid giving such kind of information or summarise them in the correct terms. In order to determine the best syllabification principles or solutions the purposes of the research play a fundamental role. A model which will be used to get light on a phonological theory would be different by an algorithm developed for engineering goals or for speaker behaviour investigation. In chapter III, I describe CLIPS, the largest Italian corpus of spoken language. One of its most relevant characteristics is the fact to come with a time aligned phonological, phonetic, subphonetic and lexical transcription. I decided to work on the phonological transcription because, as said in the previous chapter, the orthographic would be irrelevant to analyse and the phonetic is too much complex at a first stage (such as segment alteration, epenthesis, deletion). One of the main purpose of the corpus was to provide a support which could be used for statistical and probabilistic language analysis, especially in the field of speech processing applications. For this reason and to exploit the time-aligned representation of the signal it was decided in chapter IV to design an SSP based model. The SSP was chosen because it was the only phonological principle to rely on a property that could be be traced back to the signal, in particular to the energy profile. A similar system was also developed by Cutugno et al. (2001) and applied to a portion of an annotated spoken language corpus. However, the application of the principle resulted in evidently erroneous syllabifications, for example, words such as Carlo 'Charles' were syllabified as /ka.rlo/ by the strict application of the SH. To avoid similar problems, a conditional statement treated sonorants as an exception, relying on syllable structure and segments neighborhood, Conclusion 141 following an approach similar to the normativist proposed syllabification rules for Italian enumerated on chapter I. Again, this approach led to unacceptable syllabifications. Geminates were divided in the case of sonorants and kept together in other cases (/gal.lo/ 'cock' vs /ga.tto/ 'cat'), /s/ remained extrasyllabic and in some cases the algorithm returned syllable without nucleus. Of particular interest was the extrasyllabicity of the /s/ both word-marginally as in strano 'strange' /s.tra.no/ and word-internally as in extra 'extra' /ek.s.tra/, which I will show being not suited for this kind of system. In addition to this, the four issues of Italian syllabification (i.e., sC clusters, geminates, non native clusters, sequence of vocoids) were not sufficiently discussed. The solution I adopted was based on two assumptions: the system had to rely on a general phonological principle, which had to apply without exception to all possible cases; great importance should be given to the purpose of the software, that is the syllabification of time-aligned phonological transcription of a corpus created for speech application support. To avoid the syllabification problems derived by the strict application of the SSP, as showed in Cutugno et al. (2001), it was noted that it suffice to simply change the sonority value of /l/ (cf. APPENDIX A) and apply the SSP straightforward to this scale to get only acceptable syllabification (cf. APPENDIX B). In fact, while the SSP is considered to be universal, it is widely accepted in literature that the SH accept intralinguistic variation. By changing the sonority value of a phone, it has been possible to apply the principle without clumsy exceptions. Tautosyllabicity was preferred both for sC clusters and geminates. Concerning the former, this is the only case of conflict with the phonological theory. I believe in fact that from a phonological perspective, geminates are heterosyllabic in Italian, but it was necessary for the purpose of the program to keep them on the same syllable. Tautosyllabicity is in fact preferred on automatic signal segmentation system. First, it is hard to determinte the syllable boundary position in the production of non continuous phones, as there is no margin of decreasing sonority. Therefore, it is also impossible to divide the signal in two identical units. Second, by chosing such syllabification it is possible to distinguish without contextual information which syllables derive Conclusion 142 from geminates; third, less variability in syllable structure and types is obtained, being the same sequence always included in the same syllable and therefore always associated to the same portion of the signal. For these reasons and to avoid discrepancies on the syllabification system sC cluster was syllabified as tautosyllabic. By simply changing the sonority value of /s/ it has also been possible to avoid extrasyllabicity both word internally and word-marginally, resulting in /stra.no/ instead of /s.trano/ and in /ek.stra/ instead of /ek.s.tra/ and again to obtain all the advantages stated for the geminates syllabification. This point is particularly important for speech applications, because otherwise the floating extrasyllabic phone would have had to be reassigned, again by means of exceptions, additional rules or even post-lexical resyllabification, to an adjacent syllable to have it correctly analysed. However, to demonstrate that the principle discussed so far is relevant first of all on a phonological perspective, the theory has to be able to handle geminates and sC clusters correctly, to explain the reason of the sC cluster ambiguity and eventually give further evidence over the change made to the SH. As noted on 1.2.7, in the strict interpretation of the SSP, syllable boundary is placed only if sonority does not decrease, that comprehends the case of a sonority plateau. The application of this simple observation lead to a phonological syllabification system, which success on treating geminates and sC cluster as heterosyllabic with no changes. The SSP and SH are kept unchanged, but the obtained syllabification (APPENDIX C) reflects the one predicted by the phonological theory. The advantages of this solution are even more important on a phonological theory. Only a single principle is used to account for the syllabification system of Italian and no arithmetic operation on sonorities have to be postulated to justify specific cases (cf. SD principle). For this reason, the theory may have a legitimate cognitive value and eventually be confirmed by the ambiguous behaviour of sC cluster: by changing the sonority of /s/ from 1 to 0 in fact the tautosyllabic syllabification of the sC cluster is obtained, but with no word-internal extrasyllabicity (e.g., /e.kstra/). Within the theory, the possible diachronic shift of sC cluster syllabification from Conclusion 143 tautosyllabicity to heterosyllabicity argued by Bertinetto (1999) can simply be explained in terms of a sonority loss of the phoneme /s/. By changing the sonority value of two phonemes it was possible to obtain an organic and effective syllabification system which entirely relies on the SSP, without the necessity of rules, exceptions or relative sonority values. The obtained syllabification is perfectly suited for phonological analyses and most important for automatic signal processing, especially to train speech recognition or text to speech systems, this allowing to exploit CLIPS for its original purpose. The syllabificator as well as the corpus corpus reader have been developed, tested and are free and available to be downloaded and used for any kind of research. Further studies are obviously necessary, but this study can constitute an optimal basis for a multitude of possible future works and applications. APPENDIX A: Sonority scale 144 APPENDIX A: SONORITY SCALE [Sonorities] # Affricates # Vowels dZ = 6 a = 26 dz = 6 E = 24 tS = 4 O = 24 ts = 4 e = 22 # Stops o = 22 b=3 i = 19 d=3 u = 19 g=3 Q = 19 p=1 # Approximants t=1 j = 18 k=1 y = 18 s=1 w = 18 # Symbols #Sonorants - = 99 L = 14 % = 99 r = 14 " = 99 l = 12 _=0 m = 11 n = 11 # Fricatives v=9 z=9 Z=8 f=7 S=7 h=7 APPENDIX B: SAMPLE SYLLABIFICATION OUTPUT APPENDIX B: SAMPLE SYLLABIFICATION OUTPUT CL clusters (pl, kr, dr etc.): ['pa.dre'] ['li.tro'] ['ka.pra'] LC clusters (lp, rt, rp etc.): ['kol.pa'] ['ar.to'] ['ar.pa'] Sc cluster: ['pa.sta'] ['stra.no'] ['E.kstra'] Geminates: ['ga.tto'] ['ga.llo'] Second vowel stressed, hiatus: ['pa."u.ra'] ['pao.lo'] Non native clusters: ['di.sle.ssia'] ['bi.sno.nno'] ['te.kni.ka'] ['si.na.pti.ko'] ['ka.psu.la'] ['naf.ta'] ['a.tlan.te'] ['do.gma'] ['a.bnor.me'] ['a.fnio'] 145 APPENDIX C: PHONOLOGICAL SYLLABIFICATION APPENDIX C: PHONOLOGICAL SYLLABIFICATION CL clusters (pl, kr, dr etc.): ['pa.dre'] ['li.tro'] ['ka.pra'] LC clusters (lp, rt, rp etc.): ['kol.pa'] ['ar.to'] ['ar.pa'] Sc cluster: ['pas.ta'] ['s.tra.no'] ['Eks.tra'] Geminates: ['gat.to'] ['gal.lo'] Second vowel stressed, hiatus: ['pa."u.ra'] ['pao.lo'] Non native clusters: ['di.sles.sia'] ['bi.snon.no'] ['te.kni.ka'] ['si.nap.ti.ko'] ['kap.su.la'] ['naf.ta'] ['a.tlan.te'] ['do.gma'] ['a.bnor.me'] ['a.fnio'] 146 Bibliography 147 Bibliography Adsett, C.R., Marchand, Y. & Kes˘ elj, V., 2009. Syllabification rules versus data-driven methods in a language with low syllabic complexity: The case of Italian. Computer Speech & Language. Amari, S.I. & Kasabov, N., 1998. Brain-like computing and intelligent information systems, Springer-Verlag Singapore Pte. Limited. Anderson, A.H. et al., 1992. The HCRC map task corpus, Human Communication Research Centre. Atkeson, C.G., Moore, A.W. & Schaal, S., 1997. Locally weighted learning. Artificial Intelligence Review, 11(1), 11–73. Bach, E. & Wheeler, D., 1981. Montague phonology: a first approximation. University of Massachusetts Occasional Papers in Linguistic, 7, 27–45. Bertinetto, P.M., 1999, La sillabazione dei nessi sC in Italiano: un'eccezione alla tendenza 'universale'. In Fonologia e morfologia dell'italiano e dei dialetti d'Italia: atti del XXXI Congresso della Società di linguistica italiana. pagg. 71-96. Bird, S., 2005. NLTK-Lite: Efficient scripting for natural language processing. Dans Proceedings of the 4th International Conference on Natural Language Processing (ICON). pp. 11–18. Bird, S., Klein, E. & Loper, E., 2009. Natural Language Processing with Python, Oreilly & Associates Inc. Bird, S. & Loper, E., 2004. NLTK: the natural language toolkit. Proceedings of the ACL demonstration session, 214–217. Black, H.A., 1993. Constraint-Ranked Derivation A Serial Approach to Optimization, University of California, Santa Cruz. Bibliography 148 Blevins, J. & Goldsmith, J., 1995. The syllable in phonological theory. 1995, 206–244. Bloch, B., 1948. A set of postulates for phonemic analysis. Language, 3–46. Bloomfield, L. & Kess, J.F., 1983. An introduction to the study of language, J. Benjamins Pub Co. Bonomi, A., Falcone, M. & Barone, A., 2007. Definizione e caratterizzazione di un database vocale ortofonico realizzato da parlanti professionisti in camera anecoica. Available at: http://www.clips.unina.it/downloads/8_definizione%20database%20ortofonico.pdf. Broselow, E., 1982. On predicting the interaction of stress and epenthesis. Glossa, 16(2), 115–132. Brown, G. et al., 1984. Teaching talk: Strategies for production and assessment, Cambridge: Cambridge University Press. Bruni, F., 1992. L'italiano nelle regioni: lingua nazionale e identità regionali, Utet. Calderone, B. & Bertinetto, P.M., 2006. La sillaba come stabilizzatore di forze fonotattiche. Una modellizzazione. Camilli, A., 1941. Pronuncia e grafia dell’italiano, ed. Piero Fiorelli, Firenze, Sansoni, 3(1965), 1. Canepari, L., 1999. Il MaPI, Manuale di pronuncia italiana, Zanichelli. Cerrato, L., 2007a. Tecniche di elicitazione dialogica. Available at: http://www.clips.unina.it/downloads/2_tecniche%20di%20elicitazione%20dialogica.pdf. Cerrato, L., 2007b. Sulle tecniche di elicitazione di dialoghi di parlato semi-spontaneo. Available at: http://www.clips.unina.it/downloads/2_tecniche%20di%20elicitazione%20dialogica.pdf. Chierchia, G., 1986. Length, syllabification and the phonological cycle in Italian. Journal of Italian Linguistics, 8(1), 5–33. Chomsky, N., 1996. A review of BF Skinner's Verbal Behavior. Readings in language and mind, 413–441. Bibliography 149 Chomsky, N., 1965. Aspects of the Theory of Syntax, MIT press. Chomsky, N., 1959. review of Skinner's' Verbal Behaviour'. Language, 35(1). Chomsky, N., 2002. Syntactic structures, Walter de Gruyter. Chomsky, N. & Halle, M., 1968. The sound pattern of English. Clements, G.N. & Goldsmith, J.A., 1984. Autosegmental studies in Bantu tone, Foris Pubns USA. Clements, G.N. & Keyser, S.J., 1983. CV Phonology. A Generative Theory of the Syllabe. Linguistic Inquiry Monographs Cambridge, Mass., (9), 1–191. Cutler, A. et al., 1986. The syllable's differing role in the segmentation of French and English. Journal of memory and language(Print), 25(4), 385–400. Cutler, A. & Norris, D., 1988. The role of strong syllables in segmentation for lexical access. Journal of Experimental Psychology: Human perception and performance, 14(1), 113–121. Cutugno, F., Passaro, G. & Petrillo, M., 2001. Sillabificazione fonologica e sillabificazione fonetica. Dans Atti del XXXIII, Congresso della Società di Linguistica Italiana, Bulzoni Roma. pp. 205–232. Cutugno, F., Prosser, S. & Turrini, M., 2000. Audiometria Vocale, Bloomington, MN: GN ReSound. Cutugno, F., 2006. Criteri per le liste di lettura. Available at: http://www.clips.unina.it/downloads/4_criteri%20per%20le%20liste%20di%20lettura.pdf. Cutugno, F., 2007a. Criteri per la definizione delle mappe, esempi di mappe e di vignette per il gioco delle differenze. Available at: http://www.clips.unina.it/downloads/3_definizione%20mappe %20e%20vignette.pdf. Cutugno, F., 2007b. Criteri per la digitalizzazione del materiale audio CLIPS. Available at: Bibliography 150 http://www.clips.unina.it/downloads/7_criteri%20per%20la%20digitalizzazione.pdf. Cutugno, F., 2007c. Specifiche quantitative e indicazioni sulle modalità di registrazione relative alla raccolta di parlato: dialoghi, corpus letto e parlato radiotelevisivo. Available at: http://www.clips.unina.it/downloads/6_modalit%C3%A0%20di%20registrazione %20abc.pdf. D‘Imperio, M. & Rosenthall, S., 1998. Phonetics and Phonology of Italian Main Stress. Dans Twenty-Eighth Linguistics Symposium on Romance Languages, University Park, Penn. Daelemans, W. & Van Den Bosch, A., 1992. Generalization performance of backpropagation learning on a syllabification task. Dans Proceedings of the 3rd Twente Workshop on Language Technology. pp. 27–38. Daelemans, W. & Van Den Bosch, A., 1997. Language-independent data-oriented grapheme-tophoneme conversion. Progress in speech synthesis, 77–89. Daelemans, W. & Van den Bosch, A., 1992. A neural network for hyphenation. Artificial Neural Networks, 2, 1647–1650. Dale, R., Moisl, H. & Somers, H., 2001. Handbook of natural language processing. Computational Linguistics, 27(4), 602–603. Danesi, M., 1985. The Italian geminate consonants and recent theories of the syllable. Toronto Working Papers in Linguistics, 6(0). Dardano, M., 1994. Profilo dell’italiano contemporaneo. Storia della lingua italiana, 343–430. De Masi, S., 2007. Criteri per la predisposizione delle liste di lettura. Available at: http://www.clips.unina.it/downloads/4_criteri%20per%20le%20liste%20di%20lettura.pdf. Bibliography 151 Di Carlo, A. & D'Anna, L., 2007. Definizione del contenuto del corpus telefonico e linee guida per la raccolta. Available at: http://www.clips.unina.it/downloads/10_definizione%20del %20corpus%20telefonico.pdf. Edwards, J.A., 1993. Principles and contrasting systems of discourse transcription. Talking data: Transcription and coding in discourse research, 3–31. Eisner, J., Efficient generation in primitive Optimality Theory. Ellison, T.M., 1994. Phonological derivation in optimality theory. Dans Proceedings of the Fifteenth International Conference on Computational Linguistics. pp. 1007–1013. Falcone, M., Barone, A. & Alessandro, B., 2007. Definizione del database ortofonico in camera anecoica. Available at: http://www.clips.unina.it/downloads/9_descrizione%20del %20corpus%20ortofonico.pdf. Firth, J.R., 1957. Papers in linguistics, 1934-1951, Oxford University Press. Firth, J.R., 1948. Sounds and peosodies . Transactions of the Philological Society, 47(1), 127–152. Fudge, E.C., 1969. Syllables. Journal of Linguistics, 253–286. Gibbon, D., Moore, R. & Winski, R., 1997. Handbook of standards and resources for spoken language systems, Walter de Gruyter. Goldsmith, J., 1992. ‘Local modelling in phonology. Connectionism: Theory and Practice, Oxford University Press, Oxford. Goldsmith, J., 1994. A dynamic computational theory of accent systems. Perspectives in Phonology, 1–28. Goldsmith, J.A., 1990. Autosegmental and metrical phonology, Basil Blackwell. Goldsmith, J.A., 1976. Autosegmental phonology, Indiana University Linguistics Club. Bibliography 152 Goldsmith, J.A., 1999. Phonological theory: the essential readings, Blackwell Pub. Gordon, M., 2004. Syllable weight. Phonetically based phonology, 277–312. Halle, M. & Keyser, S.J., 1971. English stress. Its form, its growth, and its role in verse. New York etc.: Harper” Row. Hammond, M., 1997. Parsing in OT. Ms., University of Arizona (ROA-222). Hammond, M., 1995. Syllable parsing in English and French. Arxiv preprint cmp-lg/9506003. Hayes, B., 1989. Compensatory lengthening in moraic phonology. Linguistic inquiry, 253–306. Heiberg, A.J., 1999. Features in Optimality Theory: A computational model. THE UNIVERSITY OF ARIZONA. Hockett, C.F. & Francis, C., 1955. A manual of phonology, Waverly Press. Hooper, J.B., 1972. The syllable in phonological theory. Language, 525–540. Ide, N., Priest-Dorman, G. & Veronis, J., 1996. Corpus encoding standard. URL http://www. cs. vassar. edu/CES, 3. Jespersen, O., 1913. Lehrbuch der Phonetik, BG Teubner. Kahn, D., 1976. Syllable-based generalizations in English phonology, Indiana University Linguistics Club. Kasabov, N.K., 2003. Evolving connectionist systems: Methods and applications in bioinformatics, brain study and intelligent machines, Springer Verlag. Kasabov, N.K., 1996. Foundations of neural networks, fuzzy systems, and knowledge engineering, The MIT press. King, S. et al., 1998. Speech recognition via phonetically featured syllables. Dans Fifth International Conference on Spoken Language Processing. Bibliography 153 King, S. & Taylor, P., 2000. Detection of phonological features in continuous speech using neural networks. Computer Speech and Language, 14(4), 333–353. Kohler, K.J., P\ätzold, M. & Simpson, A., 1995. From scenario to segment: the controlled elicitation, transcription, segmentation and labelling of spontaneous speech. Arbeitsberichte Phonetik Kiel, 29, 7. Labov, W., 1972. Some principles of linguistic methodology. Language in society, 97–120. Laks, B., 1995. A connectionist account of French syllabification. Lingua, 95(1-3), 51–76. Laporte, E., Phonetic syllables in French: combinatorics, structure and formal definitions. Acta Linguistica Academiae Scientarum Hungaricae, 41, 175. Laurinčiukaitė, S. & Lipeika, A., 2006. Syllable-Phoneme based Continuous Speech Recognition. ELEKTRONIKA IR ELEKTROTECHNIKA, 6, 70. Lepsky, A. & Lepsky, G., 1977. The Italian Language Today, London: Hutchinson. Lesina, R., 1986. Il manuale di stile, Zanichelli. Levelt, W.J., Roelofs, A. & Meyer, A.S., 1999. A theory of lexical access in speech production. Behavioral and brain sciences, 22(01), 1–38. Loper, E. & Bird, S., 2002. NLTK: The natural language toolkit. Dans Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. pp. 62–69. Loporcaro, M. 1996. On the analysis of geminates in Standard Italian and Italian dialects. Natural Phonology: The State of the Art, 153-187. Lowenstamm, J., 1996. CV as the only syllable type. Current trends in Phonology. Models and Methods, edited by Jacques Durand & Bernard Laks, 419–441. Bibliography 154 Malmberg, B., 1971. Phonétique générale et romane: Études en allemand, anglais, espagnol et français, Mouton. Marchand, Y., Adsett, C.R. & Damper, R.I., 2009. Automatic syllabification in English: A comparison of different algorithms. Language and Speech, 52(1), 1. Marotta, G., 1993. Selezione dell'articolo e sillaba in Italiano: un'unterazione totale? Studi di grammatica italiana, 15, 255–296. McCarthy, J. & Prince, A., 1986. Prosodic Phonology. Ms. University of Massachusetts, Amherst and Brandeis University. McCarthy, J., 1981.A prosodic theory of nonconcatenative morphology.Linguistic inquiry, 373-418. McCarthy, J.J., 1979. On stress and syllabification. Linguistic Inquiry, 443–465. McCarthy, J.J. & Prince, A., 1995. Faithfulness and reduplicative identity. John J. McCarthy, 44. McCrary, K.M., 2004. Reassessing the Role of the Syllable in Italian Phonology: An Experimental Study of Consonant Cluster Syllabification, Definite Article Allomorphy and Segment Duration. UNIVERSITY OF CALIFORNIA Los Angeles. McEnery, T., Wilson, A. & Barnbrook, G., 2001. Corpus linguistics, Edinburgh. Mehler, J., Dommergues, J.Y. et al., 1981. The syllable's role in speech segmentation. Journal of Verbal Learning & Verbal Behavior. Vol, 20(3), 298–305. Mehler, J., Segui, J. & Frauenfelder, U., 1981. The role of the syllable in language acquisition and perception. The cognitive representation of speech. Amsterdam: North Holland. Nespor, M., 1993. Fonologia, Il Mulino. Nespor, M. & Vogel, I., 1979. Clash avoidance in Italian. Linguistic Inquiry, 467–482. Bibliography 155 Nespor, M. & Vogel, I., 1982. Prosodic domains of external sandhi rules. The structure of phonological representations, 1, 225–255. Ostendorf, M., 1999. Moving beyond the ‘beads-on-a-string’model of speech. Dans Proc. IEEE ASRU Workshop. pp. 79–84. Oudeyer, P., 2001. The Epigenesis of Syllable Systems: an Operational Model. Language, 167–171. Prince, A., 1990. Quantitative consequences of rhythmic organization. CLS, 26(2), 355–398. Prince, A. & Smolensky, P., 2004. Optimality theory, Blackwell. Prince, A. & Smolensky, P., 1993. Optimality Theory: Constraint interaction in generative grammar. Pulgram, E., 1970. Syllable, word, nexus, cursus, Mouton. Raymond, E.S., 2000. Why Python. Linux Journal, 73. Repetti, L.D., 1989. The bimoraic norm of tonic syllables in Italo-Romance. UCLA. Rivola, R., 1989. La lingua dei notiziari radiotelevisivi nella Svizzera italiana. van Rossum, G. & de Boer, J., 1991. Interactively testing remote servers using the Python programming language. CWI Quarterly, 4(4), 283–303. van Rossum, G. & Drake Jr, F.L., 1993. Python library reference, Technical report, CWI, Amsterdam. van Rossum, G. & Drake Jr, F.L., 2000. Python reference manual, iUniverse. van Rossum, G. & Drake, F.L., 2003. Python language reference. Network Theory Ltd. van Rossum, G. & others, 1994. Python programming language. CWI, Department CST, The Netherlands. Rubach, J., 1986. Abstract vowels in three dimensional phonology: the yers. The Linguistic Review, Bibliography 156 5, 247–280. Rubach, J. & Booij, G., 1990. Syllable structure assignment in Polish. Phonology, 121–158. Sabatini, F., 1997. DISC: dizionario italiano Sabatini Coletti, Giunti. Saussure, F. et al., 1922. Cours de linguistique générale, Payot, Paris. Savy, R., 2007a. Specifiche per la trascrizione ortografica annotata dei testi raccolti. Available at: http://www.clips.unina.it/downloads/11_specifiche%20trascrizione%20ortografica.pdf. Savy, R., 2007b. Specifiche per l'etichettatura dei livelli segmentali. Available at: http://www.clips.unina.it/downloads/12_specifiche%20di%20etichettatura.pdf. Segui, J., 1984. The syllable: A basic perceptual unit in speech processing. Attention and performance X: Control of language processes, Hillsdale, Erlbaum, 125–149. Segui, J., Dupoux, E. & Mehler, J., 1991. The role of the syllable in speech segmentation, phoneme identification, and lexical access. Segui, J., Frauenfelder, U. & Mehler, J., 1981. Phoneme monitoring, syllable monitoring and lexical access. British Journal of Psychology, 72(4), 471–477. Selkirk, E., 1986. On derived domains in sentence phonology. Phonology Yearbook, 371–405. Selkirk, E.O., 1980. On prosodic structure and its relation to syntactic structure, Indiana University Linguistics Club. Selkirk, E.O., 1984. Phonology and syntax: The relation between sound and structure. Serianni, L. & Castelvecchi, A., 1989. Grammatica italiana, UTET. Sievers, E., 1876. Grundzüge der Lautphysiologie zur Einführung in das Studium der Lautlehre der indogermanischen Sprachen, Breitkopf und H\ärtel. Sj\ölander, K. & Beskow, J., 2000. Wavesurfer-an open source speech tool. Dans Sixth Bibliography 157 International Conference on Spoken Language Processing. Skinner, B.F. & Frederic, B., 1957. Verbal behavior, Appleton-Century-Crofts New York. Sobrero, A., 2007. Articolazione diatopica, diamesica e diafasica del corpus Radiotelevisivo. Available at: http://www.clips.unina.it/downloads/5_articolazione%20del%20RTV.pdf. Sobrero, A. & Tempesta, I., 2007. Definizione delle caratteristiche generali del corpus: informatori, località. Available at: http://www.clips.unina.it/downloads/1_scelta%20informatori%20e %20localit%C3%A0.pdf. Soetre, R. et al., 2005. gProt: annotating protein interactions using Google and gene ontology. Lecture notes in computer science, 3683, 1195. Steriade, D., 1999. Alternatives to syllable-based accounts of consonantal phonotactics. Dans Proceedings of the 1998 Linguistics and Phonetics Conference. pp. 205–245. Stetson, R.H., Kelso, J.A. & Munhall, K.G., 1988. RH Stetson's Motor Phonetics, Little Brown and Company. Stoianov, I., Nerbonne, J. & Bouma, H., 1998a. Modelling the phonotactic structure of natural language words with Simple Recurrent Networks. Dans Computational Linguistics in the Netherlands, 1997: Proceedings: CLIN Meeting (8th: 1997: Nijmegen, Netherlands). p. 77. Stoianov, I., Nerbonne, J. & Bouma, H., 1998b. Modelling the phonotactic structure of natural language words with Simple Recurrent Networks. Dans Computational Linguistics in the Netherlands, 1997: Proceedings: CLIN Meeting (8th: 1997: Nijmegen, Netherlands). p. 77. Tesar, B., 1995. Computing optimal forms in Optimality Theory: Basic syllabification. Ms., University of Colorado and Rutgers University.(ROA-52). Tesar, B. & Smolensky, P., 1998. Learnability in optimality theory. Linguistic Inquiry,29, 229–268. Bibliography 158 Trommer, J., 2008. Syllable-counting allomorphy by indexed constraints. Talk given at OCP, 5. Trubeckoj, N.S., 1958. Grundz\üge der Phonologie, Vandenhoeck & Ruprecht. Vennemann, T., 1988. Preference laws for syllable structure and the explanation of sound change: With special reference to German, Germanic, Italian, and Latin, Mouton de Gruyter. Waltermire, M., 2004. The effect of syllable weight on the determination of spoken stress in Spanish. Laboratory approaches to Spanish phonology, 171–191. Weerasinghe, R., Wasala, A. & Gamage, K., 2005. A rule based syllabification algorithm for Sinhala. Lecture notes in computer science, 3651, 438. Weijters, A., 1991. A simple look-up procedure superior to NETtalk. Dans Proceedings of the International Conference on Artificial Neural Networks, Espoo, Finland. Zec, D., 1995. Sonority constraints on syllable structure. Phonology, 85–129.
© Copyright 2025 Paperzz