THE Thesaurus Occitan: A MULTIMEDIA DATABASE DEDICATED TO OCCITAN DIALECTS. PRESENTATION OF ITS MORPHOSYNTAX MODULE Pierre-Aurélien Georges Laboratoire BCL, CNRS UMR 6039 Université Nice Sophia-Antipolis, MSH de Nice (France) Abstract The Module MorphoSyntaxique (abbreviated MMS) is a computer tool especially designed for syntactic and morpho-syntactic analysis of Occitan dialects. It is part of the Thesaurus Occitan multimedia database (of which a general presentation can be found in these proceedings in another article by Guylaine Brun-Trigaud). Following the THESOC’s general guidelines (i.e. localised and oral data only), this module contains both oral texts (including ethnotexts) and single sentences, such as answers to morphosyntactic questionnaires. The “oral data” criteria can be somewhat flexed: even if this module was originally conceived for oral data processing, its part-of-speech tagger and syntactic parser are still able to process written texts so far as they are written in a familiar or popular style, close to oral register. The locations where all these texts and sentences have been harvested are stored in the database, thus enabling on the long term a comparison between different dialects on a morphosyntactical or syntactical basis, thus opening new perspectives for dialectology. Keywords: Comparative syntax; morphosyntax; Occitan dialects; ethnotexts corpus; database 1. General Presentation The THESAURUS OCCITAN (abbreviated Thesoc) is a multimedia database which encompasses oral dialectal data from the whole Occitan domain. Aside from its lexical and 0 Tools for linguistic (anejo ASJU) 107 14/7/10 09:16:09 108 PIERRE-AURÉLIEN GEORGES microtoponymy corpuses, the Thesoc also contains a module dedicated to morphosyntax and syntax analysis of Occitan dialects.1 This Module MorphoSyntaxique (MMS)2 database is aimed at studying oral syntax and morphosyntax in a comparative way. 1.1. Conditions As for the rest of the Thesoc, raw data to be included in this MMS module must match the two following criterions: — Location condition: linguistic data must be precisely located. This constitutes an essential condition for diatopic variation studies. In particular, this condition will enable dynamic and automated generation of linguistic maps on demand in the future. — Orality condition: linguistic data should come from oral sources. Indeed, our philosophy here is to integrate linguistic facts collected under oral form (with IPA transcription), which guarantees the reality of these facts under consideration. Moreover, the Thesoc allows hearing of the audio tracks recorded during the field works, thus giving to the user the possibility to control or to check the proposed transcription. 1.2. Different types of data On the one hand, the database contains a collection of single sentences or isolated sentences: answers to morphosyntactic questionnaires such as PAM,3 sentences found in unpublished survey notebooks of the Atlas linguistiques,4 or sentences published in some of these atlases, such as the ALMC.5 On the other hand, the database also contains a collection of texts, such as ethnotexts collected on the field, or radio broadcastings. Similarly to the other parts of the Thesoc, multimedia documents such as pictures, sound files, or video files, can be attached to a text or a single sentence. 1.3. Text types and orality condition Although this module was originally conceived for treating oral data only, its lemmatiser and syntactic parser presented in sections 2.2. and 2.3. are even capable of treating written texts as long as they are written in a familiar or popular style, close to oral register; thus dimming somewhat the orality condition required in section 1.1. This orality condition may then be eventually reformulated as the following: linguistic data must contain oral syntax or popular/close-to-oral syntax. 1 For a general presentation of all the other aspects of the The soc, please refer to the article by Guylaine Brun-Trigaud also included in these proceedings, and/or (Georges, not yet published, a). 2 It was formerly known as «base Textes» in (Georges 2009) because it originally started has a corpus of texts and ethnotexts, as explained in (Oliviéri 2003). 3 Parler des Alpes Maritimes, supervised by (Dalbera 1994). 4 Atlas Linguistiques de la France par régions, éditions du C.N.R.S, as presented in (Séguy 1973). 5 Atlas Linguistique du Massif Central (Nauton 1957-1963). 0 Tools for linguistic (anejo ASJU) 108 14/7/10 09:16:09 THE THESAURUS OCCITAN: A MULTIMEDIA DATABASE DEDICATED TO OCCITAN... 109 This possibility has allowed us to extend the corpus with other types of texts: some theater plays, articles from popular press, etc. However, each text record from the database contains a field that informs the user about its type/gender. As shown in Figure 1, Ethnotexts are thus easily identifiable by the “genre: Ethnotexte” field content, whereas other types of texts have another tag in this field, such as “Chanson” (song lyrics) or “Presse” (press article) for example. Thanks to this field, it’s always possible to filter out written texts and to focus only on ethnotexts and/or some other types of oral texts: search queries can be configured to show results from the whole corpus or from only certain types of texts specified by the user. This way, linguists can decide whether to stay on strictly oral data, or to also include some type of written texts in order to get a broader corpus. Fig. 1 Example of an ethnotext record 2. Data processing 2.1. Adding new data to the corpus Data can be added to the database through its XML import/export functionalities6 or by typing new records directly within the user interface. The graphical transcription field of a new record can have two different origins, depending on the nature of this record: — If it’s a text (written in a familiar or popular style, close to oral register), for which a graphical transcription is, by definition, already available, it is directly 6 Thus, TEI import / export can also be achieved by using an XSLT filter. 0 Tools for linguistic (anejo ASJU) 109 14/7/10 09:16:09 110 PIERRE-AURÉLIEN GEORGES stored in the database, whatever writing conventions or writing system have been used by its author, since there is no unique official spelling norm (or “orthographe”) for the Occitan dialects as is the case for French or English, but rather several writing systems that competes.7 The database also contains an algorithm which tries to automatically detect which writing system has been used within a text,8 for users’ information. — If it’s based on a sound track recorded on the field, for which the user doesn’t already posses a graphical transcription (as is typically the case with ethnotexts), it’s possible to easily generate a phonological graphical transcription: the user simply presses on a button (button #2 shown in Figure 2 below) to call an automated transcriber which is available in MMS (as in the rest of the Thesoc), that generates a graphical transcription (#3) directly from the IPA transcription (#1), which must therefore be typed or imported in the database first. This automated transcription is based on conversion rules that can be modified by the user and adapted to different phonological systems among dialects, so that the transcriber will automatically use different rule sets to transcribe different dialectal areas, depending on the information given by the locality field of the record to be processed. The result is a phonological graphical transcription close to (Mistral 1979)’s writing system known as “graphie mis- Fig. 2 Processing a text record 7 The three most common writing systems used are “graphie alibertine”, as defined by (Alibert 1966), “graphie mistralienne” as in (Mistral 1979), and in a far smaller proportion, the Eastern part of Occitania sometime uses “graphie italianisante”, which is inspired from Italian’s writing system. 8 This is based on statistical occurrences of some sequences of letters, and these rules are usercustomizable, therefore, any new writing system can be added to this automated recognition feature. Should this algorithm ever fail, on a very short text for example, where the situation is unclear, it is always possible to override it and to manually specify the correct writing system used by the text. 0 Tools for linguistic (anejo ASJU) 110 14/7/10 09:16:10 THE THESAURUS OCCITAN: A MULTIMEDIA DATABASE DEDICATED TO OCCITAN... 111 tralienne”, which provides users a more readable way to access the content of the text, and allows a first level of abstraction that “smoothes” or “hides” phonetic variation.9 In both cases, all further linguistic treatments proposed by the different tools available in MMS are based on this graphical transcription. Thus, while these tools were originally conceived for oral data processing, they are even able to process written texts so far as they are written in a familiar or popular style, close to oral register. This is how it is even technically possible to introduce “oral-style” written texts in the database. Among these tools, the part-of-speech tagger and the syntactical parser automate a great amount of the annotation work, thus simplifying processing of new data, as shown in following sections. 2.2. Lemmatisation process The next step after importing or typing new data in the database is the lemmatisation process (button #4). The part-of-speech tagger identifies each individual lexical element of a sentence or a text by using a reference dictionary embedded in the database. In order to manage efficiently the variation in all its aspects (graphical, dialectal, or inflectional variation), this dictionary is structured into two hierarchical levels: the variantes, which are in fact the lexical occurrences found in the corpus, are grouped under lemmas. This allows performing searches or linguistic treatments either on a particular form (e.g. such inflection, with such graphical transcription, in such dialect), or on all forms associated to a given lemma. Figure 3 below illustrates this two-levels dictionary structure: Lemmas Variantes of a lemma lo determiner «the» fuala fem. sing. Nice li pers. pronoun «him» fuale masc. plur. Nice fòl adjective «crazy» fuali fem. plur. Nice bèl adjective «beautiful» fòlha fem. sing. Toulouse Variantes of a lemma bèl masc. sing. Nice beu masc. sing. Nice bela fem. sing. Nice bella fem. sing. Nice Fig. 3 Database’s dictionary structure 9 Since the objective of MMS is to study syntax and morphosyntax, it’s not a major issue here to disregard phonetic variations in order to simplify the following linguistic treatments performed. 0 Tools for linguistic (anejo ASJU) 111 14/7/10 09:16:11 112 PIERRE-AURÉLIEN GEORGES This schematic illustrates the different types of variation handled: — phonological variation, between bèl (preceding a vowel) et beu (preceding a consonant) in the dialect of Nice, called “Nissart” — dialectal variation, between fòlha in Toulouse’s dialect and fuala in Nissart, — graphical variation, in Nissart, between bela and bella, — and inflectional variation, also in Nissart, between fuala, fuali and fuale. As reference forms, lemmas’ graphical forms are based on entries from (Alibert 1966) when available10 whereas the variantes part of the dictionary is currently populated by entries coming both from the lexical database of the Thesoc and from a set of several paper dictionaries, such as (Eynaudi 1932), that have been digitalized and integrated in the database for this purpose. Moreover, the process of manual lemmatisation of unidentified lexical items sometimes adds new entries to the dictionary, as detailed below in this section, as well as the syntactical-tree annotation of sentences. Thus, dictionary’s content is constantly improved by the lemmatisation process of new texts and/or sentences and new entries coming from Thesoc ’s lexical database. As illustrated in Figure 4 below, the output of the lemmatisation process (#5) shows the following information about each lexical item identified in the text or sentence processed: its lemma, morphosyntactic category, and inflection. Some items may remain unidentified after the lemmatisation process, either because there is no matching entry in database’s dictionary or because there are several potential candidates. Fig. 4 Lemmatisation process 10 When there is no corresponding entries in (Alibert 1966), an asterisked form is proposed in Alibert’s writing system, known as “graphie alibertine”. 0 Tools for linguistic (anejo ASJU) 112 14/7/10 09:16:11 THE THESAURUS OCCITAN: A MULTIMEDIA DATABASE DEDICATED TO OCCITAN... 113 As can be seen in Figure 4, unknown items are tagged “Inconnu” (as is the case of proper noun “Gubèrt”) whereas ambiguous items are tagged “Indéterminé”: the graphical form “e” can match either with an inflected form of the verb èstre “to be” (3rd Pers. Sing.) or with the conjunction coordination “and”; similarly, “a” can match either the definite determiner (Fem. Sing.) or an inflected form or the verb aver “to have” (3rd Pers. Sing.). When a lexical item was not correctly identified or not identified at all by the lemmatisation process, the two buttons labelled #6 in Figure 4 allows to manually identify this lexical item. In the case of ambiguous item, with several potential matching candidates in database’s dictionary, the user can choose the right entry within a list of these potential candidates as shown by #7 in Figure 5. If the unidentified lexical item has no corresponding entry in the dictionary, it’s also possible to add a new entry in the dictionary (#8) and to identify the lexical item with this new entry (these two points are realized at a glance, as one single operation). This unique user interface also shows the unidentified lexical item in context (bottom of Figure 5) to ease its manual identification by the user. Fig. 5 Annotation of ambiguous items After lemmatisation process has been performed, the next step is syntactical-tree tagging, eased by MMS’ syntactical parser. It is not necessary here to lemmatise each single lexical item of a sentence or a text before using MMS’ syntactical parser: if 80% or 90% of the lexical items have been correctly lemmatised, this is enough to switch to next step, as will be explained below. 0 Tools for linguistic (anejo ASJU) 113 14/7/10 09:16:11 114 PIERRE-AURÉLIEN GEORGES 2.3. Syntactical tree tagging MMS’ syntactical parser uses both data generated by the lemmatisation process and the embedded dictionary of the database in order to propose one or several possible syntactic structures for each sentence,11 thus automating in some proportion the syntacticaltree tagging process. The syntactic trees are generated in a generativist theoretical framework, but the syntactic rules on which the parser relies are user-customizable.12 Figure 6 below gives an example of this process. On the left pane, the results of the lemmatisation process are shown. By clicking on the button #9, the user calls the syntactical parser, which analyses the sentence and output some possible syntactic structures in a list shown just under this button. The user must then browse among these candidates and chose the correct hierarchical syntactic structure(s) that really correspond to this sentence: by clicking on a candidate structure within the list, the hierarchical representation of the selected structure is shown in the right pane.13 Fig. 6 Syntactical structure annotation Wether it’s an isolated, single sentence, or a sentence from a text. The only main constraint is that generated syntactical trees must have binary branching nodes. 13 Please note here that the syntactical tree structure (in the right pane) is displayed vertically instead of the traditional horizontal presentation. This is due to some technical limitations we are currently working on. 11 12 0 Tools for linguistic (anejo ASJU) 114 14/7/10 09:16:12 THE THESAURUS OCCITAN: A MULTIMEDIA DATABASE DEDICATED TO OCCITAN... 115 Since the candidates list is sorted by probability (most probable candidates are located on top of the list, whereas least probable ones are at the bottom), users therefore typically only need to look at the two or three first propositions to find the correct one(s). Thanks to the two buttons under #10, false candidate structures are then eliminated in order to keep only the correct one, or the two or three right ones (when there’s an ambiguity in the sentence that can not be resolved, for example if the author of this sentence has deliberately made a play on words). If the appropriate structure(s) is/are not present in the list generated by the syntactical parser, the user can choose any proposed candidate and edit this structure in order to manually build the correct hierarchical tree(s). The use of MMS’ syntactical parser thus simplifies and speeds-up the syntacticaltree tagging task by providing the user a semi-automated way of generating hierarchical structures. Moreover, it also simplifies and speeds-up the lemmatisation process: Figure 6 shows for example that the unknown proper noun “Gubèrt”, that had been neither correctly identified by the automated lemmatiser nor manually tagged by the user, has nevertheless been correctly identified as a proper noun by the syntactical parser. Similarly, the ambiguous elements “e” and “a” have been successfully identified as respectively an inflected verb and a determiner. Therefore, the syntactical parser can automatically “guess” some lexical items that remained unidentified after the lemmatisation process, and disambiguate some ambiguous lexical items, thus also simplifying the lemmatisation task, since it’s no longer mandatory to tag 100% of all the lexical items of a sentence. After these 10 steps have been performed, annotation process is now complete: each lexical item of each sentence has been lemmatized, and each sentence is tagged with one or several hierarchical syntactic structure(s). Data are then ready to be exploited through different work features offered by the database. 3. Work features 3.1. Search engine Once the sentences and texts from the database have been annotated with the help of these tools mentioned in section 2., MMS provides several search options to select data with different criterions. For example, it’s possible to search all occurrences: — of a given variante; — of all variantes associated to a given lemma; — of all variantes with one or several given morphosyntactic categories; — of a given sequence of part-of-speech categories;14 — of a given location;15 — having a graphical similarity with a given lemma or variante. It’s also possible to perform searches based on syntactical tree tagging previously processed: one can search for all sentences containing a particular syntactic struc14 15 Some wildcard options are available, such as “Beginning of sentence only” / “End of sentence only”. This is, all occurrences of all texts associated to a particular given location. 0 Tools for linguistic (anejo ASJU) 115 14/7/10 09:16:12 116 PIERRE-AURÉLIEN GEORGES ture or syntactical tree fragment, such as all sentences containing a DP within another DP for example. When an ambiguous sentence has several syntactic structures associated to it, such a search query will show this sentence in the search results if it matches any of its associated structures. As illustrated in Figure 7, the results are shown in context: the list of matching occurrences is presented in column, and each matching occurrence is displayed within the full original sentence where it is coming from. 3.2. Automated work corpus generation Whichever search query is formulated, a work corpus can then be automatically generated from these search results.16 As presented in Figure 7, it is possible, for example, to search for all occurrences with morphosyntactic category «personal pronoun», then to select some particularly interesting ones in the results list displayed,17 and to automatically generate a work corpus in which these selected occurrences are highlighted in their original context: text title,18 location, full sentence or full text where the term occurs. In this example, we chose to generate a “concise” work corpus, i.e. only matching sentences are being extracted and displayed gathered, but it is also possible to generate a “full” work corpus, which gathers both matching isolated sentences and the full texts that contains at least one sentence containing one occurrence matching the search query. In the same way, one can generate a work corpus from search results based on syntactical tree tagging previously processed. These work corpuses can be exported and saved as RTF or Microsoft Word format for further exploitation. 3.3. User-customizable taggings In 2009, a new feature has been added to MMS: users now have the ability to attribute some user-defined “tags” to some sentences of the database.19 Then, it’s possible to perform cross searches based on these tags within the database. This allows, among other things, to search for presence or absence of a correlation between two or several linguistic parameters, confirming or invalidating user’s hypotheses.20 It’s possible, for example, to search within the database for texts that contain at least one sentence with a given tag but do not contain any sentences with another given tag (evaluating a hypothetical positive correlation between two parameters within the same speaker, therefore the same idiolect). Another use would be to search within the database for locations for which there is at least one sentence with 16 Except if the search query concerns all occurrences of a given location, because this would generate a full list of all texts and isolated sentences coming from this location, and thus would have merely no interest here since it is already possible to get this information from elsewhere in the database. 17 Of course, it’s also possible to select all occurrences from the results list in order to generate an exhaustive work corpus from the database. 18 If the sentence comes from a text. 19 Whether isolated sentences or sentences from a text. 20 For some interesting applications of this feature and concrete cases, cf. (Georges, not yet published, b). 0 Tools for linguistic (anejo ASJU) 116 14/7/10 09:16:13 THE THESAURUS OCCITAN: A MULTIMEDIA DATABASE DEDICATED TO OCCITAN... 117 Fig. 7 Search by syntactic category and work corpus generation features a given parameter (no lexically realised subject pronoun, for example) and also at least one sentence with another given parameter in the same location (evaluating an hypothetical negative correlation between two parameters within a dialect). In a generative theoretical framework, one could for example use this feature to try to check if (Rizzi 1982)’s correlation between pro drop and free inversion of sub- 0 Tools for linguistic (anejo ASJU) 117 14/7/10 09:16:13 118 PIERRE-AURÉLIEN GEORGES ject and verb can be verified (or invalidated) on the field by dialectal data, such as Occitan dialects.21 Conclusion The MMS module of the Thesoc contains flexible tools and features that can be used of several maners and under different theoretical frameworks. Therefore it can fully play its role of a computer tool aiding syntactical and morphosyntactical research on Occitan dialects on a comparative basis. Currently, we are working on adding cartographic functionalities to MMS, that would fosters the use of MMS for diatopic variation study, giving the users the opportunity to visualize variation of some syntactic or morphosyntactic features among the occitan domain and the possibility to generate maps on demand to illustrate and to support their hypothesis. Aside from the diatopic perspective, the data entries in the database also contain a “date” field to indicate for each text and each single sentence, at which date they have been collected on the field. So, if enough data is available, it would be theoretically possible to study variation also from a diachronic point of view. References Alibert, L., 1966, Dictionnaire occitan-français d’après les parlers languedociens, Institut d’Etudes Occitanes, Toulouse. Dalbera, J.-P., 1994, Les parlers des Alpes-Maritimes, étude comparative, essai de reconstruction, Association Internationale d’Etudes Occitanes. Eynaudi, J., 1932, Dictionnaire de la langue niçoise, Imprimerie de «l’Eclaireur de Nice», Nice. Georges, P.-A., 2009, «Présentation de la base Textes associée au thesoc», in Actes du colloque La dialectologie hier et aujourd’hui (1906-2006), Lyon, 2006, B. Horiot (ed.), pp. 81-94. — (not yet published), a, «Le thesoc: bases de données et outil s d’analyse consacrés à l’étude des dialectes occitans», in Actes du colloque Bases de données, Méthodes, Modèles de description: de nouvelles perspectives pour la recherche sur les langues régionales et minoritaires?, Tübingen, 2008, Stauffenburg Verlag (DeLingulis). — (not yet published), b, «Les chaînes de clitiques: l‘outil informatique au service de l‘analyse comparative», in Actes du colloque Mémoires du terrain: enquêtes, matériaux, traitement des données (Lyon, 2009). Mistral, F., 1979, Lou Tresor Dóu Felibrige ou dictionnaire provençal-français, Culture Provençale et Méridionale, Raphèle-lès-Arles. Nauton, P., 1957-1963, Atlas Linguistique et ethnographique du Massif Central, Editions du CNRS, Paris. Oliviéri, M., 2003, Constitution d’une base de textes occitans, 36ème colloque de la Societas Linguistica Europaea: Linguistique et Corpus, Lyon, 4-7 septembre 2003. Rizzi, L., 1982, Issues in Italian Syntax, Foris, Dordrecht. Séguy, J., 1973, «Les atlas linguistiques de la France par régions», Langue Française 18/1, 6590. 21 Dialects from the northern part of Occitania show overt pronoun subjects, whereas the majority of the Occitan dialects are pro drop. Comparing these Occitan dialects, which are all closely related, within a microvariational approach, could therefore shed new light on this hypothetic correlation. 0 Tools for linguistic (anejo ASJU) 118 14/7/10 09:16:14
© Copyright 2026 Paperzz