Corpus-based Research into Sociolinguistic and Dialectal Variation in Slavic Languages Second Network Meeting: Technical and Linguistic Aspects Freiburg, 13-14 October 2016 Björn Wiemer (JGU Mainz) [email protected] TriMCo: linguistic features Structure: I. TriMCo: goals and archive II. Clusters and continua in diatopic and diastratic variation of East Slavic: preparing corpora for integrated queries (submitted) III. Features and productivity: problems related to annotation I. TriMCo: goals and archive TriMCo („Triangulation Approach for Modelling Convergence with a High Zoom-In Factor“) http://www.trimco.uni-mainz.de/ 1. Triangulation of reasons leading to convergence • • • genealogical inheritance areal affiliation (contact) universal tendencies 2. Profile of the BSCZ (Baltic-Slavic Contact Zone) on the background of typologically significant areal clines (“matrёški“) For this purpose we employ tools and achievements of typology, areal and contact linguistics, historical-comparative linguistics, dialect geography and computational methods. Open question: Can methods used in “macro-areal typology” be applied to micro-areas? Anyway, for triangulation in small-scale areas we need a sufficiently fine-grained level of linguistic analysis for sufficiently well-attested phenomena of convergence. --> delimitation and definition of features (see below) The Baltic-Slavic Contact Zone (BSCZ) 3 Recordings from the TriMCo sound archive transcribed in ELAN (present stage) dialects (affiliation) sum of hours in Elan word forms number of informants + region Belarusian in Belarus 25:40 Belarusian in Lithuania Latgalian (Lettgalian), Latvia Belarusian in Latvia (Letgalia) Lithuanian (Lithuania) Russian, Pskov region 13:00 85,582 59 13:10 94,107 22 03:00 18,979 4 21:00 104,481 86 11:25 71,164 24 77:15 162,989 537.302 79 274 II. Clusters and continua in diatopic and diastratic variation of East Slavic: preparing corpora for integrated queries (submitted) 1. Linguistic (dialect geographic, variationist) goal: • diatopic × diastratic variation → comparable corpora, account of разговорная речь and other superregional varieties • recent and ongoing change (on a micro level) in morphosyntax → account of productivity and range of lexical input to grammatical oppositions (or constructions) • variable analyses of single features and feature aggregates → different levels of granularity of features • modellation of micro-areal clusters (in the BSCZ) on the background of larger inner-East Slavic clines and against diastratic variation Electronic corpora to be exploited (in the first place) from the following projects • • • • TriMCo, Mainz: http://www.trimco.uni-mainz.de/trimco-dialectal-corpus/ Rusyn (Ruthenian), Freiburg: www.russinisch.uni-freiburg.de/corpus Ustya River Basin, Moscow / Zurich): http://www.slavist.de/Pushkino/login.php BRMS („Trasjanka“), Oldenburg: https://www.uni-oldenburg.de/ok-wrgr/ 2. Digital Humanities goal: • creation of a repository of small corpora of dialectal (or otherwise non-standard) speech • to be located at Mainz University (JGU, Digital Services at Central Library) • for the beginning focused on East Slavic (and exclusively on the basis of sound files aligned with transcripts) • to some extent comparable to Edisyn, • but with an extendable pool of software to make possible integrated queries across corpora with partially different transcription and annotation systems • includes clear criteria on access and use (Good Practice), guarantees copyright etc. III. Features and productivity: problems related to annotation Productivity • a multi-faceted phenomenon, however, • roughly, correlates with type/token ratio of some feature F (or pattern) • more specifically: relation between the sum total of lexical stems and grammatical contexts (on word form, constituent or clause level) to which the given feature (or the underlying process) can be applied (types), plus the frequency with which this feature really occurs in discourse (tokens) • hapax legomena may be a good indicator; problem: not feasible with small corpora • productive patterns tend to have a “long tail“ of infrequent lexemes (Zeldes 2012) • a related empirical question: does the distribution of productive patterns show a steep difference between a small group of lexemes with high token-frequency vs. many other lexemes with low token-frequency? Even smaller corpora might show this difference even when they capture only the beginning of the „tail“. → focus on frequent phenomena (features, patterns) (However, less frequent phenomena can also be meaningfully assessed by other statistical methods.) → together with annotations we need a data base with the inventory of lexical stems (occurring in the corpus) ⊃ morphological segmentation Productivity of WHAT? In other words: What is a feature? abstract categories (oppositions) level 1: gender level 2: masculine, feminine, neuter formal realizations allomorphy, syncretism, complex or non-concatenative exponence → standard problems of annotation Challenge: features may be vague or notoriously ambiguous. • vagueness and ambiguity should not be „dissolved“ in favor of presumably clear categories • they are the locus of reanalysis and, thus, of change → they must be retrievable as such examples (from Pskov region) (1) tut mál'ec byl here boy:NOM.SG.M be:PST:SG.M ‘here the boy was sent‘ pr'íslan-a send:PP.? (i) neuter, (ii) masculine, (iii) non-agreeing ? compare with (2) a to býl-a vot vot tak raspúščen-y ‘and this was, well like this, released‘ (3) n‘iktó n‘ičevó n‘irъbótyl‘-i ‘nobody worked anything’ (4) v rad'ít'il'-ix u nas býla vós'im at parents.GEN? at 1PL.GEN be:PST.N eight:NOM ‘with our parents we were eight sisters‘ s''es't'ór sister:PL.GEN • {ix} – an analogical extension from adjective inflection? • If yes, supported by case homonymy of locative and genitive plural of adjectives and by /v – u/ alternation (--> conflates prepositions v (+ LOC) ‘in‘ and u (+ GEN) ‘at‘)? Selected features (for single and aggregate analysis) – with different levels of granularity F1. Morphology F1.1. Agglutinated (postfixed) reflexive marker: • allomorphs: -sja / -s’ F1.2. Loss of neuter: proceeds stepwise (and this can influence the decision on annotation) • only in the plural (cp. Br. [akn-o.sg] – [ókn-y.pl] • NP-internal: bol‘š-ája sel-ó (morphologized as inherent feature of lexemes) F1.3. Spread of {à} for NOM.PL of masculine nouns F1.4. case syncretisms (e.g., DAT/INS.PL) F1.5. Fate of all relics of decayed (or fused) declensional classes: {u} for GEN.SG.M, {ù} for LOC.SG.M, maybe even {ov’e} for NOM.PL.M (in the southwest part of east Slavic)? F1.6. Distribution of GEN.PL markers: -of/af, -ej, ∅ (more ?). F1.7. Fate of relics of the dual (e.g., in numeral constructions; see F2.5). F1.8. Tendency to avoid uninflected nouns (e.g., v kin-e, xodili v teplyx pol’t-ax). F1.9. Avoidance of suppletive forms, or rather more suppletion than in the standard language? F1.10. Suffixes (allomorphs?) used to derive imperfective stems („secondary imperfectivization“). F2. Syntax F2.1. Demonstrative pronoun as topic marker (often referred to as a nascent or pseudo-article): (i) inflected or unchangeable? (ii) promiscuous attachment (good for clitics) or selective on syntactic class (affix-like)? (iii) function: emphatic (or contrastive) topic? F2.2. Phenomena related to anteriority participles (with inherited suffixes -n/t- and -(v)šy/-v): The diachronic relation between both domains is anything but clear, and fine-grained dialectal data (plus a thorough review of extant work on related topics) can even give a chance of understanding (and reconstructing) these relations better. F2.3. Oblique Actor-phrases: u+GEN, maybe also mere dative. This issue is tightly connected to predicative possession and all relevant alignment patterns (HAVE vs. BE). • a problem for syntactic, but not for morphological annotation F2.4. Marking of base of comparison: particle (stand. Russ. čem), genitive (as in stand. Russ.), other (e.g., za + ACC). F2.5. Numeral constructions: agreement vs. government (or combination of both, as in stand. Russ.)? By analogy: constructions with quantifiers (vse, ljuboj, každyj, etc.; see F3.1). F2.6. NOM-INS-variation for nominal predicates and zero of the copula • how to mark the zero copula? F2.7. Distribution of (simple) clitics (e.g., že, ved‘): (a) position in the clause, (b) only enclitic, or do proclitics occur as well?, (c) is attachment really promiscuous (as expected for „good“ clitics) or more restricted to particle syntactic classes (or groups within the latter ones)? F3. Semantics an discourse pragmatics F3.1. Distribution of indefinite pronouns: ljuboj, vsjakij, kakoj ugodno (vs. každyj); kakoj-to, kakoj-libo, kakoj-nibud‘, koe-kakoj. Also: xot‘ čto and similar. • discontinuous for koeF3.2. (a) Use of modal auxiliaries and (b) tendency of certain forms to lexicalize (into epistemic particles or hedge markers). • (b): can be difficult to distinguish from “ordinary” finite verbs (e.g., može, mus‘i) ⊃ syntactic analysis on clause level (maybe even on a higher level)
© Copyright 2026 Paperzz