TriMCo - ParaSol - A Parallel Corpus of Slavic and Other Languages

Corpus-based Research into Sociolinguistic and Dialectal Variation in Slavic Languages
Second Network Meeting: Technical and Linguistic Aspects
Freiburg, 13-14 October 2016
Björn Wiemer (JGU Mainz)
[email protected]
TriMCo: linguistic features
Structure:
I. TriMCo: goals and archive
II. Clusters and continua in diatopic and diastratic variation of East Slavic: preparing corpora for integrated
queries (submitted)
III. Features and productivity: problems related to annotation
I. TriMCo: goals and archive
TriMCo
(„Triangulation Approach for Modelling Convergence with a High Zoom-In Factor“)
http://www.trimco.uni-mainz.de/
1. Triangulation of reasons leading to convergence
•
•
•
genealogical inheritance
areal affiliation (contact)
universal tendencies
2. Profile of the BSCZ (Baltic-Slavic Contact Zone)
on the background of typologically significant areal clines (“matrёški“)
For this purpose we employ tools and achievements of typology, areal and contact
linguistics, historical-comparative linguistics, dialect geography and computational
methods.
Open question:
Can methods used in “macro-areal typology” be applied to micro-areas?
Anyway, for triangulation in small-scale areas we need a sufficiently fine-grained
level of linguistic analysis for sufficiently well-attested phenomena of convergence.
--> delimitation and definition of features (see below)
The Baltic-Slavic Contact Zone (BSCZ)
3
Recordings from the TriMCo sound archive transcribed in ELAN (present stage)
dialects
(affiliation)
sum of
hours in Elan
word forms
number of
informants
+ region
Belarusian in Belarus
25:40
Belarusian in
Lithuania
Latgalian (Lettgalian),
Latvia
Belarusian in Latvia
(Letgalia)
Lithuanian
(Lithuania)
Russian, Pskov region
13:00
85,582
59
13:10
94,107
22
03:00
18,979
4
21:00
104,481
86
11:25
71,164
24
77:15
162,989
537.302
79
274
II. Clusters and continua in diatopic and diastratic variation of East Slavic:
preparing corpora for integrated queries (submitted)
1. Linguistic (dialect geographic, variationist) goal:
• diatopic × diastratic variation
→ comparable corpora, account of разговорная речь and other superregional varieties
• recent and ongoing change (on a micro level) in morphosyntax
→ account of productivity and range of lexical input to grammatical oppositions (or constructions)
• variable analyses of single features and feature aggregates
→ different levels of granularity of features
• modellation of micro-areal clusters (in the BSCZ) on the background of larger inner-East Slavic clines and
against diastratic variation
Electronic corpora to be exploited (in the first place) from the following projects
•
•
•
•
TriMCo, Mainz: http://www.trimco.uni-mainz.de/trimco-dialectal-corpus/
Rusyn (Ruthenian), Freiburg: www.russinisch.uni-freiburg.de/corpus
Ustya River Basin, Moscow / Zurich): http://www.slavist.de/Pushkino/login.php
BRMS („Trasjanka“), Oldenburg: https://www.uni-oldenburg.de/ok-wrgr/
2. Digital Humanities goal:
• creation of a repository of small corpora of dialectal (or otherwise non-standard) speech
• to be located at Mainz University (JGU, Digital Services at Central Library)
• for the beginning focused on East Slavic (and exclusively on the basis of sound files aligned with
transcripts)
• to some extent comparable to Edisyn,
• but with an extendable pool of software to make possible integrated queries across corpora with
partially different transcription and annotation systems
• includes clear criteria on access and use (Good Practice), guarantees copyright etc.
III. Features and productivity: problems related to annotation
Productivity
• a multi-faceted phenomenon, however,
• roughly, correlates with type/token ratio of some feature F (or pattern)
• more specifically: relation between the sum total of lexical stems and grammatical contexts (on word form,
constituent or clause level) to which the given feature (or the underlying process) can be applied (types),
plus the frequency with which this feature really occurs in discourse (tokens)
• hapax legomena may be a good indicator; problem: not feasible with small corpora
• productive patterns tend to have a “long tail“ of infrequent lexemes (Zeldes 2012)
• a related empirical question: does the distribution of productive patterns show a steep difference between
a small group of lexemes with high token-frequency vs. many other lexemes with low token-frequency?
Even smaller corpora might show this difference even when they capture only the beginning of the „tail“.
→ focus on frequent phenomena (features, patterns)
(However, less frequent phenomena can also be meaningfully assessed by other statistical methods.)
→ together with annotations we need a data base with the inventory of lexical stems (occurring in the
corpus) ⊃ morphological segmentation
Productivity of WHAT?
In other words: What is a feature?
abstract categories (oppositions)
level 1: gender
level 2: masculine, feminine, neuter
formal realizations
allomorphy, syncretism, complex or non-concatenative
exponence
→ standard problems of annotation
Challenge: features may be vague or notoriously ambiguous.
• vagueness and ambiguity should not be „dissolved“ in favor of presumably clear categories
• they are the locus of reanalysis and, thus, of change
→ they must be retrievable as such
examples (from Pskov region)
(1) tut mál'ec
byl
here boy:NOM.SG.M be:PST:SG.M
‘here the boy was sent‘
pr'íslan-a
send:PP.?
(i) neuter, (ii) masculine, (iii) non-agreeing ?
compare with
(2) a to býl-a vot vot tak raspúščen-y
‘and this was, well like this, released‘
(3) n‘iktó n‘ičevó n‘irъbótyl‘-i
‘nobody worked anything’
(4) v rad'ít'il'-ix
u nas
býla
vós'im
at parents.GEN? at 1PL.GEN be:PST.N eight:NOM
‘with our parents we were eight sisters‘
s''es't'ór
sister:PL.GEN
• {ix} – an analogical extension from adjective inflection?
• If yes, supported by case homonymy of locative and genitive plural of adjectives and
by /v – u/ alternation (--> conflates prepositions v (+ LOC) ‘in‘ and u (+ GEN) ‘at‘)?
Selected features (for single and aggregate analysis) –
with different levels of granularity
F1. Morphology
F1.1. Agglutinated (postfixed) reflexive marker:
• allomorphs: -sja / -s’
F1.2. Loss of neuter: proceeds stepwise (and this can influence the decision on annotation)
• only in the plural (cp. Br. [akn-o.sg] – [ókn-y.pl]
• NP-internal: bol‘š-ája sel-ó (morphologized as inherent feature of lexemes)
F1.3. Spread of {à} for NOM.PL of masculine nouns
F1.4. case syncretisms (e.g., DAT/INS.PL)
F1.5. Fate of all relics of decayed (or fused) declensional classes: {u} for GEN.SG.M, {ù} for LOC.SG.M, maybe even
{ov’e} for NOM.PL.M (in the southwest part of east Slavic)?
F1.6. Distribution of GEN.PL markers: -of/af, -ej, ∅ (more ?).
F1.7. Fate of relics of the dual (e.g., in numeral constructions; see F2.5).
F1.8. Tendency to avoid uninflected nouns (e.g., v kin-e, xodili v teplyx pol’t-ax).
F1.9. Avoidance of suppletive forms, or rather more suppletion than in the standard language?
F1.10. Suffixes (allomorphs?) used to derive imperfective stems („secondary imperfectivization“).
F2. Syntax
F2.1. Demonstrative pronoun as topic marker (often referred to as a nascent or pseudo-article):
(i) inflected or unchangeable?
(ii) promiscuous attachment (good for clitics) or selective on syntactic class (affix-like)?
(iii) function: emphatic (or contrastive) topic?
F2.2. Phenomena related to anteriority participles (with inherited suffixes -n/t- and -(v)šy/-v): The diachronic
relation between both domains is anything but clear, and fine-grained dialectal data (plus a thorough review of
extant work on related topics) can even give a chance of understanding (and reconstructing) these relations better.
F2.3. Oblique Actor-phrases: u+GEN, maybe also mere dative. This issue is tightly connected to predicative
possession and all relevant alignment patterns (HAVE vs. BE).
• a problem for syntactic, but not for morphological annotation
F2.4. Marking of base of comparison: particle (stand. Russ. čem), genitive (as in stand. Russ.), other (e.g., za + ACC).
F2.5. Numeral constructions: agreement vs. government (or combination of both, as in stand. Russ.)? By analogy:
constructions with quantifiers (vse, ljuboj, každyj, etc.; see F3.1).
F2.6. NOM-INS-variation for nominal predicates and zero of the copula
• how to mark the zero copula?
F2.7. Distribution of (simple) clitics (e.g., že, ved‘):
(a) position in the clause,
(b) only enclitic, or do proclitics occur as well?,
(c) is attachment really promiscuous (as expected for „good“ clitics) or more restricted to particle syntactic
classes (or groups within the latter ones)?
F3. Semantics an discourse pragmatics
F3.1. Distribution of indefinite pronouns: ljuboj, vsjakij, kakoj ugodno (vs. každyj); kakoj-to, kakoj-libo, kakoj-nibud‘,
koe-kakoj. Also: xot‘ čto and similar.
• discontinuous for koeF3.2. (a) Use of modal auxiliaries and (b) tendency of certain forms to lexicalize (into epistemic particles or hedge
markers).
• (b): can be difficult to distinguish from “ordinary” finite verbs (e.g., može, mus‘i)
⊃ syntactic analysis on clause level (maybe even on a higher level)