Presentation - Linguistics Prague

1 Documentary linguistics
Compilation and exploitation of corpora
of under-researched languages
Produce and archive documentations of endangered
languages
Ulrike Mosel, ISFAS, Universiät Kiel
• that provide primary data not only for linguistics, but also
for other disciplines of the humanities and social sciences;
• that can be understood without prior knowledge of the
documented language;
• that is accepted by the speech community and can be
used for language maintenance and revitalisation.
linguis[ic]s Prague
27.05.2016
Develop and test new methods of researching, processing
and archiving linguistic and cultural data.
2
1.1 Components of a language documentation
ANNOTATED CORPUS
OF RECORDINGS
audio/video recordings
transcriptions
translations
glossing
comments on form & content
SKETCH GRAMMAR
phonology, orthogaphy
parts of speech
grammatical categories
examples with references
typological profile
LEXICAL DATABASE
head word
part of speech
definition
collocations, idioms
examples with references
illustrations
1.2 Corpora in language documentations
Language
Texts
Size
Compilation
Corpus
builder
INTRODUCTION
Language & speakers
methods
abbreviations
Purpose
3
typical corpora of
European languages
well-researched
digitalised printed
millions of words
selection of existing
texts
team of professional
native speakers
lexicography,
linguistic research
Language documentation
corpora
under-researched
recorded, transcribed
translated
much below one million
production of texts during
fieldwork
linguists assisted by nonprofessional nativespeakers
conservation of cultural
and linguistic heritage,
research
4
1
1.3 The linguists' and the speech community's corpus
linguists
speech community
kind of
language
spontaneous language;
variety of genres and registers
”good” language;
content
“as many an as varied records as
practically feasible”
(Himmelmann 2006)
important genres;
educational materials
format/
media
1.4 The Teop language corpus
Bougainville,
Papua New Guinea
Austronesian,
Oceanic,
6000 speakers
Research on Teop
since 1994
electronic corpus audio and video printed materials
recordings with transcription and
translation, rich annotation,
orthography based on linguistic principles
(phonological)
orthography similar to that
of a/the dominant language
lexicon
(encyclopedic) dictionary
on paper
electronic lexical database
6
5
2 What are corpora?
2.2 Corpus typology
2.1 Basic concepts of corpus linguistics
text
“any artefact containing language usage
(book ..., t-shirt slogan, .... speech, conversation)”
(McEnery & Hardie 2012:250)
corpus
1. Monitor corpus / dynamic corpus
“grows in size over time ... contains a variety of materials”
(McEnery & Hardie 2012:6)
2. Sample text corpus designed in order to be representative
of a particular language variety within a specified sampling
frame (McEnery & Hardie 2012:8, 250)
3. Opportunistic corpora
“represent nothing more or less than the data
that it was possible to gather for a specific task.”
“collection of sampled texts, written or spoken,
in machine readable form
(McEnery & Hardie 2012:11)
which may be annotated with various forms
4. Parallel corpora
“Parallel corpora consist of a souce text and its translation
into one or more languages.”
of linguistic information” (McEnery et al. 2006:4)
7
(Aimer, Karin. 2008:276)
8
2
5. Contrastive corpora
Two corpora or subcorpora that represent two registers, genres
or other varieties of the same language.
(Tognini Bonelli 2010:21-22)
7. Multimedia corpora
have transcripts that are aligned with audio or video recordings.
(Lee 2010:114)
6. Artificial corpora
• “They are constructed with whatever data may be accessible
at the lowest cost,
and essentially regardless of the documents’ content ...
• the material has no social or cultural rationale for being
collected ...
• ad-hoc respositories of language materials.” (Ostler 2008)
8. Multimodal corpora
are multimedia corpora that contain
“digitised collection of language and
communication-related material,
drawing on more than one modality ...
accompanied by transcriptions and annotations or codings
based on the material.“
(Allwood:2008:208)
for example
Recordings of stimulus-based elicitation
Modalities: speech, eye & head movements, body postures,
gestures, facial expressions ... (Wittenburg 2008:664)
2.3 Classification of LD corpora
2.4 Genres and registers
(Biber & Conrad 2009. Register, genre, and style. CUP, Ch. 1 & 2)
Monitor corpus
as long as the corpus is growing
Sample text corpus
-
Opportunistic corpus
all, because the collection of texts is done
during field reaserch
Parallel
unidirectional, e.g. Teop texts with English
translation
Contrastive
texts about the same topics, but produced in
diffferent registers or genres
Artificial
Elicitations - “no social or cultural rationale“
Multimedia
audio/video recordings with transcriptions
Multimodal
audio/video recordings with transcriptions
(and codings for non-verbal comunication)
Different types of texts show
different text structures and
different features of linguistic form,
i.e. phonological, lexical and grammatical features.
Linguistic variation is systematic
Selection of linguistic features depends on non-linguistic factors.
Two kinds of classifying texts types:
Genres:
by structural features:
Registers
by the pervasive use of certain phonological lexical and
grammatical features in particular speech situations.
12
3
2.4.2 Register analysis
2.4.1 Genre
Dear Sir,
....
With kind regards,
Yours sincerely
John Smith
Once upon a time ...
...
lived happily ever after
Register
Type of text
that occurs in certain speech situations and shows
certain phonological, grammatical and lexical features
throughout any text of this type
with a significantly higher frequency as in other text types.
1. Identify the situational characteristics of the texts.
Genre
Type of text
that is used in particular social contexts
and has a particular structure
which is indicated by certain formal features
such as speech formulas at the beginning or end of the text.
2. Identify typical (pervasive) linguistic features.
3.Interprete the relationship between
situational characteristics and pervasive linguistic features.
13
Biber, Douglas 2006. University language.
Situational characteristics of text varieties
(abbrev. from Biber & Conrad 2009:40)
Participants and their social characteristics
Mode (speech /writing), Medium (taped, radio, handwritten ...)
Production circumstances (real time, planned, edited, ...)
Setting (private / public; ...)
Communicative purposes (narrate, report, describe, ...)
Topic
The situational chatacteristics should be documented in
the metadata of each text.
Biber 2006: 48: Content word classes across registers
16
4
3. Content and structure of corpora of
under-researched languages
3.1 Awetí
3.3 Saliba / Logea - Oceanic , Papua New Guinea
3.2 Beaver Archive
Athabaskan, Canada
DoBeS – Project
4 Structure and content of the Teop Language Corpus
(on my computer)
4 Genres:
1. legends
2. personal narratives
3. descriptions
4. unconnected example
sentences
3 Modes
1. sponateously spoken (R)
2. edited transcriptions (E)
3. written texts (W)
contrastive subcorpora: 01-02, 04-05, 07-08
5
4.1 Different modes: spontaneous speech vs. planned writing
Changes in edited legends
1. original recordings with transcriptions
2. edited versions of the transcription with recording readings
Elaboration: addition of linguistic units
words, phrases, clauses
Linkage:
The contrastive subcorpora
paratactic constructions >
> complex sentences
1. show alternative ways of expressing the same content
Compression: more information in a single linguistic unit
2. provide a new type of data for research on
what speakers actually do
Decompression: complex sentences > paratactic constructions
when they put an oral text into writing
(Mosel 2015)
(Mosel 2015)
21
22
Create contrastive narrative and procedural texts
about butchering a chicken
4.2 Narratives vs procedural texts
Narratives
Procedural texts
Paratactic clauses
Coordinate clauses
Sequence of past events
Adverbial clause constructions:
‘when ..., then...’
Regular fixed order of actions
> create a corpus of
contrastive narrative and procedural texts
minimise variables
Choose the very same topic!
23
6
5 Types of data
_____ corpus content _______ corpus exploitation
Make series of photographs and use them as stimuli for
1. the description of how to butcher a chicken
2. the narrative of how the twins helped
their father butchering a chicken told by their mother
procedural text:
narrative text:
40 clauses,
53 clauses,
12 adverbial clause constr.
no adverbial clauses
13 paratactic clauses
25
5.1 Types of data in the Teop Language Corpus
Spoken language
Written languge
raw data
audio recordings
manuscripts
by native speakers
primary data
by native speakers
transcriptions
by native speakers
edited versions of
manuscripts
edited versions of
transcriptions
primary data
by linguists
transcriptions
by a phonetician
translations
5.2 Types of data in ELAN files
Minimal annotation in ELAN
wav file
utterance ID
translations
transcription
translation
structural data
morphological
segmentation and glossing
28
7
Syntactic annotation
1) grammatical units (Erfurt Referentiality Project)
2) glossing
3) GRAID (Grammatical Relations and Animacy In
Discourse, Haig & Schnell)
Phonetic and morphlogical annotation
legend
duration ca. 5 minutes
1.
2.
3.
4.
respect self-correction
mark clause boundaries by #
insert ZERO for zero anaphora, gloss it
annotate argument relations (S/A/P) in GRAID
utterance ID
phonetic transcription
orthography
free translation
morph. segmentation
glosses
The phonetic transcrption done by a phonetician took several weeks.
30
6 Corpus analysis
6.1 Negated VCs – single layer search
a) Distribution of particular word forms and phrases
Which words occur in negated VCs?
In which positions is form X used?
discontinuous negative morpheme: saka/sa ______ haa
b) “Hospitality of constructions“
task:
search for all words that enter the empty slot
between saka or sa and the second component haa
which may have clitics haa=na, haa=ra, haa=ri
Which forms are accommodated in position X?
certain positions in constructions accommodate
a greater variety of
grammatical and semantic lexeme classes than others
regular expression:
(\bsa(ka)?\b .* \bhaa
some even neutralise word class distinctions
(Hopper & Thompson 1984)
31
32
8
Search for the negative contruction frame
\bsa(ka)?\b .* \bhaa
\bsa(ka)?\b .* \bhaa
listen
bird
person
good
finished
person
stay
910 annotations
NEG HEAD (ADV) NEG (=IPFV)
33
34
noun/verb distinction (cont.)
6.2 Noun/verb distinction
(a complex single layer search)
Workflow (cont.)
Workflow
1. Identify the elements constituting the constructional frames1 of
NPs (referential phrases)
VCes (TAM marked predicates):
A constructional frame consists of functional morphs
and empty, syntactically defined head and modifier positions
for content words or stems.
2. Identify the elements that directly precede or follow the
empty head position for the content word
NP: ART (QUANT ADJ etc.) HEAD
VC: (NEG) TAM (ADV) etc. HEAD
3. Construct regular expressions for the head position
4. select a few prototypical frequent action and object words
e.g. ‘do, make‘, ‘say‘, ‘person‘, .‘thing‘, and search
1) cf. the notions of collocational frame works, grammar patterns and
colligates in Stefanowitsch & Gries 2009: 936-937
5. Repeat the procedure with modifier positions
35
36
9
Constructional frames for corpus search in Teop
Distribution of prototypical action words
VC with paku ‘do’ as head:
(\bare\b|\bbe\b|\bkahi\b|\b|\bmepaa\b|\bme\b|\bna\b|
\bore\b|\borepaa\b|
\bpaa\b|\bpahin\b|\bpasi\b|\bpate\b|
\bre\b|\brepaa\b|\bto\b|\btoro\b)\b \bpaku\b
NP with taba ‘thing’ as head:
(\ba\b|\bbona\b|\bo\b|\bbono\b|\bamaa\b|\bmaa\b|\bsi\b|
\bbua\b|\bbuo\b) \btaba\b
VC head
NP head
paku
'do, make'
725
10
asun
'hit, kill'
56
1
mosi
'cut'
70
9
sue
'say'
796
10
nao
'go'
820
5
pita
'walk'
69
3
hua
'paddle'
122
4
Replace paku and taba by the other selected content words
37
Results:
1. action and object words are flexible,
2. action words much more frequent in VC head position
3. object words much more frequent in NP head position
Distribution of prototypical person/thing words
VC head
NP head
aba
'person'
5
243
taba
'thing'
4
461
moon
'woman'
7
665
otei
'man'
-
478
beiko
'child'
1
366
iana
'fish'
-
176
naono
'tree'
-
136
vasu
'stone'
-
48
38
Further research shows that
action and object words form distinct word classes,
i.e. nouns and verbs:
nouns are modified by adjectives, never by deadjectival adverbs.
verbs are modified by adverbs, never by adjectives.
This questions all studies on flexible word classes which do not
consider the modification of words.
39
40
10
Simple multiple layer search
for “comparatives“
6.3. Comparison
(a simple mutiple layer search)
How is comparison expressed in Teop?
'X is bigger/smaller than Y'
There is not morphological comparative in Teop.
Search for the Teop adjectives ‘big‘ and and ‘small‘
and examine more than 1000 tokens?
Search for than on the translation tier.
Search for .* on the transcription tier
41
42
Comparative constructions
Simple multiple layer search for
comparatives
The inalieanble possessive comparative construction
transcription tier: “wild card“
translation tier: “than“
A babarii
a rutaa n=
a barii ....
the young drummer the small 3SG.POSS= the drummer
The exceed comparative construction
(The babarii is the small of the adult barii.)
(The drummer is big exceeding the tang.)
43
44
11
Comparative constructions
7 Corpus compilation and grammatical analysis
The alieanble possessive comparative construction
Focus on a few registers/genres.
The more diversified the corpus is,
the smaller are the subcorpora, and
the smaller the probability that you can
adequately identify regular patterns of language usuage.
Eve a beera te =a kavara ri=
o
goroto
vai
it the big
of =the whole 3PL.POSS=the.PL turtle
this
‘It is the big of the whole of the turtles‘
Recommendations for corpus compilation:
Use ELAN or a similar tool with an implemented powerful query
language.
Extended annotation is very time consuming.
Document your annotation rules and your search methods.
Make your corpus and the metadata accessible.
Aim at scientific research that is replicable and falsifiable.
Possessive comparative constructions are not mentioned in
Stassen 1985, 2013 (WALS)
45
46
12