1 Corpus Linguistics

organisation
Corpus Linguistics
zcourse plan (→ handout)
zall materials will be available online at
DGfS & GLOW Summer School
Micro- and Macrovariation
Stuttgart, 2006
http://www2.huberlin.de/korpling/lehre/SummerSchoolStutt.php/
Anke Lüdeling
[email protected]
zmini-projects (in groups):
ten minute presentation plus short written
summary
zyou can contact me via email any time
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
corpus linguistics
today
zis concerned with
zdifferent kinds of data for different
linguistic research questions
zbrief history of corpus linguistics
{design
{processing (architecture and annotation)
{evaluation
2
zof corpus data, where a corpus is
„A collection of pieces of language that are
selected and ordered according to explicit
linguistic criteria in order to be used as a
sample of the language.“
http://www.ilc.cnr.it/EAGLES96/corpintr/corpintr.html
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
3
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
4
linguistic data
linguistic data
zwhere do linguists get the data they need
to test hypotheses/theories?
z the research question and the theoretical
framework determine the kind of data hat can be
used (the mantra ;-)
z in many cases different kinds of data need to be
integrated
zintrospection
zpsycholinguistic experiments
zneurolinguistic experiments
zquestionnaires
zcorpora
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
z methodology/data issues are discussed a lot in
recent years, see http://www.sfb441.unituebingen.de/index-engl.html, several
conferences like Linguistic Evidence, QITL, etc.,
Bod, Hay & Jannedy 2003, Kepser & Reis 2005,
…
5
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
6
1
introspection/intuition
psycholinguistic experiments
z arm-chair linguistics ;-)
z generative tradition, competence model
z research question:
which complex expressions
(phrases, sentences, words, …)
can be produced by the internal grammar of a
native speaker of a language?
z grammaticality judgments, yes/no
z discussion: Schütze 1996, Keller 2001,
Featherston 2005
zresearch questions: how are linguistic data
stored and accessed?
zstorage and processing models,
interaction with other cognitive tasks
zelicitation tasks, reaction time
experiments, eye tracking, errors, …
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
¾Lynn Frazier's class
7
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
neurolinguistic experiments
questionnaires
zresearch questions: where are linguistic
data stored and accessed?
zstorage and processing models,
interaction with other cognitive tasks
zEEG, ERP, imaging techniques, …
zdifferent research questions
{structuralist/generative fieldwork to find
grammatical forms
{judgment tasks
(yes/no, magnitude estimation, …)
{…
zan experimental technique
(issues of representativity, filler items etc.),
often used to verify specific hypotheses,
often used for quantitative studies
¾Bornkessel & Schlesewski's class
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
8
9
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
corpora
corpora – research questions
zresearch questions: see following slides
zcollections of texts
z many areas traditionally use corpora
10
{historical linguistics
{sociolinguistics
{dialectology
{lexicography
{language acquisition research
{(computational linguistics/natural language processing)
{nowadays mostly electronic, but see history
{texts usually produced for independent
purposes (authenticity) –
collection (design) and evaluation methods
dependent on the research question
z other areas have only recently started using
corpora
{generative/theoretical linguistics
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
11
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
12
2
historical linguistics
historical linguistics – example
z (there is no other kind of data ;-)
z question: what was a variety (at least that portion of the
language that survived) like at a specific point in time
– qualitative and quantitative study of a 'synchronic'
corpus
(many papers …)
z question: how can language change be
described/modelled
– comparison of corpus data from different points in time
(also short-term/recent change!)
zautomatic calculation of similarity trees to
find out about language relationships
between historical varieties of German
(Lüdeling 2006)
¾ Eythorsson's class
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
13
small case study: Lord‘s prayer
AS
AHD
MHD FNHD
NHD
AS
0
2
3
4
4
AHD
2
0
2
3
3
MHD
3
2
0
2
2
FNHD
4
3
2
0
1
NHD
4
3
2
1
0
15
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
experiments
word lists
zword lists
(Hochmuth 2004)
zsyntactic similarity using the
TIGER format (Brants et al. 2002)
zvisualisation of distances using PHYLIP
phylogeny software
ztranslated to SAMPA
zmanually aligned
zgrapheme correspondences,
SAMPA correspondences
zseveral string comparison methods
(edit distances)
zphylogenetic methods for clustering
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
14
‚intuitive‘ distance
z fader ist usa firio barno, thu bist an them hohen
himilo rikie. Altsächsisch (9. Jhd., Heliand)
z fater unser, thu thar bist in himile, Ahd. (9. Jhd.,
Tatian)
z Got vater unser, dâ du bist In dem himelrîche
gewaltic alles des dir ist, Mhd. (ca. 1200, Reinmar
van Zweter)
z Vnser vater ynn dem hymel. Fnhd. (1422, Luther)
z Vater unser, der du bist im Himmel, Nhd.
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
17
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
16
18
3
hierarchical clustering, feature-weighted
Levenshtein distance
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
19
‚intuitive‘
distances
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
20
syntactic distance
Althochdeutsch
und Frühneuhochdeutsch
NP
S
PP
NP
PP
ztransformation of TIGER graphs to trees,
crossing edges ignored
(order of terminal nodes)
zcalculation of distances between
corresponding sentences using treediff
∑
(Shasha, Wang & Zhang)
zsentence distances combined to text
distance
d ( x, y ) 2 =
Vnser vater
Ø
Ø
Ø ynn dem hymel,
fater unser, thu thar bist
in Ø himile,
‚intuitive‘ distances vs.
Lord‘s prayer
AS
AHD
MHD
FNHD
NHD
i
d ( x i , yi ) 2
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
22
phylogenetic trees
AS
AHD
MHD
FNHD
NHD
AS
0
2
3
4
4
AS
0.00
60.58
91.74
62.95
61.26
AHD
2
0
2
3
3
AHD
60.58
0.00
72.11
35.19
27.03
MHD
3
2
0
2
2
MHD
91.74
72.11
0.00
72.98
71.81
FNHD
4
3
2
0
1
FNHD 62.95 35.19 72.98
0.00
25.09
NHD
4
3
2
1
0
NHD
25.09
0.00
61.26
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
27.03
71.81
23
‘intuitive’ distances
Lord’s Prayer
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
24
4
problems
corpus-based methods: limits & chances
ztoo little data
zthe Lord‘s Prayer is a translation and
formulaic
z corpora (comparability, availability)
z annotation
(linguistic levels, tag sets, guidelines, tools)
z similarity measures
z models (trees, nets, ?)
z fater unser, thu thar bist in himile, Ahd.
z Vater unser, der du bist im Himmel, Nhd.
vs.
z Vnser vater ynn dem hymel. Fnhd.
z Unser Vater im Himmel
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
z base the calculation of language relationships
on more than just phonology and morphology
z frequencies can be used
25
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
sociolinguistics
descriptive grammars
zquestion: what characteristics does a
given sociolect/register/… have
(wrt pronunciation, lexis, syntax, …)?
zquestion: how do sociolects/registers/…
differ?
zquestion/task: description and
classification of basic elements and
combinations in a language (and perhaps
determine frequency information)
26
zcorpus-based grammars
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
27
historical linguistics/ sociolinguistics/
descriptive grammars
28
psycholinguistics
zwe want:
{qualitative and quantitative description of a
language wrt all linguistic levels
{perhaps extrapolation of results to a larger body
of language (OHG, London Teenager Talk, …)
zwe need:
{reliable corpora with good documentation,
reliable annotation with good documentation
{search and evaluation techniques
{mathematical models
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
29
z question/task: find frequency information for
words/morphemes/readings etc. in order to
design experiments
(reaction times are correlated with frequency)
z question/task: find typical contexts for a given
word
z we want: large general corpora or corpus-based
lexicons with frequency information, good
annotation
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
30
5
lexicography
linguistic data
z question/task: find readings/typical
contexts/frequencies of a word, find good
examples
z question/task: find collocates to a given word
z we want: large general corpora or special
corpora (for special lexicons), good annotation
and evaluation methods, mathematical models
for collocation detection
z WordSketch (Kilgarriff), Evert 2005
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
introspection
experimental data
corpus data
competence:
what is grammatical?
how is language
processed?
what occurs?
(formal) production
system that produces
all (and only the)
grammatical
expressions of a
language
model that describes
storage of linguistic units
and the way these are
addressed and
processed,
neurological models
model that describes
those linguistic units
and combinations and
their distribution in a
corpus
qualitative (categorial)
31
qualitative and
quantitative
(probabilistic)
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
32
history of corpus linguistics
history of corpus linguistics
ztext collections are used already in the
19th century (and earlier) to
„Anm.: Gelegentlich erscheint auch sonst ein s in
der Kompositionsfuge nach Femininum, ohne daß
es in die Schriftsprache durchgedrungen ist, vgl.
z.B. Gemeindsversammlung Hebel 452, 24,
Huldszeichen Heine 2, 111, über Naturs Größe Le.
11, 209, 5, Sprachsverbesserer, Leibniz,
Unvorgreifl. Ged. 67,3, Vernunftswahrheiten Le.
12, 434, 32. Belege für Anfügung eines s an einen
weiblichen Genitiv sind noch: Erdens-Götter
Lohenst., Cleop. 2291 ...“
Hermann Paul (1959, Band V, 13)
{describe language change
{illustrate statements about grammar
{document language acquisition
{compile dictionaries
{compare languages
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
33
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
history of corpus linguistics
structuralism
z text collections
z synchrony
z spoken language
z American structuralism:
Boas, Sapir, Bloomfield, Harris
{mostly 'classics', 'high' language
{for dead languages: everything available
{influential at least until the 1960s, terminology and
empirical methods still in use
{corpus-based – corpora often small, systematically
collected through elicitation (almost questionnaire
studies), no quantitative studies
{non-European languages (native American
languages)
{(almost) no quantitative studies
{no 'balance'
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
34
35
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
36
6
structuralism
generative theory (Chomsky)
zdebate: should a grammar describe the
collected fragment (the corpus) or is it
possible to extrapolate from that to the
whole language?
zbasic idea: a language is finite – if you
collect enough data you could in principle
collect everything
znew research goal: people understand and
produce infinitely many
sentences/complex expressions. It is
therefore not interesting to describe a finite
sample – the real goal is description of the
underlying production system
zcompetence vs. performance
i-language vs. e-language
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
37
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
generative theory
generative theory
z a corpus is a collection of performance data
which are influenced by all kinds of non-linguistic
factors
¾ since it is not possible to abstract away from
these extra-linguistic factors a corpus cannot be
used to find a competence model
¾ introspection is the only way one can use to
distinguish between grammatical and
ungrammatical utterances
z if there are infinitely many possible utterance,
any corpus will be skewed
z "Any natural corpus will be skewed. Some
sentences won't occur because they are
obvious, others because they are false, still
others because they are impolite. The corpus, if
natural, will be so wildly skewed that the
description [based upon it] would be no more
than a mere list." (Chomsky 1962, 159)
z solution: introdpection?
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
39
grammaticality and corpus data
grammatical
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
38
40
Chomsky on corpus linguistics
ungrammatical
occurs in a
corpus
immer,
immmer, letzendlich,
unkaputtbar, ich habe
Wirtschaftskrise,
nach 14 Jahren Kohl, fertig, ...
...
does not
occur in a
corpus
NPs with 27 genitive
attributes: das Haus
der Großmutter der
Schwester des
Verwalters der ...
"It doesn't exist"
(Chomsky in an interview, answering a
question by Bas Aarts "What do you think of
corpus linguistics?", 2001)
"Some sentences won't occur
because they are obvious,
others because they are false,
still others because they are
impolite."
(Chomsky 1962, 159)
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
41
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
42
7
corpus linguistics after the 1950s
early machine-readable corpora
z some linguistic areas always used corpora and
continued to do so (dialectology, historical
linguistics, …)
z computational linguistics and psycholinguistics
have interest in machine-readable corpora
because they need frequency data
z theoretical linguistics: corpus-based work
marginalized
z almost no discussion of empirical
questions/standards
zRoberto Busa: corpus of medieval
philosophy texts (project with IBM 1949 –
1967), concordancer
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
(this later lead to: Thomae Aquinatis Opera Omnia cum
hypertextibus in CD-ROM )
zother work on historical texts (Greek Bible
and other texts)
zMorton's authorship detection
zJuillands ‚mechanolinguistics‘
43
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
early machine-readable corpora
early machine-readable corpora
Memories of the early days are all of paper tape. It
waved in and out of every machine, it dried and
then cracked and split or it got damp when it lay
limp and then sagged and stretched. Sometimes it
curled round you like a hungry anaconda, at others
it lay flat and lifeless and would not wind. Above all
it extended to infinity in all directions. A Greek New
Testament, half a million characters, ran to a mile
of paper tape, and the complete concordance of it
ran to seven miles (Morton 1980, 197).
z Englisch
44
{Quirk (1960s) Survey of English Usage
{Francis & Kucera: Brown Corpus
{Svartvik (1970s): London-Lund Corpus
{Leech: Lancaster-Oslo-Bergen Corpus (LOB)
{Sinclair: COBUILD, Bank of English
z several corpuslinguistic centers in Europe
(zitiert nach http://info.ox.ac.uk/ctitext/history/pioneer.html)
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
45
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
early machine-readable corpora
second generation corpora
z1 m tokens
zmuch larger (> 100 m tokens)
zstandardization intiatives
znetworks
(Francis/Kucera's frequencies still standard for many
psycholinguistic experiments)
zdevelopment of annotation methods
zdevelopment of search methods and
concordancers
zdiscussion of corpus design
(sampling method)
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
46
zmuch more corpus-based research
zthe gap between corpuslinguistis and
theoretical linguists is closing
47
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
48
8
references
z Evert, Stefan & Fitschen, Arne (2001) Textkorpora. In: Carstensen et al.
(Hrsg) Computerlinguistik und Sprachtechnologie. Eine Einführung.
Spektrum Akademischer Verlag, Heidelberg, 369 – 376
z Featherston, Sam (2005) The Decathlon Model: Design features for an
empirical syntax. In: Reis M. & Kepser S. Linguistic Evidence: Empirical,
Theoretical, and Computational Perspectives Berlin: Mouton de Gruyter
z Keller, Frank. 2001. Experimental Evidence for Constraint Competition in
Gapping Constructions. In Gereon Müller and Wolfgang Sternefeld, eds.,
Competition in Syntax, 211-248. Berlin: Mouton de Gruyter
z Leech, Geoffrey (1993) Corpus Annotation Schemes. In: Literary and
Linguistic Computing 8(4), 275 - 281
z Lüdeling, Anke (2006) XXX
z Manning, Christopher D. & Schütze, Hinrich (1999) Foundations of
Statistical Natural Language Processing. MIT Press, Cambridge, Kapitel 10
z Schütze, Carson T. (1996) The Empirical Base of Linguistics:
Grammaticality Judgments and Linguistic Methodology Chicago University
Press, Chicago
"corpus linguistics is not a branch of
linguistics, but the route into linguistics".
Michael Hoey, remark at TALC 1998
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
49
Anke Lüdeling, DGfS & GLOW Summer School,
Stuttgart, Aug 2006
50
9