Corpus based linguistics and translatology - Corpus

Corpus based linguistics and translatology
Corpus Linguistic Examples
Ekaterina Lapshinova-Koltunski
25.10.2012
25.10.2012
Corpus Linguistics
1 / 45
Outline
1
What Corpus Linguistic is all about
Sample research questions
Methodology
2
Use and usage of words
Frequencies
Frequency distribution
Distribution of word senses
Collocations
Use of “synonymous” words
Language variation
3
Example Studies
Example Study I
Example II
25.10.2012
Corpus Linguistics
2 / 45
What Corpus Linguistic is all about
Sample research questions
Simple research questions
the meanings of words
differentiation
“synonymous” words
use and usage of words
frequency
co-occurrence patterns of words (collocations)
language variation
language contrasts
textual studies
what is a text about? keywords, terminology
language variation
examples: deal and great-large-big, taken from: (Biber et al. 1998)
25.10.2012
Corpus Linguistics
3 / 45
What Corpus Linguistic is all about
Methodology
Concordances and Word Lists
Concordances?
25.10.2012
Corpus Linguistics
4 / 45
What Corpus Linguistic is all about
Methodology
Concordances and Word Lists
Concordances?
words in their context
25.10.2012
Corpus Linguistics
4 / 45
What Corpus Linguistic is all about
Methodology
Concordances and Word Lists
Concordances?
words in their context
Word lists and Counting
abstraction over results
frequency distribution
25.10.2012
Corpus Linguistics
4 / 45
What Corpus Linguistic is all about
Methodology
Concordances: Examples
did n’t know whether a <ghost> so transparent might
explanation . But the <ghost> sat down on the oppos
familiar with one old <ghost> , in a white waistcoa
dles pretends to see a <ghost> in the corner . I hea
come before me like a <ghost> , and haunted happier
, ’ like a reproachful <ghost> ! ’ I was obliged to
Try this on OPUS (open parallel corpus)
http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus=
OpenSubtitles;lang=en
Corpus Concordance
http:
//www.lextutor.ca/concordancers/concord_e.html
25.10.2012
Corpus Linguistics
5 / 45
Use and usage of words
Meanings of words
deal and its meanings?
KWIC display (from LOB corpus)
1
2
3
4
5
6
7
and secret plans prepared to
of companies and put one property
. In particular, a good
hangs a tale - and a great
where his new measures to
just a matter of working a good
. “I’m mixed up in a
(2) deal
(3) deal
(4) deal
(4) deal
(2) deal
(4) deal
(3) deal
with the mass sit-down
through each. Mr.
of concern has been
of money. Neville
with Britain’s
harder before we really
involving millions
three different meanings:
1
2
3
25.10.2012
(2) 1+5: (2) handling a problem
(3) 2+7: (3) business transaction
(4) others: (4) amount
Corpus Linguistics
6 / 45
Use and usage of words
Meaning of words
may get too many to handle ...
e.g., in 8 million word from Longman-Lancaster corpus: 1500
entries
what to do?
ranking (frequency)
most concordance programs can generate a frequency list of all
the words contained in a corpus
25.10.2012
Corpus Linguistics
7 / 45
Use and usage of words
Frequencies
Frequency of words
frequency list of forms of deal generated by the TACT program
(LOB corpus: 1 million words):
deal . . . . . . . . . . . . . . . . . . . . . 182
dealing . . . . . . . . . . . . . . . . . . . 52
deals . . . . . . . . . . . . . . . . . . . . . 25
dealt . . . . . . . . . . . . . . . . . . . . . 31
word forms (deal, dealing, deals, dealt)
vs. lemma (base form: deal)
deal in LOB: 290 times.
Is that frequent?
25.10.2012
Corpus Linguistics
8 / 45
Use and usage of words
Frequencies
Frequency of words
compared to function words, this is not frequent
the
of
2,817
35,745
(2) other content words:
sigh
make
approach
16
2,417
185
(3) occurrence of tagged forms of deal (TACT, LOB):
deal_nn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
deal_vb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
dealing_vbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
deals_vbz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
dealt_vbd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
dealt_vbn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
we can test it with CQP (Corpus Query Processor)
25.10.2012
Corpus Linguistics
9 / 45
Use and usage of words
Frequency distribution
Basic statistics
raw frequency vs. normed frequency
raw frequency: can be misleading
norm the count: normed frequency
normed frequency: convert the number of occurrences of a word
to a standard scale (basis of norming), e.g., 100,000; 1,000,000
(fpm)
formula:
raw frequency
number of words
× basis of norming
example:
14
88,000
25.10.2012
× 100, 000 = 15.9
Corpus Linguistics
10 / 45
Use and usage of words
Frequency distribution
Example
2,417 occurrences of make in LOB
⇒ raw frequency
1,162,807 words in LOB (total)
normed frequency of make in LOB
formula:
raw frequency
number of words
× basis of norming
in our case:
2417
1162807
25.10.2012
× 100, 000 = 20, 7
Corpus Linguistics
11 / 45
Use and usage of words
Distribution of word senses
Distribution of word senses: across registers
how can we sort and analyse all the information from a
concordance file?
suppose we get 2,000 occurrences of deal in a 10 million word
corpus . . .
a good way to start: collocates (i.e., the words that a target word
commonly co-occurs with) - because there is a strong tendency
for each collocate of a word to be associated with a single sense
or meaning
25.10.2012
Corpus Linguistics
12 / 45
Use and usage of words
Distribution of word senses
Distribution of word senses: across registers
how can we sort and analyse all the information from a
concordance file?
suppose we get 2,000 occurrences of deal in a 10 million word
corpus . . .
a good way to start: collocates (i.e., the words that a target word
commonly co-occurs with) - because there is a strong tendency
for each collocate of a word to be associated with a single sense
or meaning
Collocations
Words that show a tendency to co-occur
statistically salient patterns
(Firth, 1957): You shall know a word by the company it keeps!
25.10.2012
Corpus Linguistics
12 / 45
Use and usage of words
Collocations
Collocates of deal
as a noun in a 5.7 million sample of the Longman-Lancaster corpus
academic prose
(2.7-million sample)
freq fpm
left collocates
great
122
45
good
63
23
right collocates
of
106
39
more
18
7
in
8
3
to
8
3
25.10.2012
fiction
(3-million sample)
freq fpm
left collocates
great
122
40
good
84
28
the
24
8
big
10
3
right collocates
of
84
28
to
22
7
about
15
5
more
10
3
with
9
3
Corpus Linguistics
13 / 45
Use and usage of words
Collocations
The noun deal
the noun deal in academic prose: most likely to refer to either an
amount or to a business transaction (look back at full
concordances)
the noun deal in fiction:
other common uses compared to academic prose, e.g., agreement,
lack of importance
plus one more meaning: deal as a type of wood
we can also compare the distribution over time (1950’s vs. 2000’s)
or over lanuage variants (Brittish vs. American), e.g. under
http://corpus.byu.edu/
25.10.2012
Corpus Linguistics
14 / 45
Use and usage of words
Collocations
Frequency distribution
user-related: historical periods, dialects, sociolects
use-related: register (situation/function)
for example: the noun deal in selected registers
register
press reportage
press review
press editorials
religion
scientific
25.10.2012
approx.
no. of words
in sample
raw freq.
for deal
normed freq.
for deal
(100,000 words)
88,000
34,000
54,000
34,000
160,000
14
4
4
5
16
15.9
11.8
7.4
14.7
10.0
Corpus Linguistics
15 / 45
Use and usage of words
Use of “synonymous” words
Usage of “synonymous” words
big - large - great
frequency in a 5.7-mio sample of the Longman-Lancaster corpus
total sample
big
great
large
academic prose
big
great
large
fiction
big
great
large
25.10.2012
freq
fpm
1,319
2,342
2,254
230
408
393
84
1,641
772
31
605
284
1,235
701
1,482
408
232
490
Corpus Linguistics
16 / 45
Use and usage of words
Use of “synonymous” words
Collocates of big - large - great
In Academic prose
large
big
right
right
collocate fpm collocate
enough
2.2 number
traders
1.1 numbers
scale
and
enough
proportion
amounts
quantities
25.10.2012
great
fpm
48.3
31.3
29.4
28.0
15.9
11.8
10.7
10.3
right
collocate
deal
importance
number
majority
variety
extent
part
care
Corpus Linguistics
fpm
44.6
12.5
8.9
8.1
7.0
7.0
4.1
3.3
17 / 45
Use and usage of words
Use of “synonymous” words
Collocates of big - large - great
In Fiction
big
right
collocate
man
enough
and
black
house
one
toe
old
25.10.2012
fpm
9.6
8.9
8.3
8.3
7.6
7.0
5.0
4.6
large
right
collocate
and
black
enough
house
room
white
number
for
fpm
15.2
4.3
3.6
3.0
2.7
2.7
2.3
2.3
great
right
collocate
deal
man
burrow
big
aunt
care
pleasure
and
Corpus Linguistics
fpm
40.4
6.6
5.6
4.6
4.3
4.0
4.0
3.0
18 / 45
Use and usage of words
Language variation
Keywords
Computer Science
algorithm
time
problem
graph
set
number
edge
node
proof
case
Linguistics
language
verb
case
example
word
form
analysis
structure
clause
argument
Biology
gene
sequence
protein
cell
region
expression
figure
site
analysis
DNA
What are they good for?
25.10.2012
Corpus Linguistics
19 / 45
Use and usage of words
Language variation
Terminology: number of N
Biology
genes
tRNA
nucleotide
amino
gene
cells
species
repeats
proteins
ESTs
Computer Science
edges
packets
vertices
nodes
rounds
iterations
queries
times
variables
elements
Linguistics
syllables
synsets
tokens
contact
words
borrowing
languages
errors
ways
premodifiers
test it for German
http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus=
Europarl3;lang=de
25.10.2012
Corpus Linguistics
20 / 45
Use and usage of words
Language variation
Terminology: number of N of N
Computer Science
proof of theorem
loss of generality
proof of lemma
number of edges
number of packets
number of vertices
number of nodes
number of rounds
set of vertices
set of edges
Linguistics
point of view
place of articulation
rule of paradigm
moment of speech
rules of paradigm
state of affairs
number of syllables
degree of commitment
part of speech
varieties of English
Biology
number of genes
conflict of interest
amplification of cDNA
institutes of health
origin of replication
number of tRNA
absence of Tc
expression of genes
orders of magnitude
levels of expression
test it for German
http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus=
Europarl3;lang=de
25.10.2012
Corpus Linguistics
21 / 45
Use and usage of words
Language variation
Terminology: patterns
pattern
Adj N prep N
Adj conj Adj N
Adj Adj N
Adj NN
Adj N
VVP N
NN
N of N
N prep Adj N
N prep (N conj N)
25.10.2012
example
early stage of development
upper and lower bounds
exponential lower bounds
maximum buffer size
lower bounds
consumed energy
wind energy
production of energy
introduction of new technology
emission of sulphur and nitrogen
Corpus Linguistics
22 / 45
Use and usage of words
Language variation
Terminology: multilingual (Europarl)
127815
90788
69872
67738
55733
53960
45836
41666
38802
29843
29622
29371
27784
26305
22529
22347
22148
22059
21787
21425
21346
20782
20233
25.10.2012
DE
Kommission
Union
Herr Präsident
Parlament
Bericht
Mitgliedstaaten
Rat
Europa
Frage
Maßnahmen
Vorschlag
Parlaments
Entwicklung
EU
Menschen
Kollegen
Rahmen
Arbeit
Zusammenarbeit
Länder
Bürger
Zeit
Bereich
Corpus Linguistics
23 / 45
Use and usage of words
Language variation
Terminology: multilingual (Europarl)
DE
Kommission
Union
Herr Präsident
Parlament
Bericht
Mitgliedstaaten
Rat
Europa
Frage
Maßnahmen
Vorschlag
Parlaments
Entwicklung
EU
Menschen
Kollegen
Rahmen
Arbeit
Zusammenarbeit
Länder
Bürger
Zeit
Bereich
127815
90788
69872
67738
55733
53960
45836
41666
38802
29843
29622
29371
27784
26305
22529
22347
22148
22059
21787
21425
21346
20782
20233
116134
71187
61381
59368
59232
58620
58066
50168
48936
48788
46846
36137
33672
31556
30607
30226
28245
28162
27869
27578
27275
26875
26855
EN
Commission
Mr President
report
Europe
Parliament
Council
Member States
countries
European Union
people
time
way
fact
Union
proposal
debate
question
Committee
years
rights
issue
work
European Parliament
test it in CQP
25.10.2012
Corpus Linguistics
23 / 45
Use and usage of words
Language variation
Summary
Concordances show how the use of a word in context
co-occurrences
different meanings
semantically related words
25.10.2012
Corpus Linguistics
24 / 45
Use and usage of words
Language variation
Summary
Concordances show how the use of a word in context
co-occurrences
different meanings
semantically related words
Frequency lists give information about word frequencies
topic of the text (most frequent content words)
collocations (frequent co-occurrences)
terminology (words specific to a genre/register)
differences/commonalities between different
texts/registers/languages
25.10.2012
Corpus Linguistics
24 / 45
Use and usage of words
Language variation
Concordance Tools
WordSmith (commercial): http://www.lexically.net/
wordsmith/version5/index.html
Wconcord (free): http://www.linglit.tu-darmstadt.de/
index.php?id=linguistics
in online corpora ...
25.10.2012
Corpus Linguistics
25 / 45
Example Studies
Example Study I
English Dative Alternation
constructions with double objects (NP NP):
(1) John gave [Mary] [the book]
prepositional dative constructions (NP PP):
(2) John gave [the book] [to Mary]
25.10.2012
Corpus Linguistics
26 / 45
Example Studies
Example Study I
English Dative Alternation
Two aspects:
1
Causing a change of state (possession)
⇒ V NP NP
Ex: John gave [Mary] [the book]
2
Causing a change of place (movement to goal)
⇒ V NP [to NP]
Ex: John gave [the book] [to Mary]
”Meaning-to-Structure Mapping” hypothesis
cf. (Pinker, 1989)
25.10.2012
Corpus Linguistics
27 / 45
Example Studies
Example Study I
English Dative Alternation
Traditional Analysis
Evidence from idioms which allow one aspect only:
give someone the creeps (’jemandem das Fürchten lehren’)
→ change of state – no change of place
That movie gave me the creeps.
*That movie gave the creeps to me.
25.10.2012
Corpus Linguistics
28 / 45
Example Studies
Example Study I
English Dative Alternation
EVIDENCE
1
This life-sized prop will give the creeps to just about anyone! Guess he wasn’t
quite dead when we buried him!
(http://www.frightshop.com/)
2
Some of Andy’s death screens are pretty nasty and the enemies are guaranteed
to give the creeps to the smaller set.
(http://www.ladydragon.com/a-heartofdarkness.html)
3
Stories like these must give the creeps to people whose idea of heaven is a
world without religion...
(http://enquirer.com/editions/2001/09/30/loclordsgym.html)
25.10.2012
Corpus Linguistics
29 / 45
Example Studies
Example Study I
English Dative Alternation
EVIDENCE
1
This life-sized prop will give the creeps to just about anyone! Guess he wasn’t
quite dead when we buried him!
(http://www.frightshop.com/)
2
Some of Andy’s death screens are pretty nasty and the enemies are guaranteed
to give the creeps to the smaller set.
(http://www.ladydragon.com/a-heartofdarkness.html)
3
Stories like these must give the creeps to people whose idea of heaven is a
world without religion...
(http://enquirer.com/editions/2001/09/30/loclordsgym.html)
4
Bioshock pushes it all the way on PS3, though, and Dead Space will give the
creeps to all Xbox 360 gamers who love a spot of survival horror.
(GAMES REVIEWS by The Journal (Newcastle, England) seen at
http://legal-dictionary.thefreedictionary.com/give+the+
creeps#Browsers)
25.10.2012
Corpus Linguistics
29 / 45
Example Studies
Example Study I
English Dative Alternation
Factors
already mentioned
definite
form (lexical NP or pronoun)
thematic role
length/weightiness
25.10.2012
Corpus Linguistics
30 / 45
Example Studies
Example Study I
English Dative Alternation
Jennifer Hay
3263 examples with a double object
Switchboard corpus (spoken)
Wall Street Journal corpus (written)
R language package (Harald Baayen)
simple dataset of verbs
full dataset of dative
Dataset: www.blackwellpublishing.com/quantmethods
Syntax: Bresnan et al.’s dative alternation data.
25.10.2012
Corpus Linguistics
31 / 45
Example Studies
Example Study I
English Dative Alternation
Factors of influence:
1
He dragged [a guest] [a can of beer].
recipient = a guest: indefinite, unknown, animate
theme = a can of beer: indefined, unknown, inanimate
2
”Well... it started like this...” Shinbo explained while Sumomo
dragged [him] [a can of beer] and opened it for him...
recipient = him: definite, mentioned, animate, pronoun
theme = a can of beer: indefinite, unknown, inanimate
25.10.2012
Corpus Linguistics
32 / 45
Example Studies
Example Study I
English Dative Alternation
Factors of influence:
1
He dragged [a guest] [a can of beer].
recipient = a guest: indefinite, unknown, animate
theme = a can of beer: indefined, unknown, inanimate
2
”Well... it started like this...” Shinbo explained while Sumomo
dragged [him] [a can of beer] and opened it for him...
recipient = him: definite, mentioned, animate, pronoun
theme = a can of beer: indefinite, unknown, inanimate
(Collins, 1995) Features of recepient in 1. NP
more discourse accessible
more definite
pronominal
shorter
if compare to the 2. NP (theme)
25.10.2012
Corpus Linguistics
32 / 45
Example Studies
Example Study I
English Dative Alternation
(Collins, 1995): accessibility
cf. (Bresnan et al., 2007)
25.10.2012
Corpus Linguistics
33 / 45
Example Studies
Example Study I
English Dative Alternation
Summary
evaluation of single sentences without context only partially
reflects the grammatical variants
’usage data’ show generalisations which are not observable
subjectively
double object constructions are more variable as expected
meaning cannot explain the variation
⇒ quantitative, corpus-based analysis!!!
25.10.2012
Corpus Linguistics
34 / 45
Example Studies
Example II
Language Acquisition
Development of Language Competence
occurrence of expected mistakes
sequence of acquisition of a construction
acquisition of several constructions
25.10.2012
Corpus Linguistics
35 / 45
Example Studies
Example II
Language Acquisition
Development of Language Competence
occurrence of expected mistakes
sequence of acquisition of a construction
acquisition of several constructions
Corpus Lingusitics:
huge data of spontaneous speech
accessible and can be validated statistically
Dataset: CHILDES (Child Language Exgange System):
http://childes.psy.cmu.edu/
25.10.2012
Corpus Linguistics
35 / 45
Example Studies
Example II
Language Acquisition
Approach
Syntactic aspects (constructions)?
Theory to test?
Prediction of the theory?
25.10.2012
Corpus Linguistics
36 / 45
Example Studies
Example II
Language Acquisition
Approach
Syntactic aspects (constructions)?
Theory to test?
Prediction of the theory?
Data: age and background of children?
utterances?
25.10.2012
Corpus Linguistics
36 / 45
Example Studies
Example II
Language Acquisition
Example: Acquisition of wh-questions
Wh-word + SUBJ + finite VERB
Wh-word + finites AUX + SUBJ (subj-aux Inversion)
25.10.2012
Corpus Linguistics
37 / 45
Example Studies
Example II
Language Acquisition
Example: Acquisition of wh-questions
Wh-word + SUBJ + finite VERB
Wh-word + finites AUX + SUBJ (subj-aux Inversion)
Hypothesis:
acquisition of wh-questions = inversion
Analysed Dataset:
Adam 3;6 (Brown Corpus): 832 utterances; 178 wh-questions; 4 no inversion:
a. Why you won’t let me fly?
b. Why de tail is gon to break# huh?
c. Why your had is out like that?
d. Why he can excercise it?
25.10.2012
Corpus Linguistics
37 / 45
Example Studies
Example II
Language Acquisition
Example: Acquisition of wh-questions
relative frequency calculation:
1 no inversion / all utternaces:
4 / 832 = 0.6 %
2 no inversion / all wh-questions:
4 / 178 = 2.3 %
Are all wh-questions relevant for our analysis?
Further cases:
a. What you eat for dinner? (no auxiliary)
b. What do eat for dinner? (no subject)
c. Do you know what you can eat for dinner? (embedded clauses)
d. Who can stay for dinner? (subject questions)
25.10.2012
Corpus Linguistics
38 / 45
Example Studies
Example II
Language Acquisition
Example: Acquisition of wh-questions
we are interested in non-subject Matrix wh-questions:
What can she eat? vs. What she can eat?
relative frequency calculation:
1 no inversion / non-subject Matrix wh-questions:
4 / 27 = 15 %
2 no inversion / non-subject Matrix why-questions:
4 / 5 = 80 %
⇒ why-questions in the age of 3,6 are not acquired if compare to other
wh-questions
cf. (Stromswold, 1996)
25.10.2012
Corpus Linguistics
39 / 45
Student’s Projects
http://fr46.uni-saarland.de/lsteich/WS201213-HS-CL/
Former_Projects.html
25.10.2012
Corpus Linguistics
40 / 45
Thank you!
25.10.2012
Corpus Linguistics
41 / 45
Assignment
Define:
a. Subject of analysis?
b. Features?
c. outline the data in a table.
Example 1 illustrates the task. Do the same for Example 2:
Example 1 We want to analyse the average sentence length in
’Lenz’ (Georg Büchner) on the basis of number of words in a
sentence. One word is a orthographical entity, which is separated
by space or stence mark from other words.
cf. CL by H.Zinsmeister
25.10.2012
Corpus Linguistics
42 / 45
Assignment
The following Example illustrates the task. Do the same for the next Example:
Example 1
a. Sentence ID, sentence length
b. Sentence ID: any label: s_1, s_2, also 1,2,3 etc.; sentence length:
numeric, numbers (1,2...)
c.
sent ID
s_1
s_2
length
7
17
s_3
14
25.10.2012
sentence
Den [20. Januar] ging Lenz durch’s Gebirg.
Die Gipfel und hohen Bergflächen im Schnee, die
Thäler hinunter graues Gestein, grüne Flächen,
Felsen und Tannen.
Es war naßkalt, das Wasser rieselte die Felsen hinunter und sprang über den Weg.
Corpus Linguistics
43 / 45
Assignment
a. Subject of analysis?
b. Features?
c. outline the data in a table.
Example 2 We want to analyse language knowledge of L2 learners. For
this, we define 3 levels of sentence complexity:
1 simple: Touristen lieben das Reisen. (main clause with 1 verb)
2 intermediate: Touristen wollen viel erleben (main clause with more
than 1 verb)
3 complex: Touristen meinen, dass das Reisen Spaß machen. (main
and subordinate clauses)
Text for Analysis (Alesko, wdt07_02):
Ist Urlaub die vergebliche Flucht aus dem Alltag? Heutzutage gelangt es
in hohe Konjunktur, einen Urlaub zu machen. Immer mehr Menschen
bevorzugen in den Ferien einen Urlaub aus Abwechslung.
25.10.2012
Corpus Linguistics
44 / 45
References
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating
language structure and use. Cambridge University Press, New York.
Bresnan, J., A. Cueni, T. Nikitina und H. Baayen (2007). Predicting the Dative
Alternation In: G. Bouma, I. Kraemer und J. Zwarts (eds.) Cognitive Foundations
of Interpretation, p. 69-94. Royal Netherlands Academy of Arts and Sciences.
Collins, P. (1995). The indirect object construction in English: an informational
approach. Linguistics 33: p. 35-49.
Firth, J. (1957). A synopsis of linguistic theory 1930-55. In Studies in linguistic
analysis, pages 1-32. The Philological Society, Oxford.
Stromswold, K. (1996). Analyzing children’s spontaneous speech. In: D.
McDaniel, C. McKee und H. Cairns (eds.) Methods for assessing children’s
syntax, 23-53. Cambridge, MA: MIT Press.
25.10.2012
Corpus Linguistics
45 / 45