Corpus based linguistics and translatology Corpus Linguistic Examples Ekaterina Lapshinova-Koltunski 25.10.2012 25.10.2012 Corpus Linguistics 1 / 45 Outline 1 What Corpus Linguistic is all about Sample research questions Methodology 2 Use and usage of words Frequencies Frequency distribution Distribution of word senses Collocations Use of “synonymous” words Language variation 3 Example Studies Example Study I Example II 25.10.2012 Corpus Linguistics 2 / 45 What Corpus Linguistic is all about Sample research questions Simple research questions the meanings of words differentiation “synonymous” words use and usage of words frequency co-occurrence patterns of words (collocations) language variation language contrasts textual studies what is a text about? keywords, terminology language variation examples: deal and great-large-big, taken from: (Biber et al. 1998) 25.10.2012 Corpus Linguistics 3 / 45 What Corpus Linguistic is all about Methodology Concordances and Word Lists Concordances? 25.10.2012 Corpus Linguistics 4 / 45 What Corpus Linguistic is all about Methodology Concordances and Word Lists Concordances? words in their context 25.10.2012 Corpus Linguistics 4 / 45 What Corpus Linguistic is all about Methodology Concordances and Word Lists Concordances? words in their context Word lists and Counting abstraction over results frequency distribution 25.10.2012 Corpus Linguistics 4 / 45 What Corpus Linguistic is all about Methodology Concordances: Examples did n’t know whether a <ghost> so transparent might explanation . But the <ghost> sat down on the oppos familiar with one old <ghost> , in a white waistcoa dles pretends to see a <ghost> in the corner . I hea come before me like a <ghost> , and haunted happier , ’ like a reproachful <ghost> ! ’ I was obliged to Try this on OPUS (open parallel corpus) http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus= OpenSubtitles;lang=en Corpus Concordance http: //www.lextutor.ca/concordancers/concord_e.html 25.10.2012 Corpus Linguistics 5 / 45 Use and usage of words Meanings of words deal and its meanings? KWIC display (from LOB corpus) 1 2 3 4 5 6 7 and secret plans prepared to of companies and put one property . In particular, a good hangs a tale - and a great where his new measures to just a matter of working a good . “I’m mixed up in a (2) deal (3) deal (4) deal (4) deal (2) deal (4) deal (3) deal with the mass sit-down through each. Mr. of concern has been of money. Neville with Britain’s harder before we really involving millions three different meanings: 1 2 3 25.10.2012 (2) 1+5: (2) handling a problem (3) 2+7: (3) business transaction (4) others: (4) amount Corpus Linguistics 6 / 45 Use and usage of words Meaning of words may get too many to handle ... e.g., in 8 million word from Longman-Lancaster corpus: 1500 entries what to do? ranking (frequency) most concordance programs can generate a frequency list of all the words contained in a corpus 25.10.2012 Corpus Linguistics 7 / 45 Use and usage of words Frequencies Frequency of words frequency list of forms of deal generated by the TACT program (LOB corpus: 1 million words): deal . . . . . . . . . . . . . . . . . . . . . 182 dealing . . . . . . . . . . . . . . . . . . . 52 deals . . . . . . . . . . . . . . . . . . . . . 25 dealt . . . . . . . . . . . . . . . . . . . . . 31 word forms (deal, dealing, deals, dealt) vs. lemma (base form: deal) deal in LOB: 290 times. Is that frequent? 25.10.2012 Corpus Linguistics 8 / 45 Use and usage of words Frequencies Frequency of words compared to function words, this is not frequent the of 2,817 35,745 (2) other content words: sigh make approach 16 2,417 185 (3) occurrence of tagged forms of deal (TACT, LOB): deal_nn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 deal_vb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 dealing_vbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 deals_vbz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 dealt_vbd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 dealt_vbn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 we can test it with CQP (Corpus Query Processor) 25.10.2012 Corpus Linguistics 9 / 45 Use and usage of words Frequency distribution Basic statistics raw frequency vs. normed frequency raw frequency: can be misleading norm the count: normed frequency normed frequency: convert the number of occurrences of a word to a standard scale (basis of norming), e.g., 100,000; 1,000,000 (fpm) formula: raw frequency number of words × basis of norming example: 14 88,000 25.10.2012 × 100, 000 = 15.9 Corpus Linguistics 10 / 45 Use and usage of words Frequency distribution Example 2,417 occurrences of make in LOB ⇒ raw frequency 1,162,807 words in LOB (total) normed frequency of make in LOB formula: raw frequency number of words × basis of norming in our case: 2417 1162807 25.10.2012 × 100, 000 = 20, 7 Corpus Linguistics 11 / 45 Use and usage of words Distribution of word senses Distribution of word senses: across registers how can we sort and analyse all the information from a concordance file? suppose we get 2,000 occurrences of deal in a 10 million word corpus . . . a good way to start: collocates (i.e., the words that a target word commonly co-occurs with) - because there is a strong tendency for each collocate of a word to be associated with a single sense or meaning 25.10.2012 Corpus Linguistics 12 / 45 Use and usage of words Distribution of word senses Distribution of word senses: across registers how can we sort and analyse all the information from a concordance file? suppose we get 2,000 occurrences of deal in a 10 million word corpus . . . a good way to start: collocates (i.e., the words that a target word commonly co-occurs with) - because there is a strong tendency for each collocate of a word to be associated with a single sense or meaning Collocations Words that show a tendency to co-occur statistically salient patterns (Firth, 1957): You shall know a word by the company it keeps! 25.10.2012 Corpus Linguistics 12 / 45 Use and usage of words Collocations Collocates of deal as a noun in a 5.7 million sample of the Longman-Lancaster corpus academic prose (2.7-million sample) freq fpm left collocates great 122 45 good 63 23 right collocates of 106 39 more 18 7 in 8 3 to 8 3 25.10.2012 fiction (3-million sample) freq fpm left collocates great 122 40 good 84 28 the 24 8 big 10 3 right collocates of 84 28 to 22 7 about 15 5 more 10 3 with 9 3 Corpus Linguistics 13 / 45 Use and usage of words Collocations The noun deal the noun deal in academic prose: most likely to refer to either an amount or to a business transaction (look back at full concordances) the noun deal in fiction: other common uses compared to academic prose, e.g., agreement, lack of importance plus one more meaning: deal as a type of wood we can also compare the distribution over time (1950’s vs. 2000’s) or over lanuage variants (Brittish vs. American), e.g. under http://corpus.byu.edu/ 25.10.2012 Corpus Linguistics 14 / 45 Use and usage of words Collocations Frequency distribution user-related: historical periods, dialects, sociolects use-related: register (situation/function) for example: the noun deal in selected registers register press reportage press review press editorials religion scientific 25.10.2012 approx. no. of words in sample raw freq. for deal normed freq. for deal (100,000 words) 88,000 34,000 54,000 34,000 160,000 14 4 4 5 16 15.9 11.8 7.4 14.7 10.0 Corpus Linguistics 15 / 45 Use and usage of words Use of “synonymous” words Usage of “synonymous” words big - large - great frequency in a 5.7-mio sample of the Longman-Lancaster corpus total sample big great large academic prose big great large fiction big great large 25.10.2012 freq fpm 1,319 2,342 2,254 230 408 393 84 1,641 772 31 605 284 1,235 701 1,482 408 232 490 Corpus Linguistics 16 / 45 Use and usage of words Use of “synonymous” words Collocates of big - large - great In Academic prose large big right right collocate fpm collocate enough 2.2 number traders 1.1 numbers scale and enough proportion amounts quantities 25.10.2012 great fpm 48.3 31.3 29.4 28.0 15.9 11.8 10.7 10.3 right collocate deal importance number majority variety extent part care Corpus Linguistics fpm 44.6 12.5 8.9 8.1 7.0 7.0 4.1 3.3 17 / 45 Use and usage of words Use of “synonymous” words Collocates of big - large - great In Fiction big right collocate man enough and black house one toe old 25.10.2012 fpm 9.6 8.9 8.3 8.3 7.6 7.0 5.0 4.6 large right collocate and black enough house room white number for fpm 15.2 4.3 3.6 3.0 2.7 2.7 2.3 2.3 great right collocate deal man burrow big aunt care pleasure and Corpus Linguistics fpm 40.4 6.6 5.6 4.6 4.3 4.0 4.0 3.0 18 / 45 Use and usage of words Language variation Keywords Computer Science algorithm time problem graph set number edge node proof case Linguistics language verb case example word form analysis structure clause argument Biology gene sequence protein cell region expression figure site analysis DNA What are they good for? 25.10.2012 Corpus Linguistics 19 / 45 Use and usage of words Language variation Terminology: number of N Biology genes tRNA nucleotide amino gene cells species repeats proteins ESTs Computer Science edges packets vertices nodes rounds iterations queries times variables elements Linguistics syllables synsets tokens contact words borrowing languages errors ways premodifiers test it for German http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus= Europarl3;lang=de 25.10.2012 Corpus Linguistics 20 / 45 Use and usage of words Language variation Terminology: number of N of N Computer Science proof of theorem loss of generality proof of lemma number of edges number of packets number of vertices number of nodes number of rounds set of vertices set of edges Linguistics point of view place of articulation rule of paradigm moment of speech rules of paradigm state of affairs number of syllables degree of commitment part of speech varieties of English Biology number of genes conflict of interest amplification of cDNA institutes of health origin of replication number of tRNA absence of Tc expression of genes orders of magnitude levels of expression test it for German http://opus.lingfil.uu.se/bin/opuscqp.pl?corpus= Europarl3;lang=de 25.10.2012 Corpus Linguistics 21 / 45 Use and usage of words Language variation Terminology: patterns pattern Adj N prep N Adj conj Adj N Adj Adj N Adj NN Adj N VVP N NN N of N N prep Adj N N prep (N conj N) 25.10.2012 example early stage of development upper and lower bounds exponential lower bounds maximum buffer size lower bounds consumed energy wind energy production of energy introduction of new technology emission of sulphur and nitrogen Corpus Linguistics 22 / 45 Use and usage of words Language variation Terminology: multilingual (Europarl) 127815 90788 69872 67738 55733 53960 45836 41666 38802 29843 29622 29371 27784 26305 22529 22347 22148 22059 21787 21425 21346 20782 20233 25.10.2012 DE Kommission Union Herr Präsident Parlament Bericht Mitgliedstaaten Rat Europa Frage Maßnahmen Vorschlag Parlaments Entwicklung EU Menschen Kollegen Rahmen Arbeit Zusammenarbeit Länder Bürger Zeit Bereich Corpus Linguistics 23 / 45 Use and usage of words Language variation Terminology: multilingual (Europarl) DE Kommission Union Herr Präsident Parlament Bericht Mitgliedstaaten Rat Europa Frage Maßnahmen Vorschlag Parlaments Entwicklung EU Menschen Kollegen Rahmen Arbeit Zusammenarbeit Länder Bürger Zeit Bereich 127815 90788 69872 67738 55733 53960 45836 41666 38802 29843 29622 29371 27784 26305 22529 22347 22148 22059 21787 21425 21346 20782 20233 116134 71187 61381 59368 59232 58620 58066 50168 48936 48788 46846 36137 33672 31556 30607 30226 28245 28162 27869 27578 27275 26875 26855 EN Commission Mr President report Europe Parliament Council Member States countries European Union people time way fact Union proposal debate question Committee years rights issue work European Parliament test it in CQP 25.10.2012 Corpus Linguistics 23 / 45 Use and usage of words Language variation Summary Concordances show how the use of a word in context co-occurrences different meanings semantically related words 25.10.2012 Corpus Linguistics 24 / 45 Use and usage of words Language variation Summary Concordances show how the use of a word in context co-occurrences different meanings semantically related words Frequency lists give information about word frequencies topic of the text (most frequent content words) collocations (frequent co-occurrences) terminology (words specific to a genre/register) differences/commonalities between different texts/registers/languages 25.10.2012 Corpus Linguistics 24 / 45 Use and usage of words Language variation Concordance Tools WordSmith (commercial): http://www.lexically.net/ wordsmith/version5/index.html Wconcord (free): http://www.linglit.tu-darmstadt.de/ index.php?id=linguistics in online corpora ... 25.10.2012 Corpus Linguistics 25 / 45 Example Studies Example Study I English Dative Alternation constructions with double objects (NP NP): (1) John gave [Mary] [the book] prepositional dative constructions (NP PP): (2) John gave [the book] [to Mary] 25.10.2012 Corpus Linguistics 26 / 45 Example Studies Example Study I English Dative Alternation Two aspects: 1 Causing a change of state (possession) ⇒ V NP NP Ex: John gave [Mary] [the book] 2 Causing a change of place (movement to goal) ⇒ V NP [to NP] Ex: John gave [the book] [to Mary] ”Meaning-to-Structure Mapping” hypothesis cf. (Pinker, 1989) 25.10.2012 Corpus Linguistics 27 / 45 Example Studies Example Study I English Dative Alternation Traditional Analysis Evidence from idioms which allow one aspect only: give someone the creeps (’jemandem das Fürchten lehren’) → change of state – no change of place That movie gave me the creeps. *That movie gave the creeps to me. 25.10.2012 Corpus Linguistics 28 / 45 Example Studies Example Study I English Dative Alternation EVIDENCE 1 This life-sized prop will give the creeps to just about anyone! Guess he wasn’t quite dead when we buried him! (http://www.frightshop.com/) 2 Some of Andy’s death screens are pretty nasty and the enemies are guaranteed to give the creeps to the smaller set. (http://www.ladydragon.com/a-heartofdarkness.html) 3 Stories like these must give the creeps to people whose idea of heaven is a world without religion... (http://enquirer.com/editions/2001/09/30/loclordsgym.html) 25.10.2012 Corpus Linguistics 29 / 45 Example Studies Example Study I English Dative Alternation EVIDENCE 1 This life-sized prop will give the creeps to just about anyone! Guess he wasn’t quite dead when we buried him! (http://www.frightshop.com/) 2 Some of Andy’s death screens are pretty nasty and the enemies are guaranteed to give the creeps to the smaller set. (http://www.ladydragon.com/a-heartofdarkness.html) 3 Stories like these must give the creeps to people whose idea of heaven is a world without religion... (http://enquirer.com/editions/2001/09/30/loclordsgym.html) 4 Bioshock pushes it all the way on PS3, though, and Dead Space will give the creeps to all Xbox 360 gamers who love a spot of survival horror. (GAMES REVIEWS by The Journal (Newcastle, England) seen at http://legal-dictionary.thefreedictionary.com/give+the+ creeps#Browsers) 25.10.2012 Corpus Linguistics 29 / 45 Example Studies Example Study I English Dative Alternation Factors already mentioned definite form (lexical NP or pronoun) thematic role length/weightiness 25.10.2012 Corpus Linguistics 30 / 45 Example Studies Example Study I English Dative Alternation Jennifer Hay 3263 examples with a double object Switchboard corpus (spoken) Wall Street Journal corpus (written) R language package (Harald Baayen) simple dataset of verbs full dataset of dative Dataset: www.blackwellpublishing.com/quantmethods Syntax: Bresnan et al.’s dative alternation data. 25.10.2012 Corpus Linguistics 31 / 45 Example Studies Example Study I English Dative Alternation Factors of influence: 1 He dragged [a guest] [a can of beer]. recipient = a guest: indefinite, unknown, animate theme = a can of beer: indefined, unknown, inanimate 2 ”Well... it started like this...” Shinbo explained while Sumomo dragged [him] [a can of beer] and opened it for him... recipient = him: definite, mentioned, animate, pronoun theme = a can of beer: indefinite, unknown, inanimate 25.10.2012 Corpus Linguistics 32 / 45 Example Studies Example Study I English Dative Alternation Factors of influence: 1 He dragged [a guest] [a can of beer]. recipient = a guest: indefinite, unknown, animate theme = a can of beer: indefined, unknown, inanimate 2 ”Well... it started like this...” Shinbo explained while Sumomo dragged [him] [a can of beer] and opened it for him... recipient = him: definite, mentioned, animate, pronoun theme = a can of beer: indefinite, unknown, inanimate (Collins, 1995) Features of recepient in 1. NP more discourse accessible more definite pronominal shorter if compare to the 2. NP (theme) 25.10.2012 Corpus Linguistics 32 / 45 Example Studies Example Study I English Dative Alternation (Collins, 1995): accessibility cf. (Bresnan et al., 2007) 25.10.2012 Corpus Linguistics 33 / 45 Example Studies Example Study I English Dative Alternation Summary evaluation of single sentences without context only partially reflects the grammatical variants ’usage data’ show generalisations which are not observable subjectively double object constructions are more variable as expected meaning cannot explain the variation ⇒ quantitative, corpus-based analysis!!! 25.10.2012 Corpus Linguistics 34 / 45 Example Studies Example II Language Acquisition Development of Language Competence occurrence of expected mistakes sequence of acquisition of a construction acquisition of several constructions 25.10.2012 Corpus Linguistics 35 / 45 Example Studies Example II Language Acquisition Development of Language Competence occurrence of expected mistakes sequence of acquisition of a construction acquisition of several constructions Corpus Lingusitics: huge data of spontaneous speech accessible and can be validated statistically Dataset: CHILDES (Child Language Exgange System): http://childes.psy.cmu.edu/ 25.10.2012 Corpus Linguistics 35 / 45 Example Studies Example II Language Acquisition Approach Syntactic aspects (constructions)? Theory to test? Prediction of the theory? 25.10.2012 Corpus Linguistics 36 / 45 Example Studies Example II Language Acquisition Approach Syntactic aspects (constructions)? Theory to test? Prediction of the theory? Data: age and background of children? utterances? 25.10.2012 Corpus Linguistics 36 / 45 Example Studies Example II Language Acquisition Example: Acquisition of wh-questions Wh-word + SUBJ + finite VERB Wh-word + finites AUX + SUBJ (subj-aux Inversion) 25.10.2012 Corpus Linguistics 37 / 45 Example Studies Example II Language Acquisition Example: Acquisition of wh-questions Wh-word + SUBJ + finite VERB Wh-word + finites AUX + SUBJ (subj-aux Inversion) Hypothesis: acquisition of wh-questions = inversion Analysed Dataset: Adam 3;6 (Brown Corpus): 832 utterances; 178 wh-questions; 4 no inversion: a. Why you won’t let me fly? b. Why de tail is gon to break# huh? c. Why your had is out like that? d. Why he can excercise it? 25.10.2012 Corpus Linguistics 37 / 45 Example Studies Example II Language Acquisition Example: Acquisition of wh-questions relative frequency calculation: 1 no inversion / all utternaces: 4 / 832 = 0.6 % 2 no inversion / all wh-questions: 4 / 178 = 2.3 % Are all wh-questions relevant for our analysis? Further cases: a. What you eat for dinner? (no auxiliary) b. What do eat for dinner? (no subject) c. Do you know what you can eat for dinner? (embedded clauses) d. Who can stay for dinner? (subject questions) 25.10.2012 Corpus Linguistics 38 / 45 Example Studies Example II Language Acquisition Example: Acquisition of wh-questions we are interested in non-subject Matrix wh-questions: What can she eat? vs. What she can eat? relative frequency calculation: 1 no inversion / non-subject Matrix wh-questions: 4 / 27 = 15 % 2 no inversion / non-subject Matrix why-questions: 4 / 5 = 80 % ⇒ why-questions in the age of 3,6 are not acquired if compare to other wh-questions cf. (Stromswold, 1996) 25.10.2012 Corpus Linguistics 39 / 45 Student’s Projects http://fr46.uni-saarland.de/lsteich/WS201213-HS-CL/ Former_Projects.html 25.10.2012 Corpus Linguistics 40 / 45 Thank you! 25.10.2012 Corpus Linguistics 41 / 45 Assignment Define: a. Subject of analysis? b. Features? c. outline the data in a table. Example 1 illustrates the task. Do the same for Example 2: Example 1 We want to analyse the average sentence length in ’Lenz’ (Georg Büchner) on the basis of number of words in a sentence. One word is a orthographical entity, which is separated by space or stence mark from other words. cf. CL by H.Zinsmeister 25.10.2012 Corpus Linguistics 42 / 45 Assignment The following Example illustrates the task. Do the same for the next Example: Example 1 a. Sentence ID, sentence length b. Sentence ID: any label: s_1, s_2, also 1,2,3 etc.; sentence length: numeric, numbers (1,2...) c. sent ID s_1 s_2 length 7 17 s_3 14 25.10.2012 sentence Den [20. Januar] ging Lenz durch’s Gebirg. Die Gipfel und hohen Bergflächen im Schnee, die Thäler hinunter graues Gestein, grüne Flächen, Felsen und Tannen. Es war naßkalt, das Wasser rieselte die Felsen hinunter und sprang über den Weg. Corpus Linguistics 43 / 45 Assignment a. Subject of analysis? b. Features? c. outline the data in a table. Example 2 We want to analyse language knowledge of L2 learners. For this, we define 3 levels of sentence complexity: 1 simple: Touristen lieben das Reisen. (main clause with 1 verb) 2 intermediate: Touristen wollen viel erleben (main clause with more than 1 verb) 3 complex: Touristen meinen, dass das Reisen Spaß machen. (main and subordinate clauses) Text for Analysis (Alesko, wdt07_02): Ist Urlaub die vergebliche Flucht aus dem Alltag? Heutzutage gelangt es in hohe Konjunktur, einen Urlaub zu machen. Immer mehr Menschen bevorzugen in den Ferien einen Urlaub aus Abwechslung. 25.10.2012 Corpus Linguistics 44 / 45 References Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press, New York. Bresnan, J., A. Cueni, T. Nikitina und H. Baayen (2007). Predicting the Dative Alternation In: G. Bouma, I. Kraemer und J. Zwarts (eds.) Cognitive Foundations of Interpretation, p. 69-94. Royal Netherlands Academy of Arts and Sciences. Collins, P. (1995). The indirect object construction in English: an informational approach. Linguistics 33: p. 35-49. Firth, J. (1957). A synopsis of linguistic theory 1930-55. In Studies in linguistic analysis, pages 1-32. The Philological Society, Oxford. Stromswold, K. (1996). Analyzing children’s spontaneous speech. In: D. McDaniel, C. McKee und H. Cairns (eds.) Methods for assessing children’s syntax, 23-53. Cambridge, MA: MIT Press. 25.10.2012 Corpus Linguistics 45 / 45
© Copyright 2026 Paperzz