organisation Corpus Linguistics zcourse plan (→ handout) zall materials will be available online at DGfS & GLOW Summer School Micro- and Macrovariation Stuttgart, 2006 http://www2.huberlin.de/korpling/lehre/SummerSchoolStutt.php/ Anke Lüdeling [email protected] zmini-projects (in groups): ten minute presentation plus short written summary zyou can contact me via email any time Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 corpus linguistics today zis concerned with zdifferent kinds of data for different linguistic research questions zbrief history of corpus linguistics {design {processing (architecture and annotation) {evaluation 2 zof corpus data, where a corpus is „A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.“ http://www.ilc.cnr.it/EAGLES96/corpintr/corpintr.html Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 3 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 4 linguistic data linguistic data zwhere do linguists get the data they need to test hypotheses/theories? z the research question and the theoretical framework determine the kind of data hat can be used (the mantra ;-) z in many cases different kinds of data need to be integrated zintrospection zpsycholinguistic experiments zneurolinguistic experiments zquestionnaires zcorpora Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 z methodology/data issues are discussed a lot in recent years, see http://www.sfb441.unituebingen.de/index-engl.html, several conferences like Linguistic Evidence, QITL, etc., Bod, Hay & Jannedy 2003, Kepser & Reis 2005, … 5 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 6 1 introspection/intuition psycholinguistic experiments z arm-chair linguistics ;-) z generative tradition, competence model z research question: which complex expressions (phrases, sentences, words, …) can be produced by the internal grammar of a native speaker of a language? z grammaticality judgments, yes/no z discussion: Schütze 1996, Keller 2001, Featherston 2005 zresearch questions: how are linguistic data stored and accessed? zstorage and processing models, interaction with other cognitive tasks zelicitation tasks, reaction time experiments, eye tracking, errors, … Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 ¾Lynn Frazier's class 7 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 neurolinguistic experiments questionnaires zresearch questions: where are linguistic data stored and accessed? zstorage and processing models, interaction with other cognitive tasks zEEG, ERP, imaging techniques, … zdifferent research questions {structuralist/generative fieldwork to find grammatical forms {judgment tasks (yes/no, magnitude estimation, …) {… zan experimental technique (issues of representativity, filler items etc.), often used to verify specific hypotheses, often used for quantitative studies ¾Bornkessel & Schlesewski's class Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 8 9 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 corpora corpora – research questions zresearch questions: see following slides zcollections of texts z many areas traditionally use corpora 10 {historical linguistics {sociolinguistics {dialectology {lexicography {language acquisition research {(computational linguistics/natural language processing) {nowadays mostly electronic, but see history {texts usually produced for independent purposes (authenticity) – collection (design) and evaluation methods dependent on the research question z other areas have only recently started using corpora {generative/theoretical linguistics Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 11 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 12 2 historical linguistics historical linguistics – example z (there is no other kind of data ;-) z question: what was a variety (at least that portion of the language that survived) like at a specific point in time – qualitative and quantitative study of a 'synchronic' corpus (many papers …) z question: how can language change be described/modelled – comparison of corpus data from different points in time (also short-term/recent change!) zautomatic calculation of similarity trees to find out about language relationships between historical varieties of German (Lüdeling 2006) ¾ Eythorsson's class Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 13 small case study: Lord‘s prayer AS AHD MHD FNHD NHD AS 0 2 3 4 4 AHD 2 0 2 3 3 MHD 3 2 0 2 2 FNHD 4 3 2 0 1 NHD 4 3 2 1 0 15 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 experiments word lists zword lists (Hochmuth 2004) zsyntactic similarity using the TIGER format (Brants et al. 2002) zvisualisation of distances using PHYLIP phylogeny software ztranslated to SAMPA zmanually aligned zgrapheme correspondences, SAMPA correspondences zseveral string comparison methods (edit distances) zphylogenetic methods for clustering Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 14 ‚intuitive‘ distance z fader ist usa firio barno, thu bist an them hohen himilo rikie. Altsächsisch (9. Jhd., Heliand) z fater unser, thu thar bist in himile, Ahd. (9. Jhd., Tatian) z Got vater unser, dâ du bist In dem himelrîche gewaltic alles des dir ist, Mhd. (ca. 1200, Reinmar van Zweter) z Vnser vater ynn dem hymel. Fnhd. (1422, Luther) z Vater unser, der du bist im Himmel, Nhd. Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 17 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 16 18 3 hierarchical clustering, feature-weighted Levenshtein distance Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 19 ‚intuitive‘ distances Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 20 syntactic distance Althochdeutsch und Frühneuhochdeutsch NP S PP NP PP ztransformation of TIGER graphs to trees, crossing edges ignored (order of terminal nodes) zcalculation of distances between corresponding sentences using treediff ∑ (Shasha, Wang & Zhang) zsentence distances combined to text distance d ( x, y ) 2 = Vnser vater Ø Ø Ø ynn dem hymel, fater unser, thu thar bist in Ø himile, ‚intuitive‘ distances vs. Lord‘s prayer AS AHD MHD FNHD NHD i d ( x i , yi ) 2 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 22 phylogenetic trees AS AHD MHD FNHD NHD AS 0 2 3 4 4 AS 0.00 60.58 91.74 62.95 61.26 AHD 2 0 2 3 3 AHD 60.58 0.00 72.11 35.19 27.03 MHD 3 2 0 2 2 MHD 91.74 72.11 0.00 72.98 71.81 FNHD 4 3 2 0 1 FNHD 62.95 35.19 72.98 0.00 25.09 NHD 4 3 2 1 0 NHD 25.09 0.00 61.26 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 27.03 71.81 23 ‘intuitive’ distances Lord’s Prayer Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 24 4 problems corpus-based methods: limits & chances ztoo little data zthe Lord‘s Prayer is a translation and formulaic z corpora (comparability, availability) z annotation (linguistic levels, tag sets, guidelines, tools) z similarity measures z models (trees, nets, ?) z fater unser, thu thar bist in himile, Ahd. z Vater unser, der du bist im Himmel, Nhd. vs. z Vnser vater ynn dem hymel. Fnhd. z Unser Vater im Himmel Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 z base the calculation of language relationships on more than just phonology and morphology z frequencies can be used 25 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 sociolinguistics descriptive grammars zquestion: what characteristics does a given sociolect/register/… have (wrt pronunciation, lexis, syntax, …)? zquestion: how do sociolects/registers/… differ? zquestion/task: description and classification of basic elements and combinations in a language (and perhaps determine frequency information) 26 zcorpus-based grammars Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 27 historical linguistics/ sociolinguistics/ descriptive grammars 28 psycholinguistics zwe want: {qualitative and quantitative description of a language wrt all linguistic levels {perhaps extrapolation of results to a larger body of language (OHG, London Teenager Talk, …) zwe need: {reliable corpora with good documentation, reliable annotation with good documentation {search and evaluation techniques {mathematical models Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 29 z question/task: find frequency information for words/morphemes/readings etc. in order to design experiments (reaction times are correlated with frequency) z question/task: find typical contexts for a given word z we want: large general corpora or corpus-based lexicons with frequency information, good annotation Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 30 5 lexicography linguistic data z question/task: find readings/typical contexts/frequencies of a word, find good examples z question/task: find collocates to a given word z we want: large general corpora or special corpora (for special lexicons), good annotation and evaluation methods, mathematical models for collocation detection z WordSketch (Kilgarriff), Evert 2005 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 introspection experimental data corpus data competence: what is grammatical? how is language processed? what occurs? (formal) production system that produces all (and only the) grammatical expressions of a language model that describes storage of linguistic units and the way these are addressed and processed, neurological models model that describes those linguistic units and combinations and their distribution in a corpus qualitative (categorial) 31 qualitative and quantitative (probabilistic) Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 32 history of corpus linguistics history of corpus linguistics ztext collections are used already in the 19th century (and earlier) to „Anm.: Gelegentlich erscheint auch sonst ein s in der Kompositionsfuge nach Femininum, ohne daß es in die Schriftsprache durchgedrungen ist, vgl. z.B. Gemeindsversammlung Hebel 452, 24, Huldszeichen Heine 2, 111, über Naturs Größe Le. 11, 209, 5, Sprachsverbesserer, Leibniz, Unvorgreifl. Ged. 67,3, Vernunftswahrheiten Le. 12, 434, 32. Belege für Anfügung eines s an einen weiblichen Genitiv sind noch: Erdens-Götter Lohenst., Cleop. 2291 ...“ Hermann Paul (1959, Band V, 13) {describe language change {illustrate statements about grammar {document language acquisition {compile dictionaries {compare languages Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 33 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 history of corpus linguistics structuralism z text collections z synchrony z spoken language z American structuralism: Boas, Sapir, Bloomfield, Harris {mostly 'classics', 'high' language {for dead languages: everything available {influential at least until the 1960s, terminology and empirical methods still in use {corpus-based – corpora often small, systematically collected through elicitation (almost questionnaire studies), no quantitative studies {non-European languages (native American languages) {(almost) no quantitative studies {no 'balance' Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 34 35 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 36 6 structuralism generative theory (Chomsky) zdebate: should a grammar describe the collected fragment (the corpus) or is it possible to extrapolate from that to the whole language? zbasic idea: a language is finite – if you collect enough data you could in principle collect everything znew research goal: people understand and produce infinitely many sentences/complex expressions. It is therefore not interesting to describe a finite sample – the real goal is description of the underlying production system zcompetence vs. performance i-language vs. e-language Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 37 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 generative theory generative theory z a corpus is a collection of performance data which are influenced by all kinds of non-linguistic factors ¾ since it is not possible to abstract away from these extra-linguistic factors a corpus cannot be used to find a competence model ¾ introspection is the only way one can use to distinguish between grammatical and ungrammatical utterances z if there are infinitely many possible utterance, any corpus will be skewed z "Any natural corpus will be skewed. Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list." (Chomsky 1962, 159) z solution: introdpection? Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 39 grammaticality and corpus data grammatical Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 38 40 Chomsky on corpus linguistics ungrammatical occurs in a corpus immer, immmer, letzendlich, unkaputtbar, ich habe Wirtschaftskrise, nach 14 Jahren Kohl, fertig, ... ... does not occur in a corpus NPs with 27 genitive attributes: das Haus der Großmutter der Schwester des Verwalters der ... "It doesn't exist" (Chomsky in an interview, answering a question by Bas Aarts "What do you think of corpus linguistics?", 2001) "Some sentences won't occur because they are obvious, others because they are false, still others because they are impolite." (Chomsky 1962, 159) Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 41 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 42 7 corpus linguistics after the 1950s early machine-readable corpora z some linguistic areas always used corpora and continued to do so (dialectology, historical linguistics, …) z computational linguistics and psycholinguistics have interest in machine-readable corpora because they need frequency data z theoretical linguistics: corpus-based work marginalized z almost no discussion of empirical questions/standards zRoberto Busa: corpus of medieval philosophy texts (project with IBM 1949 – 1967), concordancer Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 (this later lead to: Thomae Aquinatis Opera Omnia cum hypertextibus in CD-ROM ) zother work on historical texts (Greek Bible and other texts) zMorton's authorship detection zJuillands ‚mechanolinguistics‘ 43 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 early machine-readable corpora early machine-readable corpora Memories of the early days are all of paper tape. It waved in and out of every machine, it dried and then cracked and split or it got damp when it lay limp and then sagged and stretched. Sometimes it curled round you like a hungry anaconda, at others it lay flat and lifeless and would not wind. Above all it extended to infinity in all directions. A Greek New Testament, half a million characters, ran to a mile of paper tape, and the complete concordance of it ran to seven miles (Morton 1980, 197). z Englisch 44 {Quirk (1960s) Survey of English Usage {Francis & Kucera: Brown Corpus {Svartvik (1970s): London-Lund Corpus {Leech: Lancaster-Oslo-Bergen Corpus (LOB) {Sinclair: COBUILD, Bank of English z several corpuslinguistic centers in Europe (zitiert nach http://info.ox.ac.uk/ctitext/history/pioneer.html) Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 45 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 early machine-readable corpora second generation corpora z1 m tokens zmuch larger (> 100 m tokens) zstandardization intiatives znetworks (Francis/Kucera's frequencies still standard for many psycholinguistic experiments) zdevelopment of annotation methods zdevelopment of search methods and concordancers zdiscussion of corpus design (sampling method) Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 46 zmuch more corpus-based research zthe gap between corpuslinguistis and theoretical linguists is closing 47 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 48 8 references z Evert, Stefan & Fitschen, Arne (2001) Textkorpora. In: Carstensen et al. (Hrsg) Computerlinguistik und Sprachtechnologie. Eine Einführung. Spektrum Akademischer Verlag, Heidelberg, 369 – 376 z Featherston, Sam (2005) The Decathlon Model: Design features for an empirical syntax. In: Reis M. & Kepser S. Linguistic Evidence: Empirical, Theoretical, and Computational Perspectives Berlin: Mouton de Gruyter z Keller, Frank. 2001. Experimental Evidence for Constraint Competition in Gapping Constructions. In Gereon Müller and Wolfgang Sternefeld, eds., Competition in Syntax, 211-248. Berlin: Mouton de Gruyter z Leech, Geoffrey (1993) Corpus Annotation Schemes. In: Literary and Linguistic Computing 8(4), 275 - 281 z Lüdeling, Anke (2006) XXX z Manning, Christopher D. & Schütze, Hinrich (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Kapitel 10 z Schütze, Carson T. (1996) The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology Chicago University Press, Chicago "corpus linguistics is not a branch of linguistics, but the route into linguistics". Michael Hoey, remark at TALC 1998 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 49 Anke Lüdeling, DGfS & GLOW Summer School, Stuttgart, Aug 2006 50 9
© Copyright 2025 Paperzz