Lecture 1 Introduction to Corpus Linguistics Introduction Electronic texts and text analysis tools have opened up a great number of opportunities to higher education and language services providers, but leaning to use these resources continues to be challenges for both scholars and professionals. Nowadays computer based applications are increasingly based on corpora such as translation memories (TMs) and machine translation (MT) systems. They are part of life. We will divide the work on different practical concepts, after each of one providing some tasks. Surely to do the tasks students need to have access to online facilities. Quantitative linguistics is a sub-discipline of general linguistics and, more specifically, of mathematical linguistics. Quantitative Linguistics (QL) deals with language learning, language change, and application as well as structure of natural languages. QL investigates languages using statistical methods; its most demanding objective is the formulation of language laws and, ultimately, of a general theory of language in the sense of a set of interrelated languages laws Synergetic linguistics was from its very beginning specifically designed for this purpose. QL is empirically based on the results of language statistics, a field which can be interpreted as statistics of languages or as statistics of any linguistic object. This field is not necessarily connected to substantial theoretical ambitions. Corpus linguistics and computational linguistics are other fields which contribute important empirical evidence. Language laws in quantitative linguistics In QL, the concept of law is understood as the class of law hypotheses which have been deduced from theoretical assumptions, are mathematically formulated, are interrelated with other laws in the field, and have sufficiently and successfully been tested on empirical data, i.e. which could not be refuted in spite of much effort to do so. Köhler writes about QL laws: “Moreover, it can be shown that these properties of linguistic elements and of the relations among them abide by universal laws which can be formulated strictly mathematically in the same way as common in the natural sciences. One has to bear in mind in this context that these laws are of stochastic nature; they are not observed in every single case (this would be neither necessary nor possible); they rather determine the probabilities of the events or proportions under study. It is easy to find counterexamples to each of the above-mentioned examples; nevertheless, these cases do not violate the corresponding laws as variations around the statistical mean are not only admissible but even essential; they are themselves quantitatively exactly determined by the corresponding laws. This situation does not differ from that in the natural sciences, which have since long abandoned the old deterministic and causal views of the world and replaced them by statistical/probabilistic models.“ Some linguistic laws There exist quite a number of proposed language laws, among them are: Law of diversification: If linguistic categories such as parts-of-speech or inflectional endings appear in various forms it can be shown that the frequencies of their occurrences in texts are controlled by laws. Length (or more generally, complexity) distributions. The investigation of text or dictionary frequencies of units of any kind with regard to their lengths yields regularly a number of distributions, depending on the given kind of the unit under study. By now, the following units have been studied: Law of the distribution of morph lengths; Law of the distribution of the lengths of rhythmical units; Law of the distribution of sentence lengths; Law of the distribution of syllable lengths; Law of the distribution of word lengths; Other linguistic units which also abide by this law are e.g., letters (characters) of different complexities, the lengths of the so-called hrebs and of speech acts. The same holds for the distributions of sounds (phones) of different durations. Martin's law: This law concerns lexical chains which are obtained by looking up the definition of a word in a dictionary, and then looking up the definition of the definition just obtained etc. Finally, all these definitions form a hierarchy of more and more general meanings, whereby the number of definitions decreases with increasing generality. Among the levels of this kind of hierarchy, there exists a number of lawful relations. Menzerath's law (also, in particular in linguistics, Menzerath-Altmann law): This law states that the sizes of the constituents of a construction decrease with increasing size of the construction under study. The longer, e.g. a sentence (measured in terms of the number of clauses) the shorter the clauses (measured in terms of the number of words), or: the longer a word (in syllables or morphs) the shorter the syllables or words in sounds). Rank-frequency laws: Virtually any kind of linguistic units abides by these relations. We will give here only a few illustrative examples: The words of a text are arranged according their text frequency and assigned a rank number and the corresponding frequency. Since George Kingsley Zipf (the wellknown “Zipf’s Law”), a large number of mathematical models of the relation between rank and frequency has been proposed. A similar distribution between rank and frequency of sounds, phonemes, and letters can be observed. Word associations: Rank and frequency of associations subjects react with on a (word) stimulus. Law of language change: Growth processes in language such as vocabulary growth, the dispersion of foreign or loan words, changes in the inflectional system etc. abide by a law known in QL as Piotrowski law, and corresponds to growth models in other scientific disciplines. The Piotrowski law is a case of the so-called logistic model (cf. logistic equation). It was shown that it covers also languages acquisition processes (cf. language acquisition law). Text block law: Linguistic units (e.g. words, letters, syntactic functions and constructions) show a specific frequency distribution in equally large text blocks. Zipf's law: The frequency of words is inversely proportional to their rank in frequency lists. What is Corpus? In general language “corpus “means a collection of texts put together according to some criteria. In corpus linguistics, a corpus is by default assumed to be a collection of texts in electronic format which are processed and analyzed using software. The first electronic text corpus dates back to the 1960s, when the Brown Corpus was created at Brown University (USA). Then came some other corpora, like Lancaster-Oslo-Bergen (LOB) Corpus 1978, and others. But it was only after the release of the first corpus-based dictionary, COBUILD 1987, that corpora in linguistic became popular. In 1995 the 100-million-word British National Corpus is published. The national corpora appeared in many other Western countries. Corpus linguistics thus informed other fields of linguistic research, reviving machine translation and computational linguistics as well as discourse studies. It has also influenced language pedagogy and translator training, multilingual terminology, tools for professional translators, contrastive linguistics, and descriptive translation studies. A corpus is described using criteria such as medium (written, spoken, or both) date, language, author and translation status of the text. It can be synchronic or diachronic, depending on the producing the texts on fixed or non-fixed span of time. While the Brown Corpus was based on bibliographic descriptions of a 1960s US College library, the BNC was designed according to more democratic criteria. Translation-driven corpora can be monolingual (containing only texts in one language) or bi-/multilingual. After having compiled the corpus it can be subjected to analysis. Here is when concordance is applied. The table shows frequency of a word per million words (source BNC). Figure 1 Figure 2 Figure 2 is an example of concordance, where the lines are ordered according to the three first words on the left and three words on the right. Words are coloured depending on their being different parts of speech.
© Copyright 2026 Paperzz