Language laws in quantitative linguistics

Lecture 1
Introduction to Corpus Linguistics
Introduction
Electronic texts and text analysis tools have opened up a great number of
opportunities to higher education and language services providers, but leaning to
use these resources continues to be challenges for both scholars and professionals.
Nowadays computer based applications are increasingly based on corpora such as
translation memories (TMs) and machine translation (MT) systems. They are part
of life.
We will divide the work on different practical concepts, after each of one
providing some tasks. Surely to do the tasks students need to have access to online
facilities.
Quantitative linguistics is a sub-discipline of general linguistics and, more
specifically, of mathematical linguistics. Quantitative Linguistics (QL) deals with
language learning, language change, and application as well as structure of natural
languages. QL investigates languages using statistical methods; its most
demanding objective is the formulation of language laws and, ultimately, of a
general theory of language in the sense of a set of interrelated languages laws
Synergetic linguistics was from its very beginning specifically designed for this
purpose. QL is empirically based on the results of language statistics, a field which
can be interpreted as statistics of languages or as statistics of any linguistic object.
This field is not necessarily connected to substantial theoretical ambitions. Corpus
linguistics and computational linguistics are other fields which contribute
important empirical evidence.
Language laws in quantitative linguistics
In QL, the concept of law is understood as the class of law hypotheses which have
been deduced from theoretical assumptions, are mathematically formulated, are
interrelated with other laws in the field, and have sufficiently and successfully been
tested on empirical data, i.e. which could not be refuted in spite of much effort to
do so. Köhler writes about QL laws: “Moreover, it can be shown that these
properties of linguistic elements and of the relations among them abide by
universal laws which can be formulated strictly mathematically in the same way as
common in the natural sciences. One has to bear in mind in this context that these
laws are of stochastic nature; they are not observed in every single case (this would
be neither necessary nor possible); they rather determine the probabilities of the
events or proportions under study. It is easy to find counterexamples to each of the
above-mentioned examples; nevertheless, these cases do not violate the
corresponding laws as variations around the statistical mean are not only
admissible but even essential; they are themselves quantitatively exactly
determined by the corresponding laws. This situation does not differ from that in
the natural sciences, which have since long abandoned the old deterministic and
causal views of the world and replaced them by statistical/probabilistic models.“
Some linguistic laws
There exist quite a number of proposed language laws, among them are:
Law of diversification: If linguistic categories such as parts-of-speech or
inflectional endings appear in various forms it can be shown that the frequencies of
their occurrences in texts are controlled by laws.
Length (or more generally, complexity) distributions. The investigation of text or
dictionary frequencies of units of any kind with regard to their lengths yields
regularly a number of distributions, depending on the given kind of the unit under
study. By now, the following units have been studied:
Law of the distribution of morph lengths;
Law of the distribution of the lengths of rhythmical units;
Law of the distribution of sentence lengths;
Law of the distribution of syllable lengths;
Law of the distribution of word lengths;
Other linguistic units which also abide by this law are e.g., letters (characters) of
different complexities, the lengths of the so-called hrebs and of speech acts. The
same holds for the distributions of sounds (phones) of different durations.
Martin's law: This law concerns lexical chains which are obtained by looking up
the definition of a word in a dictionary, and then looking up the definition of the
definition just obtained etc. Finally, all these definitions form a hierarchy of more
and more general meanings, whereby the number of definitions decreases with
increasing generality. Among the levels of this kind of hierarchy, there exists a
number of lawful relations.
Menzerath's law (also, in particular in linguistics, Menzerath-Altmann law): This
law states that the sizes of the constituents of a construction decrease with
increasing size of the construction under study. The longer, e.g. a sentence
(measured in terms of the number of clauses) the shorter the clauses (measured in
terms of the number of words), or: the longer a word (in syllables or morphs) the
shorter the syllables or words in sounds).
Rank-frequency laws: Virtually any kind of linguistic units abides by these
relations. We will give here only a few illustrative examples:
The words of a text are arranged according their text frequency and assigned a rank
number and the corresponding frequency. Since George Kingsley Zipf (the wellknown “Zipf’s Law”), a large number of mathematical models of the relation
between rank and frequency has been proposed.
A similar distribution between rank and frequency of sounds, phonemes, and
letters can be observed.
Word associations: Rank and frequency of associations subjects react with on a
(word) stimulus.
Law of language change: Growth processes in language such as vocabulary
growth, the dispersion of foreign or loan words, changes in the inflectional system
etc. abide by a law known in QL as Piotrowski law, and corresponds to growth
models in other scientific disciplines. The Piotrowski law is a case of the so-called
logistic model (cf. logistic equation). It was shown that it covers also languages
acquisition processes (cf. language acquisition law).
Text block law: Linguistic units (e.g. words, letters, syntactic functions and
constructions) show a specific frequency distribution in equally large text blocks.
Zipf's law: The frequency of words is inversely proportional to their rank in
frequency lists.
What is Corpus?
In general language “corpus “means a collection of texts put together according to
some criteria. In corpus linguistics, a corpus is by default assumed to be a
collection of texts in electronic format which are processed and analyzed using
software.
The first electronic text corpus dates back to the 1960s, when the Brown Corpus
was created at Brown University (USA). Then came some other corpora, like
Lancaster-Oslo-Bergen (LOB) Corpus 1978, and others. But it was only after the
release of the first corpus-based dictionary, COBUILD 1987, that corpora in
linguistic became popular.
In 1995 the 100-million-word British National Corpus is published. The national
corpora appeared in many other Western countries. Corpus linguistics thus
informed other fields of linguistic research, reviving machine translation and
computational linguistics as well as discourse studies. It has also influenced
language pedagogy and translator training, multilingual terminology, tools for
professional translators, contrastive linguistics, and descriptive translation studies.
A corpus is described using criteria such as medium (written, spoken, or both)
date, language, author and translation status of the text. It can be synchronic or
diachronic, depending on the producing the texts on fixed or non-fixed span of
time.
While the Brown Corpus was based on bibliographic descriptions of a 1960s US
College library, the BNC was designed according to more democratic criteria.
Translation-driven corpora can be monolingual (containing only texts in one
language) or bi-/multilingual. After having compiled the corpus it can be subjected
to analysis. Here is when concordance is applied.
The table shows frequency of a word per million words (source BNC).
Figure 1
Figure 2
Figure 2 is an example of concordance, where the lines are ordered according to
the three first words on the left and three words on the right. Words are coloured
depending on their being different parts of speech.