Dylan Glynn, Justyna A. Robinson, vol. 43

Corpus Methods for Semantics
Human Cognitive Processing (HCP)
Cognitive Foundations of Language Structure and Use
This book series is a forum for interdisciplinary research on the grammatical structure, semantic organization, and communicative function of language(s), and their
anchoring in human cognitive faculties.
For an overview of all books published in this series, please see
http://benjamins.com/catalog/hcp
Editors
Klaus-Uwe Panther
Nanjing Normal University
& University of Hamburg
Linda L. Thornburg
Nanjing Normal University
Editorial Board
Bogusław Bierwiaczonek
Jan Dlugosz University, Czestochowa, Poland /
Higher School of Labour Safety Management,
Katowice
Mario Brdar
Josip Juraj Strossmayer University, Croatia
Barbara Dancygier
University of British Columbia
N.J. Enfield
Max Planck Institute for Psycholinguistics,
Nijmegen & Radboud University Nijmegen
Elisabeth Engberg-Pedersen
University of Copenhagen
Ad Foolen
Radboud University Nijmegen
Raymond W. Gibbs, Jr.
University of California at Santa Cruz
Rachel Giora
Tel Aviv University
Elżbieta Górska
University of Warsaw
Martin Hilpert
University of Neuchâtel
Zoltán Kövecses
Eötvös Loránd University, Hungary
Teenie Matlock
University of California at Merced
Carita Paradis
Lund University
Günter Radden
University of Hamburg
Francisco José Ruiz de Mendoza Ibáñez
University of La Rioja
Doris Schönefeld
University of Leipzig
Debra Ziegeler
University of Paris III
Volume 43
Corpus Methods for Semantics. Quantitative studies in polysemy and synonymy
Edited by Dylan Glynn and Justyna A. Robinson
Corpus Methods for Semantics
Quantitative studies in polysemy and synonymy
Edited by
Dylan Glynn
University of Paris VIII
Justyna A. Robinson
University of Sussex
John Benjamins Publishing Company
Amsterdamâ•›/â•›Philadelphia
8
TM
The paper used in this publication meets the minimum requirements of
the╯American National Standard for Information Sciences – Permanence
of Paper for Printed Library Materials, ansi z39.48-1984.
Library of Congress Cataloging-in-Publication Data
Corpus Methods for Semantics : Quantitative studies in polysemy and synonymy / Edited
by Dylan Glynn and Justyna A. Robinson.
p. cm. (Human Cognitive Processing, issn 1387-6724 ; v. 43)
Includes bibliographical references and index.
1. Semantics. 2. Cognitive grammar. 3. Computational linguistics. 4. Polysemy.
5. Corpora (Linguistics) I. Glynn, Dylan. II. Robinson, Justyna A.
P325.C595
2014
401’.43--dc23
isbn 978 90 272 2397 5 (Hb ; alk. paper)
isbn 978 90 272 7033 7 (Eb)
2014004752
© 2014 – John Benjamins B.V.
No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any
other means, without written permission from the publisher.
John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands
John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa
Table of contents
Contributors
Outline
vii
1
Section 1.╇ Polysemy and synonymy
Polysemy and synonymy: Cognitive theory and corpus method
Dylan Glynn
Competing ‘transfer’ constructions in Dutch: The case of ont-verbs
Martine Delorge, Koen Plevoets, and Timothy Colleman
Rethinking constructional polysemy: The case of the English
conative construction
Florent Perek
Quantifying polysemy in Cognitive Sociolinguistics
Justyna A. Robinson
The many uses of run: Corpus methods and Socio-Cognitive Semantics
Dylan Glynn
7
39
61
87
117
Visualizing distances in a set of near-synonyms:
Rather, quite, fairly, and pretty
Guillaume Desagulier
145
A case for the multifactorial assessment of learner language:
The uses of may and can in French-English interlanguage
Sandra C. Deshors and Stefan Th. Gries
179
Dutch causative constructions: Quantification of meaning and meaning
of quantification
Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
205
vi
Corpus Methods for Semantics
The semasiological structure of Polish myśleć ‘to think’:
A study in verb-prefix semantics
Małgorzata Fabiszak, Anna Hebda, Iwona Kokorniak,
and Karolina Krawczak
223
A multifactorial corpus analysis of grammatical synonymy:
The Estonian adessive and adposition peal ‘on’
Jane Klavan
253
A diachronic corpus-based multivariate analysis
of “I think that” vs. “I think zero”
Christopher Shank, Koen Plevoets, and Hubert Cuyckens
279
Section 2.╇ Statistical techniques
Techniques and tools: Corpus methods and statistics for semantics
Dylan Glynn
307
Statistics in R: First steps
Joost van de Weijer and Dylan Glynn
343
Frequency tables: Tests, effect sizes, and explorations
Stefan Th. Gries
365
Collostructional analysis: Measuring associations between constructions
and lexical elements
Martin Hilpert
391
Cluster analysis: Finding structure in linguistic data
Dagmar Divjak and Nick Fieller
405
Correspondence analysis: Exploring data and identifying patterns
Dylan Glynn
443
Logistic regression: A confirmatory technique for comparisons
in corpus linguistics
Dirk Speelman
487
Name index
535
Subject index
541
Contributors
Timothy Colleman
Ghent University
[email protected]
Dylan Glynn
University of Paris VIII
[email protected]
Hubert Cuyckens
University of Leuven
[email protected]
Stefan Th. Gries
University of California, Santa Barbara
[email protected]
Sandra Deshors
New Mexico State University
[email protected]
Anna Hebda
Adam Mickiewicz University, Poznań
[email protected]
Martine Delorge
Ghent University
[email protected]
Martin Hilpert
University of Neuchâtel
[email protected]
Guillaume Desagulier
Université Paris 8
Vincennes Staint Denis
Université Paris Ouest
Nanterre La Défense
UMR 7114 MoDyCo
[email protected]
Jane Klavan
University of Tartu
[email protected]
Dagmar Divjak
University of Sheffield
[email protected]
Karolina Krawczak
Adam Mickiewicz University, Poznań
[email protected]
Małgorzata Fabiszak
Adam Mickiewicz University, Poznań
[email protected]
Natlia Levshina
Université catholique de Louvain
[email protected]
Nick Fieller
University of Sheffield
[email protected]
Florent Perek
University of Freiburg
[email protected]
Dirk Geeraerts
University of Leuven
[email protected]
Koen Plevoets
Ghent University
[email protected]
Iwona Kokorniak
Adam Mickiewicz University, Poznań
[email protected]
viii Corpus Methods for Semantics
Justyna A. Robinson
University of Sussex
[email protected]
Dirk Speelman
University of Leuven
[email protected]
Christopher Shank
Bangor University
[email protected]
Joost van de Weijer
Lund University
[email protected]
Outline
1. Aim of the volume
It could be argued that Cognitive Linguistics is undergoing a paradigm shift. Originally, the field sought to show the inadequacies of earlier models of language and the
theories of linguistic structure based upon them. Today, the emphasis has changed
to testing the various theories about how language works (Geeraerts 2006; Gries and
Stefanowitsch 2006; Stefanowitsch and Gries 2006; Gonzalez-Marquez et al. 2008;
Glynn and Fischer 2010). This has brought analytical methods, based on observable
and quantifiable data, to the fore. In the light of these developments, this volume systematises, reviews, and promotes a range of research techniques and theoretical perspectives that currently inform work across the field of linguistics, with a particular
focus on Cognitive Semantics. More precisely, the aim of this book is twofold:
i. Didactic: To broaden the understanding and application of the state-of-the-art
corpus linguistic techniques for the study of conceptual structure in Cognitive
Semantics.
Scientific:
To advance the state-of-the-art of those techniques through a collecii.
tion of studies applied to the description of the conceptual structures of polysemy
and synonymy.
This publication grew out of the belief that there exists a strong desire in the research
community to understand and learn how quantitative corpus methods work and how
to apply them to research questions that are basic to the cognitive project. Instead of a rift
between linguists using corpus data and those using traditional introspective analysis,
constructive communication between the methodologies should be encouraged. Both
the descriptive research and the explanations of the statistical techniques included in
this book seek to promote such communication. The chapters that describe the statistical techniques are written to help linguists using traditional methods both understand
how these new methods work and how to apply them. The research chapters, in turn,
showcase the methods described. Their aim is not only to advance corpus-driven quantitative research in Cognitive Semantics, but also to promote the possibilities that these
methodologies offer. Observational data and quantitative corpus-driven methods
cannot inform all research questions. However, it is hoped that this volume will advance the current state-of-the-art in their use as well as promote their application in
the broader linguistic research community.
2
Corpus Methods for Semantics
2. Structure and summary
The book divides into two sections. The first section begins with eleven chapters, arranged according to their object of study. These chapters begin with an overview of
the field in “Polysemy and synonymy: Corpus method and cognitive theory” (Glynn).
This chapter includes the analytical justification for approaching both lexis and morpho-syntax in terms of polysemy and synonymy as well as a justification of extending
the traditional uses of the terms to cover any variation or similarity in use. The analytical chapters begin with morpho-syntactic polysemy, move to lexical polysemy, then
on to lexical synonymy, and finally turn to morpho-syntactic synonymy.
Beginning with research on the polysemy of morpho-syntactic semantics, the first
descriptive chapter, “Competing ‘transfer’ constructions in Dutch” (Delorge, Plevoets,
and Colleman), considers the polysemy of a morpheme-based construction. The ontprefix in Dutch combines with a range of verbs to express dispossession. Using correspondence analysis, the study seeks to capture the lexical semantic morpho-syntactic
interplay associated with the construction. The next chapter, “Rethinking constructional polysemy” (Perek), also examines a grammatical construction. The syntactically encoded conative construction in English combines with a range of lexemes.
Through the application of collostructional analysis, the author attempts the task of
teasing out and identifying the semantic variation associated with the construction.
Turning to lexical semantics, “Quantifying polysemy in cognitive sociolinguistics” (Robinson) examines the usage of polysemous adjectives in a community of
speakers from South Yorkshire, UK. The study applies cluster analysis, logistic regression, and decision tree analysis in order to examine the extent to which individual conceptualisations are non-random and can be related to the socio-demographic
characteristics of the speaker. “The many uses of run” (Glynn) is a repeat analysis
of Gries’ (2006) study. Employing a combination of cluster analysis, correspondence
analysis and logistic regression, it confirms Gries’ findings but argues that sociolinguistic dimensions should be included in the study of polysemy.
Remaining with lexical semantics, but focusing on near-synonymy, “Visualizing distances in a set of near-synonyms” (Desagulier) examines rather, quite, fairly,
and pretty in English, combining collostructional analysis and multivariate statistics
such as correspondence analysis and cluster analysis. “The uses of may and can in
French-English interlanguage” (Deshors and Gries) treats a lexical alternation in first
language and second language use. With the use of cluster analysis and logistic regression, the authors seek to identify not only the relationship in use between may and
can, but to compare this with French native speakers using English.
Moving towards morpho-syntactic semantics, “Dutch causative constructions”
(Levshina, Geeraerts, and Speelman) examines a lexeme-based grammatical construction alternation. Focusing on the expression of causation in Dutch, the study employs
logistic regression analysis to determine both the semantic and extralinguistic factors
Outline
3
that determine the choice and difference in conceptualisation between the two constructions. The next study, “The semasiological structure of Polish myśleć ‘to think’”
(Fabiszak, Hebda, Kokorniak, and Krawczak) continues to move from lexical to syntactic semantics with an analysis of the near-synonymy of a set of prefix-verb combinations. The study combines introspective methods and usage-feature analysis, examined
with cluster analysis, correspondence analysis and logistic regression.
“A multifactorial analysis of grammatical synonymy” (Klavan) studies a lexical-morphological alternation. Employing logistic regression, the study attempts to
determine the conceptual differences that motivate speakers’ choice of a preposition
over a grammatical case to express the spatial relation of on in Estonian. “A diachronic
corpus-based multivariate analysis of ‘I think that’ vs. ‘I think zero’” (Shank, Plevoets,
and Cuyckens) is an analysis of well-known complementiser alternation in English.
Logistic regression is used to test a wide range of language factors proposed in the
literature to motivate the omission of the complementiser.
The second section consists of seven chapters that explain some of the tools and
methods for quantitative corpus-driven research. It is designed to introduce the application and interpretation of the statistical methods used in the first section for
researchers completely new to the field. It also serves as a ‘cookbook’, or is a quick reference, for intermediate users of the statistical techniques and the programming environment R. The first chapter, “Techniques and tools: Corpus methods and statistics
for semantics” (Glynn), is an overview of the field. It examines two corpus methods
that are commonly used in Cognitive Semantics and summarises many of statistical
techniques currently used in the field.
The second chapter introduces the statistical environment of R, used throughout
the book (van de Weijer and Glynn). Readers with no experience in R will find this
chapter useful when applying the techniques described in the previous chapters to
data analysis. The following chapter, “Frequency tables: Tests, effect sizes, and explorations” (Gries), covers many of the essential and basic analytical concepts and how
to apply them in R. Building on the statistical basics, the next chapter, “Collostructional analysis: Measuring associations between constructions and lexical elements”
(Hilpert), explains the application of collostructional analysis, one of the most popular quantitative techniques in Cognitive Linguistics. This family of techniques are
used to quantify the degree of association between linguistic forms.
The next three chapters each consider a different multivariate statistical technique. The three techniques in question have proven popular in recent Cognitive Linguistic research. The chapter “Cluster analysis: Finding structure in linguistic data”
(Divjak and Fieller), focuses on a method for sorting a given set of phenomena, such
as lexemes, constructions, or senses, into categories of similar and dissimilar, relative
to some other set(s) of linguistic phenomena such as meanings, argument types, case
marking, and so forth. This is followed by the chapter “Correspondence analysis: Exploring data and identifying patterns” (Glynn), which considers a technique similar
4
Corpus Methods for Semantics
to cluster analysis, but one that looks for correlations between different phenomena
rather than categorising them. It is useful for identifying structure in complex multidimensional data. The third chapter on multivariate techniques, “Logistic regression:
A confirmatory technique for variant comparison in corpus linguistics” (Speelman),
considers an advanced form of statistical modelling. Logistic regression is a powerful
and popular tool in the social sciences, including Cognitive Linguistics. As a confirmatory technique, regression analysis represents a level of statistical analysis that is
more complex than the previous techniques covered. This chapter charts the basics of
its application, interpretation and verification.
Where the first section represents a broad, yet coherent, picture of the cutting
edge in the application of these techniques, the second section seeks to offer an introduction to the different statistical techniques employed in corpus-driven semantics.
Focusing on the study of polysemy and synonymy of both lexical morpho-syntactic
forms, these empirical analyses represent the vanguard of corpus-driven Cognitive
Linguistics.
The editors
References
Geeraerts, D. (2006). Methodology in Cognitive Linguistics. In G. Kristiansen, M. Achard,
R. Dirven, & F. J. Ruiz de Mendoza Ibañez (Eds.), Cognitive Linguistics: Current applications and future perspectives (pp. 21–50). Berlin & New York: Mouton de Gruyter.
Glynn, D., & Fischer, K. (Eds.). (2010). Quantitative Cognitive Semantics: Corpus-driven approaches. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423
Gonzalez-Marquez, M., Mittelberg, I., Coulson, S., & Michael S. (Eds.). (2008). Methods in
Cognitive Linguistics. Amsterdam & Philedelphia: John Benjmains.
Gries, St. Th. (2006). Corpus-based methods and Cognitive Semantics: The many senses of
to run. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110197709
Gries, St. Th., & Stefanowitsch, A. (Eds.). (2006). Corpora in Cognitive Linguistics: Corpus-based
approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110197709
Stefanowitsch, A., & Gries, St. Th. (Eds.). (2006). Corpus-based approaches to metaphor and
metonymy. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110199895
Section 1
Polysemy and synonymy
Polysemy and synonymy
Cognitive theory and corpus method
Dylan Glynn
University of Paris VIII
This chapter introduces the field of polysemy and synonymy studies from a
Cognitive Linguistic perspective. Firstly, the discussion explains and defines the
object of research, showing that the study of semantic relations, traditionally restricted to the description of lexical semantics, needs to be extended to include
all formal structures, including morpho-syntax. Secondly, given the theoretical
assumptions of Cognitive Linguistics, it is argued that quantitative corpus-driven methods are essential for the description of semantic structures. Lastly, the
chapter charts the development of Cognitive Semantic research in polysemy
and synonymy and demonstrates how the current corpus-driven research in the
field is inherently linked to the traditions of radial network analysis and prototype semantics. It is argued that instead of an empirical revolution (as has been
suggested in recent commentaries), the current trends in the use of observational data are a natural extension of the Cognitive Semantic research tradition.
Keywords: Cognitive Linguistics, corpus linguistics, polysemy, prototype
semantics, quantification, radial network analysis, synonymy
1. Introduction: Theory and method
The idea of ‘corpus semantics’, just like the possibility of ‘quantifying meaning’, is not
self-evident. This introduction to the field of corpus-driven Cognitive Semantics
attempts to explain how semantic analysis can, and indeed, should, turn to corpus
methods. It also explains why quantitative techniques are needed in this endeavour.
Assuming the Usage-Based Model (Hopper 1987; Langacker 1987), how can we
identify and explain the semantic structuring of language empirically? Post-Generativist and post-Structuralist approaches to language avoid positing, a priori, analytical constructs to explain the structuring of language, rather treating it holistically
as a dynamic and varied result of use. However, without a structurally independent
langue or an ‘ideal’ speaker’s competence, against what predictive model can we test
8
Dylan Glynn
our hypotheses or attempt to falsify our claims about language structure? Without
constructs such as langue and ideal competence, linguistic research, whether Functional or Cognitive, must adopt an inductive, sample-based, methodology. To these
ends, experimental techniques for the analysis of semantics have been developed, yet
corpus methods remain poorly represented within the field.
Moreover, the theory of Cognitive Linguistics recognises no internal language
modules, such as syntax, lexis, semantics or pragmatics. From a non-modular perspective, the study of meaning must account for the integration of all these components of
language structure and do so simultaneously in a functionally and conceptually plausible manner. Corpus-driven methods, and multivariate statistics more specifically,
are perfectly suited for such a task.
A usage-based semantics, therefore, must take two fundamental steps. Firstly, it
must adopt inductive research methods. Whether elicited through experimentation,
extracted from electronic corpora, or collated from questionnaires and field research,
generalisations based on samples of data present the only possibility for hypothesis
testing. Importantly, acknowledging this fact entails validating sample-based results
through statistical confirmation. Secondly, it must develop corpus-driven semantic
analysis. If we are to account holistically for the integrated complexity of the various
dimensions of language structure, it is essential that we examine natural contextualised language production. Samples of natural language large enough to permit inductively valid claims are what we term corpora. Here again, statistics comes to the
fore, though for a different reason. If we are to identify structure, sensitive to its usage
context, multivariate statistics is a powerful, if not essential, tool due to the sheer
complexity of the data.
The aim of this book is both to introduce quantitative corpus-driven semantic
methodology to the broader research community and to advance the state-of-theart. The methods in focus are those that are especially applicable to the lexical and
constructional semantic relations of similarity and difference. Linguistic forms are
used in different ways, and capturing this semasiological variation is what we term
the study of polysemy. Likewise, speakers choose between different linguistic forms to
express similar concepts. Explaining this onomasiological variation is what we term
the study of synonymy.
The reader should be aware that the term polysemy is not restricted to ‘true’ polysemy, where distinct referents are indicated by a single form. Instead, meaning is
understood from a usage-based perspective, where any systematic variation in use
represents semasiological structure. In the same vein, synonymy is not restricted to
absolute similarity since, from a Cognitive Linguistic perspective, one assumes that
any variation in form is motivated by some variation in use and that ‘true’ synonymy
is rare, if it exists at all. Lastly, it must be added that the term semantics is used to
indicate encyclopaedic semantics and pragmatics, as opposed to linguistic semantics
in its narrow sense.
Polysemy and synonymy
Moreover, the term corpus methodology should be understood to indicate research that is corpus-driven, as opposed to corpus-exemplified (q.v. Tummers et al.
2005). Where corpus-exemplified research identifies occurrences to explain or support a theory of language structure, corpus-driven research examines large samples of
natural language in order to test theories about language structure (often previously
proposed a priori in corpus-exemplified research). In this, we focus on quantitative
techniques, and, more specifically, statistical methods for the exploration of data and
the falsification or confirmation of quantitatively testable hypotheses.
We begin by developing operational definitions of polysemy and synonymy (Section 2). The discussion then demonstrates, given the above definitions, why we need
corpus methodology (Section 3). Finally, we consider the argument that the quantitative corpus tradition and the prototype-based Cognitive Semantic tradition are not
only analytically compatible, but, in fact, are inherently entwined. It is argued that the
introspection-based research in prototype structuring of linguistic categories and the
results of the radial network tradition should be understood as the theoretical modelling of semantic structure, an essential first step in empirical research (Section 4).
2. Polysemy and synonymy: Definition, object and operationalisation
Given the linguistic heritage of the 20th century, it is upon us to begin with two simple questions. Firstly, what exactly constitutes polysemous and synonymous semantic
relations in usage-based semantics? Secondly, is it theoretically and analytically possible to speak of the polysemy and synonymy of a syntactic construction? This section
answers each question in turn. In doing so, the discussion will offer operational definitions of the concepts ‘polysemy’ and ‘synonymy’ and will justify extending the study
of such relations to morpho-syntax.
With meaning defined as encyclopaedic and with all form being semantically
motivated, how do we understand the notions of polysemy (difference in sense of
a form) and synonymy (similarity in sense between forms)? Given a traditional understanding of the terms, usage-based approaches, such as Cognitive Linguistics or
Functional Linguistics, are not interested in polysemy or synonymy per se. Much of
what cognitivists and functionalists examine would be called ‘semantic vagueness’,
‘word fields’ or ‘syntactic alternation’ in other theoretical paradigms. Moreover, for
Structuralism, which holds a distinction between the langue and its use, our position,
that the meaning of a word can be operationalised as its use, is nonsense. For Generativism, which argues for an autonomous formal system, examining the semantic
structure of syntactic patterns is equally nonsense.
In Structuralist terms, polysemy was identified using the definitional test and/
or the ambiguity test. These tests were designed to distinguish polysemous relations
from vague relations, on the one hand, and from a monosemic form, on the other (for
9
10
Dylan Glynn
further discussion, see Geeraerts 1993a). This modular understanding of semantic
structure assumes two theoretical constructs – firstly, the notion of truth conditional
semantics and, secondly, the notion of semantic categories determined by necessary
and sufficient conditions.
The same assumptions determined the definition of synonymy – any two lexemes
were considered synonymous if replacing one lexeme with the other did not change
the ‘truth semantic’ meaning of the phrase (Lyons 1968:â•›428). Cognitive Linguistics
categorically refutes both assumptions.1 Indeed, cognitive-functional approaches to
meaning do not recognise the notion of necessary and sufficient conditions nor any
strict division between linguistic semantics and context pragmatics. Without such
assumptions, the meaning of the terms polysemy and synonymy can be loosened and
defined as:
Polysemy – different concepts-functions of a form
Synonymy – different forms for a concept-function
Note that these definitions of polysemy and synonymy do not exclude taxonomic relations such as hyponymy and hyperonymy (for polysemy) or basic level, superordinate
and subordinate relations (for synonymy). The definitions need to include these relations because, in a given situation-context, a given form may be used at different levels
of specificity (‘vertical’ polysemy), just as a choice is made between different words
signifying different levels of specificity (‘vertical’ synonymy). Such a broad understanding of semantic relations could be more accurately described as semasiological
and onomasiological variation (Geeraerts 1993b; Grondelaers and Geeraerts 2003).
Nevertheless, we will continue with the terms polysemy and synonymy since they
enjoy wider currency.
1. See Geeraerts (1987, 1993a, 1994) and Tuggy (1993) for a more detailed explanation of
the theoretical questions at stake here. For a summary of the questions that led to the original debates on truth conditional semantics and necessary and sufficient conditions, see
Verschueren (1981) and Lakoff (1982). Footnote 10 lists the principal works that established
Cognitive Linguistics’ position on these questions. The contemporary Anglo-Saxon Structuralist position is summarised in Cruse (2000) and Murphy (2003). The aforementioned debates
largely ignored the contemporary French, German and Russian traditions. Examples of current
French Structuralist research on semantic relations include Rastier’s (1987, 1991, 2011) context
sensitive approach and Victorri and Fuchs’s (1996) construal sensitive framework. The German
Structuralist understanding of semantic relations lies close to the Anglo-Saxon tradition (q.v.
Coșeriu 1980; Kastovsky 1982; Lutzeier 1985; and Lipka 1992). The Leibnizian tradition of
semantic primitives, important to the Moscow school of semantics, is represented by Apresjan
(1974, 2000), Wierzbicka (1985, 1996), and Mel’čuk (1989). From a Leibnizian functional perspective, Wierzbicka (1989, 1990) offers responses to issues raised in Verscheuren (1981) and
Lakoff (1982). Geeraerts (1999b) offers, in turn, responses to Wierzbicka.
Polysemy and synonymy
Can we now justify applying the notions of polysemy (semasiological variation
of a single form) and synonymy (onomasiological variation between more than one
form) to the study of grammar? If we assume that all form is conceptually or functionally motivated and if we agree that, in the study of polysemy and synonymy, we
are effectively studying variation in concept-function relative to form and variation in
form relative to concept-function, then this must necessarily be extended to non-lexical meaning and form. Therefore, in blunt terms, the study of polysemy and synonymy includes the study of schematic forms and their meanings such as those typical
of syntax and prosody just as much as it does the study of words and morphemes. We
can, therefore, identify our object of study more precise terms:
Polysemy – the functional-conceptual variation of any symbolic form
Synonymy – the functional-conceptual relation between any symbolic forms2
This is a simple statement about an object of study that many linguists would take as
obvious, yet others take as ludicrous. The division is a result of the fact that even if
we accept that all formal structure is motivated, there certainly exist different types of
form just as there exist different types of meaning. Phonological, gestural, syntactic,
morphological and lexical forms all tend to possess different characteristics. There is
no doubt in this – the referential meaning of a lexeme like chair is far from the intersubjective meaning of a request implicature, and these two ‘types’ of meaning differ
from the abstract relational meaning of the Transitive Construction. Notwithstanding the belief that such linguistic devices are inextricably interwoven structurally, the
characteristics of these different types of form and meaning, just as the tools needed
to describe them, differ profoundly.
If we accept that, analytically (as opposed to theoretically), such differences exist,
then we can distinguish different lines of research. Given this, it is still possible to
speak of lexical research or syntactic research as long as this is seen as an analytical
emphasis, not an object of study. In order to avoid conjuring the theoretical modules
of earlier theories of language, we can term the different lines of research – schematic
2. This definition would also cover antonymy. Although it may seem a little far-fetched to consider antonymy as a synonymy relation, this is merely a result of the terminology. Ideally, onomasiological structure would be a better term than synonymy, but the term is not established in
the Anglo-Saxon tradition and would, therefore, add considerable terminological weight to the
discussion. Nevertheless, it must be noted that antonymy should not be seen as an antonym (in
a non-technical sense) for synonymy. It is well known from the study of lexical fields and, more
recently, word space modelling in computational linguistics, that antonyms are, in fact, closely
related to each other semantically and, therefore, are relatively synonymous. For example, hate
is much closer in meaning to love than car or run. Generally, antonyms are synonyms semantically opposed by one culturally or perceptually salient feature. Therefore, in fact, antonymy is
an antonym to synonymy, in other words, very close in its meaning. See Jones (2002:â•›51) and
Murphy (2003:â•›37) for examples of the issue at hand.
11
12
Dylan Glynn
and concrete. Such distinctions are abstract enough to avoid leading us into the trap
of modularising language structures, while also being transparent enough to help us
easily differentiate between lines of research and the tools they necessitate.
Therefore, although semantic structure and its relations exist equally for morpho-syntax and lexis, the kind of semantics typically associated with lexical forms is
of a much more concrete nature than that of more schematic formal structures. This
is, of course, a tendency, but an important one because the different kinds of formal
structure and the semantics associated with them may warrant different analytical
techniques. For these practical purposes, let us identify four objects of study:
Polysemy – Concrete meaning
–Schematic meaning
Synonymy – Concrete form
–Schematic form
Seen in these terms, Lakoff ’s (1987) analysis of over is a study of concrete, or non-schematic, polysemy, but his study of the Deictic Construction is schematic polysemy. On
the other side of the coin, Lakoff ’s (1987) analysis of anger is concrete (non-schematic) synonymy and the analysis of the Dative alternation by Goldberg (2002) is an
instance of schematic synonymy. Regardless of whether we speak of words or syntax,
the analytical object of the relations between different yet functionally-conceptually
similar forms versus the relations between different functions-concepts of a single
form should be evident.
A few further examples should demonstrate the distinction and its value for
identifying the fundamental objects of study for all usage-based linguistics. By way
of example for schematic polysemy, consider Halliday’s (1967) study of the English
Transitive Construction, Langacker’s (1982) analysis of the English Passive Construction or Bondarko’s (1983) analysis of the Perfective Aspect. For concrete polysemy,
take Culioli’s (1990) analysis of the French lexeme donc ‘so, thus’ or Fillmore’s (2000)
lexeme crawl. For synonymy, the same diversity exists. At the schematic level, we
have Givón’s (1982) evidential markers, Halliday’s (1985) English Grammatical Conjunctions, Talmy’s (1988) Causative Constructions, Culioli’s (1990) English Negation
Constructions, and Langacker’s (1991) English Nominal Constructions.3 Obviously, the lexical research is equally diverse: from Fillmore’s (1977) buy-sell frame, the
3. Seen from this perspective, the entire tradition of conceptual analysis (Wierzbicka 1985;
Lakoff 1987; Stepanov 1997; Vorkachev 2004; and Bartmiński 2008 inter alia) is based upon
synonymy. Such research begins with a concept and examines what words or expressions are
available for its linguistic representation. On the grammatical front, a similar notion holds.
Understood in these terms, Talmy’s (1988) Force Dynamics or Langacker’s Causative Constructions (1991:â•›408–411) are essentially onomasiological fields of near-synonymous forms.
Of course, regardless of whether it is lexis or syntax, in order to understand the relationship
between these different forms, we must investigate each form semasiologically, that is, its
Polysemy and synonymy
lexical field of say-speak by Dirven et al. (1982), and Lehrer’s (1982) study on adjectives for the description of wine to the full swath of conceptual metaphor and metonymy studies. Indeed, in usage-based linguistics, most descriptive linguistic studies can
be classified as one of these four lines of research.
Despite the fact that a wide and varied range of linguistic analyses can be understood as the study of polysemy or synonymy, not all research can be characterised by this typology. Within both the Functional and Cognitive paradigms, there
exist objects of study that are not readily characterised in this manner. Such research
lies beyond the realm of polysemy and synonymy. Within Cognitive Linguistics, ad
hoc, or non-entrenched, categorisation typical of conceptual integration (Fauconnier
and Turner 1998), just as with the entire field of language processing, are beyond
the purview of polysemy and synonymy research. Within Functional Linguistics, the
detailed, context dependent analysis of conversation or the wide-ranging research on
the characteristics of genre, register, and stylistics seem to lie outside the realm of
semantic relations per se. This is not to say that such research, both the cognitive and
the functional, does not inform the study of semantic relations, but it is distinct from
this object of study.
Having established the importance and limits of these two objects of study – the
conceptual-functional structure of the form and the forms available to express a concept-function – we can now ask how best to operationalise these notions? In other
words, how can we define the object of study in a measurable way?4 With no langue or
independent syntax against which we can test hypotheses, what exactly are we analysing in the study of semantic relations? How can we falsify or verify results? We need
an operationalisation of the object of study that either offers stability to the system or a
means of capturing the dynamic nature of that system. The answer lies in Langacker’s
(1987:â•›59–60) theory of the entrenched form-meaning pair.5 The theory of entrenchment can provide a frequency-based operationalisation of grammaticality – the
more often a form-meaning pair is used, the more automated its processing becomes
and the more ‘grammatically acceptable’ it is according to the speaker’s intuition.6
polysemy. Bondarko (1991) argues convincingly that the semasiological–onomasiological divide is fundamental to any theory of conceptually or functionally motivated language.
4. For a discussion on the importance of operationalisation in language science, see
Stefanowitsch (2010).
5. See also Givón’s (2005:â•›48ff.) more detailed investigation into what he terms automated
processing in the attention system. The idea of form-function mapping is extended to include
perceptual and conceptual issues such as prototype structure.
6. The theory of entrenchment is based on the notion of automisation, a widely accepted theory in psychology. Schneider and Shiffrin (1977) and Shiffrin and Schneider (1977) were the
first to develop the hypothesis.
13
14
Dylan Glynn
Generalised across a speech community, that is the ensemble of individual speakers’
linguistic knowledge, we have an operationalisation of grammar.
If we accept these hypotheses and assume that through repeated contextualised
use, the relation between a concept-function and a form becomes stable, then we have
an identifiable object of study. At a theoretical level, therefore, the study of semantic
relations is the study of variation of entrenched form-meaning pairs:
Polysemy – entrenched functional-conceptual variation of a schematic or
non-schematic form
Synonymy – entrenched functional-conceptual relation between schematic
and non-schematic forms
It is important to note that Langacker’s theory of entrenchment is determined by relative frequency of use. This can be re-stated as an operational definition: the degree
of entrenchment is determined by the frequency of association of a given form and
a given use of that form. Of course, what constitutes ‘a form’ and ‘a use’ is open to
debate, but the notion of entrenchment per se is operationalised in a way that permits quantified analysis of semantic structure – the aim of this volume. The relative
frequency of a form-meaning pair determines its entrenchment in a speaker’s knowledge. This, when extended to an entire speech community, means that the frequency
of a form-meaning pair in language can indicate the degree of its stability in the intersubjective system of language.7 Therefore:
Polysemy and synonymy can be measured in terms of the relative frequency
of association of form and meaning
Given this understanding of semantic relations, questions as diverse as prototype
effects and sociolinguistic variation are neatly explained by a single principle. This
claim deserves a brief explanation.
A central question for the study of polysemy is: which ‘senses’ are more ‘central’, or more prototypical, than others. A similar question exists in synonymy studies: which forms are more basic taxonomically (as in basic-level terms, Lakoff 1987)?
Both the concept of prototype meaning and basic-level form can be operationalised
in terms of frequency. Although it is not claimed that frequency alone can explain
prototype or taxonomic structure, it is, nonetheless, one important operationalisation of these phenomena (see Arppe et al. 2010). The operationalised definition is
straightforward: the more frequent a given meaning, the more ‘typical’ it is categorically. This is a frequency-based understanding of prototypicality. The same can be
7. Even if this operationalisation of entrenchment is adequate, which is questionable (see
Note 8), this does not, in turn, entail that corpora are representative of frequency of use. Given
the dynamic complexity of language, it is a reasonable argument that no corpus will be truly
representative for the foreseeable future.
Polysemy and synonymy
posited, mutatis mutandis, of basic level categories in taxonomic structure: if for polysemy, typicality is operationalised as the most frequent concept-function (relative to a
form), for synonymy, basicness is operationalised as the most frequent form (relative
to a concept-function). Therefore, for synonymy, the more frequent a form, the more
basic it is taxonomically.
Importantly, this frequency-based understanding of the system integrates the
varied and dynamic nature of language into our model of semantic relations. For a
usage-based understanding of language, the system is emergent and entirely dependent on context – context of situation, context of speaker, context of time, context of
region. A given form-meaning pair will be more frequent in one city than another,
in one register than another, for one gender more than another, and at one period of
time more than another. Relative frequency, at a theoretical level, eloquently incorporates this complexity into the object of study. Therefore, the operationalisation –
context-sensitive frequency-based typicality – simultaneously captures semantic
structure both categorically (prototype effects) and taxonomically (basic-level effects)
but also relative to social variation.
Lastly, it must be stressed that this frequency-based approach to entrenchment is
only an operational definition. Other operationalisations of the relationship between
form and meaning may be equally valid. Despite Langacker’s concern with frequency,
something that holds well for corpus linguists, perceptual and conceptual salience
surely also have a hand in the learning process, and therefore in entrenchment. Langacker’s explanation of entrenchment assumes that all input has the same ‘weight’ in
or ‘impact’ upon the system. This, it would seem, is a simplification. There is no reason to suppose that every occurrence of, or exposure to, language events has the same
value in the process of entrenchment. The implications, especially for prototype and
taxonomic structures are far reaching. We must suppose, therefore, that the frequency-based account of language cannot give us the full picture. It does, however, offer
an operationalised and quantifiable object of study, one that will permit the testing of
hypotheses, the verification of results, and one that will provide clear benchmarks for
the comparison of results, using other methodologies and other operationalisations.8
3. Complexity and sampling: The need for quantification
Why are quantitative corpus methods needed for a cognitive approach to polysemy
and synonymy? There are two answers to that question. The first answer takes us back
to our object of study and the second, to our model of language.
8. The possibility of using frequency as an operationalisation is central to a diverse range of
discussions in usage-based linguistics. Schmid (2000, 2010), Tomasello (2003), Bybee (2007)
and Geeraerts (2010a) are examples of research that consider this possibility.
15
16
Dylan Glynn
The first reason we need quantitative techniques can be summarised as a question
of complexity. We defined our object of study as “relative frequency of association of
form and meaning”. The discussion has, thus far, ignored an important issue – what
constitutes a given form and what constitutes a given meaning? We have stated that
a form is any form and that meaning is anything we know of the world. In fact, both
the terms ‘form’ and ‘meaning’ are entirely misleading. All forms exist in a formal
context and are, therefore, composite. Theoretically, one should not speak of a form
but of a composite form. Similarly, meaning is situated contextually and therefore
one should not speak of a meaning as a reified sense, but as an intersubjective result
of communication.9 The importance and implications of these two points cannot be
underestimated.
Formal structure is complex. Even the simplest utterance is a composite form
at some level. Moreover, dialect and sociolect variation means that even phonetic
components can be indicative of usage variation. Prosody, syntax, morphology, lexis
and even gesture, all come together in effectively every utterance as composite forms.
Since it is a fundamental tenet of Cognitive Linguistics that language must be analysed holistically, we must treat forms as composite structures, always.
If formal structure is complex from a cognitive perspective, meaning is more so.
Even lexical meaning cannot be divided into discrete senses. It is a dynamic, context
dependent, multi-dimensional and intersubjective social phenomenon. Geeraerts’
(1993a) and Kilgarriff ’s (1997) studies on polysemy and word meanings mark important milestones in the study of semantic relations. Their work shows that lexical senses, just like any functionally or conceptually determined category, cannot be assumed
to be discrete or reifiable. Arriving at this point, both theoretically and descriptively,
was a long road. Via theoretical research on prototype structures (Geeraerts 1993a;
Zlatev 2003), on the one hand, and via corpus research in lexicology (Kilgarriff 1997;
Glynn 2010b) on the other, the proposal that a reified and discrete understanding of
semantic structure, including individuated senses, will not produce adequate descriptions of that structure has gained currency.
Encyclopaedic semantics entails that all of sociolinguistic structure is included in
the semantic analysis. The situation context, the gender, the age, the socio-economic
class, the geographical region, and the social status are all dimensions of language use,
dimensions of world knowledge in how to use language, how to communicate successfully. This world knowledge must, therefore, be integrated into semantic analysis.
The result is a complex multidimensional ‘form’ coupled with a complex multidimensional ‘meaning’. It may be possible, using intuition and introspection, to consider
9. It is not the point of this discussion to enter into the debates on sign and communication
theory. Suffice it is to say that any sign theory to which a Cognitive Linguist would ascribe
would also see meaning as a result of a communicative act and not an inherent objectifiable
phenomenon.
Polysemy and synonymy
Table 1.╇ Variation in the results of introspection-based polysemy analysis
Lexeme
Senses identified
Study
over (English)
6 basic senses, 21 sub-senses
17 senses
9 of Lakoff ’s sub-senses as 1 sense
3 basic senses, > 40 sub-senses
6 basic senses, 12 senses sub-senses
3 basic senses
1 basic sense, 15 senses sub-senses
13 senses
3 basic senses, 11 sub-senses
3 basic senses, 8 sub-senses
3 basic senses, inconclusive sub-senses
3 basic senses, 14 sub-senses
5 basic senses, 8 sub-senses
Lakoff (1987)
Taylor (1989a)
Vandeloise (1990)
Deane (1993a, 2006)
Dewell (1994)
Kreitzer (1997)
Tyler and Evans (2003)
Cuyckens (1991)
Geeraerts (1992)
Bellavia (1996)
Dewell (1996)
Meex (2001)
Liamkina (2007)
over (Dutch)
über (German)
all of these dimensions, but to understand how they all interact, or how they could
all interact, is an effectively impossible feat. Quantifying the analysis permits the use
of multivariate statistics, which is designed for modelling and capturing structure in
precisely this kind of complex system.
The second reason we need to develop quantitative methods for the study of semantic relations lies in the model of language propounded by Cognitive Linguistics.
Both the Structuralist and Generativist traditions assumed models of language that
permitted the falsification of claims made about its structure. Necessary and sufficient
conditions, for establishing semantic categories, and grammatical acceptability tests,
for checking proposed grammatical rules, both allow an analyst to falsify hypotheses. What possibility does Cognitive Linguistics have for falsifying propositions made
about language structure? How can we test for the descriptive or explanatory adequacy of a conceptual metaphor or a reference-point construction? Neither makes any
predictions that can be falsified.
The lexical polysemy studies of early Cognitive Semantics were excellent examples of this problem. Lakoff (1987) proposed 21 senses for the lexeme over, but his
work was challenged (see Table 1). Although this, it would seem, is good scientific
procedure – a study proposes a given number of senses, the results are challenged –
there is, ultimately, no way of resolving the issue because there is no way of disproving
his original analysis. Table 1 lists the different proposals of the number of senses for
over in English, Dutch and German. This debate could effectively continue ad nauseum, since using one’s intuition to determine a category, especially a category that
can have both a fuzzy boundary and better or worse exemplars, has no possibility for
falsification. It is, ultimately, a matter of opinion.
17
18
Dylan Glynn
In earlier theoretical paradigms, truth conditional semantics and predictive rules
could be tested using deductive proofs. Usage-based research has no such options. A
counter example, even many, does not contradict a proposal about relative structure.
From our perspective, linguistic structure is emergent; it is dynamic and varied. However, a quantification of the study of semantic relations permits inductive research.
The only way to test our hypotheses is to take a sample of natural language use or a
sample of elicited language use and make generalisations based on that sample. Once
we speak in terms of samples and populations, then we speak of inductive research,
and, in brief, statistics.
Statistical analysis gives us the probability that a given finding in a given sample
is not chance, it gives the possibility of modelling the variation in our data and testing
the accuracy of analyses by using these models to predict language use. Quite simply,
moving towards quantification and statistical analysis appears inevitable for all usage-based language research.
4. Modelling meaning. Multidimensional patterns and prototype effects
How does the usage-based approach contribute to the prototype structure and radial
network tradition of polysemy and synonymy in Cognitive Semantics? The study of
both (vague) polysemy and (near) synonymy has a long tradition in Cognitive Linguistics. Indeed, many of the seminal works were devoted to such questions. This
section places quantitative corpus-driven research, such as that presented in the current volume, in Gries and Stefanowitsch (2006) and Glynn and Fischer (2010), in this
‘historical’ context.
At the end of the last century, what was often called the ‘network’ or ‘radial category’ approach to meaning included a wide range of polysemy and synonymy studies
of especially spatial prepositions and grammatical cases. These forms were particularly interesting because they linked the theoretical research on perception-based
construal and image schemata with the culturally determined lexico-grammatical
structure. The aim of this research was to model prototype structure and encyclopaedic semantics. In effect, presented with the boundless considerations of encyclopaedic semantics as well as relative categorisation due to prototype effects on structure,
theoretically and analytically, much of the work can be seen as an attempt to identify
order in what is an immensely varied and complex system. While theoretical models
such as Frame Semantics (Fillmore 1985) and Idealised Cognitive Models (Lakoff
1987) were proposed in an attempt to identify structure in encyclopaedic semantics
and the prototype effects in language, radial network analysis, in its various guises
(Barthélemy 1991; Rice 1993; Geeraerts 1995), was essentially a representational formalism designed to visualise and summarise systematicity in semantic complexity.
This formalism was used, to various extents, by different authors, but the principle
Polysemy and synonymy
of (i) employing encyclopaedic semantic features (ii) without the notion of necessary
and sufficient conditions for category membership (iii) in order to distinguish senses
and relate forms, underlies all radial network research.
The theoretical research on semantics and categorisation began with Fillmore
(1975, 1977) and Lakoff (1975, 1977). Although theoretical discussion on these topics continues to this day, the field was established by Verschueren (1981), Fillmore
(1985), Vandeloise (1986), Geeraerts (1987), and Lakoff (1987).10 The application of
these theories in descriptive analysis blossomed into what was termed the radial network approach. Early case studies included Lindner (1983), Brugman (1983a, 1984),
Rudzka-Ostyn (1983, 1989), Hawkins (1985), Vandeloise (1986), Janda (1986, 1990),
Norvig and Lakoff (1987), and Taylor (1988). This line of research proved so popular that the mainstay of Cognitive Semantics in 1990s could be argued to be based
upon it. A string of anthologies largely devoted to the application of the method include Geeraerts (1989), Tsohatzidis (1990), Dubois (1991), Lehrer and Kittay (1992),
Zelinsky-Wibbelt (1993), Schwarz (1994), Dirven and Vanparys (1995), Taylor and
MacLaury (1995), Pütz and Dirven (1996), Ravin and Leacock (2000), Cuyckens and
Zawada (2001), Cuyckens and Radden (2002), Nerlich et al. (2003), Cuyckens et al.
(2003), and Rakova et al. (2007).
At first sight, this highly abstracted and introspective research tradition would
seem distinct from, even at odds with, the bottom up approach of corpus-driven methodology. However, upon closer inspection, we see that the very origins of corpus-driven, indeed, quantitative corpus-driven, semantic research lie in the radial network
studies. The contemporary methodology directly inherits from and builds upon this
tradition. For both the study of synonymy and polysemy, many of the earliest studies
were entirely empirical. Moreover, as we will see below, the two approaches are theoretically linked. Let us, however, begin with an overview of the radial network research.
For the practical concerns of brevity, it is impossible to offer even a snippet of
this immense field. However, Table 2 offers a selection of studies, chosen to represent
the depth and variation of the radial network approach to semasiological structure.
The object of study, its general part of speech or grammatical category, whether this
is schematic or concrete in form, as well as the method of analysis and reference аrе
listed. It must be stressed that the distinction between schematic and concrete forms
is only designed to show tendencies and no theoretical distinction is intended. The
decision as to what to include is based on a subjective evaluation of the impact of
the study and the author upon the field, as well the extent of the study, priority being
given to monographs.
10. Beyond these early studies, the following research represents key points in the discussion
on semantics and categorisation in Cognitive Semantics: Deane (1988), Geeraerts (1988, 1989,
1993a, 1997), Taylor (1989a), Wierzbicka (1989, 1990), Vandeloise (1990), Kleiber (1990, 1999),
Lehrer (1990a, 1990b), Dunbar (1991, 2001), Tuggy (1993, 1999), and Croft (1998). Recent discussion includes Tyler and Evans (2003), Zlatev (2003) and Evans (2005, 2006).
19
20 Dylan Glynn
Table 2.╇ Prototype-encyclopaedic analysis of polysemy in Cognitive Semantics
Object
Form
Schematicity Method
Reference
lie (English)
talk, say, tell, speak (English)
over (English)
on the go Cx. (English)
up, out (English)
idea (English)
kind of Cx. (English)
uit (Dutch), wy (Polish)
spatial preps. (English)
za-, pere-, do-, ot- (Russian)
spatial preps. (French)
over (English)
let alone Cx. (English)
down (English)
tall (English)
ask (English)
Genitive (English)
Dative (Czech)
vers (Dutch)
Middle Voice Cx. (French)
Resultative Cx. (English)
Dative (Polish)
over (Dutch)
Ditransitive Cx. (English)
in (Dutch)
Instrumental (Russian)
at, on, in (English)
‘give’ (Mandarin)
at, by, to (English)
(a)round (English)
Genitive (Polish)
in (English)
Instrumental (Polish)
off (English)
op (Dutch)
over (English)
answer, respond (English)
door, langs (Dutch)
Caused-Motion Cx. (English)
at, on, in (English)
Dative (Polish)
over (English)
figure out (English)
verb
verb
prep.
constr.
particle
noun
constr.
prep.
prep.
prefix
prep.
prep.
constr.
prep.
adj.
verb
case
case
adj.
constr.
constr.
case
prep.
constr.
prep.
case
prep.
verb
prep.
prep.
case
prep.
case
prep.
prep.
prep.
verb
prep.
constr.
prep.
case
prep.
verb
concrete
concrete
concrete
schematic
concrete
concrete
schematic
concrete
concrete
concrete
concrete
concrete
schematic
concrete
schematic
concrete
schematic
schematic
concrete
schematic
schematic
schematic
concrete
schematic
concrete
schematic
concrete
concrete
concrete
concrete
schematic
concrete
schematic
concrete
concrete
concrete
concrete
concrete
schematic
concrete
schematic
concrete
concrete
Coleman and Kay (1981)
Dirven et al. (1982)
Brugman (1983a)
Brugman (1983b)
Lindner (1983)
Brugman (1984)
Kay (1984)
Rudzka-Ostyn (1985)
Herskovits (1986, 1988)
Janda (1986)
Vandeloise (1986)
Lakoff (1987)
Fillmore et al. (1988)
Schulze (1988)
Dirven and Taylor (1988)
Rudzka-Ostyn (1989)
Taylor (1989b)
Janda (1990, 1993)
Geeraerts (1990)
Melis (1990)
Goldberg (1991, 1995)
Rudzka-Ostyn (1992, 1996)
Geeraerts (1992)
Goldberg (1992, 1995)
Cuyckens (1993)
Janda (1993)
Rice (1993)
Newman (1993)
Deane (1993b)
Schulze (1993)
Rudzka-Ostyn (1994, 2000)
Vandeloise (1994)
Dąbrowska (1994)
Schulze (1994)
Cuyckens (1994)
Dewel (1994)
Rudzka-Ostyn (1995)
Cuyckens (1995)
Goldberg (1995)
Sandra and Rice (1995)
Dąbrowska (1997)
Kreitzer (1997)
Morgan (1997)
elicitation
observation
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
elicitation
observation
introspection
introspection
introspection
observation
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
observation
introspection
introspection
elicitation
introspection
introspection
introspection
Polysemy and synonymy
Table 2.╇ (continued)
Object
Form
Schematicity Method
Reference
at, on, in (English)
Causative Cx. (English)
Dative (Dutch)
straight (English)
on (English)
to, for (English)
What’s X doing Y Cx. (Engl.)
crawl (English)
prep
constr.
constr.
adj.
prep.
prep.
constr.
verb
concrete
schematic
schematic
concrete
concrete
concrete
schematic
concrete
Cuyckens et al. (1997)
Lemmens (1998)
Geeraerts (1998)
Cienki (1998)
Rice et al. (1999)
Rice (1999)
Kay and Fillmore (1999)
Fillmore (2000)
elicitation
observation
introspection
introspection
elicitation
observation
introspection
introspection
Research in synonymy, though it received less attention and possibly produced less in
terms of quantity, was equally important in the development of Cognitive Semantics.
Studies such as Fillmore (1977), Lehrer (1982), Dirven et al. (1982), Janda (1986), and
later Schmid (1993), Geeraerts et al. (1994), and Rudzka-Ostyn (1995) represent seminal work in the field. Table 3 offers a summary of Cognitive Linguistic case studies
in synonymy, again up until the turn of the century. There is some redundancy with
Table 2 because what may be a set of individual case studies on polysemy were also
combined to present a study in near-synonymy. Again, the table is designed to offer
an overview and is no way complete.
It should be noted that although there was less work on synonymy per se, the
conceptual metaphor studies, which were extremely numerous at the time, are, in effect, synonymy studies. Although such research was primarily interested in figurative
lexemes, they remain, nonetheless, studies of near-synonymy (as noted by Kittay and
Lehrer 1981 at the time). There was much discussion about what constitutes a source
domain and/or target domain and whether certain expressions were in fact examples
of the concept in question. From a lexical semantic point of view, these questions are,
of course, questions of near-synonymy. Expressions such as to have the hots versus to
be head over heels were said to profile different aspects of the target domain, in other
words, they were near-synonyms.
In order to appreciate the trends and heritage of contemporary methods in Cognitive Semantic research, it is helpful to make a quantified summary of the research
output.11 Since, for reasons of space, full coverage of this research history is impossible, only the results of the investigation are presented. A survey of 126 studies was
conducted from roughly the beginning of the Cognitive Linguistic research community with the publication of Paprotté and Dirven’s (1985) anthology The Ubiquity of
11. See Geeraerts (2005; 2006b), Croft (2009) and Glynn (2010c) for other discussions on
methodological trends in Cognitive Linguistics.
21
22
Dylan Glynn
Table 3.╇ Prototype-encyclopaedic analysis of synonymy in Cognitive Semantics
Object
Form
Schema- Method
ticity
Reference
buy (English)
speak (English)
wine terms (English)
trade names (English)
za-, pere-, do-, ot- (Russian)
Deictic Cx. (English)
risk (English)
house (English)
start – begin (English)
clothing terms (Dutch)
see (English)
contact (English)
answer (English)
up – down (English)
front – back (English)
Perfectivisers (Polish)
run (English)
por – para (Spanish)
liegen – stehen (German)
epistemic modifiers (Dutch)
causatives (French)
negation (English)
kill (English)
verb particle Cxs (English)
beer terms (Dutch, French)
football terms (Dutch)
anaphoric nouns (English)
verb
verb
adj.
nouns
prefix
constr.
verb
noun
verb
noun
verb
verb
verb
prep.
prep.
prefix
verb
prep.
verb
verb
constr.
various.
verb
constr.
noun
noun
noun
concrete
concrete
concrete
concrete
schematic
schematic
concrete
concrete
concrete
concrete
concrete
concrete
concrete
concrete
concrete
schematic
concrete
concrete
concrete
concrete
schematic
concrete
concrete
schematic
concrete
concrete
concrete
Fillmore (1977)
Dirven et al. (1982)
Lehrer (1982)
Vorlat (1985)
Janda (1986)
Lakoff (1987)
Fillmore and Atkins (1992)
Schmid (1993)
Schmid (1993)
Geeraerts et al. (1994)
Atkins (1994)
Dirven (1994)
Rudzka-Ostyn (1995)
Boers (1996)
Boers (1996)
Dąbrowska (1996)
Taylor (1996)
Delbeque (1996)
Bornetto (1996)
Sanders and Spooren (1996)
Archard (1996)
Lewandowska-Tomaszczyk (1996)
Lemmens (1998)
Gries (1999)
Geeraerts (1999a)
Geeraerts et al. (1999)
Schmid (2000)
introspection
observation
elicitation
observation
introspection
introspective
observation
elicitation
observation
observation
observation
introspection
observation
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
introspection
observation
observation
observation
observation
observation
Metaphor to the turn of the century, where the field becomes extremely diverse and
empirical methods begin to become the norm.
In order to operationalise the criterion of what constitutes ‘cognitive’ research
in the field, the survey is restricted to three publication avenues. Only full articles
that addressed polysemy or synonymy in the three following kinds of sources are
considered:
The official journal of the International Cognitive Linguistics Association, Cognitive Linguistics, between 1990 and 1999.
ii. Five foundational anthologies within the paradigm, three of which are proceedings of the first three conferences of the International Cognitive Linguistics Society (Rudzka-Ostyn and Geiger 1993; de Stadler and Eyrich 1993; and Casad
i.
Polysemy and synonymy
1996) and two of which predate the society but, at the time, were constitutive of
the community (Paprotté and Dirven 1985 and Rudzka-Ostyn 1988).
iii. Nine anthologies largely devoted to the Cognitive Linguistic research on polysemy
and synonymy, published between 1989 and 1996 (Geeraerts 1989; Tsohatzidis
1990; Dubois 1991; Rauh 1991; Lehrer and Kittay 1992; Zelinsky-�Wibbelt 1993;
Schwarz 1994; Taylor and MacLaury 1995; and Pütz and Dirven 1996).
Monographs are not included since it is difficult to gauge the impact and importance
they contributed to the field but also because they often contain multiple case studies.
Another issue is how to determine what constitutes an example of the radial network
approach to semantic relations. This is determined using the three-part definition offered above: employing encyclopaedic semantic features, without the notion of necessary and sufficient conditions for category membership, in order to distinguish senses
or relate forms.
Each of the 126 studies are categorised for year of publication, type of publication
and author. The object of study for each study is also categorised as:
– schematic vs. concrete (lexemes vs. constructions/grammatical categories)
– polysemy vs. synonymy
– linguistic phenomenon (actual form(s) under investigation)
–language
Lastly, the studies are categorised for their method of analysis. Three kinds of method are distinguished: introspection, observation, and elicitation. Two of these methods are further distinguished. For observational data (corpus based) methods, three
methods are distinguished:
– corpus-driven with statistical verification
– corpus-driven with raw counts
– corpus-illustrated (introspection exemplified with natural data)
For elicited methods, another three methods are distinguished:
– quantified direct elicitation (questionnaires etc.) with raw counts
– quantified direct elicitation with statistical verification
– experimental elicitation with statistical verification
Distinguishing introspective, experimental and observational methods is unproblematic, save when a study uses more than one method, as in Dirven and Taylor (1988). In
this case, for instance, the study is categorised as elicited, since these data feature more
prominently than the corpus data in the analysis.
The most striking result is just how balanced and broad the range of studies is.
Although obviously Eurocentric, a surprisingly wide range of languages is considered
(in total 30 different languages, with English making up 36% of the studies). Moreover,
23
24
Dylan Glynn
18
16
14
12
Concrete
10
Schematic
8
Polysemy
Synonymy
6
4
2
0
1985 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Figure 1.╇ Object of study in Cognitive Semantics, 1985–1999
despite the predilection for spatial prepositions, particles, and morphemes, these parts
of speech represent ‘only’ 18 (14%) of the studies. Although 14% represents a sizable
proportion of the research, considering that the approach is often termed preposition
research and considering that two of the anthologies were devoted to prepositions and
another two to spatial representation, this figure is not as overwhelming as one might
expect.
However, the best indicator of the diversity of the research is found if we take a
more coarse-grained perspective and consider schematic vs. concrete and synonymous and vs. polysemous objects of study. Figure 1 represents the numbers of such
studies over the 15-year period.
We see here how evenly dispersed the four different objects of study are over the
period. Only in 1993 do we see a divergence, where the number of studies examining
schematic forms such as grammatical constructions and grammatical categories drop
when other objects of study remain steady or increase in number. Indeed, divided in
this manner, not even the analysis of concrete instances of polysemy, such as prepositions, is even vaguely dominant.
Although this tells us nothing of the methodological heritage, it is important to
note that radial network analysis was not restricted to polysemy studies of spatial
prepositions, and that the diverse range of lexical and grammatical analysis of both
near-synonymy and vague-polysemy visible today is directly descendent from this
tradition.
Turning to the methodological trends, we see an important and consistent presence of empirical studies, even if the use of introspection dominates. Figure 2 depicts
these trends. Two levels of granularity are summarised in a single plot: a simple distinction between empirical and introspective as well as a breakdown of empirical into
elicited and observational. Figure 2 shows how empirical methods followed the trends
in introspective studies and were, although less common, far from irrelevant. The
Polysemy and synonymy
18
16
14
12
Empirical
10
Introspection
8
Observation
6
Elicitation
4
2
0
1985 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
Figure 2.╇ Method of study in Cognitive Semantics, 1985–1999
exception to this is, again, 1993, where we see a large number of introspective studies
and no published observation or elicitation-based research in the sample.
At a more fine-grained level, there is a striking lack of statistical sophistication in
the observation-based research. In the sample of studies, only a single corpus analysis, Gries (1999), showed any sophistication, the entire body of corpus-driven research being restricted to raw counts. Although we know this was not completely the
case since studies outside the narrow range under consideration employed statistical
techniques from the beginning of the research paradigm, the relative trend is clear –
the use of statistical techniques to consider results of corpus-driven research was not
the norm. The use of statistical techniques for elicited data, however, was more common, though far from typical and appearing quite late in the sample, Schulze (1991),
Chaffin (1992), and Myers (1994) being the earliest instances.
Although based on a limited sample of 126 studies from the official journal and a
selection of anthologies, the survey hopefully demonstrates that the semantic tradition
within Cognitive Linguistics is not restricted to prepositional studies nor is its methodology overwhelmingly introspective. Perhaps the image that Cognitive Linguistics
is exclusively orientated towards introspective methodology results from the most
widely known theoretical works that founded the paradigm. Certainly the seminal
publications of Fillmore (1985), Talmy (1985), Lakoff (1987), and Langacker (1987)
restrict themselves to introspective investigation. The importance of these theoretical
and foundational works notwithstanding, we see above that the research community as a whole always included significant methodological diversity. Quantitative corpus-driven methods are, therefore, not a new turn or a new direction, but the natural
development of an existing tradition.
The rich descriptive heritage summarised in Tables 2 and 3 represents the Cognitive Semantic approach to polysemy and synonymy research. This volume continues
the tradition, but instead of turning to prototype set theory as the analytical framework
to capture the complexity of semantic relations, the corpus-driven research presented
25
26 Dylan Glynn
here uses multivariate statistical modelling and collocation association measures. Does
this mean that we have done away with radial network analysis, prototype category
structure and fuzzy set structure? It would be possible to dismiss the introspective radial network analyses of the early period of Cognitive Linguistics as little more than exercises in prototype set theory. After all, it has been shown that the method of analysis
was largely ad hoc (Sandra and Rice 1995) and the last decade has seen few important
studies employing the approach. Indeed, some in the empirical research community
would seek to distance themselves from such a research tradition.12 However, to turn
our back on this tradition or seek to demonstrate the superiority of the current methodology would do injustice to this previous research.
Firstly, it is precisely this research tradition that freed the study of semantic relations from the notions of discrete senses and context independent semantics. Radial
network studies were the first and essential step towards this realisation – both theoretically and analytically. Theoretically, it set the stage for an understanding of semantics as emergent structure and, analytically, it produced the idea of understanding
meaning as a network of interacting factors. On both fronts, the corpus-driven and
experimental study of semantic relations is a direct heir to radial network research.
Indeed, within the tradition, as early as Geeraerts (1993a:â•›260), the idea that we need
to move away from a reified and mono-dimensional understanding of meaning was
being overtly mooted.13
Secondly, such studies are an essential step in empirical research. They represent
hypothetical models of language structure, based on careful and systematic introspection-based analysis of language. Rather than ignore, or worse still, dismiss such
theoretical research, empirical analysis needs to treat it as foundational. The results
of introspection-based research are theoretical models, models that are, most likely,
reasonably accurate descriptions of language structure. The new generation of experimental and observational linguistic analysis needs to test the accuracy and explanatory power of those models, modifying them where needed. Seen in this light,
the configuration of the argument-structure of the verbs of ‘buying and selling’ by
12. There is much discussion about an empirical turn/revolution in Cognitive Linguistics
(Geeraerts 2006b, 2010b; Levshina 2011; Klavan 2012 inter alios). Indeed, certain authors have
expressed concerns about “empirical imperialism” (Lampert and Lampert 2010; Schmid 2010),
a term coined by Geeraerts (2006a). Some of the discussion on the topic (Stefanowitsch 2008;
Fischer 2010) represents the ‘turn’ as a break from the past and indeed Gries (forthc.) considers
corpus-driven and experimentation-based research in polysemy as an entirely different discipline from research using prototype set theory. Whether it is a result of the gradual emergence
of empiricism or a radical methodological revolution, there is little doubt that the analytical
landscape of Cognitive Semantics has changed substantially over the last 25 years.
13. This line of thinking is not as radical as one might expect. Lehrer and Lehrer (1994) and
Victorri and Fuchs (1996) represent further examples of early discussions on how a non-reified
understanding of semantic structure needs to be developed.
Polysemy and synonymy
Fillmore (1977), the schema specifications in Lakoff ’s (1987) study of over, or the
image-schematic grammatical features of Janda’s (1993) study of the Dative are theoretical models and it is our task to test their descriptive accuracy and improve them.
There are, of course, differences between the corpus-driven quantitative research
and the introspection-based radial network studies. No matter how large a corpus,
found data will always be biased towards what is common rather than what is possible. Introspection is a vital methodology for proposing hypotheses about what is
possible in a language. It follows that a lack of corpus evidence of a given form-meaning pair does not mean it is not possible or does not occur. This is simply because
even the largest corpus in the world is but a microscopic fraction of actual language
use. This difference affects the results profoundly. Corpus-driven research is exclusively frequency-based, and this, in turn, will prioritise typical structures over less
typical structures. It is for this reason that it would be difficult to simply carry out
a corpus-driven study on over and compare the results with Lakoff ’s (1987) results.
A corpus-driven study may or may not confirm parts of the network analysis, but it
is unlikely that it would paint the same picture anymore than lack of confirmation
would negate his hypothesis. This is simply because only the most frequent usage patterns, or schema configurations to use Lakoff ’s terminology, would be found.
Another apparent difference would be the lack of interest in prototype effects. In
the corpus-driven research, what has become of the analytical apparatus – prototype
categorisation and fuzzy sets? Although there is little overt reference to such notions,
the results of multifactorial analysis and collocation analysis are, in fact, structured
as fuzzy-bounded prototype categories. Since the results are based upon relative frequency, they are, therefore, necessarily ‘prototype’ structured (at least if we accept a
frequency-based operationalisation of prototypicality) and are not discrete. Moreover, multifactorial feature analysis identifies ‘meanings’ as tendencies, where a tendency is a multidimensional pattern of use. This, quite literally, produces networks of
different uses – a frequency-based and complex multidimensional network of sense
relations. The radial network analysis produced prototype maps of meaning upon one
‘semantic’ dimension where the ‘nodes’ were discrete reified senses. Today, multifactorial feature analysis produces multidimensional networks of usage patterns that can
be interpreted as emergent language structure. The difference between the two is a
natural progression, not a methodological schism.
We can, therefore, deduce that, despite important differences, there is a direct line
of descent in the methodology from the radial network approach described above to
the contemporary corpus-driven research represented by studies such as Gries (2003,
2006), Heylen (2005), Divjak and Gries (2006), Divjak (2006, 2010a, 2010b), Wulff
(2006), Gronderlaers et al. (2007, 2008), Wulff et al. (2007), Janda and Solovyev (2009),
Hilpert (2008), Gilquin (2009), Glynn (2009, 2010a, 2010b, 2014a, 2014b, forthc.),
Speelman and Geeraerts (2010), Krawczak and Kokorniak (2012), Krawczak (2014a,
2014b), and the research in this volume. Armed with contemporary corpus-driven
27
28
Dylan Glynn
methods, the task ahead of us now is, arguably, to return to the research questions that
gave rise to Cognitive Semantics in an effort to understand the conceptual and functional structures that motivate language. Although contemporary quantitative research
may be gaining descriptive adequacy, our ultimate goal remains explanatory adequacy.
References
Apresjan, J. D. (1974). Лексическая Семантика. Синонимические средства языка [Lexical
Semantics: Synonymous foundations of language]. Moscow: Nauka.
Apresjan, J. D. (2000). Systematic lexicography. Oxford: Oxford University Press.
Arppe, A., Gilquin, G., Glynn, D., Hilpert, M., & Zeschel, A. (2010). Cognitive corpus linguistics: Five points of debate on current theory and methodology. Corpora, 5, 1–27.
DOI: 10.3366/cor.2010.0001
Atkins, B. (1994). Analyzing the verbs of seeing: A frame semantics approach to corpus lexicography. Proceedings of the Twentieth Annual Meeting of the Berkeley Linguistics Society,
42–56.
Barthélemy, J.-P. (1991). Similitude, arbres, et typicalité. In D. Dubois (Ed.), Sémantique et
cognition: catégories, prototypes, typicalité (pp. 205–224). Paris: Centre national de la recherche scientifique.
Bartmiński, J. (2008). Aspects of cognitive ethnolinguistics. London: Equinox.
Bellavia, E. (1996). The German über. In M. Pütz, & R. Dirven (Eds.), The construal of space in
language and thought (pp. 73–107). Berlin & New York: Mouton de Gruyter.
Boers, F. (1996). Spatial prepositions and metaphor: A Cognitive Semantic journey along the updown and front-back dimensions. Tübignen: Gunter Narr.
Bondarko, A. V. (1983). Принципы функциональной грамматики и вопросы аспектологии
[Principles of functional grammar and questions of aspectology]. Lenningrad: Nauka.
Bondarko, A. V. (1991). Functional grammar: A field approach. Amsterdam & Philadelphia:
John Benjamins. DOI: 10.1075/llsee.35
Brugman, C. (1983a). The story of over: Polysemy, semantics, and the structure of the lexicon.
Trier: LAUT.
Brugman, C. (1983b). How to be in the know about on the go. Proceedings of the Chicago Linguistics Society, 19, 64–76.
Brugman, C. (1984). The very idea: A case study in polysemy and cross-lexical generalizations.
Proceedings of the Chicago Linguistics Society, 20, 21–38.
Bybee, J. (2007). Frequency of use and the organization of language. Oxford: Oxford University
Press. DOI: 10.1093/acprof:oso/9780195301571.001.0001
Casad, E. (Ed.). (1996). Cognitive Linguistics in the redwoods. The expansion of a new paradigm.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110811421
Chaffin, R. (1992). The concept of a semantic relation. In A. Lehrer, & E. Kittay (Eds.), Frames,
fields, and contrasts: New essays in semantic and lexical organisation (pp. 253–288).
London: Lawrence Erlbaum.
Cienki, A. (1998). Straight: An image schema and its metaphorical extensions. Cognitive Linguistics, 9, 107–150. DOI: 10.1515/cogl.1998.9.2.107
Coleman, L., & Kay, P. (1981). Prototype semantics: The English word lie. Language, 57, 26–44.
Coșeriu, E. (1980). Textlinguistik. Tübingen: Gunter Narr.
Polysemy and synonymy
Croft, W. (1998). Linguistic evidence and mental representations. Cognitive Linguistics, 9, 151–
173. DOI: 10.1515/cogl.1998.9.2.151
Croft, W. (2009). Toward a social Cognitive Linguistics. In V. Evans, & S. Pourcel (Eds.),
New directions in Cognitive Linguistics (pp. 395–420). Amsterdam & Philadelphia: John
Benjamins.
Cruse, A. (2000). Aspects of the micro-structure of word meanings. In Y. Ravin, & C. Leacock
(Eds.), Polysemy: Theoretical and computation approaches (pp. 30–51). Oxford: Oxford
University Press.
Culioli, A. (1990). Pour une linguistique de l’énonciation: Opérations et représentations. Paris:
Ophrys.
Cuyckens, H. (1991). The semantics of spatial prepositions in Dutch. Unpublished PhD dissertation, University of Antwerp.
Cuyckens, H. (1993). The Dutch spatial preposition “in”: A cognitive-semantic analysis. In
C. Zelinsky-Wibbelt (Ed.), The semantics of prepositions: From mental processing to natural
language processing (pp. 27–72). Berlin & New York: Mouton de Gruyter.
Cuyckens, H. (1994). Family resemblance in the Dutch spatial preposition op. In M. Schwarz
(Ed.), Kognitive Semantik: Ergebnisse, Probleme, Perspektiven (pp. 179–196). Tübingen:
Gunter Narr.
Cuyckens, H. (1995). Family resemblance in the Dutch spatial prepositions Door and Langs.
Cognitive Linguistics, 6, 183–207. DOI: 10.1515/cogl.1995.6.2-3.183
Cuyckens, H., Sandra, D., & Rice, S. (1997). Towards an empirical lexical semantics. In
B. Smieja, & M. Tasch (Eds.), Human contact through language and linguistics (pp. 35–54).
Frankfurt/Main: Peter Lang.
Cuyckens, H., & Zawada, B. (Eds.). (2001). Polysemy in Cognitive Linguistics. Amsterdam &
Philadelphia: John Benjamins. DOI: 10.1075/cilt.177
Cuyckens, H., & Radden, G. (Eds.). (2002). Perspectives on prepositions. Tübignen: Max
Niemeyer. DOI: 10.1515/9783110924787
Cuyckens, H., Dirven, R., & Taylor, J. (Eds.). (2003). Cognitive approaches to lexical semantics.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110219074
Dąbrowska, E. (1994). Radial categories in grammar: The Polish instrumental case. Linguistica
Silesiana, 15, 83–94.
Dąbrowska, E. (1996). Temporal structuring of events: A study of Polish perfectivizing prefixes.
In R. Dirven, & M. Pütz (Eds.), The construal of space in language and thought (pp. 467–
490). Berlin & New York: Mouton de Gruyter.
Dąbrowska, E. (1997). Cognitive Semantics and the Polish dative. Berlin & New York: Mouton
de Gruyter. DOI: 10.1515/9783110814781
de Stadler, L., & Eyrich, C. (Eds.). (1993). Issues in Cognitive Linguistics. Berlin & New York:
Mouton de Gruyter.
Deane, P. (1988). Polysemy and cognition. Lingua, 75, 325–361.
DOI: 10.1016/0024-3841 (88)90009-5
Deane, P. (1993a). Multimodal spatial representation: On the semantic unity of ‘over’ and other
polysemous prepositions. Duisburg: LAUD.
Deane, P. (1993b). At, by, to, and past: A study in multimodal image theory. Proceedings of the
Berkeley Linguistics Society, 19, 112–124.
Deane, P. (2006). Multimodal spatial representation: On the semantic unity of over. In
B. Hampe (Ed.), From perception to meaning: Image schemas in Cognitive Linguistics
(pp. 235–284). Berlin & New York: Mouton de Gruyter.
29
30
Dylan Glynn
Delbeque, N. (1996). Towards a cognitive account of the use of the prepositions por and para in
Spanish. In E. Casad (Ed.), Cognitive Linguistics in the Redwoods: The expansion of a new
paradigm in linguistics (pp. 249–318). Berlin & New York: Mouton de Gruyter.
Dewell, R. (1994). Over again: On the role of image–schemas in semantic analysis. Cognitive
Linguistics, 5, 351–380. DOI: 10.1515/cogl.1994.5.4.351
Dewell, R. (1996). The separability of German über: A cognitive approach. In M. PuÌ‹tz, &
R. Dirven (Eds.), The construal of space in language and thought (pp. 109–133). Berlin &
New York: Mouton de Gruyter.
Dirven, R. (1994). Cognition and semantic structure: The experiential basis of the semantic
structure of verbs of body contact. In M. Schwarz (Ed.), Kognitive Semantik: Ergebnisse,
Probleme, Perspektiven (pp. 131–145). Tübingen: Gunter Narr.
Dirven, R., & Taylor, J. (1988). The conceptualisation of vertical space in English: The case of
tall. In B. Rudzka-Ostyn (Ed.), Topics in Cognitive Linguistics (pp. 379–402). Amsterdam &
Philadelphia: John Benjamins.
Dirven, R., Goossens, L., Putseys, Y., & Vorlat, E. (1982). The scene of linguistic action and its
perspectivization by speak, talk, say, and tell. Amsterdam & Philadelphia: John Benjamins.
DOI: 10.1075/pb.iii.6
Dirven, R., & Vanparys, J. (Eds.). (1995). Current approaches to the lexicon. Frankfurt/Main:
Peter Lang.
Divjak, D. (2006). Ways of intending: A corpus-based Cognitive Linguistic approach to
near-synonyms in Russian. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 19–56). Berlin & New
York: Mouton de Gruyter.
Divjak, D. (2010a). Structuring the lexicon: A clustered model for near-synonymy. Berlin & New
York: Mouton de Gruyter.
Divjak, D. (2010b). Corpus-based evidence for an idiosyncratic aspect-modality relation in
Russian. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven
approaches (pp. 305–331). Berlin & New York: Mouton de Gruyter.
Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2, 23–60. DOI: 10.1515/CLLT.2006.002
Dubois, D. (Ed.). (1991). Sémantique et cognition: Catégories, prototypes, typicalité. Paris:
Centre national de la recherche scientifique.
Dunbar, G. (1991). The cognitive lexicon. Tübingen: Gunter Narr.
Dunbar, G. (2001). Toward a cognitive analysis of polysemy, ambiguity, and vagueness. Cognitive Linguistics, 12, 1–14. DOI: 10.1515/cogl.12.1.1
Evans, V. (2005). The meaning of time: Polysemy, the lexicon and conceptual structure. Journal
of Linguistics, 41, 33–75. DOI: 10.1017/S0022226704003056
Evans, V. (2006). Lexical concepts, cognitive models and meaning-construction. Cognitive Linguistics, 17, 491–534. DOI: 10.1515/COG.2006.016
Fauconnier, G., & Turner, M. (1998). Conceptual integration networks. Cognitive Science, 22,
133–187. DOI: 10.1207/s15516709cog2202_1
Fillmore, C. (1975). An alternative to checklist theories of meaning. Proceedings of the Berkeley
Linguistics Society, 1, 123–131.
Fillmore, C. (1977). Topics in lexical semantics. In P. Cole (Ed.), Current issues in linguistic
theory (pp. 76–138). Bloomington: Indiana University Press.
Fillmore, C. (1985). Frames and the semantics of understanding. Quaderni di Semantica, 6,
222–254.
Polysemy and synonymy
Fillmore, C. (2000). Describing polysemy: The case of ‘crawl’. In Y. Ravin, & C. Leacock (Eds.),
Polysemy: Theoretical and computation approaches (pp. 91–110). Oxford: Oxford University Press.
Fillmore, C., Kay, P., & O’Connor, M. (1988). Regularity and idiomaticity in grammatical constructions: The case of let alone. Language, 64, 501–538. DOI: 10.2307/414531
Fillmore, C., & Atkins, B. (1992). Toward a frame-based lexicon: The semantics of risk and its
neighbours. In A. Lehrer, & E. Kittay (Eds.), Frames, fields, and contrasts: New essays in
semantic and lexical organisation (pp. 75–102). London: Lawrence Erlbaum.
Fischer, K. (2010). Quantitative methods in cognitive semantics. In D. Glynn, & K. Fischer
(Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 43–61). Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110226423
Geeraerts, D. (1987). On necessary and sufficient conditions. Journal of Semantics, 5, 275–291.
DOI: 10.1093/jos/5.4.275
Geeraerts, D. (1988). Where does prototypicality come from? In B. Rudzka-Ostyn (Ed.), Topics
in Cognitive Linguistics. Amsterdam & Philadelphia: John Benjamins.
Geeraerts, D. (1989). Prospects and problems of prototype theory. Linguistics, 27, 587–612.
DOI: 10.1515/ling.1989.27.4.587
Geeraerts, D. (1990). The lexicographical treatment of prototypical polysemy. In S. Tsohatzidis
(Ed.), Meanings and prototypes: Studies in linguistic categorization (pp. 195–210). London:
Routledge.
Geeraerts, D. (1992). The semantic structure of Dutch over. Leuvense Bijdragen, 81, 205–230.
Geeraerts, D. (1993a). Vagueness’s puzzles, polysemy’s vagaries. Cognitive Linguistics, 4, 223–
72. DOI: 10.1515/cogl.1993.4.3.223
Geeraerts, D. (1993b). Generalised onomasiological salience. In J. Nuyts, & E. Pederson (Eds.),
Perspectives on language and conceptualization (Special edition of the Belgian Journal of
Linguistics, 8) (pp. 43–56). Brussels: Editions de l’Université de Bruxelles.
Geeraerts, D. (1994). Classical definability and the monosemic bias. Rivista di Linguistica, 6,
149–172.
Geeraerts, D. (1995). Representational formats in Cognitive Semantics. Folia Linguistica, 39,
21–41.
Geeraerts, D. (1997). Diachronic prototype semantics: A contribution to historical lexicology.
Oxford: Clarendon Press.
Geeraerts, D. (1998). The semantic structure of the indirect object in Dutch. In W. Van
Langendonck, & W. Van Belle (Eds.), The dative. Vol. 2. Theoretical and contrastive studies
(pp. 185–210). Amsterdam & Philadelphia: John Benjamins.
Geeraerts, D. (1999a). Beer and semantics. In L. De Stadler, & C. Eyrich (Eds.), Issues in Cognitive Linguistics (pp. 35–55). Berlin & New York: Mouton de Gruyter.
Geeraerts, D. (1999b). Idealist and empiricist tendencies in Cognitive Semantics. In
T. Janssen, & G. Redeker (Eds.), Cognitive Linguistics: Foundations, scope, and methodology
(pp. 163–194). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110803464.163
Geeraerts, D. (2005). Lectal data and empirical variation in Cognitive Linguistics. In F. José
Ruiz de Mendoza Ibáñez, & S. Peña Cervel (Eds.), Cognitive Linguistics: Internal dynamics
and interdisciplinary interactions (pp. 163–189). Berlin & New York: Mouton de Gruyter.
Geeraerts, D. (2006a). Words and other wonders: Papers on lexical and semantic topics. Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110219128
31
32
Dylan Glynn
Geeraerts, D. (2006b). Methodology in Cognitive Linguistics. In G. Kristiansen, M. Achard,
R. Dirven, & F. J. Ruiz de Mendoza Ibañez (Eds.), Cognitive Linguistics: Current applications and future perspectives (pp. 21–50). Berlin & New York: Mouton de Gruyter.
Geeraerts, D. (2010a). Theories of lexical semantics. Oxford: Oxford University Press.
Geeraerts, D. (2010b). Recontextualizing grammar: Underlying trends in thirty years of Cognitive Linguistics. In E. Tabakowska, M. Choinski, & L. Wiraszka (Eds.), Cognitive Linguistics in action: From theory to application and back (pp. 71–102). Berlin & New York:
Mouton de Gruyter.
Geeraerts, D., Grondelaers, St., & Bakema, P. (1994). The structure of lexical variation: Meaning,
naming, and context. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110873061
Geeraerts, D., Grondelaers, St., & Speelman, D. (1999). Convergentie en divergentie in de Nederlandse woordenschat. Amsterdam: Meertens Instituut.
Geeraerts, D. (Ed.) (1989). Prospects and problems of prototype theory (Special edition of Linguistics, 27). Berlin & New York: Mouton de Gruyter.
Givón, T. (1982). Evidentiality and epistemic space. Studies in Language, 6, 23–39.
DOI: 10.1075/sl.6.1.03giv
Givón, T. (2005). Context as other minds: The pragmatics of sociality, cognition and communication. Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/z.130
Glynn, D. (2009). Polysemy, syntax, and variation: A usage-based method for Cognitive Semantics. In V. Evans, & S. Pourcel (Eds.), New directions in Cognitive Linguistics (pp. 77–
106). Amsterdam & Philadelphia: John Benjamins.
Glynn, D. (2010a). Synonymy, lexical fields, and grammatical constructions: A study in usage-based Cognitive Semantics. In H.-J. Schmid, & S. Handl (Eds.), Cognitive foundations
of linguistic usage-patterns: Empirical studies (pp. 89–118). Berlin & New York: Mouton de
Gruyter.
Glynn, D. (2010b). Testing the hypothesis: Objectivity and verification in usage-based Cognitive Semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 239–270). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110226423
Glynn, D. (2010c). Corpus-driven Cognitive Semantics: An overview of the field. In D. Glynn,
& K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 1–
42). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423.1
Glynn, D. (2014a). The conceptual profile of the lexeme home: A multifactorial diachronic analysis. In J. E. Díaz-Vera (Ed.), Metaphor and metonymy across time and cultures (pp. 265–
293). Berlin & New York: Mouton de Gruyter.
Glynn, D. (2014b). The social nature of anger: Multivariate corpus evidence for context effects
upon conceptual structure. In I. Novakova, P. Blumenthal, & D. Siepmann (Eds.), Emotions in discourse (pp. 69–82). Frankfurt/Main: Peter Lang.
Glynn, D. (Forthcoming). Mapping meaning: Corpus methods for Cognitive Semantics.
Cambridge: Cambridge University Press.
Glynn, D., & Fischer, K. (Eds.) (2010). Quantitative Cognitive Semantics: Corpus-driven approaches. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423
Goldberg, A. (1991). A semantic account of resultatives. Linguistic Analysis, 21, 66–96.
Goldberg, A. (1992). The inherent semantics of argument structure: The case of the English
ditransitive construction. Cognitive Linguistics, 3, 37–74. DOI: 10.1515/cogl.1992.3.1.37
Goldberg, A. (1995). Constructions: A construction grammar approach to argument structure.
London: University of Chicago Press.
Polysemy and synonymy
Goldberg, A. (2002). Surface generalization: An alternative to alternations. Cognitive Linguistics, 13, 327–356. DOI: 10.1515/cogl.2002.022
Gries, St. Th. (1999). Particle movement: A cognitive and functional approach. Cognitive Linguistics, 10, 105–145. DOI: 10.1515/cogl.1999.005
Gries, St. Th. (2003). Multifactorial analysis in corpus linguistics: A study of particle placement.
London & New York: Continuum Press.
Gries, St. Th. (2006). Corpus-based methods and Cognitive Semantics: The many senses of
to run. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110197709
Gries, St. Th. (Forthcoming). Polysemy. In E. Dąbrowska, & D. Divjak (Eds.), Handbook of
Cognitive Linguistics. Berlin & New York: Mouton de Gruyter.
Gries, St. Th., & Stefanowitsch, A. (Eds.). (2006). Corpora in Cognitive Linguistics: Corpus-based
approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110197709
Grondelaers, St., & Geeraerts, D. (2003). Towards a pragmatic model of cognitive onomasiology. In H. Cuyckens, R. Dirven, & J. Taylor (Eds.). Cognitive approaches to lexical semantics
(pp. 67–92). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110219074.67
Halliday, M. (1967). Notes on transitivity and theme in English. Journal of Linguistics, 3, 37–81.
DOI: 10.1017/S0022226700012949
Halliday, M. (1985). An introduction to Functional Grammar. London: Edward Arnold.
Hawkins, B. (1985). The semantics of English spatial prepositions. Trier: LAUT.
Herskovits, A. (1986). Language and spatial cognition: An interdisciplinary study of the prepositions in English. Cambridge: Cambridge University Press.
Herskovits, A. (1988). Spatial expressions and the plasticity of meaning. In B. Rudzka-Ostyn
(Ed.), Topics in Cognitive Linguistics (pp. 271–297). Amsterdam & Philadelphia: John
Benjamins.
Hopper, P. (1987). Emergent grammar. Berkeley Linguistics Society, 13, 139–157.
Janda, L. (1986). A semantic analysis of the Russian verbal prefixes za-, pere-, do-, and ot-.
Munich: Otto Sanger.
Janda, L. (1990). Radial network of a grammatical category – its genesis and dynamic structure.
Cognitive Linguistics, 1, 269–288. DOI: 10.1515/cogl.1990.1.3.269
Janda, L. (1993). A geography of case semantics: The Czech dative and the Russian instrumental.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110867930
Janda, L., & Solovyev, V. (2009). What constructional profiles reveal about synonymy: A case
study of the Russian words for sadness and happiness. Cognitive Linguistics, 20, 367–393.
DOI: 10.1515/COGL.2009.018
Jones, S. (2002). Antonymy: A corpus-based approach. London: Routledge.
Kastovsky, D. (1982). Wortbildung und Semantik. Düsseldorf: Francke.
Kay, P. (1984). The kind of/sort of construction. Proceedings of the Berkeley Linguistics Society,
10, 128–137.
Kay, P., & Fillmore, C. (1999). Grammatical constructions and linguistic generalizations: The
What’s X doing Y? construction. Language, 75, 1–33. DOI: 10.2307/417472
Kilgarriff, A. (1997). I don’t believe in word senses: Computers and the Humanities, 31, 91–113.
DOI: 10.1023/A:1000583911091
Kittay, E., & Lehrer, A. (1981). Semantic fields and the structure of metaphor. Studies in Language, 5, 31–63. DOI: 10.1075/sl.5.1.03kit
33
34
Dylan Glynn
Klavan, J. (2012). Converging and diverging evidence: Corpus-linguistic and experimental
methods for studying grammatical synonymy. Unpublished PhD dissertation, University
of Tartu.
Kleiber, G. (1990). Sémantique du prototype: Catégorie et sens lexical. Paris: Presses Universitaires de France.
Kleiber, G. (1999). Problèmes de sémantique: La polysémie en questions. Villeneuve-d’Ascq:
Presses universitaires du Septentrion.
Krawczak, K. (2014a). Epistemic stance predicates in English: A quantitative corpus-driven
study of subjectivity. In D. Glynn, & M. Sjölin (Eds.), Subjectivity and epistemicity: Corpus,
discourse, and literary approaches to stance (pp. 355–386). Lund: Lund University Press.
Krawczak, K. (2014b). Shame and its near-synonyms in English: A multivariate corpus-driven
approach to social emotions. In I. Novakova, P. Blumenthal, & D. Siepmann (Eds.), Emotions in discourse (pp. 84–94). Frankfurt/Main: Peter Lang.
Krawczak, K., & Kokorniak, I. (2012). A corpus-driven quantitative approach to the construal
of Polish think. Poznań Studies in Contemporary Linguistics, 48, 439–472.
DOI: 10.1515/psicl-2012-0021
Kreitzer, A. (1997). Multiple levels of schematization: A study in the conceptualization of space.
Cognitive Linguistics, 8, 291–325. DOI: 10.1515/cogl.1997.8.4.291
Lakoff, G. (1975). Hedges: A study in meaning criteria and the logic of fuzzy concepts. Journal
of Philosophical Logic, 2, 458–508.
Lakoff, G. (1977). Linguistic gestalts. Proceedings of the Chicago Linguistics Society, 13, 236–287.
Lakoff, G. (1982). Categories: An essay in Cognitive Linguistics. In Linguistic Society of Korea
(Ed.), Linguistics in the morning calm (pp. 139–194). Seoul: Hanshin.
Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal about the mind.
London: University of Chicago Press. DOI: 10.7208/chicago/9780226471013.001.0001
Langacker, R. (1982). Space grammar, analysability, and the English passive. Language, 58,
22–80. DOI: 10.2307/413531
Langacker, R. (1987). Foundations of Cognitive Grammar. Vol. 1. Theoretical prerequisites.
Stanford: Stanford University Press.
Langacker, R. (1991). Foundations of Cognitive Grammar. Vol. 2. Descriptive application.
Stanford: Stanford University Press.
Lehrer, A. (1982). Wine and conversation. Bloomington: Indiana University Press.
Lehrer, A. (1990a). Polysemy, conventionality, and the structure of the lexicon. Cognitive Linguistics, 1, 207–246. DOI: 10.1515/cogl.1990.1.2.207
Lehrer, A. (1990b). Prototype theory and its implication for lexical analyses. In S. Tsohatzidis
(Ed.), Meanings and prototypes: Studies in linguistic categorization (pp. 368–381). London:
Routledge.
Lehrer, K., & Lehrer, A. (1994). Fields, networks, and vectors. In F. Palmer (Ed.), Grammar and
meaning: A festschrift for John Lyons (pp. 26–47). Cambridge: Cambridge University Press.
Lehrer, A., & Kittay, E. (Eds.). (1992). Frames, fields, and contrasts: New essays in semantic and
lexical organization. Hillsdale: Lawrence Erlbaum.
Lemmens, M. (1998). Lexical perspectives on transitivity and ergativity: Causative constructions
in English. Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/cilt.166
Levshina, N. (2011). A usage-based study of Dutch causative constructions. Unpublished PhD
dissertation, University of Leuven.
Lewandowska-Tomaszczyk, B. (1996). Depth of negation: A cognitive semantic study. Łódź:
Łódź University Press.
Polysemy and synonymy
Liamkina, O. (2007). Semantic structure of the German spatial particle über. Journal of Germanic Linguistics, 19, 115–160. DOI: 10.1017/S1470542707000050
Lindner, S. (1983). A lexico-semantic analysis of English verb-particle constructions with up and
out. Trier: LAUT.
Lipka, L. (1992). An outline of English lexicology. Tübingen: Max Niemeyer.
Lutzeier, P. (1985). Linguistische Semantik. Stuttgart: J. B. Metzler.
Lyons, J. (1968). Introduction to theoretical linguistics. Cambridge: Cambridge University Press.
DOI: 10.1017/CBO9781139165570
Meex, B. (2001). The spatial and non-spatial sense of the German preposition über. In
H. Cuyckens, & B. Zawada (Eds.), Polysemy in Cognitive Linguistics (pp. 1–36). Amsterdam & Philadelphia: John Benjamins.
Mel’čuk, I. A. (1989). Semantic primitives from the viewpoint of meaning-text linguistic theory.
Quaderni di Semantica, 10, 65–102.
Melis, L. (1990). La voie pronominale: La systématique des tours pronominaux en français moderne. Paris: Duclot.
Morgan, P. (1997). Figuring out figure out: Metaphor and the semantics of the English verb
particle construction. Cognitive Linguistics, 8, 327–358. DOI: 10.1515/cogl.1997.8.4.327
Murphy, L. (2003). Semantic relations and the lexicon: Antonymy, synonymy, and other paradigms. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511486494
Myers, D. (1994). Testing for prototypicality: The Chinese morpheme gong. Cognitive Linguistics, 5, 261–280. DOI: 10.1515/cogl.1994.5.3.261
Nerlich, B., Todd, Z., Herman, V., & Clarke, D. (Eds.). (2003). Polysemy: Flexible patterns of
meaning in mind. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110895698
Newman, J. (1993). The semantics of giving in Mandarin. In B. Rudzka-Ostyn (Ed.), Topics in
Cognitive Linguistics (pp. 433–486). Amsterdam & Philadelphia: John Benjamins.
Norvig, P., & Lakoff, G. (1987). Taking: A study in lexical network theory. Proceedings of the
Berkeley Linguistics Society, 13, 195–206.
Paprotté, W., & Dirven, R. (Eds.). (1985). Ubiquity of metaphor: Metaphor in language and
thought. Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/cilt.29
Pütz, M., & Dirven, R. (Eds.). (1996). The construal of space in language and thought. Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110821611
Rakova, M., Pethő, G., & Răkosi, C. (Eds.). (2007). The cognitive basis of polysemy: New sources
of evidence for theories of word meaning. Frankfurt/Main: Peter Lang.
Rastier, F. (1987). Sémantique interprétative. Paris: Presses universitaires de France.
Rastier, F. (1991). Sémantique et recherches cognitives. Paris: Presses universitaires de France.
Rastier, F. (2011). La mesure et le grain: Sémantique de corpus. Paris: Honoré Champion.
Rauh, G. (Ed.). (1991). Approaches to prepositions. Tübignen: Gunter Narr.
Ravin, Y., & Leacock, C. (Eds.). (2000). Polysemy: Theoretical and computational approaches.
Oxford: Oxford University Press.
Rice, S. (1993). Far afield in the lexical fields: The English prepositions. Tübingen: Gunter Narr.
Rice, S. (1999). Patterns of acquisition in the emerging mental lexicon: The case of to and for in
English. Brain and Language, 68, 268–276. DOI: 10.1006/brln.1999.2105
Rice, S., Sandra, D., & Vanrespaille, M. (1999). Prepositional semantics and the fragile link
between space and time. In M. Hiraga, C. Sinha, & S. Wilcox (Eds.), Cultural typology and
psycholinguistic issues in Cognitive Linguistics (pp. 107–127). Amsterdam & Philadelphia:
John Benjamins.
35
36
Dylan Glynn
Rudzka-Ostyn, B. (1983). Cognitive Grammar and the structure of Dutch uit and Polish wy.
Linguistic Agency University of Trier: Trier.
Rudzka-Ostyn, B. (1985). Metaphoric processes in word formation. In W. Paprotté, & R. Dirven
(Eds.), Ubiquity of metaphor: Metaphor in language and thought (pp. 209–241). Amsterdam & Philadelphia: John Benjamins.
Rudzka-Ostyn, B. (1989). Prototypes, schemas, and cross-category correspondences: The case
of ask. In D. Geeraerts (Ed.), Prospects and problems of prototype theory (pp. 613–661).
Berlin & New York: Mouton de Gruyter.
Rudzka-Ostyn, B. (1992). Case relations in Cognitive Grammar: Some reflexive uses of the
Polish dative. Leuvense Bijdragen, 81, 327–373.
Rudzka-Ostyn, B. (1994). The structure of the genitive category in Polish. Proceedings of the
LAUD International Symposium Language and Space, Duisburg. Republished in Rudzka-�
Ostyn (2000: Chapter 6).
Rudzka-Ostyn, B. (1995). Metaphor, schema, invariance: The case of verbs of answering. In
L. Goossens, P. Pauwels, B. Rudzka-Ostyn, A.-M. Simon-Vandenbergen, & J. Vanparys
(Eds.), By word of mouth: Metaphor, metonymy, and linguistic action from a cognitive perspective (pp. 205–244). Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/pbns.33
Rudzka-Ostyn, B. (1996). The Polish dative. In W. van Belle, & W. van Langendonck (Eds.),
The dative. Vol. 1. Descriptive studies (pp. 341–394). Amsterdam & Philadelphia: John
Benjamins.
Rudzka-Ostyn, B. (2000). Z rozważań nad kategorią przypadka [Considerations on the category
of case]. Kraków: Universitas.
Rudzka-Ostyn, B. (Ed.). (1988). Topics in Cognitive Linguistics. Berlin & New York: Mouton de
Gruyter. DOI: 10.1075/cilt.50
Rudzka-Ostyn, B., & Geiger, R. (Eds.). (1993). Conceptualizations and mental processing in language. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110857108
Sanders, J., & Spooren, W. (1996). Subjectivity and certainty in epistemic modality: A study of
Dutch epistemic modifiers. Cognitive Linguistics, 7, 241–264.
DOI: 10.1515/cogl.1996.7.3.241
Sandra, D., & Rice, S. (1995). Network analyses of prepositional meaning: Mirroring whose
mind – the linguist’s or the language user’s? Cognitive Linguistics, 6, 89–130.
DOI: 10.1515/cogl.1995.6.1.89
Schmid, H.-J. (1993). Cottage and co., idea, start vs. begin. Die Kategorisierung als Grundprinzip
einer differenzierten Bedeutungsbeschreibung. Tübingen: Max Niemeyer.
DOI: 10.1515/ 9783111355771
Schmid, H.-J. (2000). English abstract nouns as conceptual shells: From corpus to cognition.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110808704
Schmid, H.-J. (2010). Does frequency in text instantiate entrenchment in the cognitive system? In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven
approaches (pp. 101–135). Berlin & New York: Mouton de Gruyter.
Schneider, W., & Shiffrin, R. (1977). Controlled and automated human information processing,
I: Detection, search and attention. Psychological Review, 84, 1–66.
DOI: 10.1037/ 0033-295X.84.1.1
Schulze, R. (1988). A short story of down. In W. Hüllen, & R. Schulze (Eds.), Understanding the
lexicon: Meaning, sense, and world knowledge in lexical semantics (pp. 395–414). Tübingen:
Niemeyer. DOI: 10.1515/9783111355191
Polysemy and synonymy
Schulze, R. (1991). Getting round to (a)round: Towards the description and analysis of a ‘spatial’ predicate. In G. Rauh (Ed.), Approaches to prepositions (pp. 253–74). Tübingen: Gunter
Narr.
Schulze, R. (1993). The meaning of (a)round: A study of an English preposition. In A. Geiger,
& B. Rudzka-Ostyn (Eds.), Conceptualizations and mental processing in language (pp. 399–
432). Berlin & New York: Mouton de Gruyter.
Schulze, R. (1994). Image schemata and the semantics of off. In M. Schwarz (Ed.), Kognitive
Semantik: Ergebnisse, Probleme, Perspektiven (pp. 197–213). Tübingen: Gunter Narr.
Schwarz, M. (Ed.). (1994). Kognitive Semantik: Ergebnisse, Probleme, Perspektiven. Tübingen:
Gunter Narr.
Shiffrin, R., & Schneider, W. (1977). Controlled and automatic information processing, II:
Perception, learning, automatic attending and a general theory. Psychological Review, 84,
127–190. DOI: 10.1037/0033-295X.84.2.127
Speelman, D., & Geeraerts, D. (2010). Causes for causatives: The case of Dutch ‘doen’ and ‘laten’.
In T. Sanders, & E. Sweetser (Eds.), Causal categories in discourse and cognition (pp. 173–
204). Berlin & New York: Mouton de Gruyter.
Stefanowitsch, A. (2008). Negative entrenchment: A usage-based approach to negative evidence. Cognitive Linguistics, 19, 513–531. DOI: 10.1515/COGL.2008.020
Stefanowitsch, A. (2010). Empirical cognitive semantics: Some thoughts. In D. Glynn, &
K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 355–
380). Berlin & New York: Mouton de Gruyter.
Stepanov, J. S. (1997). Константы: Словарь русской культуры [Constants: A dictionary of
Russian culture]. Moscow: Shkola Jezyki Russkoj Kul’tury.
Talmy, L. (1985). Lexicalization patterns: Semantic structure in lexical forms. In T. Shopen
(Ed.), Language typology and syntactic description (pp. 57–149). Cambridge: Cambridge
University Press.
Talmy, L. (1988). Force dynamics in language and cognition. Cognitive Science, 12, 49–100.
DOI: 10.1207/s15516709cog1201_2
Taylor, J. (1988). Contrasting prepositional categories: English and Italian. In B. Rudzka-�Ostyn
(Ed.), Topics in Cognitive Linguistics (pp. 299–326). Amsterdam & Philadelphia: John
Benjamins.
Taylor, J., & MacLaury, R. (1995). Language and the cognitive construal of the world. Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110809305
Taylor, J. (1989a). Linguistic categorization: Prototypes in linguistic theory. Oxford: Clarendon
Press.
Taylor, J. (1989b). Possessive genitives in English. In D. Geeraerts (Ed.), Prospects and problems
of prototype theory (Special edition of Linguistics 27) (pp. 663–686). Berlin & New York:
Mouton de Gruyter.
Taylor, J. (1996). On running and jogging. Cognitive Linguistics, 7, 21–34.
DOI: 10.1515/cogl.1996.7.1.21
Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition.
London & Cambridge (Mss): Harvard University Press.
Tsohatzidis, S. (Ed.). (1990). Meanings and prototypes: Studies on linguistic categorization.
London: Routledge.
Tuggy, D. (1993). Ambiguity, polysemy, and vagueness. Cognitive Linguistics, 4, 273–290.
DOI: 10.1515/cogl.1993.4.3.273
37
38
Dylan Glynn
Tuggy, D. (1999). Linguistic evidence for polysemy in the mind: A response to William Croft
and Dominiek Sandra. Cognitive Linguistics, 10, 343–368.
Tummers, J., Heylen, K., & Geeraerts, D. (2005). Usage-based approaches in Cognitive Linguistics: A technical state of the art. Corpus Linguistics and Linguistic Theory, 1, 225–261.
DOI: 10.1515/cllt.2005.1.2.225
Tyler, A., & Evans, V. (2003). Reconsidering prepositional polysemy networks: The case of over.
In B. Nerlich, Z. Todd, V. Herman, & D. Clark (Eds.), Polysemy: Flexible patterns of meaning in mind and language (pp. 95–160). Berlin & New York: Mouton de Gruyter.
Vandeloise, C. (1986). L’espace en français. Paris: Seuil.
Vandeloise, C. (1990). Representation, prototypes, and centrality. In S. Tsohatzidis (Ed.), Meanings and prototypes: Studies on linguistic categorization (pp. 403–437). London: Routledge.
Vandeloise, C. (1994). Methodology and analysis of the preposition in. Cognitive Linguistics, 5,
157–184. DOI: 10.1515/cogl.1994.5.2.157
Verschueren, J. (1981). Problems of lexical semantics. Lingua, 53, 317–351.
DOI: 10.1016/ 0024-3841(81)90046-2
Victorri, B., & Fuchs, C. (1996). La polysémie: construction dynamique du sens. Paris: Hermès.
Vorkachev, S. G. (2004). Счастье как лингвокультурный концепт [Happiness as a cultural-linguistic concept]. Moscow: Gnozis.
Vorlat, E. (1985). Metaphors and their aptness for trade names in perfumes. In W. Paprotté, &
R. Dirven (Eds.), Ubiquity of metaphor: Metaphor in language and thought (pp. 263–294).
Amsterdam & Philadelphia: John Benjamins.
Wierzbicka, A. (1985). Lexicography and conceptual analysis. Ann Arbor: Karoma.
Wierzbicka, A. (1989). Prototypes in semantics and pragmatics: Explicating attitudinal meanings in terms of prototypes. In D. Geeraerts (Ed.), Prospects and problems of prototype
theory (pp. 731–769). Berlin & New York: Mouton de Gruyter.
Wierzbicka, A. (1990). Prototypes ‘save’: On the uses and abuses of the notion of ‘prototype’ in
linguistics and related fields. In S. Tsohatzidis (Ed.), Meanings and prototypes: Studies on
linguistic categorization (pp. 347–367). London: Routledge.
Wierzbicka, A. (1996). Semantics: Primes and universals. Oxford: Oxford University Press.
Wulff, S., Stefanowitsch, A., & Gries, St. Th. (2007). Brutal Brits and persuasive Americans:
Variety-specific meaning construction in the into-causative. In G. Radden, K.-M. Köpcke,
Th. Berg, & P. Siemund (Eds.), Aspects of meaning construction (pp. 265–281). Amsterdam
& Philadelphia: John Benjamins.
Wulff, S. (2006). Go-V vs. go-and-V in English: A case of constructional synonymy? In St. Th.
Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 101–126). Berlin & New York: Mouton de Gruyter.
Zelinsky-Wibbelt, C. (Ed.). (1993). The semantics of prepositions. Berlin & New York: Mouton
de Gruyter. DOI: 10.1515/9783110872576
Zlatev, J. (2003). Polysemy or generality? Mu. In H. Cuyckens, R. Dirven, & J. Taylor (Eds.),
Cognitive approaches to lexical semantics (pp. 447–494). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110219074.447
Competing ‘transfer’ constructions in Dutch
The case of ont-verbs
Martine Delorge, Koen Plevoets, and Timothy Colleman
Ghent University
This paper zooms in on the semantic relations between the constructions of
“possessional transfer” (i.e. constructions used to encode events of possessional
transfer) in Dutch by zooming in on a specific morphological class of dispossession verbs, viz. verbs with the prefix ont- ‘away’, such as ontnemen ‘take away’,
ontfutselen ‘fish out of ’, onttrekken ‘extract, withdraw’, ontheffen ‘relieve’, etc. A
database with several thousand attested ont-examples from various corpora of
present-day written Dutch will serve as the starting point for an investigation of
their constructional possibilities and preferences: the ont-verbs will be shown
to cluster into a number of subclasses in terms of alternation possibilities. In
addition, a comparison of these present-day Dutch results with data from a
diachronic corpus of 19th century Dutch will reveal a number of lexico-grammatical shifts: the use of the double object construction and (especially) of the
aan-dative with ont-verbs is more heavily constrained now than it was in earlier
stages of the language.
Keywords: aan-dative, alternations, dispossession verbs, Dutch, ont-verbs
1. Introduction1
As shown in (1) to (5) below, the grammar of present-day Dutch includes several different argument structure constructions that can be used for the encoding of
three-participant events of ‘possessional transfer’. The constructions in (1) and (2)
have received by far the most linguistic attention, since they constitute the wellknown dative alternation: (1) exemplifies the double object construction in which
1. All three authors are associated with the BOF/GOA research project on ‘Meaning in-between structure and the lexicon’ in the Linguistics department at Ghent University. We would
like to thank the editors as well as two anonymous referees for their helpful comments and
suggestions. The usual disclaimers apply.
40 Martine Delorge, Koen Plevoets, and Timothy Colleman
the verb is combined with a subject and two bare NP objects, (2) exemplifies the socalled aan-dative, in which the theme is coded as a direct object but the recipient is
marked with the preposition aan (cognate with on, but relevantly similar to to). Existing studies of these constructions and their semantic relations include Van Belle and
Van Langendonck (1996), Janssen (1997), Geeraerts (1998), Colleman (2009a) and
Colleman and De Clerck (2009). In addition to the aan-dative, there are a number
of structurally similar constructions formed with other prepositions that can also be
used to encode certain subtypes of ‘possessional transfer’ events. One of these is the
construction in (3), in which the preposition van marks the source of the transfer in
a ‘dispossession’ event. At first sight, the example in (4) – which also denotes an event
of dispossession – exemplifies the very same construction. However, there is a crucial
difference in the linking of argument roles between (3) and (4): in the latter clause,
the direct object codes the possessional source, and van marks the theme. In terms
of the distinction between the basic types of ditransitive alignment put forward in
typological research by Haspelmath (2005) and Malchukov and colleagues (2010), the
constructions in (2) and (3) represent indirective alignment (i.e. the theme is coded
like the monotransitive patient; i.e. as a bare NP object, and the recipient/possessor
is coded differently) but the construction in (4) represents secundative alignment (i.e.
the recipient/possessor argument is coded like the monotransitive patient while the
theme gets special marking). The met-construction in (5) is also secundative, but it
denotes a fairly prototypical event of ‘caused possession’ rather than ‘dispossession’.
(1) De man heeft de vrouw een boek gegeven.
‘The man has given the woman a book.’
(2) De man heeft een boek aan de vrouw gegeven.
‘The man has given a book to the woman.’
(3) De man heeft een boek gestolen van de vrouw.
‘The man has stolen a book from the woman.’
(4) De man heeft de vrouw beroofd van al haar boeken.
‘The man has robbed the woman of all her books.’
(5) De man heeft de vrouw begiftigd met een boek.
‘The man presented the woman with a book.’
In terms of constructional polysemy, the first two constructions cover a wider region
in semantic space than the latter three. While the instances in (1) and (2) denote
events of ‘caused possession’ with agent, theme and recipient participants, both the
double object construction and the aan-dative can also be used to encode events of
dispossession with agent, theme and (human) source participants, albeit with rather
limited lexical possibilities (see Geeraerts 1998 and Colleman & De Clerck 2009 for
examples and further discussion). (6) and (7) present examples with the dispossession
verb ontnemen ‘take away’.
Competing ‘transfer’ constructions in Dutch
(6) De man heeft de vrouw een boek ontnomen.
‘The man has taken a book from the woman.’ (lit. has away-taken the woman
a book)
(7) De man heeft een boek aan de vrouw ontnomen.
‘The man has taken a book from the woman.’
The van-constructions in (3) and (4), by contrast, are limited to events of dispossession, whereas the met-construction in (5) is limited to a handful of infrequent and formal verbs of giving and cannot be used to encode the reverse orientation of transfer.2
2. Introducing the Dutch ont-verbs
In this paper we will try to shed more light on the synchronic and diachronic semantic relations between the argument structure constructions introduced in section 1
by zooming in on a specific morphological class of dispossession verbs, viz. complex
verbs with the prefix ont- ‘away’. While the ont-verbs are not particularly frequent in
everyday language, they nevertheless constitute an interesting class in that many of
them are found in several of the above constructions. Moreover, even though, semantically, ont-verbs of dispossession seem to form a rather homogeneous class at first
sight, they nevertheless display very different constructional preferences. As such, a
quantitative investigation of the degree of attraction between these ont-verbs and the
argument structure constructions in question can shed more light on the issue of
constructional competition between constructions with partly overlapping semantic
ranges: how do the constructions illustrated in section 1 divide up the semantic domain of ‘dispossession’, which is itself a sub-domain of ‘possessional transfer’? Moreover, since we will not only look into the constructional preferences of the selected
ont-verbs in present-day data, but also in data from a corpus of 19th century Dutch,
it can also be investigated whether the semantic relation between the ‘dispossession’
constructions in question has changed in the course of the last century and a half.
As such, the present study adds to the growing body of diachronic investigations of
constructional semantics (see, e.g., Barðdal 2007 and Colleman & De Clerck 2011 for
earlier studies).
The following 15 ont-verbs were selected for the case study, listed in alphabetical
order: ontdoen ‘strip’, ontfutselen ‘filch, fish out of ’, ontheffen ‘release’, ontlenen ‘take,
derive, borrow’, ontladen ‘unload, be released’, ontlasten ‘relieve’, ontlokken ‘elicit
2. There is a single lexical exception in the case of the construction in (4): next to a set of verbs
of dispossession, this pattern also accommodates the giving verb voorzien ‘provide’, as in Ze
voorzagen hem van voedsel (‘They provided him with food’). Most probably, this is a calque of
the French pattern pourvoir quelqu’un de quelque chose.
41
42
Martine Delorge, Koen Plevoets, and Timothy Colleman
(from)’, ontnemen ‘take away’, ontroven ‘steal away’, ontrukken ‘snatch (away)’, ontstelen ‘steal away’, onttrekken ‘withdraw, take away, derive’, ontvreemden ‘steal, thieve’,
ontworstelen ‘tear, wrest from’ and ontwringen ‘wrench from’. These verbs were chosen on the basis of (i) their semantic status as (prototypical) dispossession verbs and
(ii) frequency considerations.
Semantically, ditransitive ont-verbs belong to two categories, good examples of
which are ontnemen ‘take away from’ and ontzeggen ‘deny’, respectively. Whereas ontnemen denotes an event of dispossession, as in (6) and (7) above, ontzeggen instead
denotes a blocked ‘caused possession’ event, as in (8). Put differently, ontnemen and
ontzeggen are verbs of taking and not-giving, respectively. In terms of Colleman’s
(2009b) multidimensional analysis of the semantic structure of the Dutch double
object construction, the examples in (6) and (8) represent distinct subsenses of the
construction, departing from the core ‘caused reception’ meaning along the semantic
dimensions of ‘orientation of the transfer’ and ‘polarity of the transfer’, respectively.
(8) Ze hebben vrouwen lange tijd het stemrecht ontzegd.
‘For a long time, they denied women the right to vote.’
In this case study, we focus on ont-verbs of the ‘dispossession’ type. The verbs under
investigation select agent, theme and possessional source participants. Note that this
is not to say that they occur exclusively in three-argument constructions: on the contrary, several of them occur far more often in two-argument constructions without
an expressed source (see below). In terms of Goldbergian construction grammar, the
selected verbs select a possessional source participant, but this participant need not be
lexically profiled – though it is for the most prototypical members of the category, such
as ontnemen (‘take away’) or ontfutselen (‘filch, fish out of ’) (for further details on the
notion of lexical profiling, see Goldberg 1995:â•›43–48).
The formation of dispossession verbs with the prefix ont- is not a productive
word-formation pattern anymore, but dictionaries include quite a lot of examples,
many of which are highly infrequent in present-day language (e.g. ontsjacheren ‘to
barter sth. away from s.o.’ or ontschaken ‘to take s.o. away from s.o. by abduction’, to
give but two examples). In order to avoid the selection of all too infrequent verbs for
the case study, we only selected verbs which are included in the CELEX lexical database of Dutch. In addition, preference was given to verbs with a CELEX frequency of
at least 3 occurrences per one million words of running text.3 However, this frequency
criterion was applied liberally: an exception was made for a number of infrequent
verbs that we wanted to include in the investigation for semantic reasons, such as
ontroven and ontstelen, the verb bases of which are the semantically prototypical and
highly frequent dispossession verbs roven ‘rob’ and stelen ‘steal’.
3. The CELEX frequency was checked using the WordGen tool developed by Wouter Duyck
and colleagues (Duyck et al. 2004).
Competing ‘transfer’ constructions in Dutch
3. Methodology of the case study
A database consisting of several thousands of attested ont-examples from various
corpora of present-day written Dutch serves as the starting point for an investigation of the constructional flexibility of these verbs: the newspaper component of the
present-day CONDIV-corpus, the 27 and 38 Million Words Corpora of the Dutch
Institute for Lexicology (INL) and the Twente Nieuws Corpus (TNC). The selected
corpora all represent relatively formal genres of written Dutch (newspapers, magazines, fictional and non-fictional prose, etc.). All forms of the 15 test verbs were
automatically retrieved from the corpora, and the results were manually filtered and
labelled according to syntactic construction. We used more than one corpus in order
to be able to retrieve enough examples of the selected verbs to make well-grounded
judgements about their constructional behaviour. After all, some of the verbs from
the selection, as already mentioned, are quite infrequent. The different corpora are
similar enough in their contents to justify combining them together. Section 4 presents and discusses the main findings from the synchronic investigation.
In addition, we compared the present-day Dutch results with data from a corpus
of 19th-century Dutch in order to investigate possible lexicogrammatical shifts. The
diachronic part of the study is based on a corpus consisting of 50 volumes from the
periodical De Gids (1850-1899). De Gids (‘The Guide’) was an influential literary and
general cultural journal, and as such represents formal written Dutch. The results of
the diachronic investigation are presented in Section 5.
4. The results of the present-day investigation
4.1
Overall distribution
As was noted in Section 2 above, the overall aim of the investigation is to investigate the constructional competition between the various Dutch argument structure
constructions that can be used to encode ‘dispossession’ events, through an identification of the constructional preferences of a sample of ont-verbs in (a) present-day
data and (b) 19th century data. Table 1 presents the results of the synchronic part of
the corpus investigation. We distinguish six different three-argument constructions,
four of which have already been introduced in Section 1: the double object construction (DOC), the aan-dative, and two constructions with van, which are labeled van-I
(“indirective”) and van-S (“secundative”), respectively. The remaining constructions
distinguished in the table are two fairly infrequent indirective patterns in which the
(possessional) source argument is marked with the prepositions bij ‘at’ and uit ‘out
(of)’, respectively. In addition, most of the investigated verbs occur in other constructions, too, i.e. in various two-argument constructions. In five cases, two-participant
43
44 Martine Delorge, Koen Plevoets, and Timothy Colleman
constructions even account for the large majority of occurrences, viz. with ontvreemden ‘steal, thieve’, ontworstelen ‘wrest from’, onttrekken ‘withdraw, take away’, ontlasten
‘relieve’, and ontladen ‘unload’. In fact, the verb ontladen turned out to display just a
single three-argument instance in the corpus data, from a total of 331 occurrences,
viz. a single instance of the secundative van-construction. Because of this sparsity of
data, this verb is excluded from further discussion. The rather frequent occurrence of
verbs of dispossession in structures without a possessional source object is not very
surprising. Newman (1996:â•›57) has already observed that “there is no giver necessarily
present in the base of take […] There is only one person necessarily involved in the
characterization of the basic meaning of TAKE”. As shown in Delorge (2010), simplex
verbs of dispossession, such as stelen ‘steal’, roven ‘rob’, etc. typically display the same
characteristic.4
In the remainder of this section, we remove the monotransitive and other
two-argument constructions from consideration and focus on the distribution of the
three-argument constructions. The distribution of the 14 remaining ont-verbs over
the 6 three-argument constructions distinguished in Table 1 is statistically significant
(χ² = 46741.05, df = 65, p < 2.2e-16). Table 4 in the Appendix lists the Pearson residuals. The Cramér’s V effect size is 0.7125667, which suggests a strong association. Note
that effect sizes tend to increase with the number of rows and columns, however, and
that there are quite a lot of cells with an expected frequency of less than 5, which reduces the value of the chi-square test. This is why we turn to an exploratory statistical
tool in the next sub-section.
4.2
Four clusters of ont-verbs
As a first observation, it can be seen from the frequencies in Table 1 that there is a
nearly complete lexical split between the secundative van-construction on the one
hand and the other five three-argument constructions included in the table on the
other. While many verbs occur in both the DOC and several of the indirective constructions in the corpus data, the secundative construction is associated with a number of verbs that occur in this three-argument construction (virtually) exclusively, viz.
ontlasten ‘relieve’, ontdoen ‘strip’, and ontheffen ‘release’ (in addition, ontladen ‘unload’
was also attested in this construction only, see Section 4.1). Corpus examples are listed in (9) to (11). As can be seen from these examples, the theme of ontheffen and
4. It should be noted that constructions with a reflexive pronoun, such as the ontworstelen example in (i) below, were included in the rest category, too, as they do not contain three
arguments.
(i) Een deel van de Otavalo-Indianen probeert zich aan die armoede te ontworstelen.
[TNC]
‘Part of the Otavalo Indians try to wrest themselves out of that poverty.’
Competing ‘transfer’ constructions in Dutch
Table 1.╇ Observed frequencies in the present-day data
Verb
ontdoen
ontfutselen
ontheffen
ontlasten
ontlenen
ontlokken
ontnemen
ontroven
ontrukken
ontstelen
onttrekken
ontvreemden
ontworstelen
ontwringen
Total
Three-argument constructions
DOC
aan
0
458
0
0
0
587
3993
2
6
104
2
8
7
11
5178
0
146
5
0
6087
342
207
3
189
3
2248
3
26
11
9270
van-I van-S
0
19
0
0
2
23
15
0
0
0
7
33
0
0
99
2575
0
697
111
0
0
0
0
0
0
0
0
0
0
3383
uit
bij
Total
0
4
0
0
17
2
2
0
7
0
46
298
1
0
377
0
1
0
0
4
53
3
0
0
0
0
43
0
0
104
2575
628
702
111
6110
1007
4220
5
202
107
2303
385
34
22
18411
Other cxs
Total
1888
28
321
956
65
256
270
0
7
3
3562
653
859
1
8869
4463
656
1023
1067
6175
1263
4490
5
209
110
5865
1038
893
23
27280
ontlasten is typically an abstract entity, such as a task or duty. In the case of ontdoen,
the lexical possibilities are wider. None of these verbs is a very prototypical representative of the class of ‘dispossession’ verbs semantically, unlike some of the other verbs
which will be discussed below. Ontlasten and ontheffen imply a positive effect on the
original possessor, who is relieved of something conceptualized as a burden – needless
to say, in prototypical dispossession events, the transfer has a negative effect on the
original possessor. As for ontdoen, this denotes an event of physical removal or separation rather than actual dispossession: typically, as in (11), the object NP does not
refer to a human participant.5
(9) Inmiddels zijn creatieve oplossingen gevonden om hen te ontlasten van hun
zware taak. [TNC]
‘In the meantime, creative solutions have been found to relieve them of their
heavy task.’
5. The object NP can refer to a human participant, as in (ii) below. However, even such instances seem to denote an event of physical removal rather than of actual dispossession.
(ii) Het Hof van Cassatie achtte het onmogelijk dat de man de 18-jarige van haar broek
had kunnen ontdoen als zij zich daar tegen had verzet. [TNC]
‘The court of cassation judged that it would have been impossible for the man to strip
the 18-year-old of her pants if she had offered resistance.’
45
46 Martine Delorge, Koen Plevoets, and Timothy Colleman
(10) De bisschop onthief haar van haar functie. [TNC]
‘The bishop released her from office.’
(11) Met snijbranders ontdoen de slopers de molens van hun wieken. [TNC]
With cutting torches, the demolishers are stripping the mills of their sails.’
Returning to the syntactic possibilities, ontheffen is the only one of these verbs that
is also attested in another three-argument construction, viz. the indirective aan-construction, but only very sporadically so (a mere 5 out of 1023 corpus instances).
Conversely, none of the remaining 11 ont-verbs are attested in the secundative
van-construction a single time. This situation is reminiscent of the lexical split found
with verbs of giving, where the secundative pattern is also limited to a handful of
verbs that are not eligible for use in the DOC or the aan-dative, including begiftigen,
as in (5) above (see Delorge & De Clerck 2007).
Most of the remaining 11 verbs are used in the DOC as well as in several indirective constructions. However, a manual inspection of the frequencies in Table 1 (and
of the Pearson residuals in Table A) suggests that they cluster into a number of classes
with distinct constructional preferences. We used an exploratory statistical technique,
viz. Correspondence Analysis (CA), to visualize such associations in the data: see
the plot in Figure 1, in which the constructions are represented in capitals and the
verbs in lower case (on CA, see Greenacre 2007, as well as the articles by Glynn in the
present volume). The eigenvalues for the first two dimensions are 53.8% and 44.33%,
respectively, indicating that the analysis presented in Figure 1 explains 98.13% of the
variation (inertia). The table used to produce the correspondence analysis includes
many small cells, and so the numerical summary of the analysis is included in the Appendix, Table 5. To assure a reasonable degree of accuracy in the analysis, the quality
score (qlt) should be over 500. A figure of 500 indicates that 50% of the inertia for that
data point lies off principle axes and therefore that points are less accurately displayed
in the plot (see Glynn, this volume, and Greenacre 2007 for more details). The remainder of this sub-section interprets the visualization in Figure 1.
The only outlier among the 11 verbs is ontvreemden ‘steal, thieve’, which is
grouped with the uit-construction on the right-hand side of the plot. Indeed, ontvreemden is the only verb which is combined with a uit-phrase in the majority of its
three-participant uses. Unlike the “secundative” verbs discussed above, ontvreemden,
which is a hyponym of stelen ‘steal’, denotes a prototypical dispossession event involving a human possessor: if something is stolen, it is by definition stolen from someone.
However, ontvreemden differs from ontnemen ‘take away from’ and other verbs to be
discussed below in that this human possessor participant is not lexically profiled: in
the majority of instances, the possessor role is not expressed. In fact, this is a characteristic ontvreemden shares with its hypernym (see Goldberg’s 1995:â•›45–46 semantic
discussion of English steal and the frequency data for Dutch stelen in Delorge 2010).
The uit-construction differs from the other three-argument constructions in Table 1
in that the PP refers to a locational source rather than to a possessional source. In (12),
Competing ‘transfer’ constructions in Dutch
Correspondence analysis graph
ontstelen
ontnemen
DOC
Dimension1 (53.80%)
1.0
ontfutselen
ontlokken
0.5
VAN.I
BIJ
ontwringen
ontroven
0.0
ontvreemden
UIT
ontworstelen
–0.5
AAN ontrukken
onttrekken
ontlenen
0
1
2
3
4
5
Dimension2 (44.33%)
Figure 1.╇ Correspondence analysis of the present-day data (without the secundative
construction)
for instance, the jewellery store is the place where the necklace was stolen, though this
of course refers indirectly to the person from whom it was stolen, viz. the jeweller.
While this is a pattern typically found with ontvreemden, the van-example in (13)
shows that the verb does also occur in three-argument constructions with a genuine
possessional source argument.
(12) Fadiga ontvreemdde het collier, met een waarde van 300 euro, zondag uit een
juwelierszaak. [TNC]
‘On Sunday, Fadiga stole the necklace worth 300 € from a jewelry store.’
(13) Muijtstege had de twee wel vaker financieel geholpen en ze hadden ook al
eerder geld van hem ontvreemd. [TNC]
‘Muijtstege had given financial help to the two [petty criminals] before, and
they had stolen money from him before as well.’
On the left-hand side, we find a kind of cline of verbs positioned in-between the DOC
and the aan-dative. Starting at the top, the two verbs that are most closely associated
with the DOC are ontnemen ‘take away from’ and ontstelen ‘steal away from’. Indeed,
these are found with double object syntax in the large majority of their instances
across the four corpora: the DOC accounts for 88.9% of the ontnemen instances in the
database and for 94.5% of the ontstelen instances. Good examples of this are shown
in (14) and (15).
47
48 Martine Delorge, Koen Plevoets, and Timothy Colleman
(14) Verder mag de leiding verslaafden hun paspoort en geld niet meer ontnemen.
[TNC]
‘Furthermore, the leaders are no longer allowed to dispossess addicts of their
passports and money.’ (lit. to away-take addicts their passport and money)
(15) 10000 frank kan de miljoenen niet vervangen die de staat mij ontstolen heeft.
[CONDIV]
‘10,000 francs cannot replace the millions that the state has stolen from me.’
(lit. that the state has me away-stolen)
In both cases, the observed preference for the DOC can be straightforwardly linked
to the verbs’ lexical semantics. The verb bases nemen ‘take’ and stelen ‘steal’ are among
the most prototypical simplex verbs of taking away. Similarly, their prefixed variants
ontnemen and ontstelen denote prototypical dispossession events in which a human
participant is deprived of an item in his/her possession, typically by another human
participant. This item need not be a concrete object – it is in (14), but not, or less so,
in (15) – but, in any event, there was a fairly prototypical possession relation between
the indirect and direct object referents before the event, which is broken by instigation
of the subject referent. Note that, unlike many of the other verbs in Table 1, ontnemen
and ontstelen occur in two-argument constructions in only a small fraction of their
instances (6% and 2.7%, respectively): these are typical three-participant verbs, with a
lexically profiled possessional source participant.
Van Belle and Van Langendonck (1996:â•›245) have also observed the preference
of ontnemen for the double object construction, albeit on an introspective basis. They
ascribe this to the factor [+/– involvement], which, in their account, is one of the two
major semantic determinants of the dative alternation in Dutch, next to [+/– (material) transfer]. Ontnemen lexicalizes a transfer of possession which has a fundamental
effect on the original possessor, and the same applies to ontstelen: the ‘transfer’ events
denoted in (14) and (15) above (negatively) affect the indirect object referent, i.e. the
loss of the direct object referent has clear consequences for the indirect object and
his/her further actions. Since the DOC is hypothesized to highlight the relationship
between the direct and the indirect object referents, the lexical semantics of verbs
such as ontnemen and ontstelen tallies well with the constructional semantics of the
double object construction (cf. De Schutter 1974:â•›205; Verhagen 1986 and Colleman
2009b; inter alia).
According to Van Belle and Van Langendonck (1996:â•›245), the only reason why
verbs such as ontnemen can appear in the aan-dative at all, despite their strong lexical
focus on the affectedness of the possessional source participant, is that there is also a
‘caused motion’ event involved: the theme moves from the domain of possession of
the source to that of the agent. An often advanced semantic explanation for the dative
alternation posits that the aan-structure emphasizes the spatial aspects of the denoted
transfer scene, i.e. the movement of the theme along a path from the agent participant towards the recipient participant (or, in this case, from the source participant
Competing ‘transfer’ constructions in Dutch
Table 2.╇ DOC and aan-frequencies in instances with concrete vs. abstract themes
theme = concrete entity
theme = abstract entity
Total
DOC
aan
Total
â•⁄411
2786
3197
â•⁄19
110
129
â•⁄430
2896
3326
toward the agent participant) (see, e.g. Goldberg 1992 and Langacker 1991 for similar
semantic hypotheses about the English to-dative, and Colleman & De Clerck 2009
for further discussion). This would seem to suggest that material transfers generally
prefer the aan-construction, or, at least, that the proportion of aan-instances to DOC
instances is larger when the theme is a concrete object than when it is some kind of
abstract commodity. The ontnemen data, however, show that no significant difference
can be attested between cases with a concrete theme on the one hand, and cases with
an abstract theme on the other hand; see the distribution in Table 2 (χ² = 0.386, df = 1,
p = 0.5342).6
This warns us against an all too literal interpretation of the ‘caused motion’ hypothesis: even when the theme is a concrete entity which undergoes a spatial transfer as it changes ownership, the DOC is still the preferred construction in the large
majority of cases. The general semantic hypothesis put forward in Colleman (2009b)
is that the aan-dative highlights the changing agent-theme relationship, whereas the
DOC highlights all three participants and their interrelations, including the recipient
participant. We leave it to future research to test the validity of this hypothesis in a
more systematic way for verbs with a possessional source rather than a recipient, but,
in any event, ontnemen and ontstelen fit the picture, with their strong lexical focus on
the affectedness of the possessor participant.
As we move from the top to the bottom of the left-hand side in Figure 1, the
lexical preferences shift towards the aan-dative. Ontfutselen ‘filch, fish out of ’ and
ontlokken ‘elicit from’ are the only two remaining verbs for which the DOC is the most
frequently attested construction in the database (accounting for 69.8% and 46.5% of
all occurrences, respectively). There is a twofold reason why they are positioned lower
in the plot than ontnemen and ontstelen: (i) because their proportion of aan-examples
is relatively higher (22% and 27%, respectively, as opposed to less than 5% for both
ontnemen and ontstelen) and (ii) because they are the only verbs, next to ontvreemden
‘steal’, which are found in the indirective van-construction (in the case of ontfutselen)
or in the bij-construction (in the case of ontlokken) somewhat more than sporadically,
though it should be stressed that such uses still account for only a small fraction of
6. The frequencies in Table 2 do not add up to the overall frequencies of the DOC and aan-dative with ontnemen listed in Table 1, as a good number of cases were not straightforwardly
classifiable into either concrete or abstract categories but represented various in-between uses.
49
50
Martine Delorge, Koen Plevoets, and Timothy Colleman
their total number of occurrences. We will briefly return to the constructions with van
and bij at the end of this sub-section.
Still further down, in the lower-left corner, are those verbs which occur with aan
in the majority of their (three-argument) uses. The verbs associated most closely with
the aan-dative are ontlenen ‘take, derive, borrow’, onttrekken ‘withdraw from, derive’,
ontrukken ‘snatch away, pull away’ and ontworstelen ‘wrest from’: this construction
accounts for 99.6%, 97.6%, 93.6% and 76.5% of the overall number of three-argument
instances, respectively. The first thing to observe is that these verbs cover a much
wider region in semantic space than just ‘possessional transfer’: typical examples are
shown in (16) to (19) below, none of which involves a human possessor participant
that is ‘dispossessed’ of something in the strict sense of the word: rather, these instances denote various kinds of (metaphorical) ‘separation’ or ‘withdrawal’ events.
(16) Of Jeroen Bosch inspiratie ontleende aan al dat water in de stad, is niet
bekend. [TNC]
‘Whether Hieronymus Bosch derived inspiration from all that water in the
town is unknown.’
(17) Tegenwoordig is [dat] de meest gebruikte methode om geur aan natuurlijke
producten te onttrekken. [TNC]
‘Presently, that is the most usual method for extracting the odour of natural
products.’
(18) Bourguiba mag zijn land hebben ontrukt aan de Franse overheersing, hij was
niettemin een francofiel. [TNC]
‘Bourguiba may have wrenched his country from French rule, he was a francophile nonetheless.’
(19) Dat spoort met een tendens om de VS te ontworstelen aan internationale
afspraken. [CONDIV]
‘That confirms a tendency to wrestle the USA out of international agreements.’
Many occurrences of onttrekken and ontrukken, especially, are more or less fixed expressions, such as iets aan de vergetelheid onttrekken/ontrukken ‘to save something
from oblivion’ or iets aan het zicht onttrekken ‘to hide something from sight’. This is
the case in more than 50% of the ontrukken examples and 26.6% of the onttrekken examples. Again, such examples are not prototypical instances of dispossession events.
Still, the aan-construction can be used to express events with a human deprivee, too:
see (20) and (21) for examples of ontrukken and onttrekken.
(20) In het holst van de nacht ontrukten zwaarbewapende Amerikaanse immigratieagenten een verschrikt Cubaans jongetje van zes aan luid lamenterende
verwanten. [TNC]
‘In the middle of the night, heavily armed American immigration agents
snatched a startled Cuban boy of six from loudly lamenting family members.’
Competing ‘transfer’ constructions in Dutch
(21) De partij vraagt voorts aan procureur-generaal Van Oudenhove om alle
Vlaams Blok-dossiers aan Dejemeppe te onttrekken. [CONDIV]
‘Furthermore, the party asks attorney general Van Oudenhove to withdraw
all cases related to the Vlaams Blok from Dejemeppe.’
Only in such cases is the DOC a (marked) option as well, hence the low frequencies
in the DOC column in Table 1. (22) is one of the few DOC clauses attested with ontworstelen, for instance: in this case, the verb is used in its literal, compositional sense
‘wrest from’, and the relation between the object referents is one of fairly prototypical
possession.
(22) Ik dacht aan de talloze keren dat ik in de supermarkt had moeten wachten
omdat moeders vooraan in de rij hun gillende broedsel vergeefs een reep
chocola probeerden te ontworstelen. [TNC]
‘I thought of the many occasions when I had found myself waiting in the
supermarket because mothers at the front of the row fruitlessly try tried to
wrest away a chocolate bar from their yelling brood.’
For a final observation, the constructions with bij and van can be seen to occupy an
isolated position in the middle of Figure 1. Indeed, none of the verbs under investigation is particularly attracted to these constructions. The verbs with the relatively
highest frequencies of indirective van-instances are ontvreemden ‘steal’ and ontfutselen ‘filch, fish out of ’, but even in these cases, the van-construction only accounts for
about 3% of the occurrences. See (13) above for an example with ontvreemden and
(23) below for one with ontfutselen.
(23) Ze moeten 80.000 frank boete betalen omdat ze geld en andere goederen
hebben ontfutseld van enkele gasten. [CONDIV]
‘They have to pay an 80,000 francs fine because they have filched money and
other goods from a number of guests.’
In this regard, the ont-verbs differ from the pattern attested with simplex verbs of
dispossession, for with stelen ‘steal’, nemen ‘take’, etc., the human possessor participant is marked with van in present-day Dutch (as in (3) above, see Delorge 2010 for
details). It is sometimes suggested that it would be natural for ont-verbs of dispossession to substitute the “default” source preposition van for aan (e.g. Schermer-Vermeer
1991:â•›216–217). However, our data show that, so far, the van-construction has not
really caught on with these verbs, at least not in formal registers of written Dutch.
The “top verbs” of the bij-construction are ontvreemden ‘steal’, again, and ontlokken ‘elicit (from)’, but, similarly to the van-construction, this construction with
bij accounts for only a small fraction of occurrences in both cases (in-between 4 and
5%). An ontlokken example is shown in (24). Again, such examples do not denote
prototypical dispossession.
51
52
Martine Delorge, Koen Plevoets, and Timothy Colleman
(24) Alleen het bootje met daarop Sint en zijn pieten die sinterklaasliedjes zongen,
ontlokte applaus bij het publiek. [TNC]
‘Only the boat on which the saint and his servants were singing St-Nicholas
songs drew applause from the public.’
To summarize, we can distinguish four clusters of ont-verbs in the present-day data:
a set of verbs that are (virtually) exclusively used in the secundative van-construction: ontlasten ‘relieve’, ontdoen ‘strip’, ontheffen ‘release’ (and ontladen ‘unload’);
ii. the verb ontvreemden ‘steal’, which marks the source participant with uit, bij, or
van, if it is expressed at all;
iii. a set of verbs with a strong lexical preference for the double object construction,
most notably ontnemen ‘take away from’ and ontstelen ‘steal away from’;
iv. a set of verbs with a strong lexical preference for the aan-construction, most
notably ontlenen ‘take, derive, borrow’, onttrekken ‘withdraw, derive’, ontrukken
‘snatch away’ and ontworstelen ‘wrest, wrench away’.
i.
There is a cline from (iii) to (iv), with verbs such as ontfutselen ‘filch, fish out of ’, ontroven ‘rob from’ and ontwringen ‘wrench from’ occupying intermediate positions. The
next section explores the constructional preferences of ont-verbs in an older sub-stage
of the Dutch language.
5. A diachronic perspective
An important trend in construction grammar is the growth of interest in issues of diachronic and synchronic language variation in the syntax and semantics of schematic
constructions – see Colleman and De Clerck (2011) for references and discussion. In
this section, we will explore a number of differences and similarities in the constructional behaviour of the selected ont-verbs between present-day Dutch and 19th-century Dutch, as represented by a 50-year sample of the corpus De Gids (1850–1899).
First, Table 3 presents the distribution of the 14 verbs under investigation over the six
three-argument constructions in the 19th-century data – ontladen ‘unload’ was left
out again. This distribution is statistically significant (χ² = 5659.828, df = 65, p-value <
2.2e-16). Table 6 in the Appendix lists the Pearson residuals, and the Cramér’s V effect
size is 0.5195202.
The data show that the virtually complete lexical split between the secundative
construction on the one hand and the other three-argument constructions on the
other was already present in the 19th century: the three verbs attested in the secundative van-construction are not attested in any of the other three-argument constructions, with the exception of ontheffen ‘release’, which has a small number of aan and
DOC examples. Conversely, the other 11 verbs do not enter into the secundative
construction.
Competing ‘transfer’ constructions in Dutch
Table 3.╇ Observed frequencies in the 19th-century data
Verb
ontdoen
ontfutselen
ontheffen
ontlasten
ontlenen
ontlokken
ontnemen
ontroven
ontrukken
ontstelen
onttrekken
ontvreemden
ontworstelen
ontwringen
Total
Three-argument constructions
DOC
aan
0
30
2
0
9
179
1023
29
113
78
7
4
4
43
1521
0
11
4
0
873
144
697
11
287
13
39
5
6
41
2131
van-I van-S
0
0
0
0
34
0
0
0
0
0
0
4
0
2
40
89
0
304
30
0
0
0
0
0
0
0
0
0
0
423
uit
bij
Total
0
0
0
0
74
2
0
0
0
0
0
1
0
0
77
0
0
0
0
0
2
0
0
0
0
0
0
0
0
2
89
41
310
30
990
327
1720
40
400
91
46
14
10
86
4194
Other cxs
Total
87
3
23
67
12
17
57
0
32
2
52
17
87
16
472
176
44
333
97
1002
344
1777
40
432
93
98
31
97
102
4666
Figure 2, below, shows the plot from a Correspondence Analysis of the observed
frequencies of these remaining 11 verbs in the DOC and the four indirective constructions. The two-dimensional analysis accounts for 96.65% of the inertia (Dim. 1:
72.95%, Dim. 2: 23.7%). Again, the numerical output is supplied in the Appendix,
Table 7. We will not explore this distribution to the same degree of detail as in the previous section, but we will focus on the relation between the DOC and the aan-dative,
the two constructions involved in the dative alternation. A difference with the visualization in Figure 1 is that, in Figure 2, the DOC and the aan-construction occupy less
extreme positions in the cluster of points on the left-hand side of the plot: the distance
between the two constructions is smaller than in Figure 1. This suggests that in the
19th-century data, there is more overlap in their distributions over the 11 verbs than
in the present-day data.
In order to further test this, we conducted two-by-two comparisons of the observed DOC and aan-dative frequencies of the 11 non-secundative verbs in 19th-century vs. present-day language. In six cases, this revealed a significant shift. Three
verbs display a significantly stronger preference for the DOC in the present-day data
compared to the 19th-century data: ontnemen (χ² = 1194.934, df = 1, p < 2.2e-16;
OR = 0.07611464), ontstelen (χ² = 8.7284, df = 1, p = 0.003133; OR = 0.1744933) and
ontlokken (χ² = 6.0898, df = 1, p = 0.0136; OR = 0.7244433). The odds ratios (ORs)
are included as a measure of the effect size: the effect is stronger for ontnemen and
ontstelen than for ontlokken (the more the OR differs from 1, the stronger the effect).
Three other verbs display the reverse tendency, i.e. they have a significantly stronger
53
Martine Delorge, Koen Plevoets, and Timothy Colleman
Correspondence analysis graph
UIT
1.5
1.0
Dimension1 (87.61%)
54
0.5
0.0
–0.5
–1.0
VAN
ontlenen
ontvreemden
onttrekken
AAN
onturkken
ontheffen
ontworstelenontwringen
ontlokken
ontnemen
BIJ
ontfutselen
DOC ontroven
ontstelen
0
0.5
1
1.5
Dimension2 (8.948%)
2
2.5
Figure 2.╇ Correspondence analysis of the 19th C data (without secundative
construction)
preference for the aan-dative in the present-day data compared to the older data: ontlenen (Fisher Exact p = 8.038e-09; OR = Infinite), onttrekken (Fisher Exact p = 2.828e11; OR = 198.7147) and ontrukken (χ² = 51.9195, df = 1, p-value = 5.782e-13; OR =
12.36263).7 In this case, the odds ratios are larger than 1 because the effect is reversed:
the odds for the aan-dative are larger in the present-day data than in the old data.
The remaining five verbs do not display a significant diachronic shift: the distribution
attested with ontwringen, for instance, is remarkably constant, with a proportion of
DOC to aan-dative instances of (about) 1:1 in both periods.
The diachronic shifts observed in individual verbs can be seen as indicators of a
tendency towards polarization or constructional specialization: in those verbs which
already displayed a strong preference for one of the two alternating constructions in
the 19th-century data, this lexical preference has become even stronger in the present-day data. That is, ontnemen, ontstelen and (to a somewhat lesser extent) ontlokken
have become even more closely associated with the DOC, whereas ontlenen, onttrekken, and ontrukken have become even more closely associated with the aan-construction. All in all, there seem to be few verbs left in this morphological class which can
be said to alternate more or less freely. An interesting follow-up question for further
7. In the case of ontlenen and onttrekken, we used the Fisher Exact test rather than the Pearson
chi-square test because at least one of the cells in the 2-by-2 table had an expected frequency of
less than 5.
Competing ‘transfer’ constructions in Dutch
diachronic research into the Dutch dative alternation is whether this polarization tendency observed for ont-verbs is also found in other ditransitive verb classes.
6. Conclusion
Whereas the large majority of existing studies on the grammatical encoding of ‘possessional transfer’ events in Dutch – or English, for that matter – deal with verbs of
giving, either primarily or exclusively, the present study has focused on verbs lexicalizing the reverse direction of transfer, i.e. verbs of dispossession. More specifically, we have presented a corpus-based case study of a specific morphological class of
dispossession verbs, viz. prefixed verbs with ont- ‘away’. Starting with the frequency
data included in a database with over 18,000 three-argument examples from four corpora of present-day written Dutch, we have distinguished four clusters of ont-verbs
according to their constructional preferences. These are (i) verbs which are (virtually)
exclusively found in the secundative van-construction, (ii) verbs with a preference
for indirective constructions with uit, bij, or van, or for two-argument constructions
without a source participant, (iii) verbs with a strong lexical preference for the double
object construction, and (iv) verbs with a strong lexical preference for the aan-dative.
These clusters provide valuable information about the semantic relations between the
argument structure constructions at stake. As for the distinction between clusters (iii)
and (iv), for instance, we have observed that, compared to the verbs strongly attracted
to the aan-dative in cluster (iv), the verbs with a strong attraction to the DOC in cluster (iii) denote more prototypical events of dispossession in which a human original
possessor is (strongly) negatively affected by the transfer. This suggests that the DOC
is the preferred construction for the encoding of scenes in which the source of the
transfer is a prototypical human deprivee – a finding which is in line with a general
hypothesis known from the literature on the dative alternation with verbs denoting a
possessional transfer in the canonical direction (i.e., verbs of giving and the like), viz.
that compared to the aan-dative, the DOC highlights the affectedness of the indirect
object referent. We have not found corroborating evidence, however, for another wellknown hypothesis about verbs of giving, viz. that the use of the aan-dative highlights
the spatial aspects of the scene: in clauses with ont-verbs, it does not seem to matter
for the choice between the DOC and the aan-dative whether or not the theme actually
moves along a spatiotemporal path as it changes ownership. Furthermore, we have
also observed that there is a cline from verbs with a strong DOC preference to verbs
with a strong aan-preference, with a number of verbs occupying intermediate positions. This is different for the secundative van-construction, in that there is a nearly
complete lexical split between this construction on the one hand and the other five
constructions included in the table on the other: the secundative construction is frequently found with a number of ont-verbs which hardly occur in the DOC or in any
55
56
Martine Delorge, Koen Plevoets, and Timothy Colleman
of the investigated indirective constructions. Semantically, these verbs depart from
prototypical ‘dispossession’ in that they imply a positive rather than negative effect
on the original possessor, who is relieved of something conceptualized as a burden,
or denote an event of physical removal or separation. The secundative construction,
it turns out, is hardly an option for the encoding of more protypical ‘dispossession’
events with a human possessional source maleficially affected by the subject’s action.
The indirective constructions with van, bij, and uit, finally, account for a small fraction
of the ont-examples only; ontvreemden ‘steal’ is the only verb in the data with a preference for one of these constructions, namely for the locative construction with uit – it
remains to be investigated how exactly simplex verbs of dispossession such as stelen
‘steal’ or roven ‘rob’ behave with respect to these various indirective constructions.
The most interesting finding from the diachronic part of the investigation is that
the relation between the DOC and the aan-dative seems to characterized by an interesting tendency of polarization or semantic specialization: the lexical preferences
displayed in the 19th-century data tend to become even stronger in the present-day
language.
References
Barðdal, J. (2007). The semantic and lexical range of the ditransitive construction in the history
of (North) Germanic. Functions of Language, 14, 9–30. DOI: 10.1075/fol.14.1.03bar
Colleman, T. (2006). De Nederlandse datiefalternantie: Een constructioneel en corpusgebaseerd onderzoek [The Dutch dative alternation. A constructionist and corpus-based investigation]. Unpublished Ph. D. dissertation, Ghent University.
Colleman, T. (2009a). The semantic range of the Dutch double object construction: A collostructional perspective. Constructions and Frames, 1, 190–220. DOI: 10.1075/cf.1.2.02col
Colleman, T. (2009b). Verb disposition in argument structure alternations: A corpus study of
the Dutch dative alternation. Language Sciences, 31, 593–611.
DOI: 10.1016/j.langsci. 2008.01.001
Colleman, T., & De Clerck, B. (2009). Caused motion? The semantics of the English to-dative
and the Dutch aan-dative. Cognitive Linguistics, 20, 5–42. DOI: 10.1515/COGL.2009.002
Colleman, T., & De Clerck, B. (2011). Constructional semantics on the move: On semantic specialization in the English double object construction. Cognitive Linguistics, 22, 183–210.
DOI: 10.1515/cogl.2011.008
Delorge, M., & De Clerck, B. (2007). A contrastive and corpus-based study of English and
Dutch provide-verbs. Phrasis, 48, 121–142.
Delorge, M. (2010). De relatie tussen betekenis en structuur bij privatieve en receptieve werkwoorden in het Nederlands [The relation between meaning and structure in verbs of dispossession and reception in Dutch]. Unpublished Ph. D dissertation, Ghent University.
De Schutter, G. (1974). De Nederlandse zin: Poging tot beschrijving van zijn structuur [The
Dutch clause: An attempt at describing its structure]. Brugge: De Tempel.
Competing ‘transfer’ constructions in Dutch
Duyck, W., Desmet, T., Verbeke, L., & Brysbaert, M. (2004). WordGen: A tool for word selection and nonword generation in Dutch, English, German, and French. Behavior Research
Methods, Instruments, & Computers, 36, 488–499. DOI: 10.3758/BF03195595
Geeraerts, D. (1998). The semantic structure of the indirect object in Dutch. In W. Van
Langendonck & W. Van Belle (Eds), The Dative. Volume 2. Theoretical and contrastive studies (pp. 185–210). Amsterdam & Philadelphia: John Benjamins.
Goldberg, A. E. (1992). The inherent semantics of argument structure: The case of the English
ditransitive. Cognitive Linguistics, 3, 37–74. DOI: 10.1515/cogl.1992.3.1.37
Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure.
Chicago: University of Chicago Press.
Greenacre, M. (2007). Correspondence analysis in practice (2nd ed.). Boca Raton: Chapman &
Hall/CRC. DOI: 10.1201/9781420011234
Haspelmath, M. (2005). Argument marking in ditransitive alignment types. Linguistic Discovery, 3, 1–21. DOI: 10.1349/PS1.1537-0852.A.280
Janssen, T. (1997). Giving in Dutch: An intra-lexematical and inter-lexematical description.
In J. Newman (Ed.), The Linguistics of Giving (pp. 267–306). Amsterdam & Philadelphia:
John Benjamins.
Langacker, R. W. (1991). Concept, image, and symbol: The cognitive basis of grammar. Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110857733
Malchukov, A., Haspelmath, M., & Comrie, B. (2010). Studies in ditransitive constructions: A
comparative handbook. Berlin & New York: Mouton de Gruyter.
Newman, J. (1996). Give. A Cognitive Linguistic study. Berlin: Mouton de Gruyter.
Schermer-Vermeer, I. (1991). Substantiële versus formele taalbeschrijving: Het indirect object in
het Nederlands [Substantial versus formal language analysis: The indirect object in Dutch].
Amsterdam: University of Amsterdam, Dutch Department.
Van Belle, W., & Van Langendonck, W. (1996). The indirect object in Dutch: In W. Van Belle
& W. Van Langendonck (Eds.), The dative. Volume I: Descriptive studies (pp. 217–250).
Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/cagral.2
Verhagen, A. (1986). Linguistic theory and the function of word order in Dutch. Dordrecht: Foris.
57
58
Martine Delorge, Koen Plevoets, and Timothy Colleman
Appendix
Table 4.╇ Pearson’s residuals of the χ² of Table 1
ontdoen
ontfutselen
ontheffen
ontlasten
ontlenen
ontlokken
ontnemen
ontroven
ontrukken
ontstelen
onttrekken
ontvreemden
ontworstelen
ontwringen
DOC
aan
van-I
van-S
uit
bij
–26.9111
â•⁄21.17231
–14.0511
â•⁄–5.58732
–41.4537
â•⁄18.05141
â•⁄81.45388
â•⁄â•⁄0.50072
â•⁄–6.7413
â•⁄13.47256
–25.3715
â•⁄–9.63693
â•⁄–0.82861
â•⁄â•⁄1.93476
–36.0072
â•⁄–9.57147
–18.5346
â•⁄–7.47589
â•⁄54.27883
â•⁄–7.32894
–41.6047
â•⁄â•⁄0.304086
â•⁄â•⁄8.65564
â•⁄–6.93123
â•⁄31.96341
–13.7075
â•⁄â•⁄2.146425
â•⁄–0.02316
–3.72107
â•⁄8.501756
–1.94289
–0.77257
–5.38299
â•⁄7.557048
–1.61472
–0.16397
–1.04221
–0.75853
–1.52988
21.49648
–0.42758
–0.34395
â•⁄96.62736
–10.7422
â•⁄50.01195
â•⁄20.06194
–33.5068
–13.6028
–27.8464
â•⁄–0.95851
â•⁄–6.09239
â•⁄–4.43409
–20.5712
â•⁄–8.4109
â•⁄–2.49949
â•⁄–2.01059
â•⁄–7.2614
â•⁄–2.47057
â•⁄–3.79141
â•⁄–1.50763
â•⁄–9.66559
â•⁄–4.10051
â•⁄–9.08068
â•⁄–0.31998
â•⁄â•⁄1.40804
â•⁄–1.48021
â•⁄–0.16867
103.3261
â•⁄â•⁄0.364079
â•⁄–0.67119
–3.81388
–1.35253
–1.99135
–0.79184
–5.19401
19.83698
–4.26796
–0.16806
–1.0682
–0.77745
–3.60682
27.68344
–0.43825
–0.35252
Table 5.╇ Numerical output of the correspondence analysis of the present-day data
> Principal inertias (eigenvalues):
dim
1
2
3
4
value
%
cum%
0.828914 53.8 53.8
0.683026 44.3 98.1
0.025129
1.6 99.8
0.003653
0.2 100.0
-------- ----Total: 1.540722 100.0
Rows:
1
2
3
4
5
6
7
8
9
10
11
|
|
|
|
|
|
|
|
|
|
|
scree plot
*************************
*********************
*
name
mass qlt inr
k=1
ontf | 42
896
21 | -829
ontln | 407 1000 161 |
760
ontlk | 67
492
28 | -563
ontn | 281
995 292 | -1252
ontrv |
0
917
0 |
-87
ontrk | 13
982
4 |
679
onts |
7
990
8 | -1301
ontt | 153
996
56 |
745
ontv | 26 1000 429 |
38
ontwrs|
2
901
0 |
308
ontwrn|
1
957
0 | -299
cor
891
946
486
978
158
981
972
987
0
898
686
ctr
35
283
26
531
0
7
15
103
0
0
0
k=2 cor ctr
|
59
5
0 |
|
181
53 19 |
|
-60
5
0 |
|
163
17 11 |
|
191 759
0 |
|
-10
0
0 |
|
177
18
0 |
|
71
9
1 |
| -5078 1000 968 |
|
17
3
0 |
|
188 271
0 |
Competing ‘transfer’ constructions in Dutch
Table 5.╇ (continued)
Columns:
name mass qlt
1 | DOC | 345 1000
2 | AAN | 617 1000
3 | VANI|
7
824
4 | UIT | 25
995
5 | BIJ |
7
698
inr
348
204
22
383
43
k=1
| -1239
|
694
| -438
|
165
| -314
cor
986
946
37
1
10
ctr
638
358
2
1
1
k=2 cor
|
146 14
|
166 54
| -2011 787
| -4835 994
| -2562 687
ctr
11
25
39
859
67
|
|
|
|
|
Table 6.╇ Pearson residuals of the χ² on Table 3
ontdoen
ontfutselen
ontheffen
ontlasten
ontlenen
ontlokken
ontnemen
ontroven
ontrukken
ontstelen
onttrekken
ontvreemden
ontworstelen
ontwringen
DOC
aan
van-I
van-S
uit
bij
â•⁄–5.68127
â•⁄â•⁄3.923941
–10.4144
â•⁄–3.29846
–18.4732
â•⁄â•⁄5.547325
â•⁄15.98458
â•⁄â•⁄3.805351
â•⁄–2.66221
â•⁄â•⁄7.832866
â•⁄–2.37058
â•⁄–0.47808
â•⁄â•⁄0.196071
â•⁄â•⁄2.114915
â•⁄–6.72469
â•⁄–2.15422
–12.2317
â•⁄–3.90425
â•⁄16.49591
â•⁄–1.71846
â•⁄–5.9854
â•⁄–2.06827
â•⁄â•⁄5.875097
â•⁄–4.88802
â•⁄â•⁄3.232374
â•⁄–0.79243
â•⁄â•⁄0.407667
â•⁄–0.40802
–0.92132
–0.62533
–1.71948
–0.5349
â•⁄7.992057
–1.766
–4.05023
–0.61765
–1.9532
–0.93162
–0.66236
10.58121
–0.30883
â•⁄1.302676
â•⁄26.70959
â•⁄–2.03352
â•⁄48.77557
â•⁄15.50718
â•⁄–9.99249
â•⁄–5.74288
–13.171
â•⁄–2.00857
â•⁄–6.35164
â•⁄–3.02954
â•⁄–2.15395
â•⁄–1.18828
â•⁄–1.00428
â•⁄–2.94514
–1.27828
–0.86761
–2.38568
–0.74215
13.094
–1.63397
–5.61947
–0.85696
–2.70995
–1.29256
–0.91899
â•⁄1.46546
–0.42848
–1.25655
–0.20601
–0.13983
–0.38449
–0.11961
–0.6871
â•⁄4.66983
–0.90566
–0.13811
–0.43675
–0.20832
–0.14811
–0.08171
–0.06906
–0.20251
Table 7.╇ Numerical output of the correspondence analysis of the 19th-century data
Principal inertias (eigenvalues):
dim
1
2
3
4
value
0.319872
0.032668
0.007285
0.005248
-------Total: 0.365073
%
87.6
8.9
2.0
1.4
----100.0
cum%
87.6
96.6
98.6
100.0
scree plot
*************************
**
59
60 Martine Delorge, Koen Plevoets, and Timothy Colleman
Table 7.╇ (continued)
Rows:
1
2
3
4
5
6
7
8
9
10
11
12
|
|
|
|
|
|
|
|
|
|
|
|
name
ontf |
onth |
ontln |
ontlk |
ontn |
ontrv |
ontrk |
onts |
ontt |
ontv |
ontwrs|
ontwrn|
mass
11
2
263
87
456
11
106
24
12
4
3
23
Columns:
name mass qlt
1 | DOC | 403 999
2 | AAN | 565 995
3 | VAN | 11 979
4 | UIT | 20 895
5 | BIJ |
1
31
qlt
981
693
996
647
993
982
749
970
830
941
691
789
inr
470
239
132
144
15
inr
14
0
547
38
209
13
32
57
11
75
0
5
|
|
|
|
|
k=1
-658
69
870
-320
-408
-646
162
-887
400
828
-52
-192
|
|
|
|
|
|
|
|
|
|
|
|
k=1
-649
383
1437
1483
-566
cor
991
951
454
856
30
cor
952
77
995
646
993
954
241
917
485
93
82
503
ctr
531
259
69
140
1
ctr
k=2
15 | -116
0 |
195
622 |
-26
28 |
15
237 |
-9
14 | -110
9 |
235
59 | -214
6 |
337
8 | -2501
0 |
143
3 | -144
k=2
|
-59
|
82
| -1547
| -317
|
82
cor
8
44
525
39
1
cor
29
616
1
1
0
28
508
53
344
848
609
285
ctr
43
117
777
63
0
ctr
4
2
6
1
1
4
179
34
42
711
2
15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rethinking constructional polysemy
The case of the English conative construction
Florent Perek
Freiburg Institute for Advanced Studies and Université Lille 3
This chapter examines the conative construction, e.g., I kicked at the ball, using
collexeme analysis. Previous studies report that strong collexemes of a construction provide an indication of its central meaning, from which polysemic extensions are derived. However, the conative construction does not seem to attract
a particular kind of verb that could be used to characterize its central meaning. To address this problem, a variant of collexeme analysis is suggested that
consists in splitting the verbal distribution into semantic classes and consider
“verb-class-specific” constructions independently. For the three classes tested,
the most significant collexemes are found to be verbs whose inherent meaning
contains the semantic contribution of the construction in that class. Hence, the
most attracted collexemes do provide an indication of the constructional meaning, albeit specific to each verb class.
Keywords: collexeme analysis, semantic classes, verb-class-specific
constructions
1. Introduction1
In constructional approaches to grammar, argument structures are taken to be symbolic pairings of a syntactic structure with a schematic meaning independent of the
verbs instantiating them (cf. Goldberg 1995, 2006). For example, the ditransitive
construction (e.g., John built the children a new merry-go-round) is a pairing of the
double-object syntactic pattern with a core meaning of ‘caused possession’. An increasingly large body of evidence from experiments (Goldberg et al. 2004) and corpus
1. This chapter is based on material presented at the 4th International Conference of the German Cognitive Linguistics Association on October 8th 2010 in Bremen. I would like to thank
the audience of my talk for their interest and comments. I am also indebted to Dylan Glynn,
Adele Goldberg and Martin Hilpert for their comments on earlier versions of this chapter.
62
Florent Perek
studies (Stefanowitsch and Gries 2003) suggests that there is a close relation between
constructional meaning and constructional usage, in that the meaning of a construction closely corresponds to the meaning of the elements that typically occur in it. In
the case of argument structure constructions, this means that the meaning of verbs
occurring in a given syntactic pattern determines to a large extent the meaning that
will be associated with this syntactic pattern.
Along the same lines, previous corpus-based studies on the interaction of syntax
and lexis using the method of collostructional analysis show that “strong collexemes
of a construction provide a good indicator of its meaning” (Stefanowitsch and Gries
2003:â•›227); for example, the ditransitive is biased towards verbs lexicalizing its core
meaning of caused possession, such as give. Collexeme analysis is thus considered
as a valid approach to the analysis of constructional meaning. This chapter presents
an attempt to use collostructional analysis to describe the meaning of the conative
construction, in which a typically transitive verb is followed not by a direct object,
but by a prepositional phrase headed by at (e.g., The waiter wiped at the counter). As
shown by the literature review presented in Section 2, previous research indicates that
the meaning of the conative construction is difficult to grasp with a single semantic
generalization that would be both accurate and maximally general, which points to a
polysemy analysis. Along the lines of Stefanowitsch and Gries (2003), Section 3 considers whether collostructional analysis can inform a polysemy analysis of the conative construction by identifying its central meaning(s), from which other meanings
could be derived. However, a collexeme analysis of the construction reveals that no
single verb type clearly stands out as prototypical, as is the case with previously studied constructions. These results challenge the claim that collexeme analysis is a good
way to characterize the meaning of the construction from the verbs that most prominently occur in it.
In Section 4, a solution to this problem is presented that restores the relation
between constructional meaning and verbal use. Drawing on an earlier proposal by
Croft (2003) that constructional polysemy is better viewed as generalizations over
several semantic classes of verbs rather than extensions from a prototype, a slightly
different implementation of collexeme analysis is suggested, whose basic idea is to
split the verbal distribution into semantic classes and consider each of these thus-defined “verb-class-specific” constructions independently. The method is applied to
three classes of verbs: verbs of striking, verbs of cutting and verbs of pulling. In each
class tested, the most significant collexemes are verbs whose meaning inherently contains precisely those aspects of meaning that are arguably contributed by the construction when it is used with other verbs. Hence, the most attracted collexemes do
provide an indication of the constructional meaning, albeit specific to each verb class.
The conclusion of this study is two-fold. At the theoretical level, it shows that
the polysemy of the conative construction is better seen not as a unified network,
but rather as a conglomerate that can be explained by local lexical generalizations
Rethinking constructional polysemy
over classes of verbs. Such clusters of low-level generalizations are arguably, at least in
this case, a more psychologically valid mental representation of constructional meaning than general schemata deriving from prototypical verbs. At the methodological
level, this study shows that looking at the level of verb classes is a useful adaptation
of collexeme analysis that can appropriately deal with cases which would otherwise
yield results that are difficult to interpret. It allows us to see more clearly what the
semantic contribution of a grammatical construction is, albeit for each semantic class
separately.
2. The conative construction
The conative construction is most naturally discussed with reference to the conative
alternation, whereby the direct object of a transitive verb is realized as a prepositional
phrase headed by the preposition at, as in John shot at the burglar. As we will see in
this section, the meaning contributed by this syntactic construction is highly variable, which makes a maximally general semantic characterization of the construction
challenging, if possible at all.
One of the most cited semantic characterizations of the construction is that of
Levin (1993), who suggests that the construction “describes an ‘attempted’ action
without specifying whether the action was actually carried out”. Pinker’s (1989:â•›104)
description, viz. “the subject is trying to affect the oblique object but may or may not
be succeeding”, basically refers to the same idea while further specifying the origin
of the “attempted action” interpretation, namely that the conative variant lacks the
entailment that the referent of the at-phrase is affected by whatever activity the agent
is engaged in. Finally, Goldberg (1995:â•›63–64) formulates a construction grammar account of the construction based on these earlier observations, in which she posits that
the central meaning contributed by the construction is roughly ‘x directs action
at y’, and accounts for the “attempted action” interpretation reported by Levin and
Pinker by stipulating that in such cases “the verb designates the intended result of the
act denoted by the construction”. The common idea behind all three analyses is the
notion that the conative counterpart leaves the affectedness of the at-phrase referent
unspecified, whereas it is strongly (if not necessarily) implied by the transitive variant.
More recent work on the conative construction shows that this characterization
is not by itself sufficient to account for the interpretation of all conative sentences. As
Van der Leek (1996:â•›367) notes, “the conative does not, in its own right, guarantee
an intended result reading when featuring otherwise transitive verbs”. Both Van der
Leek (1996) and Broccias (2001) note that many conative sentences do entail that the
patient is affected, albeit to a lesser extent than the transitive counterpart. Verbs of
ingestion provide a good example thereof. Indeed, such expressions as James Bond
sipped at his Martini do entail that at least some of the designated substance was
63
64 Florent Perek
ingested; what they prevent is a holistic interpretation where the whole substance
would be consumed. Non-affectedness thus cannot be a relevant reading for verbs of
ingestion, which rather involve a ‘bit-by-bit’ interpretation. Van der Leek (1996:â•›367)
also notes that “usage of verbs of ingestion in the conative often seems to be motivated
by a desire to signal that no real attempt is (or even can be) made to carry out the action to completion”. Example (1) (taken from Van der Leek 1996:â•›367, originally from
the Longman Dictionary of Contemporary English) exemplifies such a case:
(1) [Sandy was] sipping at her drink just to be polite
Sentence (1) explicitly specifies the actual goal of the sipping (to be polite), and thus
entails that Sandy has no real intention to consume the whole drink. In other words,
the conative construction can be found in cases where there is apparently no intention
on behalf of the agent to affect the target, and hence where actual affectedness is not
only unlikely but also (and more importantly) irrelevant.
Subscribing to a constructional approach whereby clausal meaning results from
the fusion of a verb’s meaning with an abstract schema conveyed by the syntactic
construction, Broccias (2001) presents a new analysis of the conative construction.
To account for instances not covered by the “attempted action” generalization and
to tackle several other issues with previous studies, Broccias argues that the conative
construction conveys either one of three schemas: the allative schema, the ablative
schema, and the allative/ablative schema, which combines aspects of the first two.
The allative schema is described in purely locative terms as involving translational
motion towards a target with which contact is not necessarily made, which more or
less corresponds to the aforementioned analyses in terms of “attempted action”; note,
for example, that Pinker (1989) describes the output of his conative lexical rule in a
similar locative fashion as ‘X goes towards X acting-on Y’. This is also reminiscent of
Goldberg’s description of the construction’s central meaning in terms of “directed-action”. Broccias’ ablative schema, contrary to the former one, does imply that contact
is made but does not bring about the intended effect and is open to repetition; this
schema is involved, for example, with verbs of ingestion, as mentioned earlier.
It should be clear from the previous discussion that the semantic contribution
of the conative construction is highly variable, and is, if anything, difficult to grasp
with a single generalization. What could stand as the common motivation behind all
these uses is the very abstract notion that the conative construction moves the focus
to what the agent is doing, regardless of whatever effect this action brings about. This
proposal echoes Dixon’s (1991:â•›280) analysis, who notes that “the emphasis is not on
the effect of the activity on some specific object […] but rather on the subject’s engaging in the activity”. While this account seems reasonable at first blush, such an abstract
characterization must still go a long way towards the actual semantic contribution
with individual verbs, leaving a heavy burden to processes of meaning construction.
In addition, such a general meaning could not account for why some verbs (such as
Rethinking constructional polysemy
break and bend) cannot occur in this construction, since a priori any verb meaning
involving an agent subject could, in theory, undergo a focus on the agent’s activity.
Thus, the syntactic frame [NP V at NP] more likely corresponds to several different abstract schemas. Whether or not these schemas can be related in a polysemic
network is a matter of debate, but it seems to be a reasonable position. Indeed, the
various semantic contributions sketched above can be shown to share family resemblances, which gives credence to a polysemy analysis. For example, both the ‘intended-result’ and ‘bit-by-bit’ readings share the notion that, whatever else is going on in
the sentence, there is in both cases some goal which is not reached by the agent: bringing about a result on the second entity for the former, and leading an incrementally
unfolding event to its completion in the latter.
In the next section, the polysemy of the conative construction is examined on the
basis of corpus data. Specifically, it is proposed that the central meaning (or meanings) of the construction can be identified from an examination of its verbal distribution, using the method of collexeme analysis.
3. A collexeme analysis of the conative construction
Previous discussions of constructional polysemy consider that a construction gains
additional meanings through semantic extensions from a central meaning. For example, the central meaning of the ditransitive construction is ‘actual change of possession’, as instantiated by, e.g., the verb give. Several semantic extensions are derived
from this central meaning, such as ‘enabled change of possession’ (as with, e.g., allow)
or ‘intended change of possession’ (as with many verbs of creation, e.g. bake). All these
meanings are related in that they all share the notion of some change of possession,
but ‘actual transfer’ is the prototypical meaning since it is both concrete and “basic
to human experience”, according to Goldberg’s (1995:â•›39) scene encoding hypothesis.
How do we identify the central meaning of a construction? In quantitative corpus
linguistics, it has been proposed that the verbal distribution of a construction reveals
a great deal about its meaning. More precisely, the most frequent verbs occurring in
a construction would be those instantiating its central meaning. This section presents
an attempt to identify the central meaning of the conative construction on the basis of its verbal usage, using the method of collexeme analysis. Collexeme analysis is
one of the specific implementations of the more general method of collostructional
analysis suited to the identification of the central meaning of a construction. This
section starts with an outline of what the method consists in (cf. Hilpert’s contribution (this volume, 391–404) for a more thorough introduction). Drawing on previous
research, it is then shown how this method is useful for the study of grammatical constructions. The remainder of this section presents a collexeme analysis of the conative
construction.
65
66 Florent Perek
Table 1.╇ Contingency table for collexeme analysis
Lexeme L
Other lexemes
3.1
Construction C
Other constructions
F(L in C)
F(other L in C)
F(L in other C)
F(other L in other C)
Collexeme analysis
Collexeme analysis was first introduced by Stefanowitsch and Gries (2003) as “an extension of collocational analysis specifically geared to investigating the interaction of
lexemes and the grammatical structures associated with them” (ibid.:â•›209). Collexeme
analysis is concerned with the words occurring in a given slot of a chosen construction, and more particularly with “determining the degree to which particular slots in
a grammatical structure prefer, or are restricted to, a particular set or semantic class
of lexical items” (ibid.:â•›211).
The method starts with the identification of a particular construction in a corpus,
and of a particular slot of that construction that can be filled with different lexical
items. For each lexeme occurring in the slot, the following contingency table must be
calculated, as in Table 1.
This contingency table is then submitted to a distributional statistic (often the
Fisher-exact test2) to calculate the collostruction strength of the lexeme. This value
gives an index of the degree of statistical association between the lexeme and the
construction, given their frequency of co-occurrence, the frequency of the lexeme
elsewhere, and the frequency of other lexemes in the construction. The verbs in the
distribution are then ranked according to their collostruction strength.
The final step (interpretation) consists in using this ordered list of collexemes to
inform a description of the meaning of the grammatical construction, which is essentially guided by the theoretical assumptions of the constructional approach. In construction grammar, the occurrence of a lexeme in a construction is to a large extent
determined by the degree of semantic compatibility (cf. Goldberg 1995) between the
meaning of the lexeme and that of the construction (or more precisely, the meaning
assigned by the construction to the particular slot under study). In collexeme analysis,
collostruction strength is assumed to correlate with semantic compatibility: lexemes
are more attracted to some constructional slot (i.e. occur in that slot more often than
expected) if they are more semantically compatible with the slot. It thus follows that
the strongest collexemes of a construction, as the most semantically compatible lexemes, are a potential source of information about the meaning of the construction.
2. Despite the wide range of available distributional statistics, Stefanowitsch and Gries
(2003:â•›218) argue that the Fisher exact test is a perfect choice for collostructional analysis: it
“neither makes any distributional assumptions, nor does it require any particular sample size”.
Rethinking constructional polysemy
The task of the analyst is thus to track down the origin of semantic compatibility from
the lexical semantics of these collexemes, so as to deduce a characterization of the
constructional meaning.
Stefanowitsch and Gries (2003) illustrate their claims with a few case studies
showing the usefulness of the method for the description of grammatical constructions. Two of these are of particular interest for us here: the into-causative construction (Subj V Obj into V-ing) and the famous ditransitive construction (Subj V Obj1
Obj2).
For the into-causative, Stefanowitsch and Gries (2003) looked at the first verb slot
of the construction and found that the top collexemes are verbs “instantiating the two
major sub-senses of the construction, namely ‘trickery’ (as exemplified by trick/fool
[…]) and ‘force’ (as exemplified by coerce/force […])” (p. 226), while verbs instantiating senses of the construction that are intuitively less central (such as ‘verbal coercion’
and ‘persuasion through a positive or negative stimulus’) appear much further down
the list.
As to the ditransitive construction, the verb give turns out to be by far its strongest collexeme, which is to be expected given the principle of semantic compatibility:
among the many ways in which a verb can be compatible with a construction, give and
the ditransitive exemplify the optimal case where there is semantic identity. In other
words, since the verb give is maximally compatible with the ditransitive construction,
it comes as no surprise that it is its strongest collexeme. Yet, the authors argue that,
contrary to what happens with the into-causative construction, the basic ‘transfer’
sense of the ditransitive is not overwhelmingly dominant in the collexemes of the
construction, in that there are relatively few significant collexemes instantiating the
central sense in the whole list (6 out of 30, 10 including metaphorical uses such as tell,
show and teach). Rather, the high diversity of verbs provides, according to Stefanowitsch and Gries , evidence for the polysemy analysis of the construction put forward
by Goldberg. It is indeed true that instances of the central sense are a minority among
the collexemes in terms of the number of types, but these few types are clearly clustered towards the top of the list: at least four of them (eight including the metaphorical uses) are among the top ten collexemes.
Thus, for both constructions, there seems to be a strong tendency for the top
collexemes to instantiate the most central meaning(s). Both case studies thus present
evidence that collexeme analysis is a valid quantitative method to profile the meaning
of constructions from their prominent verbal collocates. As Stefanowitsch and Gries
(2003:â•›227) conclude, “strong collexemes of a construction provide a good indicator
of its meaning”. Therefore, the method should be helpful in identifying the elusive
meaning of the conative construction.
67
68 Florent Perek
3.2
Data collection
The verbal distribution of the conative construction was extracted from the prose
fiction part of the BNC, containing about 16 million words in 431 texts primarily
drawn from novels. The choice of this corpus was neither arbitrary nor unmotivated.
Intuitively, the conative construction seems to convey a complex descriptive function
which makes it more at home in narrative genres, and probably not to be found so
frequently in spontaneous spoken language. The latter intuition is actually borne out
by an earlier attempt at finding conative sentences in the conversation part of the corpus, revealing that the construction is extremely rare in that register (only 17 tokens
in 4 million words).
The corpus was queried for all verbs followed by the preposition at (with an optional intervening adverb) in the same sentence, with the exclusion of frequent verbs
that cannot support a conative reading and for which at can only be used in a purely
locative sense (e.g. be, stay, live, arrive, etc.).3 The resulting set of sentences was manually annotated to select only conative sentences, which were defined according to two
criteria: (1) the verb has to be transitive, and (2) the interpretation of the sentences
has to fall somehow into one of those described in the previous section. Sentences
with coordinated verbs were duplicated in the dataset (one duplicate per verb). This
yielded a final set of 2,563 instances, distributed over 159 verb types.
3.3
Results
The collostruction strength of each verb in the construction was computed by
Coll.analysis 3, an R program written and kindly provided by Stefan Gries, with the
Fisher exact test as a distributional statistic.4 Following Stefanowitsch and Gries (2005),
Coll.analysis applies a log transformation to the p-values yielded by the Fisher exact
test, and changes the sign to a plus if the association is one of attraction (i.e. the actual
verb’s frequency exceeds the expected frequency) and to a minus in case of repulsion
(i.e. the actual verb’s frequency is below the expected frequency). This gives a more
readable value than the p-values, often expressed in powers of ten. A collostruction
strength above 1.301 means that the verb is significantly attracted to the construction;
a collostruction strength below –1.301 means that the verb is significantly repelled by
3. The Corpus Query Processor program, part of the Corpus Workbench suite developed at
the University of Stuttgart (http://cwb.sourceforge.net/), was used to query the corpus. The
corpus was assembled from the XML version of the BNC with a script that parsed all texts of
the corpus and copied only those with the “prose-fiction” genre attribute. Another script then
converted the corpus into a format readable by CQP.
4. Available at: http://www.linguistics.ucsb.edu/faculty/stgries/teaching/groningen/.
Rethinking constructional polysemy
Table 2.╇ The thirty strongest collexemes of the conative construction
in BNC-prose-fiction
Rank Verb
f(conative:all) coll.strength Rank Verb
f(conative:all) coll.strength
â•⁄1
â•⁄2
â•⁄3
â•⁄4
â•⁄5
â•⁄6
â•⁄7
â•⁄8
â•⁄9
10
11
12
13
14
15
226:661
179:823
â•⁄72:166
â•⁄53:156
â•⁄43:97
â•⁄73:643
â•⁄36:121
â•⁄71:689
â•⁄29:87
â•⁄31:107
â•⁄44:300
â•⁄91:1363
â•⁄36:291
â•⁄76:1217
â•⁄22:140
29:263
43:567
24:180
18:112
13:56
35:524
17:149
â•⁄9:32
â•⁄8:26
26:364
35:656
17:190
51:1186
11:112
23:466
tug
clutch
dab
claw
gnaw
sniff
nibble
sip
peck
nag
pluck
tear
stab
grab
hack
209.92
127.13
â•⁄75.74
â•⁄49.14
â•⁄46.02
â•⁄32.05
â•⁄31.26
â•⁄28.56
â•⁄26.95
â•⁄26.62
â•⁄24.13
â•⁄22.51
â•⁄17.41
â•⁄17.29
â•⁄13.08
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
hammer
snatch
jab
scrabble
paw
scratch
slash
swipe
niggle
poke
suck
prod
kick
lap
strain
12.87
12.86
12.58
11
10.23
â•⁄9.13
â•⁄8.07
â•⁄8.07
â•⁄7.58
â•⁄7.55
â•⁄6.7
â•⁄6.52
â•⁄6.44
â•⁄4.82
â•⁄4.13
the construction. As noted above, the verbs at the top of the distribution ordered by
collostruction strength provide an indication of the constructional meaning.
The thirty strongest collexemes of the conative construction are reported in Table 2. As it turns out, the construction attracts a great variety of verbs. Almost all verb
classes allowed in the construction are represented in that list: verbs of pulling (tug,
pluck), verbs of seizing and holding (clutch, claw, grab, snatch), verbs of hitting and
touching (dab, claw, peck, stab, hammer, jab, paw, swipe, poke, prod, kick), verbs of
ingestion (gnaw, nibble, sip, peck, suck, lap), verbs of cutting (tear, hack, slash), etc.
This result is not surprising in itself, as constructions are often associated with several
related senses, and therefore several classes of verbs. Again, this points to a polysemy
analysis, as indeed the collexemes presented in Table 2 arguably instantiate different
senses of the construction. For example, assuming Broccias’ (2001) distinctions (cf.
Section 2), clutch, stab and kick mostly instantiate the allative schema, while nibble,
hack and suck rather instantiate the ablative schema.
While it is, a priori, not problematic that the construction attracts different classes
of verbs, the list of collexemes is, however, not particularly helpful in characterizing
the construction’s meaning. Moreover, contrary to what happens in the case studies
reviewed above, there does not seem to be a class of verbs that the construction attracts in particular. The list presents alternations of very different types of verbs, and
no particular class seems to be more strongly attracted than the others. For example,
the five most attracted collexemes exemplify precisely five different verb classes: tug
(verb of pulling), clutch (verb of seizing/holding), dab (verb of touching/hitting), claw
69
70 Florent Perek
(verb of hitting or seizing/touching) and gnaw (verb of eating/chewing). Moreover,
these verbs exemplify various semantic aspects of the construction: tug at entails no
change of location and an inherent repetition of the attempt, clutch at and claw at
entail either missed contact or prolonged exertion of a force, dab at entails little or no
affectedness, gnaw at entails no completion.
Thus, contrary to what Stefanowitsch and Gries (2003) found with the ditransitive construction, collexeme analysis is not helpful in identifying one (or more) particular sense of the conative construction which would be central and from which the
other senses would be derived. As a matter of fact, there need not be an identifiable
verb class corresponding to each constructional sense: since the senses of the conative
construction are so highly abstract, they are liable to be combined with a great variety of verbs from different semantic classes. Hence, it is not particularly surprising
that the collexeme list of the conative construction (or probably of any abstract construction) is not as easily interpretable as that of the ditransitive construction. As one
reviewer suggests, this might be because the semantics of the conative construction
is less directly related to basic bodily experience than that of the ditransitive or of the
caused-motion construction; as such, it is less likely to correspond to patterns of lexicalization in the language in general. This means that collexeme analysis in its present
form would not be able to identify the senses of many constructions, at least not as
neatly as those of the ditransitive construction (for example).
3.4
Towards a solution
In the face of such results, this chapter suggests another approach based on a refinement of collexeme analysis, which might be more informative in the case of the conative construction, and probably many other constructions. This approach is motivated
by an earlier proposal by Croft (2003), who criticizes the concept of constructional
polysemy, and thus the related notion of a “central” meaning.
According to Croft, the very concept of constructional polysemy is problematic
in several respects. The main problem can be roughly summarized as follows: how can
a construction be considered truly polysemous if its meaning in context only depends
on the verb it is instantiated with? In the case of the ditransitive construction, Croft
(ibid.:â•›55) notes that “each semantic class is associated with only one sense of the ditransitive construction”. This seems to be in part semantically motivated: for example,
the fact that the modal extension (‘conditions of satisfaction imply that X causes Y to
have Z’) is the only one occurring with promise (for instance) is expected since it is
the only extension whose specifications do not conflict with the meaning of the verb.
However, why the extension ‘X intends that Y have Z’ is the only one compatible with
verbs of creation appears to be completely arbitrary, since there is nothing in the verb’s
meaning that blatantly conflicts with a number of the other extensions, whose instantiation with the verb would thus make perfect sense.
Rethinking constructional polysemy
A polysemic analysis of the conative construction runs into exactly the same problem: while there can be several different readings of a single conative sentence, not all
interpretations are equally available in all instances. For example, in no case would
conative sentences with verbs of ingestion mean ‘X moves towards Y in order to ingest
Y’. Conversely, verbs of rubbing could never be used in the conative construction to
convey the meaning ‘X rubs a part of Y and goes towards having Y totally rubbed’,
let alone an allative interpretation (i.e. ‘X goes towards Y to rub Y’).5 Sometimes the
unavailability of some readings is straightforwardly explained by intrinsic properties
of the verbs themselves: for example, the impossibility of an incremental reading with
semelfactives such as hit and kick can be explained by the aspectual properties of these
verbs and more particularly the absence of an incremental theme. However, there are
still perfectly sensible combinations that, nonetheless, are disallowed, which would
not be the case if the construction was truly polysemous.
Croft suggests that such cases are more appropriately accounted for not by
considering the construction as genuinely polysemous, but by treating it as several
“verb-class-specific constructions”, i.e. lower-level generalizations of a constructional
meaning over a clearly delimited semantic verb class, instantiated only with verbs of
that class. The remainder of this chapter presents evidence that this view might also be
more appropriate for the conative construction. As observed in Table 2, no particular
meaning stands out in the whole distribution of the construction. However, if we look
again at Table 2 by focusing on verbs from a specific semantic field, a clearer picture
emerges. A class that is fairly easy to delimit is that of verbs of eating. Table 3 reports
the distribution of verbs of eating in the conative construction (the significantly attracted and significantly repelled collexemes appear on a gray background).
Table 3.╇ Verbs of eating in the conative construction
Verb
f(conative:all)
coll.strength
nibble
peck
suck
lick
gulp
gobble
munch
pick
eat
36:121
29:87
35:656
20:488
â•⁄9:267
â•⁄1:60
â•⁄1:84
79:4678
12:4089
â•⁄31.26
â•⁄26.95
â•⁄â•⁄6.7
â•⁄â•⁄2.68
â•⁄â•⁄1.07
â•⁄–0.18
â•⁄–0.3
â•⁄–1.1
–21.53
5. Conative uses of rub and other similar verbs (wipe, brush, …) do receive a form of “non-affectedness” interpretation which is not ‘X tries to rub Y’ but rather corresponds to a scenario in
which some entity remains unaffected; this entity might be mentioned (as in rub at the stain) or
might remain implicit or unspecified (as in rub at the counter, which most likely entails that the
agent’s goal is to clean the counter and that this goal is not achieved).
71
72
Florent Perek
The most strongly attracted verb in that class, nibble, denotes an event of eating
where only a small amount of some substance is ingested, and is therefore inherently
compatible with the “bit-by-bit” reading supported by the construction. In fact, this
verb is similar to give in the ditransitive construction: assuming a more specific eating-conative construction instantiated by verbs of eating only and whose meaning
would be ‘eat in a bit-by-bit fashion’, the meaning of nibble is identical to the meaning
of that construction, which largely motivates the prominent occurrence of that verb
in the construction.
The other significantly attracted collexemes also support the ‘bit-by-bit’ interpretation. Peck typically refers to how birds eat, by moving their beak forward repeatedly;
in the conative construction, it is also frequently used to refer to people eating only
a small amount of their meal. Suck and lick are not purely verbs of eating but rather
describe a kind of action that an agent performs on another entity; when they are
used to describe events of eating (as they very often are in the corpus), both typically
refer to a slow and gradual means of ingestion through the progressive dissolution of
a substance. Finally, the sole collexeme repelled by the construction is eat; this again
reflects the semantic preferences of the construction, as eat is a maximally neutral
verb of ingestion which is more commonly used to denote total consumption and
lends itself less easily to a ‘bit-by-bit’ interpretation.6
This simple example shows that focusing on a particular class of verbs clearly captures what the semantic contribution of the construction is for this particular class.
Thus, a collexeme analysis at the level of individual verb classes seems to be a promising approach. The next section elaborates on this proposal and presents a version of
collexeme analysis based on semantic classes.
4. A collexeme analysis of verb-class-specific constructions
In the previous section, it was found that a collexeme analysis performed on the
whole distribution of the conative construction is not very helpful in characterizing
its constructional meaning and does not clearly support a polysemy analysis either.
6. As Dylan Glynn notes in a review of an earlier version of this chapter, it is somehow unexpected that pick does not appear among the attracted collexemes of the construction, let
alone that it almost reaches the threshold of repulsion, since pick at indeed seems to be a prime
example of the ‘bit-by-bit’ reading induced by the construction. This result is explained by the
fact that the verb pick is highly polysemous and at the same time highly frequent, and that it
is not primarily a verb of eating: in fact, it probably occurs in this sense in the conative construction only. This asymmetry in the semantic distribution of pick thus appears to obscure its
contribution to our understanding of the meaning of the construction. The general issue of the
relation between frequency of verb forms and frequency of verb senses is taken up again in
Section 4.1.2.
Rethinking constructional polysemy
It was observed that a clearer picture emerges if we look only at verbs from a specific
semantic class (in that case, verbs of eating): the meaning of the strongest collexemes
clearly reflects the semantic contribution of the construction for this semantic class.
This section outlines a more principled and systematic formulation of this approach
and then presents its application to three classes of verbs: verbs of cutting, verbs of
pulling and verbs of striking.
4.1
Method
This section first explains how verbs in this study were classified into semantic classes.
It then turns to some statistical issues posed by the present approach.
4.1.1 Determining verb classes
The present approach first requires that the verbs from the distribution of the conative construction are sorted into several classes. Of course, a given verb form can
correspond to several meanings, and these meanings can belong to different semantic
classes. For example, in Table 2, peck and pick can function as verbs of eating but also
as verbs of striking (albeit more rarely). However, the frequencies obtained from the
corpus are frequencies of verb forms, not of verb meanings, and thus some of these
frequencies may actually be distributed over several semantic classes. All instances
of a verb form cannot just be assigned to a single class or be counted in several classes simultaneously: it must be determined for each token to which semantic class it
belongs.
For the example of verb-class-specific collexeme analysis presented in the last
section, the field of verbs of eating was relatively easy to select from the whole distribution. However, it might not be so easy to identify, on the sole basis of intuition, the
verb classes found in the distribution and the semantic class each verb token belongs
to. To facilitate this process, an external lexicographic source was relied on: WordNet (Fellbaum 1998), a lexical database of the English language which was created
and is being maintained at the Cognitive Science Laboratory of Princeton University.
It groups English words into sets of synonyms (called synsets) and provides lists of
the various meanings of each word form that can be looked up to perform semantic
annotation. Starting with an established list of sense distinctions, instead of building
it during the annotation process, is not only convenient, it also allows the achievement of a crucial feature of empirical studies of meaning: overt operationalization (cf.
Glynn 2010), in the sense that the analytical criteria are overtly identified. This makes
the analysis falsifiable, since it enables it to be repeated on the same data or on another
dataset (e.g. for the purpose of comparison).
The list of verb senses could be drawn from any dictionary, but WordNet presents
another useful feature for this approach: it records relations between synsets such as
73
74
Florent Perek
hyponymy, hyperonymy, part-whole relations, entailments, etc. Of particular interest
to us, the relations of hyponymy (and conversely, hyperonymy) connect the synsets
into a type hierarchy, which can be used to define verb classes: a verb class includes
the verbs of a given synset and all of its hypernyms, i.e. verbs whose meaning includes
(and often, elaborates) the meaning of the synset. Hence, co-hyponyms belong to the
same class. In sum, WordNet can be used both to annotate for verb senses and to
define verb classes on the basis of the annotated data and hyponymy/hyperonymy
relations between senses recorded in the database.
It has been noted elsewhere that WordNet sense distinctions are somehow arbitrary and sometimes so fine-grained that it is practically impossible to apply the
classification to naturally occurring examples (not to mention the theoretical vacuity
and actual impracticability of the very notion of sharp sense boundaries, cf. Kilgarriff
1997; Glynn 2010). While this is true in many cases, in the context of this study it is
often unproblematic to ignore some sense distinctions as long as they do not extend
over different verb classes. For example, drag has two senses in WordNet that may apply to conative uses of the verb: (i) ‘pull, as against a resistance’ and (ii) ‘draw slowly or
heavily’. It is not clear from the glosses what the semantic difference is supposed to be,
and if anything it is very subtle and therefore not easily applicable to the annotation of
examples in context. This distinction can, however, be ignored, since both senses have
pull as their direct hyperonym: they can thus be conflated into a single entry, drag,
subsumed by the class of verbs of pulling. Even though the fine-grained sense distinctions posited in WordNet might not always be well-grounded, the coarser-grained
distinctions imposed by verb classes are more reliable and more easily noticeable. This
strategy thus avoids the pitfalls of drawing strict sense boundaries.
The original dataset was manually annotated for WordNet senses with the help
of an interactive program.7 As it turns out, while some verbs are highly polysemic according to WordNet’s classification, the conative construction is usually restricted to
one or two senses of these verbs, and most verbs can belong to only one semantic class
when they occur in the construction. The verb sense distribution was built by calculating the frequency of each word sense in the construction. Each verb sense in this
distribution was then annotated with the synset ID of its direct hyperonym, or with
its own synset ID if the verb sense is a hyperonym of other verbs in the distribution.
This ID identifies both the class to which the verb belongs, and the most general verb
(i.e. hyperonym) of that class. In the case of classes subsumed by another class, which
can be diagnosed by the hyperonym of one class being a member of another class, the
lower class was merged into the higher one. As a last step, in each class, senses of the
7. This tool was written in Java and uses the JWNL API to read the WordNet 3.0 files (http://
sourceforge.net/projects/jwordnet/), downloaded from the website (http://wordnet.princeton.
edu/wordnet/download/).
Rethinking constructional polysemy
same verb form were collapsed into one cell summing all frequencies of the verb form.
With this method, maximally large and distinctive verb classes were obtained.
4.1.2 Statistical matters
In the collexeme analysis of verbs of ingestion in Section 3.4, verbs were just filtered
out on the basis of their belonging to the semantic class under study. However, if the
verb-class-specific constructions hypothesis is taken seriously, a collexeme analysis of
a specific semantic class only makes sense if the collostruct under consideration is not
the general construction but a more specific one taking only verbs of this semantic
class, and since such constructions have a lower frequency than the more general one,
the actual collostruction strength values could be slightly different, hence changing
the significance of some collexemes and possibly the order of the collexeme list. The
frequency of a verb-class-specific construction is obtained by summing the frequency
of all verb senses in the class.
There is, however, still one missing set of frequencies: the frequency of each verb
sense in other constructions. Unfortunately, except with a semantically annotated
corpus, there is no easy way to determine this frequency, as it is practically intractable
to manually annotate the whole corpus for verb senses. It must be acknowledged that
this is an inherent weakness of this approach. However, as serious as it might be, this
problem can be attenuated using two methods. First, in each verb class, only those
verb senses that were by far the most frequent instance of their verb form are kept in
the analysis. For example, catch occurs only seven times as a verb of striking in the
conative construction versus fifty times in other senses (mainly as a verb of seizing);
it was thus removed from the list of verbs of striking and does not appear in Table 5.
The rationale behind this decision is that a verb form occurring clearly less prominently in a given verb-class-specific construction than in the other ones should be a
weak collexeme of the construction anyway and is not likely to tell us much about the
constructional meaning.8 Second, the overall frequency of the verb form was used
for each verb sense, which makes the assumption that every occurrence of each verb
form in the corpus has the meaning that the verb has in the conative construction.
This is, of course, surely false for polysemous verbs, though not overly problematic for
this study since it will merely downplay the collostruction strength of verbs. Indeed,
the frequency of a verb sense is at least as high as the frequency of the verb form, and
for polysemous forms it is a priori lower. The approximate collostruction strength
calculated with the frequency of the verb form will thus be lower than the theoretical
collostruction strength that would be calculated with the frequency of the verb sense,
thus probably narrowing the range of significant collexemes. As it turns out, this
8. The deleted verbs include: scrape, scratch and slash for the cutting-conative construction,
catch, pick, tweak and twitch for the pulling-conative construction, and catch, jab, peck, pick and
poke for the striking-conative construction.
75
76
Florent Perek
possible downplaying of the attraction of the verbs to the construction does not prevent the identification of a number of interesting collexemes in each class.
4.2
Results
This section reports the collexeme analysis performed on the cutting-conative, pulling-conative and striking-conative constructions, defined as elaborations of the conative construction instantiated, respectively, by verbs of cutting, verbs of pulling and
verbs of striking.
4.2.1 Verbs of cutting
Events of cutting involve an agent moving a suitable instrument over the surface of an
object, and causing a rupture in the physical integrity of that object as a result. With
verbs of cutting, the conative construction does not support the allative interpretation
(or at least not literally): contact is necessarily made between some instrument and
the referent of the at-phrase, but this contact does not bring about the effect that the
transitive use of the verb would entail: the cutting either fails entirely, or is too minimal for one to consider that the object is indeed cut. Hence, conative uses of verbs
of cutting often convey the implicature that the action performed to do the cutting is
repeated.
Table 4 presents the collexemes of the cutting-conative construction. The analysis reveals three significantly attracted collexemes: hack, saw and chip. All three collexemes are particularly suited to the semantic contribution of the cutting-conative
construction.
The lexemes hack and saw are inherently repetitive: an event of hacking or sawing
always consists of several identical actions. Moreover, a single movement (a stroke of
a hacking tool or of a saw) generally does not by itself bring about the intended effect
on the patient, e.g. cutting something to bits or sawing a piece of wood apart; the
Table 4.╇ Collexemes of the cutting-conative construction
Verb
f(conative:all)
coll.strength
WordNet gloss
hack
saw
chip
chisel
snip
chop
slice
nick
cut
22:140
â•⁄6:74
â•⁄4:93
â•⁄2:39
â•⁄2:54
â•⁄3:174
â•⁄3:237
â•⁄2:163
â•⁄4:3075
â•⁄19.76
â•⁄â•⁄3.69
â•⁄â•⁄1.63
â•⁄â•⁄1.11
â•⁄â•⁄0.87
â•⁄â•⁄0.47
â•⁄â•⁄0.27
â•⁄â•⁄0.23
–22.71
cut with a hacking tool
cut with a saw
break a small piece off from
carve with a chisel
sever or remove by pinching or snipping
cut into pieces
make a clean cut through
cut a nick into
separate with or as if with an instrument
Rethinking constructional polysemy
movement must be repeated until the desired effect is obtained. Hence hack and saw
naturally support the semantic contribution of the cutting-conative construction in
their conceptual semantics, i.e. both ‘no-significant-effect’ and ‘repetition’.
The item chip inherently features only one of these two aspects. In any event of
chipping, only a small piece of the patient is broken off, and chip does not in any case
support a truly holistic interpretation, i.e. an object that is chipped is only minimally
affected and keeps its overall physical integrity, compared to what happens with true
verbs of change of state like break. Events of chipping must be repeated if the patient
is to be considered significantly affected.
The only significantly repelled collexeme in the list is cut. Its repulsion can be
explained by its status as a maximally neutral verb of cutting (and indeed the hyperonym of the whole class), which thus does not carry any semantic elaboration that
would promote its use in the conative construction. In addition, cut lends itself to a
holistic interpretation to a much larger extent than the attracted collexemes.
4.2.2 Verbs of pulling
Events of pulling consist in an agent exerting a force on a patient, usually in order to
move the patient towards self or to affect it in some other way (e.g. open a door). The
effect on the patient is not an inherent feature of these verbs, but is rather a frequent
implicature of their transitive use. The conative construction prevents this implicature
of change of location/state, thus bringing the interpretation towards an ‘attempted
action’ reading. Such uses also easily allow an interpretation of repeated actions, since
a single iteration of pulling does not bring about a significant effect.
Table 5 lists the collexemes of the pulling-conative construction. The construction has two significantly attracted collexemes: tug and pluck. According to the Oxford
English Dictionary, tug applies to events where the puller puts a lot of energy in the
pulling, or exerts a force during an extended period. Hence, tug focuses on the effort
the agent puts into the act of pulling, and not so much on the dynamics of the event
itself, i.e. whether the patient is set in motion or not.
Table 5.╇ Collexemes of the pulling-conative construction
Verb
f(conative:all)
coll.strength
WordNet gloss
tug
pluck
wrench
yank
haul
jerk
drag
pull
226:661
â•⁄42:300
â•⁄12:314
â•⁄â•⁄1:122
â•⁄â•⁄5:411
â•⁄â•⁄8:717
â•⁄25:1528
138:6024
153.73
â•⁄10.31
â•⁄–0.49
â•⁄–1.64
â•⁄–3.9
â•⁄–7.02
–10.49
–38.41
pull hard
pull or pull out sharply
twist or pull violently or suddenly
pull, or move with a sudden movement
draw slowly or heavily
pull, or move with a sudden movement
draw slowly or heavily
apply force so as to cause motion towards
the source of the motion
77
78
Florent Perek
Pluck as a verb of pulling is often used to refer to the removal of some object from
where it grows, e.g. fruit, plants, hair, or feathers. To overcome the inherent resistance
of the ground to which the object is attached (e.g. skin, branch, earth), acts of plucking frequently involve a sharp and sudden pull so as to abruptly separate the object
from its ground (as alluded to by WordNet’s gloss). The more general use of this verb
to refer to other kinds of pulling keeps this ‘sharp and sudden’ aspect. Due to their
short duration, acts of plucking are particularly prone to repetition.
As indicated earlier, the repelled collexemes may have slightly overestimated repulsion scores; thus, the values of the five repelled collexemes (yank, haul, jerk, drag
and pull) have to be interpreted with caution. However, the last two (drag and pull)
provide some interesting insight into the construction’s meaning. Drag is more appropriately described as a verb of accompanied motion (i.e. where both agent and theme
move along the same path, like bring) rather than a pure verb of pulling: it strongly
presupposes the motion of the patient, which makes it at odds with the conative construction. Pull is, of course, the hyperonym of the semantic class, i.e., it is arguably the
most neutral verb of pulling. Since it has no inherent semantic traits that particularly
favor the conative reading(s), its appearance as a repelled collexeme is expected.
4.2.3 Verbs of striking
Verbs of striking represent the largest of the three semantic classes under study. It
comprises verbs that have either hit or strike as their hyperonym in WordNet. Events
of striking consist in an agent performing some movement in the direction of a patient, aiming at forceful contact with the patient, usually with the intention of affecting it in some way (doing it harm or damage). In the conative construction, verbs of
striking typically assume an allative interpretation: some effort is directed towards a
goal (here, bringing about an effect on the patient) that is not reached.
Table 6 lists the collexemes of the striking-conative construction. The significantly attracted collexemes include dab, hammer, swipe, buffet, kick, pummel and swat,
and all of these verbs feature one or more particular semantic traits favored by the
construction.9
The verb dab, by far the strongest collexeme, is categorized by WordNet as a verb
of striking, though it is a very peculiar one. Contrary to more typical members, dabbing involves little energy and is normally not aimed at affecting the target, or at least
not negatively. Rather, typical instances of dabbing include using a cloth to gather and
remove a substance (like blood or tears) from a surface, or gently applying a substance
9. Table 5 also lists buffet as a significant collexeme. However, it is a very rare verb in our reasonably large corpus, occurring only twice, yet each time in the conative construction, which
probably explains why it reaches the significance threshold. Since its rarity makes it a poor
candidate as a relevant and telling collexeme of the construction, it was removed from the
discussion.
Rethinking constructional polysemy
Table 6.╇ Collexemes of the striking-conative construction
Verb
f(conative:all) coll.strength
dab
hammer
swipe
buffet
kick
pummel
swat
batter
slap
71:166
29:263
â•⁄9:32
â•⁄2:2
51:1186
â•⁄4:31
â•⁄3:27
â•⁄7:161
16:510
â•⁄66.44
â•⁄â•⁄9.56
â•⁄â•⁄6.81
â•⁄â•⁄3.1
â•⁄â•⁄2.89
â•⁄â•⁄1.98
â•⁄â•⁄1.41
â•⁄â•⁄0.78
â•⁄â•⁄0.44
tap
lash
whack
scuff
whip
bat
bash
punch
pound
24:802
â•⁄8:265
â•⁄1:37
â•⁄1:44
â•⁄9:350
â•⁄1:71
â•⁄1:85
â•⁄5:278
â•⁄4:245
â•⁄â•⁄0.4
â•⁄â•⁄0.33
â•⁄–0.14
â•⁄–0.19
â•⁄–0.32
â•⁄–0.39
â•⁄–0.51
â•⁄–0.69
â•⁄–0.75
thump
â•⁄4:322
â•⁄–1.31
hook
beat
bang
smash
pat
strike
â•⁄2:228
27:1372
â•⁄8:602
â•⁄4:421
â•⁄6:545
34:1990
â•⁄–1.37
â•⁄–1.62
â•⁄–1.96
â•⁄–2.14
â•⁄–2.3
â•⁄–3.39
hit
â•⁄7:2007
–17.96
WordNet gloss
hit lightly
beat with or as if with a hammer
strike with a swiping motion
strike against forcefully
strike with the foot
strike, usually with the fist
hit swiftly with a violent blow
strike against forcefully
hit with something flat, like a paddle or the
open hand
strike lightly
strike as if by whipping
hit hard
poke at with the foot or toe
strike as if by whipping
strike with, or as if with a bat
hit hard
deliver a quick blow to
hit hard with the hand, fist, or some heavy
instrument
hit hard with the hand, fist, or some heavy
instrument
hit with a hook
hit repeatedly
strike violently
hit hard
hit lightly
deliver a sharp blow, as with the hand, fist, or
weapon
deal a blow to, either with the hand or with an
instrument
on a surface (e.g. for medical or cosmetic purposes). This typical lack of affectedness
of the patient in an act of dabbing is in line with the meaning of ‘non-effective action’
that the conative construction is often claimed to convey.
The verb hammer originally refers to an act of hitting involving a hammer or a
similar tool as instrument; in that restricted use, typical things that can be hammered
include nails, metal sheets and other metallic goods. If anything, this use of hammer typically entails repetition, i.e., just as with hack in Section 4.2.1, any event of
hammering normally involves multiple blows on the patient, since a single blow does
79
80 Florent Perek
not suffice in affecting the patient in the intended way. For example, nails are rarely
properly hammered into a wall with a single blow, but rather inserted only partly, and
the hammering must be repeated as many times as necessary. Similarly, a sheet of
metal can never be shaped into any appropriate form with a single blow; it has to be
worked until the intended shape is arrived at. Of course, the verb in its modern use is
not restricted to describe exclusively acts of striking with a hammer, but the aspects
of ‘minimal effect’ and ‘repetition’ found in the original meaning of the verb arguably
subsist (as comfirmed by modern dictionaries), and the instrumental component is
echoed by the notion of a forceful and violent striking usually accompanied by loud
noise (which many dictionaries gloss as ‘as if with a hammer’).
The verbs swipe, kick and swat are similar cases in that they refer to a precisely
defined shape of motion in space. In other words, what makes an event of swiping,
kicking or swatting, is, above all, a particular movement performed by the agent, respectively a swinging blow10 (of the arm or of an instrument), an outward motion of
the foot, and the motion of a flat surface (an open hand or an instrument with the
appropriate shape) through the air so that the surface hits a target (often an insect,
crushing it). This makes these verbs agent-centered, i.e., they focus on describing what
the agent is doing rather than the effects that its action may have. In addition, kick
specifies the body part involved (a leg), further reinforcing its agent-centered character. Strikingly, there turn out to be much fewer verbs with a focus on the shape of motion among the other (i.e. non-attracted) collexemes. Possible candidates include lash,
whip, slap and possibly punch; however, the shape evoked by the former two is due
to the kind of instrument used rather than the action performed itself, and the latter
two less obviously refer to a fully described shape. The other verbs rather focus on the
manner of impact or on its effects. It thus seems that this semantic property (‘precisely
defined shape’) is highly correlated with the striking-conative construction.
Finally, pummel combines aspects of hammer and of the agent-centered verbs. It
is slightly agent-focused since it refers to a particular body part (the fists). But more
importantly, it is inherently repetitive, as all consulted dictionaries indicate: pummeling consists of a succession of small blows, most often dealt with the fists.
As for the repelled collexemes, the usual cautioning remarks apply. Let us, however, note that, just like with the cutting- and pulling-conative constructions, the maximally neutral verbs hit and strike are, as expected, the most repelled collexemes of the
striking-conative construction.
10. As a confirmation of this analysis, the OED notes that swipe is chiefly used in the context of
cricket.
4.3
Rethinking constructional polysemy
Discussion
As should be clear from the preceding discussion, Stefanowitsch and Gries’ (2003)
claims about the relation between the collexemes attracted to a construction and that
construction’s meaning are clearly borne out for these three verb-class-specific instantiations of the conative construction. Namely, the attracted collexemes all prominently
profile in their inherent semantics one or more semantic trait(s) that the construction
contributes by itself when it occurs with other verbs. The semantic generalizations
that each collexeme supports are reported in Table 7.
The collexeme list clearly exemplifies the principle of semantic compatibility and
how this principle bears on usage; namely, verbs with a meaning that lends itself particularly well to the interpretation sanctioned by the construction are “attracted” by it:
they are much more frequent in that construction than chance would predict. Conversely, the hyperonym of the semantic class is the most repelled collexeme in each
case, which the principle of semantic compatibility also predicts since such verbs are
supposedly the most neutral verbs in their class, and thus do not profile any particular
semantic trait that would attract them to the construction.
In conclusion, it seems possible to characterize the meaning of the conative
construction, or more precisely, the meaning the construction contributes when it
Table 7.╇ Semantic generalizations supported by the collexemes of verb-class-specific
constructions
Verb-class-specific
Construction
Collexemes
Semantic generalization(s)
cutting-conative
hack
saw
event consisting of several identical movements with a
minimal individual effect; hence it is inherently
unbounded and repetitive
chip
minimal effect; no holistic interpretation
tug
focus on the effort (energy and duration) that the agent
puts into the action rather than its effects
pluck
idem, plus a short duration which makes it prone to
repetition
dab
lowly energetic; patient often not directly affected
hammer
inherently consists of several repeated blows; a single
blow does not produce a sufficient effect
swipe
kick
swat
agent-centered: they profile a precisely defined motion
that the agent performs, as well as information on the
entity set in motion
pummel
profiles a body part (fists), inherently repetitive
pulling-conative
striking-conative
81
82
Florent Perek
combines with verbs of each semantic class under study, simply by attending to the salient semantic properties of the collexemes in each class. Of course, these collexemes
do not lexicalize one of the meanings of the conative construction per se, as is the case
with give and the ditransitive construction. But there is still arguably some abstract
semantic quality shared between the collexemes and the constructional meaning as
it occurs with other verbs. Such a semantic characterization would be much more
difficult (if possible at all) to arrive at by looking at the entire distribution, i.e. at the
level of the general construction vs. the more specific verb-class-specific constructions. The methodological and theoretical implications of this finding are elaborated
on in the concluding words of the next section.
5. Conclusion
As the first large-scale corpus-based investigation of the conative construction, this
study contributes to the documentation of the construction’s usage. Its initial goal
was to see what the verbs most frequently used with the construction could tell us
about its meaning, drawing on the method of collexeme analysis. As it turns out, a
collexeme analysis of the construction based on data from the prose-fiction part of
the BNC fails to highlight its central meaning(s), since there does not seem to be
a particular kind of verb that the construction attracts. Hence, while the collexeme
list is not totally at odds with the meaning of the construction as it has been characterized introspectively, in this case collexeme analysis does not seem to be helpful
in characterizing it precisely. To solve this problem, a different kind of analysis was
proposed. Instead of considering the conative construction as a whole, the focus was
shifted to verb-class-specific constructions, i.e. elaborations of a construction instantiated by verbs from a specific semantic class. A collexeme analysis was performed on
three verb-class-specific constructions, respectively instantiated by verbs of cutting,
verbs of pulling and verbs of striking, identified on the basis of the lexical database
WordNet.
The collexemes of each of these lower-level constructions feature in their inherent
meaning the semantic traits that are characteristic of verbs of that class when they
occur in the construction. In other words, collexeme analysis profiles the constructional meaning much better at the level of each verb class than at the most general
level. Of course, it does not mean that collexeme analysis is ineffective for the conative
construction taken as a whole; it is just not particularly telling. The collexemes found
for the overarching construction are attracted because they are more compatible with
the constructional meaning. But the conative construction is so multifaceted when
taken at the most general level that it is much easier to understand why these verbs
are collexemes and what this tells us about the meaning of the construction if we go
down to the level of verb classes.
Rethinking constructional polysemy
On the theoretical side, these results shed some light on the nature of constructional generalizations. Namely, a long-standing debate in constructional approaches
to grammar is concerned with which level of generalization best reflects speakers’
knowledge of constructions. In the case of argument structure, earlier constructional
approaches (cf. Fillmore and Kay ms.; Goldberg 1995) sought to cast the broadest
generalizations possible by positing one single very abstract meaning accounting for
all instances of the construction, either directly or through an extension of the constructional meaning. However, more recent research questions this commitment and
emphasizes the importance of lower levels of generalizations to appropriately account
for the distribution and meaning of constructions; see, for example, Boas’ (2003) concept of “mini-constructions” to account for English resultatives, and of course Croft’s
(2003) proposal for “verb-class-specific constructions” (cf. also Fillmore 2001; Glynn
2004). Of course, the debate “general vs. local” might appear null and void in a truly
constructional account, in which both abstract schemas and their various elaborations can be stored at any level of generality. But if a number of local generalizations
alone account for what appears at first sight to be a single general construction, this
casts the question of whether the overarching construction is needed at all, all the
more so if the local generalizations provide a better account in terms of accuracy
and coverage. This is precisely what happens with the conative construction: to the
extent that speakers attend to frequently occurring verbs in some syntactic context,
and use that information to “get a ‘fix’ on the construction’s meaning” (Goldberg
2006:â•›92), they can usefully exploit this lexical semantic information only at the level
of verb-class-specific-constructions. Under this view, a verb appears to be a collexeme
of the general construction only because it is, first and foremost, a collexeme of a
verb-class-specific construction. In sum, the results of this study suggest a different
view of the polysemy of the conative construction, which can plausibly be extended
to other constructions. The various meanings of the conative construction are better
seen not as a network of related senses, but as a cluster of low-level generalizations
over similar verb meanings, in line with Croft’s (2003) proposal. As a reviewer points
out, it thus would seem as if we are actually dealing with a case of constructional
homonymy, i.e. several constructions sharing the same form but conveying different
meanings. However, the possibility that these verb-class-specific constructions might
be, at least to some extent, unified under a higher-level generalization should not be
entirely rejected. The fact that low-level generalizations can determine the semantic
contribution of the syntactic pattern for verbs of the semantic class does not exclude
the possibility of cross-generalizations between different classes. First, if several distinct verb classes receive the same semantic contribution (which is plausible, since the
conative construction conveys a wide yet still limited range of meanings), they could
form a single higher generalization, which in turn could be used to produce new
combinations. Second, patterns of analogy between different classes might well play a
major role in determining the distribution and in helping speakers get at the correct
83
84
Florent Perek
interpretation, forming generalizations of intermediate scope. The generalizations accounting for the conative construction could well be centered on a few classes first,
from which an abstract meaning could be extracted and applied to other verbs and
classes. Such a scenario is probably necessary to explain the inclusion of “orphans”, i.e.
verbs whose semantic class does not have any other representative in the distribution.
Obviously, there is still much to learn about the workings of constructional generalizations. I hope, however, to have presented in this chapter a promising application
of collexeme analysis to understand the mechanisms of constructional abstraction
and the possible underlying representations on the basis of corpus data.
References
Boas, H. (2003). A constructional approach to resultatives. Stanford: CSLI Publications.
Broccias, C. (2001). Allative and ablative at-constructions. In M. Andronis, C. Ball, H. Elston,
& S. Neuvel (Eds.), CLS 37: The main session: Papers from the 37th meeting of the Chicago
linguistic society. Volume 1 (pp. 67–82). Chicago: Chicago Linguistic Society.
Croft, W. (2003). Lexical rules vs. constructions: A false dichotomy. In H. Cuyckens, T. Berg,
R. Dirven, & K. Panther (Eds.), Motivation in language: Studies in honour of Günter
Radden (pp. 49–68). Amsterdam: John Benjamins.
Dixon, R. (1991). A new approach to English grammar: On semantic principles. Oxford:
Clarendon Press.
Fellbaum, C. (1998) (Ed.). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Fillmore, C., & Kay, P. (MS). Construction Grammar (course reader). University of California,
Berkeley.
Fillmore, C. (2001). Mini-grammars of some time-when expressions in English. In J. Bybee, &
M. Noonan (Eds.), Complex sentences in grammar and discourse: Essays in honor of Sandra
A. Thompson (pp. 31–60). Amsterdam: John Benjamins.
Glynn, D. (2004). Constructions at the crossroads: The place of construction grammar between
field and frame. Annual Review of Cognitive Linguistics, 2, 197–233.
DOI: 10.1075/arcl.2.07gly
Glynn, D. (2010). Testing the hypothesis: Objectivity and verification in usage-based cognitive
semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative cognitive semantics: Corpus-driven approaches (pp. 239–270). Berlin: Mouton de Gruyter. DOI: 10.1515/9783110226423
Goldberg, A. (1995). Constructions: A construction grammar approach to argument structure.
Chicago: University of Chicago Press.
Goldberg, A. (2006). Constructions at work: The nature of generalization in language. Oxford:
Oxford University Press.
Goldberg, A., Casenhiser, D., & Sethuraman, N. (2004). Learning argument structure generalizations. Cognitive Linguistics, 15(3), 289–316. DOI: 10.1515/cogl.2004.011
Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31(2), 91–
113. DOI: 10.1023/A:1000583911091
Levin, B. (1993). English verb classes and alternations. Chicago: Chicago University Press.
Pinker, S. (1989). Learnability and cognition: The acquisition of argument structure. Cambridge,
MA: MIT Press.
Rethinking constructional polysemy
Stefanowitsch, A., & Gries, St. Th. (2003). Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.
DOI: 10.1075/ijcl.8.2.03ste
Stefanowitsch, A., & Gries, St. Th. (2005). Covarying collexemes. Corpus Linguistics and Linguistic Theory, 1(1), 1–43. DOI: 10.1515/cllt.2005.1.1.1
Van der Leek, F. (1996). The English conative construction: A compositional account. In
L. Dobrin, K. Singer, & L. McNair (Eds.), CLS 32: The main session: Papers from the 32th
meeting of the Chicago linguistic society (pp. 363–378). Chicago: Chicago Linguistic Society.
85
Quantifying polysemy in Cognitive
Sociolinguistics
Justyna A. Robinson
University of Sussex
This chapter uses various statistical techniques to explore the extralinguistic
grounding of individual conceptualisations of polysemous adjectives in English,
such as awesome, gay, wicked. It considers the extent to which individual conceptualisations are non-random and can be related to the socio-demographic
characteristics of the speaker. The experimental survey data collected from 72
speakers is analysed via hierarchical agglomerative clustering, decision tree
analysis, and logistic regression analysis. The results reveal that not only individual adjectives, as indicated in Robinson (2010a), but whole groups of polysemous adjectives currently undergoing semantic change form usage patterns
that can be explained by a very similar sociolinguistic distribution. This study
demonstrates that employing a socio-cognitive perspective when researching
polysemy is hugely advantageous.
Keywords: adjectives, decision tree analyses, hierarchical agglomerative
clustering, logistic regression, semantic variation, semantic change
1. Polysemy
Polysemy, which is usually defined as one form that has several related1 yet distinct2
meanings, has been widely explored in various areas of linguistics (for an overview,
1. Polysemous meanings are related historically (as opposed to homonymous meanings
which are not). From the methodological point of view there are different approaches as to
whether a particular sense is a valid reading of a polysemous category. While certain studies
include historically-related polysemous meanings in their dataset (e.g. Sweetser 1990), other
studies consider only senses that are conceptually related in a given point in time (e.g. Beretta
et al. 2005; Klein and Murphy 2002). Some discussion of diachronically-related senses that can
be perceived as unrelated is available in Geeraerts (1997) and Blank (2003).
2. Challenges and potential solutions for determining the number and boundaries between
polysemous readings are discussed in Dunbar (2001), Geeraerts (1993), Gries (2006), Hanks
88
Justyna A. Robinson
see Allan and Robinson 2012; Cuyckens and Zawada 2001; Geeraerts and Cuyckens
2007; Lewandowska-Tomaszczyk 2007; Nerlich et al. 2003; Rakova et al. 2007; Ravin
and Leacock 2000; Vanhove 2008). Although polysemy was traditionally associated
with the lexicon, much of the research has shown that polysemy also emerges when
syntactic, morphological, and phonological usage is considered (see e.g. Brugman
1981; Taylor 1995:â•›142). It has become apparent that polysemy is not just a feature of
certain words, but that it is a form of categorisation (see e.g. Taylor 1995:â•›99; Lakoff
1987:â•›12). Therefore, much of the research on polysemy has since focused on learning
more about patterns of human categorisation. Some of the key observations of polysemous usage indicate that knowledge is categorised in terms of family resemblance
models with less central meanings clustered around a prototype (see e.g. Geeraerts
1989, 1993; Janda 1990; Lakoff 1987:â•›379).
Since the majority of these observations are drawn from the analysis of intralinguistic data only, there is little known about the extent to which the categorisation of specific linguistic events varies between speakers in the same community.
For instance, can we assume that the prototypical centre of a polysemous category
will be the same for each speaker in a given community? Much research on language variation indicates that linguistic usage does indeed differ between speakers
in a community (Chambers et al. 2002; Coulmas 1997; Fought 2004; Labov 2001).
These studies demonstrate that the way people speak may be predicted, for example,
from their profession, the area in which they live, their social networks (Milroy 1980,
1987), practices they engage in (Eckert 2000) or identities they construct and adopt
(Bucholtz 2011). Theoretically, Cognitive Linguistics agrees with the social grounding
of linguistic variation by arguing for the experiential and perspectival nature of meaning (Geeraerts 1993:â•›60). However, little research has been done to relate linguistic
usage patterns directly to extralinguistic categories and to account for the socio-cultural grounding of categorisation. Notable exceptions are represented in the literature
by Geeraerts et al. (1994), Geeraerts et al. (2010), Kristiansen and Dirven (2008),
Pütz et al. (2012a), Pütz et al. (2012b), Pütz et al. (2014), Reif et al. (2013), Robinson
(2012a), and Robinson (2012b).
2. Scope of the study
The current chapter contributes to the discussion of the extralinguistic grounding
of individual conceptualisations of polysemous categories. It considers the extent to
which individual conceptualisations are non-random and can be related to the socio-demographic characteristics of the speaker. The current chapter also demonstrates
(2000), Kilgariff (1997), Krishnamurthy and Nicholls (2000), Lehrer (1990), and summarised
in Lewandowska-Tomaszczyk (2007), and Ravin and Leacock (2000).
Quantifying polysemy in Cognitive Sociolinguistics
how various statistical techniques (i.e. hierarchical agglomerative clustering, logistic
regression, and decision tree analyses) may be employed in order to enhance the way
investigations of polysemy are carried out.
The current chapter elaborates on my earlier study (Robinson 2010a) that
demonstrated the benefits of implementing a sociolinguistic perspective in cognitive research on polysemy. In that study, I found out that the usage of innovative or
conservative senses of the adjective awesome can be predicted from the socio-demographic characteristics of a speaker. For example, the use of awesome ‘terrible’ can be
predicted from the speech of people of 60 years or older. Not only do these findings
support cognitive linguistic understanding of polysemy but they also indicate that
significant differences exist in the extent to which the different meanings of the same
semantic category are salient for different speakers. Although these findings provide
compelling evidence for the existence of socio-semantic usage patterns, the conclusions regarding systematic, socially-grounded usage can be only generalised as far as
the adjective awesome is concerned. What remains to be verified is whether speakers’
usage of other similar words (e.g. other adjectives currently undergoing change) is
structured in a similar way.
The current chapter addresses this issue by investigating the usage of eight polysemous adjectives that are presently undergoing change.3 Firstly, I establish whether any
meaningful patterns can be detected when the usage of several polysemous adjectives
is considered simultaneously. Provided that meaningful usage patterns emerge, I then
determine whether these are non-randomly related to the socio-demographic characteristics of the speakers who use these categories. In order to achieve the aims of the
research, various exploratory and confirmatory statistical techniques are implemented. Thus, after introducing the data (Section 3), I summarise the aims of the hierarchical agglomerative cluster analysis (Section 4) and apply this exploratory method to
analyse the dataset (Section 5). This analysis is supplemented with confirmatory analyses involving logistic regression (Section 6) and a decision tree analysis (Section 7)
before the conclusions are drawn (Section 8).
3. This chapter is based on my doctoral research (Robinson 2010b). I would like to thank
the University of Sheffield for funding this research project and Joan Beal, Ewa Dąbrowska,
Philip Durkin, and Susan Fitzmaurice for generous advice and guidance on many aspects of
this research. I would also like to thank Christopher S. Butler, Dagmar Divjak, and anonymous
reviewers for comments on the earlier version of the current chapter. All other shortcomings
are mine.
89
90 Justyna A. Robinson
3. Data and method
Initial steps in the research follow those presented in Robinson (2010a). Eight polysemous adjectives currently undergoing change (Table 1) and five controlling adjectives have been chosen for the study. Available corpora (the British National Corpus
(henceforth, BNC) and the Oxford English Corpus (henceforth, OEC)) and dictionaries (the Oxford English Dictionary (henceforth, OED)) indicate that the investigated
adjectives have recently developed a distinctive meaning in British English. For a few
of them, a potentially disappearing meaning has also been identified (see Table 1).
In order to determine the usage patterns of these adjectives in a speech community, I carried out interviews with 72 speakers from South Yorkshire, UK. The speaker sample was equally representative of both men and women, different age groups
(11–94 years old), and socio-economic backgrounds. Each of the speakers was asked
a series of questions aimed at eliciting the most salient usage of polysemous adjectives. These questions followed a schema of asking for a referent that could be best
described with an adjective in question, as shown in the following example:
(1) Interviewer: Who or what is wicked?
Participant: My mum. (referent)
Interviewer: Why is your mum wicked?
Participant: Because she lets me play my music as loudly as I want to.
(justification for use)
Each participant provided, on average, three instances of use of the investigated adjectives, which yielded more than seventeen hundred cases for analysis (excluding
counts of those for controlling adjectives).
The information obtained on both the referents and justification for the use of
each adjective allowed me to put individual responses into groups of similar usage.
Each of these usage groups was then given a sense label. This sense label was mainly
Table 1.╇ The summary of potentially disappearing and emerging senses
of the investigated adjectives
Adjective
Incoming meaning
Potentially disappearing meaning
awesome
chilled
cool
fit
gay
wicked
solid
skinny
great
good
good/trendy
attractive
lame
good
hard, tough
‘latte’/low fat
terrible
happy
mean
Quantifying polysemy in Cognitive Sociolinguistics
Table 2.╇ Example of the database structure
Participant
wicked
‘good’
wicked
‘evil’
awesome
‘great’
awesome
‘terrible’
Speaker A
Speaker B
Speaker C
Speaker D
2
3
0
0
1
0
2
3
3
2
0
1
0
0
2
3
derived on the basis of the match between the usage and the citation of senses used
in the dictionaries. For instance, Example (1) above would be generalised with the
meaning wicked ‘good’. Another sense group that emerged for the adjective wicked
was wicked ‘evil’. Occasionally, speakers indicated that they were aware of a certain
use of an adjective but they clearly distanced themselves from using this sense. In
such cases, a category of a ‘reported’ sense was introduced.4 A category labelled ‘N/A’
was introduced in order to account for overlapping senses that could not be reliably
assigned to any of the above groups or for other problematic answers.5 The raw frequency of use of each sense for each participant was recorded in the database.
The next step in the analysis was to verify if any common usage patterns emerge
across all senses used by participants. Usage patterns can be determined by identifying clusters in which two or more senses are used similarly (frequently or infrequently) by a number of speakers. Let us examine Table 2 to illustrate this. Table 2 presents
a mini database that is structured in a similar way to the one used in the current study.
One can observe that speakers A and B use awesome ‘great’ and wicked ‘good’
more frequently than other senses and that speakers C and D use awesome ‘impressive’
and wicked ‘evil’ more frequently than other senses of the adjectives. Thus, one may
conclude that two distinct usage patterns emerge for two different groups of speakers:
one for speakers A and B and another one for speakers C and D. In order to establish
usage patterns in a large database, like the one used in the current study, I used the
exploratory technique of hierarchical agglomerative cluster analysis (hereafter, HAC).
Once semantic usage patterns are established, I consider the question of why
these senses cluster together.6 Two variants are considered. Since the adjectives used
4. This separation of ‘reported’ senses from ‘non-reported’ senses is performed for practical
reasons of potentially showing where in a community constraints on usage emerge. This is not
to suggest that these are two different senses.
5. For instance, when the adjective gay used as a female name Gay.
6. Before one starts to analyse the visual output of a cluster analysis, it must be stressed that
every HAC will always cluster elements together even if the clustering makes no sense. In other
words, HAC will impose a structure on any data even if such a structure does not exist (also see
Divjak and Fieller, this volume).
91
92
Justyna A. Robinson
in this study are undergoing change, one option is to assess whether senses that cluster
together are similar in terms of the historical information we have about them. Do
recently developed senses group separately from historically older senses? Another
possibility is to consider whether speakers who share the same socio-demographic
characteristics (in terms of age, gender, education, etc.) use the same combinations or
clusters of senses. The findings of this analysis help to answer the question of whether there are any systematic semantic usage patterns and whether these are socially
grounded. In order to find out more information about the users of the senses, I employed two confirmatory techniques: logistic regression and answer tree analysis.
4. Hierarchical agglomerative clustering
There are a number of exploratory techniques used by social scientists that can help to
find groups in data, such as principal component analysis, factor analysis, and different types of cluster analysis. These techniques differ in respect to their aims. Principal
component analysis and factor analysis are methods for reducing the dimensionality
of data (summarising the information in a complete set of variables using fewer variables), whereas cluster analysis aims to organise observations in meaningful structures.
I use the exploratory technique of HAC in the current study because I am not interested in data reduction (as the dataset contains a comparatively small number of dimensions) and I intend to investigate groups in the data which are non-randomly similar.
HAC belongs to a family of multivariate exploratory statistical methods (i.e.
non-hypothesis-testing) for finding groups in data based on measured characteristics. HAC starts with each case in a separate cluster and then combines the clusters
sequentially, reducing the number of clusters in each step until only one cluster is left.
This hierarchical clustering process can be represented as a dendrogram, where each
step in the clustering process is illustrated by a fork in the tree diagram. A detailed
discussion of HAC as well as other types of cluster analysis is presented in Divjak and
Fieller (this volume).
The dendrogram in Figure 1 is an example of how senses from Table 1 cluster7
when their usage evidence for awesome ‘great’ (meaning 1), awesome ‘terrible’ (meaning 2), wicked ‘good’ (meaning 3), and wicked ‘evil’ (meaning 4) is inputted into the
cluster analysis.
The numbers visible on the junctions of the dendrogram represent the measurement of the distance at which clusters fuse together in the hierarchical cluster analysis. This example illustrates meanings 1 and 3 being combined at a fusion value of 2,
whereas meanings 2 and 4 are combined at a fusion value of 1. The elements that are
clustered earlier (represented by lower fusion values) are more closely related than
7. Distance measure used: phi-square; amalgamation strategy used: Ward.
Quantifying polysemy in Cognitive Sociolinguistics
Awesome ‘great’
2
Wicked ‘good’
Awesome ‘terrible’ 1
Wicked ‘evil’
Figure 1.╇ HAC of senses of presented in Table 1
elements that are clustered later (represented by higher fusion values). The cluster
analysis groups senses that are used in a similar way by different people. This dendrogram indicates that different people in the research sample are mostly using the same
combination of senses 1 and 3 (1+3) and 2 and 4 (2+4), rather than other combinations such as (1+4), (2+3), or (1+2+3).
5. Hierarchical agglomerative cluster analysis of collected data
Having briefly outlined what constitutes HAC, I move on to discuss details of the
computational steps of performing HAC on the current dataset. The analysis was performed using software called ClustanGraphics 7.05 (hereafter, Clustan).
5.1
Selection of polysemous adjectives
Table 3 presents the eight adjectives that are included in the HAC together with relevant meanings (a total of thirty-five meanings). Milligan and Cooper (1986, cited
in Everitt et al. 2001:â•›179) suggest that only variables that are expected to be forming
clusters should be included in the analysis. Irrelevant or masking variables should be
excluded, if possible. Therefore, controlling adjectives and meanings grouped in the
category ‘N/A’ are excluded from the HAC.
The raw frequencies of the use of different senses are recorded in an SPSS database (following the format of the example of Table 2). Certain types of variables need
transforming or standardising before the HAC is run (e.g. standardizing to z-scores).
However, there is no need to standardise the variables in the current study as their
scales do not differ. For more information on standardising variables for cluster analysis, see Divjak and Fieller (this volume).
5.2
Dissimilarity matrix
The next step involves generating a dissimilarity matrix which shows the distances
between items. This procedure calculates either the similarities or dissimilarities (also
93
94 Justyna A. Robinson
Table 3.╇ Adjectives and their meanings explored in the HAC
Adjective Meanings
awesome
chilled
cool
fit
gay
skinny
solid
wicked
great
relaxed
terrible
cold
impressive
calm,
collected
cold
reported
relaxed
good,
calm,
reported
trendy
collected
good, trendy
attractive athletic
healthy
suitable
reported
attractive
lame
unmanly homosexual happy
reported reported
reported
lame
homosexual happy
latte/
thin
showing
mean
tight fitting
low fat
skin
hard
of one
hard
dependable
(person) substance (object)
good
evil
reported
good
referred to as distances or proximities), either between pairs of variables or between
pairs of cases. Between-variable dissimilarities are generated using SPSS 18 and they
are copied into Clustan. Distances can be measured by using different metrics according to different data types (see Divjak and Fieller, this volume). For instance, proximities in count data in the current dataset are measured with phi-square.
5.3
Amalgamation strategy
Once distances between items are calculated, an amalgamation strategy is applied.
This procedure involves using one of a few available algorithms (for a summary, see
Divjak and Fieller, this volume) that define how separate elements are to be clustered.
In the current dataset, the algorithm Increase in Sum of Squares (sometimes called
Ward)8 is chosen. This method merges the two elements whose merging least increases their sum of squared deviations from their mean. The Ward algorithm is considered to be a robust method for data classification, which is sensitive to outliers (Gries
8. Other algorithms have also been considered (single linkage, complete linkage, and average
linkage). The basic operation of such methods is similar as they fuse individuals or groups of
individuals who are closest or most similar. Differences arise because various methods define
the distance between individuals or groups of individuals in different ways (see Everitt et al.
2001:â•›Chapter 3). Scholars agree that there is no single best amalgamation method (Everitt et al.
2001:â•›Chapter 8; Moisl 2009; and Tan et al. 2006:â•›639–642).
Quantifying polysemy in Cognitive Sociolinguistics
2007). This method also appears to have affinity with other linguistic research, e.g.
Beitel et al. (2001), Divjak and Gries (2006), Gries and Stefanowitsch (2006).
There is one more step to complete before we can start analysing the dendrogram.
The order of the original data can also influence the amalgamation of values. In order
to obtain the optimal ordering of the cases, I follow the ‘serialize procedure’. This
procedure yields the best order of variables that can be obtained from the current
proximity matrix.
5.4
Dendrogram
An HAC of the meanings of the adjectives resulted in the dendrogram presented in
Figure 2. The vertical layout of the dendrogram in Figure 2 means that we analyse it
from left to right. The HAC of the current dataset clustered senses that are used in
similar ways by the same people.
For example, the fact that wicked ‘good’ and awesome ‘great’ clustered together
means that a number of people who used wicked ‘good’ in the interview were also likely to use awesome ‘great’. Skinny ‘thin’ and awesome ‘great’ are fused later and belong to
different clusters. It does not mean that no participant exhibited the use of these two
senses. Instead, it just means that people using skinny ‘thin’ were less likely to also use
awesome ‘great’ in the same interview.
5.5
Best-cut
After generating a dendrogram, one needs to decide which subclusters are meaningful and therefore should be highlighted for analysis. There are many “rules of thumb”
as far as the analysis of cluster levels is concerned, but I will briefly talk about two
approaches to this process. First of all, one may delimit borders of individual clusters
based on the structure of the dendrogram and perceived (dis)similarities between
variables. This procedure involves considering whether various subclusters make
sense from the point of the intuitive (dis)similarities between data and the scope of
the investigation. Another possibility is to employ statistical measures to determine
the best number and size of clusters in the dendrogram (also called best cut). Ideally,
conclusions from introspection and statistical analysis should overlap (although this
is not always the case). The last scenario is that none of the possible divisions of the
dendrogram make sense in the context of a given research question (statistically and
through introspection). This is always a possibility since “cluster analysis can create as
well as reveal structure” (Breckenridge 2000:â•›261) (cf. Divjak and Fieller, this volume).
Initial inspection of the dendrogram in Figure 2 indicates that three large clusters
could be delimited (see highlighted clusters in Figure 3), the top one being more independent from the remaining two. This three-cluster solution seems sensible in the
95
96 Justyna A. Robinson
TokensGayLame
TokensWickedGood
TokensAwesomeGreat
TokensCoolGoodTrendy
TokensChilledRelaxed
TokensFitAttractive
TokensGayReportedLame
TokensSolidHardPerson
TokensFitReportedAttractive
TokensSolidOfOneSubstance
TokensAwesomeImpressive
TokensGayUnmanly
TokensSkinnyShowingSkin
TokensSkinnyTightFitting
TokensFitAthletic
TokensSkinnyThin
TokensGayHomosexual
TokensSolidHardObject
TokensWickedEvil
TokensGayReportedHappy
TokensChilledCalmCollected
TokensWickedReportedGood
TokensCoolReported
TokensSolidDependable
TokensSkinnyLatte
TokensFitSuitable
TokensCoolCold
TokensGayHappy
TokensCoolCalmCollected
TokensFitHealthy
TokensGayReportedHomosexual
TokensSkinnyMean
TokensAwesomeTerrible
TokensChilledCold
TokensChilledReportedRelaxed
Figure 2.╇ Dendrogram of clustered senses
ClustanTM
Quantifying polysemy in Cognitive Sociolinguistics
TokensGayLame
TokensWickedGood
TokensAwesomeGreat
TokensCoolGoodTrendy
TokensChilledRelaxed
TokensFitAttractive
TokensGayReportedLame
TokensSolidHardPerson
TokensFitReportedAttractive
TokensSolidOfOneSubtance
TokensAwesomeImpressive
TokensGayUnmanly
TokensSkinnyShowingSkin
TokensSkinnyTightFitting
TokensFitAthletic
TokensSkinnyThin
TokensGayHomosexual
TokensSolidHardObject
TokensWickedEvil
TokensGayReportedHappy
TokensChilledCalmCollected
TokensWickedReportedGood
TokensCoolReported
TokensSolidDependable
TokensSkinnyLatte
TokensFitSuitable
TokensCoolCold
TokensGayHappy
TokensCoolCalmCollected
TokensFitHealthy
TokensGayReportedHomosexual
TokensSkinnyMean
TokensAwesomeTerrible
TokensChilledCold
TokensChilledReportedRelaxed
Figure 3.╇ Dendrogram with three and seven-cluster division
ClustanTM
97
98 Justyna A. Robinson
light of the current research project. Taking into consideration diachronic information on the usage of individual meanings (cf. Table 1), one can notice that each of the
three clusters groups senses that are of different historical depth. Thus, the top cluster
includes novel senses, the bottom cluster largely includes senses that are considered
to be disappearing, and the middle cluster mostly represents diachronically ‘middle’
senses (neither recent innovations, nor necessarily disappearing senses). Moreover,
the most recent and the oldest senses are grouped into two clusters positioned at
the extreme ends of the dendrogram. These visual characteristics indicate that these
two clusters are substantially different from each other in terms of usage/people who
use them.
The statistical delineation of clusters was carried out by following the best-cut
procedure. Best cut in the data was established by using a significance test (upper tail
rule, cf. Mojena 1977) on the fusion values at every stage in which the clusters join
together in the dendrogram. Best cut indicates the level at which the change in fusion
values is significant for most groups. This is then displayed by highlighting partitions
on the dendrogram (these clusters are delimited with squares on Figure 3). The partition corresponds to the largest number of clusters, which is significant at the level
of 5%.
The best-cut procedure indicates that the seven-cluster solution turns out to be
statistically significant. The seven-cluster solution breaks Cluster 2 down into two
further clusters and Cluster 3 into four further clusters, whereas Cluster 1 remains
unchanged. At first sight, it is less apparent why these sub-clusters would be separated.
One potential explanation could involve historical information on the usage of some
of these subclusters. Therefore, one could suggest that the cluster (skinny ‘mean’ and
awesome ‘terrible’) contains disappearing senses.
At this point, a decision needs to be made as to the level (seven or three clusters)
at which to carry out further analysis. This initial exploratory analysis shows that
the three-cluster solution already seems to exhibit interesting sense groupings. Going
deeper into subclusters (seven clusters) might yield more detailed, but not necessarily
as relevant, information (from the point of view of the current research project) on
groups of variables. Besides, one rule of thumb says that one should distinguish as few
clusters as possible. In the current chapter, I present the analysis at the level of three
clusters only.9
9. The analysis of clusters at the level that a researcher intuitively considers appropriate (three
clusters) may lead to ignoring interesting nuances in use that can be revealed at a more detailed
‘best-cut’ level of seven clusters. Therefore, the analysis of the seven-cluster solution of the current data is presented in Robinson (2010b).
5.6
Quantifying polysemy in Cognitive Sociolinguistics
Validation of clusters
The validation of a given clustering involves a series of procedures that determine the
robustness of a present solution for making predictions. Different studies suggest various ways of assessing the validity of a given clustering (see Duda et al. 2001:â•›557–559;
Everitt et al. 2001: Chapter 8; Moisl and Jones 2005; and Tan et al. 2006:â•›532–555). In
the current study, I examine both the internal stability and the external validity of the
present cluster solution following suggestions presented by Clatworthy et al. (2005).
5.6.1 Confirmatory analysis: Internal stability
This first step in the validation procedure aims at answering the question of whether
a given cluster solution can be replicated on new data. In order to validate the clustering in Figure 2, I run a tree validation procedure (also called bootstrap validation),
which involves a series of random trials on randomised proximities. Each trial generates a different dendrogram for the given data and the series of trials provide a mean
dendrogram and confidence intervals. The validation is achieved by comparing the
initial clustering with the clustering in the randomised dendrogram. The validation
procedure confirms that the obtained dendrogram and the best-cut division of the
dendrogram can be replicated even on randomised data.
5.6.2 Confirmatory analysis: External validity (sociolinguistic analysis)
As Clatworthy et al. (2005:â•›333) point out, the internal stability of clusters is not sufficient evidence to determine the value of a cluster solution. External validity procedures are employed to determine whether a set of external predictors (i.e. variables
which were not included in the clustering process) can be associated with the obtained cluster solution and whether the same cluster solution can be replicated from a
new independent sample. This approach is considered to be one of the better ways to
validate cluster solutions (Aldenderfer and Blashfield 1984:â•›66).
In Sections 6 and 7, the external validity of the cluster solution is examined. I verify whether any information about the speakers (their age, gender, etc.) who provided
the usage data for this study could explain the way in which the senses in Figure 2
are grouped. This practically means employing other statistical measures to validate
the use of clusters (linguistic variables) against external variables (socio-demographic
variables). In order to perform external validation, multivariate techniques (logistic
regression modelling and decision tree analysis) are employed.
From a statistical point of view, verifying the external stability is a necessary element of cluster analysis. From the viewpoint of the current study, this procedure can
be considered as a way of testing whether any meaningful semantic usage patterns
emerge from analysing sociolinguistic variation. This naturally leads to gaining more
insights as to whether conceptualisations of polysemous categories are non-randomly
grounded in socio-cognitive contexts.
99
100 Justyna A. Robinson
5.6.3 Summary of cluster-variables
Each of the three main clusters represents a usage pattern which is treated as a linguistic variable. In order to carry out statistical calculations, each of these conceptual
patterns-variables needs to be presented as a binary variable to suit further statistical
analyses. In order to do this, counts of senses belonging to a given cluster are added
up. Following this, by the means of a visual bander10 (SPSS 18), a new variable is created that reflects the characteristics of a whole cluster. This is done through collapsing
a large number of ordinal categories into a smaller set of categories, representing low
and high usage of a given cluster of senses. Table 4 presents the summary of the dependent variables used, along with associated coding and category information.
Table 4.╇ Linguistic variables (dependent variables) used to investigate their association
with socio-demographic variables (independent variables)
Dependent variables
Coded as
Categories
Cluster 1
Cluster 2
Cluster 3
1, 0
1, 0
1, 0
High use
High use
High use
Low use
Low use
Low use
The following independent variables are considered in the analyses: age group,
gender, education, National Statistics Socio-Economic Classification score for a participant’s profession11 (hereafter, NSEC), and a postcode or a neighbourhood variable,
Table 5.╇ Socio-demographic variables (independent variables) used to investigate their
association with the use of different clusters of senses (dependent variables)
Independent Coded as
variables
Categories
Age group
Gender
NSEC
Education
(1, 2, 3, 4)
(1, 2)
(1, 2, 3)
(1, 2, 3, 4, 5)
Postcode
(1, 2, 3)
Up to 18
Male
Higher
Schooling prior
the age of 16
Lower property
prices
19–30
Female
Medium
Secondary
school
Middle
property
prices
31–60
Over 60
Lower
College
University
Current
student
Higher
property
prices
10. A visual bander is an SPSS tool to recode values of a variable into groups. Data frequently
need to be manipulated before analyses are conducted. Data may need to be recoded, computations may need to be made, new variables may need to be created, or certain records may need
to be selected (for more information, see Einspruch 2005).
11. See Office for National Statistics: Standard Occupational Classification (2000) for details.
Quantifying polysemy in Cognitive Sociolinguistics 101
which is based on property values in areas defined by the postcode of a participant’s
residence. For a summary of the coding of the independent variables, see Table 5.
6. Logistic regression
Having explored the data via HAC, confirmatory statistical analyses are carried out
in order to validate emergent groups in the data. More specifically, I aim to assess
the overall effect of socio-demographic categories on the use of particular clusters
of meanings, hypothesising that the clustering solution can be explained by the categories of age, gender, NSEC, education and/or postcode value. In order to address
the above-mentioned aim, a multifactor statistical model is employed, i.e., a model
that considers several external factors simultaneously and measures their effect on the
use of each cluster. In addition, the appropriate statistical approach needs to allow us
to check for confounding variables. Socio-demographic factors may constitute such
cases, i.e., education and occupation may be confounded, as people who are more
educated are likely to have better jobs.
Logistic regression analysis fulfils these requirements. Logistic regression can be
used to test hypotheses about the relationship of several independent variables to a
dichotomous dependent variable (see Hosmer and Lemeshow 1989; Kleinbaum 1994;
Speelman (this volume); Tabachnick and Fidell 2001 for introductions to logistic
regression). Logistic regression is increasingly being used in linguistic studies (e.g.
Benki 1998; Bresnan et al. 2007; Kallel 2007; Levshina et al. (this volume); Tummers
et al. 2004).
The logistic regression model also allows for estimating odds ratios for each of the
independent variables in the model. For instance, one may establish how many times
a given meaning is more likely to be used by age group <up to 18> than by age group
<19–30>. Logistic regression also provides information on variance (the percentage
to which an independent variable is explained by the dependent ones) and is used to
determine the importance of independent variables.
In the current study, logistic regression is performed using SPSS 18 to assess
the overall effect of socio-demographic factors (independent variables) on the use
of clusters. All responses (including missing values) for seventy-two participants are
included in the analysis. All sociolinguistic factors are entered into the model, and
then the factors are examined to verify whether they meet removal criteria using a
forward stepwise method. The final model is established once no further variables are
eligible for removal. The final model is then reported. The resultant fitted model informs us about significant changes in regression coefficients (expressed as B) between
predictors.
In cases where a stable regressive model could not be established and a final solution could not be found (even by modifying model criteria, such as increasing the
102 Justyna A. Robinson
number of iterations: i.e. the series of approximations used by the logistic regression),
I obtain insights into investigated variation by using multivariate statistical modelling
based on decision trees (for more details, see Section 7).
6.1
Logistic regression of Cluster 1
Logistic regression analysis is performed to verify the hypothesis that the likelihood
of high use of Cluster 1 can be modelled from speakers’ age, gender, NSEC, education
and/or postcode (neighbourhood). Logistic regression analysis on Cluster 1 yields an
unstable solution, so one cannot make predictions regarding the use of this cluster.
Nevertheless, interesting findings can be obtained from examining decision trees (see
Section 7.1).
6.2 Logistic regression of Cluster 2
Logistic regression analysis is performed to verify the hypothesis that the likelihood
of high use of Cluster 2 can be predicted from speakers’ age, gender, NSEC, education
and/or postcode (neighbourhood) value.
The summary of the logistic regression analysis of the use of Cluster 2 is presented in Table 6. The final model reported includes variables that best account for
the observed variation. Insignificant variables are excluded from the model. Table 6
shows the coefficients of regression Beta (hereafter, B), their standard errors, the Wald
chi-square statistics, associated p-values, and odds ratios.12 The resultant fitted model
indicates which independent variables are included in the final logistic model. It also
informs us about significant changes in regression coefficients (B) between predictors. B determines the direction of the relationship between a given predictor and
the dependent variable (the use of Cluster 2). If B is positive, the odds for the use of
Cluster 2 are increased; when B is negative, then the odds are decreased; B equalling
0 leaves the odds unchanged. Explanations of the indicator variables can be found
directly under each regression table.
Model summary. According to the model, a high use of Cluster 2 can be modelled
from speakers’ age (p = .005) and NSEC (p = .005).
Age. The most significant differences of use exist between age groups <19–30>
and <up to 18> (p = .001, B = –5.752), and also between age groups <over 60> and
<31–60> (p = .009, B = 3.407). This means that ‘middle’ age groups speak more similarly to each other. The analysis of probability measures (hereafter, P) presented in
12. For further discussion of stepwise regression, regression coefficient, iterations and the output of the logistic regression analysis in SPSS, see Brace et al. (2006) and Norušis (1999).
Quantifying polysemy in Cognitive Sociolinguistics 103
Table 6.╇ Summary of the logistic regression analysis of Cluster 2
Variables
AgeGroup
AgeGroup(1)a
AgeGroup(2)b
AgeGroup(3)c
NSEC
NSEC(1)d
NSEC(2)e
Constant
a:
B
S.E.
–5.752
â•⁄–.313
â•⁄3.407
1.709
â•⁄.841
1.301
â•⁄3.905
â•⁄1.010
â•⁄–.056
1.467
â•⁄.929
â•⁄.380
Wald
df
p
Odds ratio
12.907
11.333
â•⁄â•⁄.138
â•⁄6.855
10.564
â•⁄7.085
â•⁄1.180
â•⁄â•⁄.021
3
1
1
1
2
1
1
1
.005
.001
.710
.009
.005
.008
.277
.883
â•⁄â•⁄.003
â•⁄â•⁄.732
30.179
49.672
â•⁄2.745
â•⁄â•⁄.946
change between the age group <19–30> in relation to the age group <up to 18>
b:
change between the age group <31–60> in relation to the age group <19–30>
c:
change between the age group <over 60> in relation to the age group <31–60>
d:
change between the NSEC group <Higher> in relation to the NSEC group <Medium>
e:
change between the NSEC group <Medium> in relation to the NSEC group <Lower>
Table 7.╇ Probability values used for logit of Cluster 2
Age group
Age group probability
NSEC group
NSEC probability
Up to 18
19–30
31–60
Over 60
â•⁄2.5%
88.9%
91.6%
26.6%
NSEC 1
NSEC 2
NSEC 3
94.7%
26.5%
11.6%
Table 7 indicates that speakers of age groups <19–30> and <31–60> are most likely to
be high users of Cluster 2 (P = 88.9% and 91.6% respectively).
NSEC. The significant ‘jump’ in B-coefficients exists between NSEC2 and NSEC1
(p = .008, B = 3.905). Speakers who occupy higher occupations (NSEC1) are most
likely to exhibit higher use of Cluster 2 (P = 94.7%), in comparison to speakers of
middle and lower occupations (P (NSEC2) = 26.5%, P (NSEC3) = 11.6%).
These findings are graphically presented in the form of logistic function estimate
values (logit) in Figures 4 and 5. The bars with positive values (above 0) represent categories of independent variables (age group and NSEC, respectively), the occurrence
of which correspond to a probability of the use of Cluster 2. In the figures, the taller
the bar above 0, the higher the probability of high use of Cluster 2. The bars with negative values (below 0) represent categories of independent variables, the occurrence of
which correspond to a probability of the use of Cluster 2. The lower the bar, the higher
the probability of ‘not high’ use of Cluster 2.
In the logistic regression analysis, the predictive and explanatory power of the
fitted model needs to be assessed. In order to validate predicted probabilities, the
104 Justyna A. Robinson
3
2
1
0
–1
–2
–3
–4
upto 18
19–30
31–60
over 60
Figure 4.╇ Logistic function estimate values for age group in Cluster 2
4
3
2
1
0
–1
–2
–3
NSEC1
NSEC2
NSEC3
Figure 5.╇ Logistic function estimate values for NSEC in Cluster 2
c-statistic is used (see Peng et al. 2002:â•›6). The c-statistic analyses the proportion of
observed to initially-predicted probabilities of occurrences of Cluster 2. In the case of
Cluster 2, the fitted model (one that includes socio-demographic variables) achieves a
success rate of 80.6%, which is an improvement over the intercept model (51.4%), i.e.,
a model that does not include any of the socio-demographic variables to account for
the observed variation, but includes a constant term only and the model that does not
take into consideration NSEC (73.6%).
The explanatory power of the calculated model refers to how effectively it fits the
actual data for estimating the outcome variable (Moss et al. 2003:â•›925). This could
be assessed by a number of ‘goodness-of-fit’ measures. –2 Log Likelihood (hereafter,
–2LL) indicates the overall fit of the model. It reflects the significance of the unexplained variance in the model. Its lowering values indicate improvement of a model
fit (increasing the likelihood of the observed results). R-square measurements (Cox
and Snell, Nagelkerke tests) indicate how much variation the model actually explains.
Sometimes these measures may yield different results (for further discussion, see
Quantifying polysemy in Cognitive Sociolinguistics 105
Field 2005:â•›239–240). The Hosmer and Lemeshow test is another measure that is considered by some researchers to be a more accurate measure for assessing the goodness-of-fit of the model (Peng et al. 2002:â•›6). It tells you how closely the observed and
predicted probabilities match and insignificant results in the Hosmer and Lemeshow
test signify a model that fits the data well.
In the case of Cluster 2, –2LL (53.740) and an insignificant Hosmer–Lemeshow
test indicate that the model fits the data well and is more adequate for explaining variation than models that do not consider socio-demographic factors. R-square measurements (Cox and Snell = .472, Nagelkerke = .630) indicate that the variation in the
outcome variable is explained moderately well by the logistic regression model.
Logistic regression analysis evidences that the use of Cluster 2 can be satisfactorily modelled from the age and NSEC of speakers, although age group has a more
significant overall effect on the use of the given variable than NSEC. Logistic regression analysis confirms our hypothesis and validates the external stability of Cluster 2.
6.3
Logistic regression analysis of Cluster 3
Logistic regression analysis is performed to verify the hypothesis that the likelihood of
high use of Cluster 3 is modelled from speakers’ age, gender, NSEC, education and/or
postcode (neighbourhood). Logistic regression on this cluster yields an unstable solution, so predictions regarding the use of this cluster cannot be made. Nevertheless,
interesting findings can be observed from examining decision trees (see Section 7.3).
7. Decision tree analysis
In cases where the logistic regression model cannot be established, I use the results of
another multivariate technique.
A decision tree analysis is a technique based on separating cases into segments
that are as different from each other as possible. For instance, with a decision tree
analysis one can easily detect segments and patterns such as ‘female bridge players
with at least 5 years’ experience are likely to win a game’, or ‘students who miss more
than 40 days of school a year are twice as likely to drop out’.
This procedure uses appropriate algorithms that predict the class (belonging) of
a dependent variable from the values of predictor variables. The choice of algorithms
largely depends on the type of data. The most appropriate algorithm chosen for our
analysis is Chi-square Automatic Interaction Detection (hereafter, CHAID).13 This is
13. Other algorithms have also been considered: CandRT, QUEST (for a summary, see SPSS
White Paper, Answer Tree Algorithm Summary, 2005). All p-values in the CHAID algorithm
106 Justyna A. Robinson
a non-parametric stepwise regression procedure that produces splits until it gets a
significant p-value for each split. The CHAID algorithm (available via Answer Tree
3.0) is used to examine factors predicting the use of senses in a cluster. It supplements
logistic regression analysis, especially in cases where a logistic regression model cannot be determined. However, it does not mean that both techniques are different ways
of answering the same questions. The CHAID algorithm identifies groups of speakers
that use similar meanings in a similar way. Logistic regression estimates an overall
effect of an independent variable (i.e. age, gender, or social class of speakers) on the
use of a particular meaning cluster. Therefore, the two methods are two different ways
of looking at the same data.
Decision trees have been widely used in database marketing research (Chaturvedi
and Green 1995:â•›245; Magidson 1994; and Rao and Steckel 1995) and in clinical science
(Barrio et al. 2006:â•›595; Boscarino et al. 2003:â•›303; and Saltini et al. 2004:â•›737) for performing classification or segmentation. However, the use of decision trees in linguistics
is rare (but cf. Heylen 2005; Robinson 2012a; Schmid 2010).
7.1
Decision tree of Cluster 1
I run the decision tree analysis in order to verify the importance of socio-demographic factors (summarised in Table 5) in predicting high use of Cluster 1. More specifically, this analysis shows whether there are any significant socio-demographic groups
(e.g. age) or subgroups (age by gender) that use the senses in Cluster 1. The output of
the analysis is presented in the form of a decision tree. The decision tree presenting a
multivariate analysis of Cluster 1 is presented in Figure 6.
The output presents several levels of significant splits (here two levels). Each split
is based on the rule of the lowest p-value. In other words, if two splits are significant,
the actual split in the decision tree follows the split according to the independent variable for which the p-value is the lowest.
In the case of a tie, the rule with the higher chi-square value is listed first. In the
case of another tie, the rule with lower degrees of freedom (hereafter, df) is listed first.
Both chi-square and df values are displayed in the tree for each split. In the decision
tree in Figure 6, the square at the top (Node 0) represents the characteristics of a
variable to be analysed. There are two categories in this variable: low uses of Cluster 1
and high uses of Cluster 1. There are 39 cases of the former and 33 cases of the latter,
accounting for, respectively, 54.17% and 45.83 % of the variable. These frequencies are
visually presented in the form of bars at the bottom of each square (node). Low use of
the cluster is represented by a darker shade of grey, whereas a high use of the cluster
is represented by a lighter shade of grey.
analysis were adjusted for multiple comparisons using the Bonferroni method (SPSS Answer
Tree 3.0).
Quantifying polysemy in Cognitive Sociolinguistics 107
WARDCluster1 (Binned)
Node 0
Category %
low
54.17
high
45.83
Total (100.00)
n
39
33
72
AGEGROUP
Adj. P-value=0.000, Chi-square=45.9182, df=3
upto 18
Node 1
Category %
low
5.56
high
94.44
Total (25.00)
19–30
n
1
17
18
Node 2
Category %
low
38.89
high
61.11
Total (25.00)
31–60
over 60
Node 3
Category %
low
72.22
high
27.78
Total (25.00)
n
7
11
18
Node 4
Category %
low
100.00
high
0.00
Total (25.00)
n
13
5
18
n
18
0
18
NSEC
Adj. P-value=0.0412, Chi-square=6.0710, df=1
3;2
Node 5
Category %
low
54.55
high
45.45
Total (15.28)
1
n
6
5
11
Node 5
Category %
low
100.00
high
0.00
Total (9.72)
n
7
0
7
Figure 6.╇ Decision tree of Cluster 1
The first significant split takes into consideration the age group to which speakers
belong. The statistics for the significance of this spilt are described just above the split.
The statistics summary indicates that age group is the most significant predictor of
using Cluster 1 (p < .001, χ2 = 45.91, df = 3). Speakers who use this cluster most frequently belong to the two youngest generations; age group <up to 18> exhibits high
use in 94.44% of cases and age group <19–30> does so in 61.11% of cases. The results
are also presented graphically: light grey bars at the bottom of Nodes 1 and 2 (squares
in Figure 6 representing age groups <up to 18> and <19–30>, respectively) indicate
the large frequency of the category representing high usage of Cluster 1. Nodes 3 and
4 (representing age groups <31–60> and <over 60>, respectively) indicate the high
proportion of speakers who exhibit a low usage of the senses grouped in Cluster 1.
This is represented graphically by the light grey bars at the bottom of nodes 3 and 4
(Figure 6).
The second significant split is based on the NSEC of speakers (p < .0412, χ2 = 6.07,
df = 1). The multivariate analysis combined together NSEC2 and NSEC3 (medium
108 Justyna A. Robinson
and lower occupations) and separated them from NSEC1 (higher occupations). All
speakers who are <31–60> years old and occupy higher professional positions indicate low usage of Cluster 1 (100% of low usage responses).
The risk estimate for this decision tree is 0.18, which indicates that if I use the
decision rule based on the current decision tree I correctly classify 82% (100% minus
18%) of cases (the calculations of risk are not presented on the decision tree).
The multivariate analysis via Answer Tree 3.0 externally validates the use of Cluster 1, showing that the age of participants and, in the case of middle age speakers, their
occupation, predicts high use of the senses grouped in Cluster 1.
7.2
Decision tree analysis of Cluster 2
The overall effect of socio-demographic factors in modelling high use of Cluster 2 is
established using logistic regression analysis. Decision tree analysis is run in order to
verify whether I could obtain any further insights into the use of Cluster 2 in relation
to socio-demographic dimensions, especially in the context of determining significant subgroups of use.
Figure 7 illustrates the relative importance of socio-demographic factors in predicting the use of Cluster 2 meanings. The most important factor in predicting usage
is the age of participants (p = .0002, χ2 = 20.8, df = 2). Speakers of age <19–60> are
grouped together as the highest users of Cluster 2. Moreover, multivariate analysis
shows that in every age group, speakers living in the most affluent neighbourhoods
(Postcode 3, i.e. above £142,795) or holding professional positions (NSEC1) most frequently exhibit high use of the meanings in the cluster (p < .05). Additionally, there
is a distinction at the level of gender (p = .01) in the youngest age group of speakers
living in the most affluent areas. In this group males are all ‘high users’ of Cluster 2
whereas females are all ‘low users’ of that cluster. Risk estimate indicates that 85%
of variation can be correctly classified when applying the decision rule based on the
current decision tree.
To conclude, decision tree analysis validates the findings of logistic regression
and provides additional evidence for the external stability of Cluster 2.
7.3
Decision tree of Cluster 3
Decision tree analysis is run in order to assess the relative importance of socio-demographic factors in predicting high use of Cluster 3 (see Figure 8).
The multivariate statistical analysis shows that age group is the most significant
predictor of using Cluster 3 (p < .001, χ2 = 44.43, df = 2). Speakers who use this cluster
most frequently belong to the two oldest generations. All of the speakers <over 60> are
high users of the cluster. Speakers of age group <31–60> use this cluster in 77.78% of
Quantifying polysemy in Cognitive Sociolinguistics 109
WARDCluster2 (Binned)
Node 0
Category
%
low user 54.17
high user 45.83
Total
(100.00)
n
39
33
72
AGEGROUP
Adj. P-value=0.002, Chi-square=20.6041, df=2
upto 18
19–30; 31–60
Node 1
Category
%
n
low user 88.89 16
high user 11.11 2
Total
(25.00) 18
POSTCODE
Adj. P-value=0.0466, Chi-square=5.8557, df=1
<£117,543–£142,795>;below £117,543
Node 4
Category
%
n
low user 100.00 13
high user 0.00 0
Total
(18.06) 13
3:2
Node 6
Category %
low user 38.46
high user 61.54
Total
(36.11)
n
3
2
5
GENDER
Adj. P-value=0.0105, Chi-square=6.5399, df=1
male
female
n
0
2
2
Figure 7. Decision tree of Cluster 2
Node 3
Category
%
n
low user 61.11 11
high user 38.89 7
Total
(25.00) 18
NSEC
Adj. P-value=0.0153, Chi-square=7.8391, df=1
above £142,795
Node 5
Category
%
low user 60.00
high user 40.00
Total
(6.94)
Node 10
Category
%
low user
0.00
high user 100.00
Total
(2.78)
over 60
Node 2
Category
%
n
low user 27.78 10
high user 72.22 26
Total
(50.00) 36
Node 11
Category
%
n
low user 100.00 3
high user 0.00 0
Total
(4.17) 3
1
n
10
16
26
Node 7
Category
%
n
low user
0.00 0
high user 100.00 10
Total
(13.89) 10
NSEC
Adj. P-value=0.0020, Chi-square=11.6157, df=1
3:2
Node 8
Category %
n
low user 90.91 10
high user 9.09 1
Total
(15.28) 11
1
Node 9
Category
%
n
low user 14.29 1
high user 85.71 6
Total
(9.72) 7
110 Justyna A. Robinson
WARDCluster3 (Binned)
Node 0
Category
%
low
45.83
high
54.17
Total
(100.00)
n
33
39
72
AGEGROUP
Adj. P-value=0.0000, Chi-square=44.4320, df=2
31–60
upto 18;19–30
Node 1
Category
%
low
80.56
high
19.44
Total
(50.00)
over 60
Node 2
Category
%
low
22.22
high
77.78
Total
(25.00)
n
29
7
36
n
4
14
18
Node 3
Category
%
n
low
0.00 0
high
100.00 18
Total
(25.00) 18
POSTCODE
Adj. P-value=0.0200, Chi-square=7.3649, df=1
below £117,543;above £142,795
<£117,543–£142,795>
Node 4
Category
%
n
low
7.14 1
high
92.86 13
Total
(19.44) 14
Node 5
Category
%
n
low
75.00 3
high
25.00 1
Total
(5.56) 4
Figure 8.╇ Decision tree of Cluster 3
cases (especially when they live in the highest and lowest postcodes (p = .02, χ2 = 7.36,
df = 1)). Risk estimate indicates that 87.5% of variation can be correctly classified
when applying the decision rule based on the current decision tree.
Decision tree analysis confirms the hypothesis and provides evidence for the external stability of Cluster 3. Overall, the external validity of the cluster solution has
been confirmed. The use of each of the three clusters can be predicted from the language of speakers who differ in socio-demographic terms.
8. Summary and discussion of results
Having carried out HAC on the usage data, it has become apparent that each of the
main three clusters can be most satisfactorily predicted from the speech of different
generations (see summary in Table 8). The use of Cluster 1 (innovative speech) is best
predicted from the speech of the youngest speakers, the use of Cluster 3 (historically older senses) is best predicted from the speech of older speakers, and the use of
Quantifying polysemy in Cognitive Sociolinguistics
Table 8.╇ Summary of the exploratory and confirmatory analyses of the use
of polysemous adjectives
Cluster 1
Cluster 2
Cluster 3
Exploratory analysis
Confirmatory analysis
More recent senses
Middle senses
Oldest senses
Younger generations
Middle generations especially NSEC1
Older generations
Cluster 2 (historically neither old nor recent senses) from the speech of middle age
groups. Additionally, the results of the statistical analysis of Cluster 2 show that speakers in professional occupations are mostly ‘high users’ of the senses grouped here.
The HAC reveals that sociolinguistically meaningful semantic usage patterns
emerge when usage evidence from several polysemous words is considered. It becomes apparent that the use of a selected group of senses can be most typical for a
socio-demographically defined group of speakers. In other words, there are speakers
for whom the same senses (e.g. fit ‘attractive’, gay ‘lame’, and wicked ‘good’) are the
most salient readings of polysemous categories (fit, gay, and wicked). This finding
suggests that not only individual words, such as awesome, but whole groups of poly�
semous adjectives currently undergoing semantic change form usage patterns that
can be explained by a very similar sociolinguistic distribution. This study validates
Robinson (2010a) by providing further evidence for the social grounding of polysemous conceptualisations and suggests that employing a socio-cognitive perspective in
linguistic research is clearly advantageous. This study also showcases the benefits of
engaging various statistical techniques to explore lexical meaning.14
References
Aldenderfer, M. S., & Blashfield, R. (1984). Cluster analysis. Sage Publications: Newbury Park.
Allan, K. & Robinson, J. A. (Eds.). (2012). Current methods in historical semantics. Berlin &
Boston: Mouton de Gruyter.
Barrio, G., De La Fuente, L., Toro, C., Vicente, T. M., Vallejo, F., & Silva, T. (2006). Prevalence of
HIV infection among young adult injecting and non-injecting heroin users in Spain in the
era of harm reduction programmes: Gender differences and other related factors. Epidemiology and Infection, 135(4), 592–603. DOI: 10.1017/S0950268806007266
14. Although in this chapter I carry out statistical analyses by employing software packages
such as ClustanGraphics 7.05, SPSS 18, and Answer Tree 3.0, the same analysis can also be performed with the help of R. The cluster analysis can also be performed with more recent versions
of SPSS.
111
112 Justyna A. Robinson
Beitel, D. A., Gibbs, R., & Sanders, P. (2001). The embodied approach to the polysemy of the
spatial preposition on. In H. Cuyckens, & B. Zawada (Eds.), Polysemy in Cognitive Linguistics (pp. 241–260). Amsterdam: Benjamins.
Benki, J. R. (1998). Evidence for phonological categories from speech perception. Unpublished
PhD dissertation, University of Massachusetts.
Beretta, A., Fiorentino, R., & Poeppel, D. (2005). The effects of homonymy and polysemy on
lexical access: An MEG study. Cognitive Brain Research, 24(1), 57–65.
DOI: 10.1016/j.cogbrainres.2004.12.006
Blank, A. (2003). Polysemy in the lexicon and in discourse. In B. Nerlich, Z. Todd,
V. Herman, & D. D. Clarke (Eds.), Polysemy: Flexible patterns of meaning in mind and
language (pp. 267–293). Berlin & New York: Mouton de Gruyter.
Boscarino, J. A., Galea, S., Ahern, J., Resnick, H., & Vlahov, D. (2003). Psychiatric medication use among Manhattan residents following the World Trade Center disaster. Journal of
Traumatic Stress, 16(3), 301–306. DOI: 10.1023/A:1023708410513
Brace, N., Kemp, R., & Snelgar, R. (2006). SPSS for psychologists, 3rd edition. New York:
Palgrave Macmillan.
Breckenridge, J. N. (2000). Validating cluster analysis: Consistent replication and symmetry.
Multivariate Behavioural Research, 35(2), 261–286. DOI: 10.1207/S15327906MBR3502_5
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, R. H. (2007). Predicting the dative alternation. In
G. Boume, I. Kraemer, & J. Zwarts (Eds.), Cognitive foundations of interpretation (pp. 69–
94). Amsterdam: Royal Netherlands Academy of Science.
The British National Corpus, Version 3 (BNC XML Edition). Accessed via http://www.sketchengine.co.uk [March 2008].
Brugman, C. (1981). Story of over. MA thesis, University of California, Berkley.
Bucholtz, M. (2011). White kids: Language, race and styles of youth identity. Cambridge:
Cambridge University Press.
Chambers, J. K., Trudgill, P., & Schilling-Estes, N. (2002). The handbook of language variation
and change. Oxford: Blackwell.
Chaturvedi, A., & Green, P. E. (1995). Software review: SPSS for Windows, CHAID 6.0. Journal
of Marketing Research, 32, 245–254. DOI: 10.2307/3152056
Clatworthy, J., Buick, D., Hankins, M., Weinman, J., & Horne, R. (2005). The use and reporting
of cluster analysis in health psychology: A review. British Journal of Health Psychology,
10(3), 329–358. DOI: 10.1348/135910705X25697
Coulmas, F. (Ed.). (1997). The handbook of sociolinguistics. Oxford: Blackwell.
Cuyckens, H., & Zawada, B. (Eds.). (2001). Polysemy in Cognitive Linguistics. Amsterdam &
Philadelphia: John Benjamins. DOI: 10.1075/cilt.177
Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2(1), 23–60. DOI: 10.1515/CLLT.2006.002
Duda, R., Hart, P., & Stork, D. (2001). Pattern classification. New York: Wiley Interscience.
Dunbar, G. (2001). Towards a cognitive analysis of polysemy, ambiguity, and vagueness. Cognitive Linguistics, 12(1), 1–14. DOI: 10.1515/cogl.12.1.1
Eckert, P. (2000). Linguistic variation as social practice. Oxford: Blackwell.
Einspruch, E. L. (2005). An introductory guide to SPSS® for Windows®, 2nd edition. Thousand
Oaks, London & New Delhi: Sage Publications.
Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis. London: Arnold.
Field, A. (2005). Discovering statistics using SPSS. London: Sage.
Fought, C. (2004). Sociolinguistic variation: Critical reflections. Oxford: Oxford University Press.
Quantifying polysemy in Cognitive Sociolinguistics 113
Geeraerts, D. (1989). Prospects and problems of prototype theory. Linguistics, 27(4), 587–612.
DOI: 10.1515/ling.1989.27.4.587
Geeraerts, D. (1993). Vagueness’s puzzles, polysemy’s vagaries. Cognitive Linguistics, 4(3), 223–
272. DOI: 10.1515/cogl.1993.4.3.223
Geeraerts, D. (1997). Diachronic prototype semantics: A contribution to historical-lexicology.
Oxford: Clarendon Press.
Geeraerts, D., Grondelaers, S., & Bakema, P. (1994). The structure of lexical variation. Meaning,
naming, and context. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110873061
Geeraerts, D., & Cuyckens, H. (Eds.). (2007). Oxford handbook of Cognitive Linguistics. Oxford:
Oxford University Press.
Geeraerts, D., Kristiansen, G., & Piersman, Y. (Eds.). (2010). Advances in Cognitive Sociolinguistics. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226461
Gries, St. Th. (2006). Corpus-based methods and Cognitive Semantics: The many meanings
of to run. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics:
Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110197709
Gries, St. Th. (2007). Cluster analysis: A practical introduction with R (for Windows). Paper
presented at Departmental Research Seminar, University of Sheffield.
Gries, St. Th., & Stefanowitsch, A. (2006). Cluster analysis and the identification of collexeme
classes. In S. Rice, & J. Newman (Eds.), Empirical and experimental methods in cognitive/
functional research. Stanford: CSLI.
Hanks, P. (2000). Do word meanings exist? Computers and the Humanities, 34(1–2), 205–215.
DOI: 10.1023/A:1002471322828
Heylen, K. (2005). A quantitative corpus study of German word order variation. In S. Kepser,
& M. Reis (Eds.), Linguistic evidence: Empirical, theoretical and computational perspectives
(pp. 241–264). Berlin: Mouton de Gruyter. DOI: 10.1515/9783110197549.241
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.
Janda, L. (1990). The radial network of a grammatical category – its genesis and dynamic structure. Cognitive Linguistics, 1(3), 269–288. DOI: 10.1515/cogl.1990.1.3.269
Kallel, A. (2007). The loss of negative concord in Standard English: Internal factors. Language
Variation and Change, 19(1), 27–49. DOI: 10.1017/S0954394507070019
Kilgariff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31(2), 91–
113. DOI: 10.1023/A:1000583911091
Klein, D., & Murphy, G. L. (2002). Paper has been my ruin: Conceptual relations of polysemous
senses. Journal of Memory and Language, 47(4), 548–570.
DOI: 10.1016/S0749-596X (02)00020-7
Kleinbaum, D. G. (1994). Logistic regression: A self-learning text. New York: Springer-Verlag.
DOI: 10.1007/978-1-4757-4108-7
Krishnamurthy, R., & Nicholls, D. (2000). Peeling an onion: A lexicographer’s experience of
manual sense-tagging. Computers and the Humanities, 34(1–2), 85–97.
DOI: 10.1023/A:1002407003264
Kristiansen, G., & Dirven, R. (Eds.). (2008). Cognitive Sociolinguistics: Language variation, cultural models, social systems. Berlin: Mouton de Gruyter. DOI: 10.1515/9783110199154
Labov, W. (2001). Principles of linguistic change: Social factors. Oxford: Blackwell.
Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal about the mind.
Chicago: Chicago University Press. DOI: 10.7208/chicago/9780226471013.001.0001
114 Justyna A. Robinson
Lehrer, A. (1990). Polysemy, conventionality, and the structure of the lexicon. Cognitive Linguistics, 1(2), 207–246. DOI: 10.1515/cogl.1990.1.2.207
Lewandowska-Tomaszczyk, B. (2007). Polysemy, prototypes and radial categories. In
D. Geeraerts, & H. Cuyckens (Eds.), Oxford handbook of Cognitive Linguistics (pp. 139–
169). Oxford: Oxford University Press.
Magidson, J. (1994). The CHAID approach to segmentation modelling. In R. P. Bagozzi (Ed.),
Advanced methods of marketing research (pp. 118–159). Cambridge, MA.: Blackwell.
Milligan, G. W., & Cooper, M. C. (1986). A study of comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioural Research, 21(4), 41–58.
DOI: 10.1207/s15327906mbr2104_5
Milroy, L. (1980). Language and social networks. Oxford: Blackwell.
Milroy, L. (1987). Observing and analysing natural language: A critical account of sociolinguistic
method. Oxford & New York: Blackwell.
Moisl, H. L. (2009). Exploratory multivariate analysis. In A. Lüdeling, & M. Kytö (Eds.), Corpus
linguistics: An international handbook (pp. 874–898). Berlin: Mouton de Gruyter.
DOI: 10.1515/9783110213881.2.874
Moisl, H. L., & Jones, V. (2005). Cluster analysis of the Newcastle Electronic Corpus of Tyneside
English: A comparison of methods. Literary and Linguistic Computing, 20(1), 125–146.
DOI: 10.1093/llc/fqi026
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation. Computer Journal, 20(4), 359–363. DOI: 10.1093/comjnl/20.4.359
Moss, M., Wellman, D. A., & Cotsonis, G. A. (2003). An appraisal of multivariable logistic models in the pulmonary and critical care literature. Chest, 123(3), 923–928.
DOI: 10.1378/chest.123.3.923
Nerlich, B., Todd, Z., & Clarke, D. D. (2003). The acquisition of get between four and ten years.
In B. Nerlich, Z. Todd, V. Herman, & D. Clarke (Eds.), Polysemy: Flexible patterns of meaning in mind and language (pp. 333–357). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110895698
Nerlich, B., Todd Z., Herman, V., & Clarke, D. D. (2003). Polysemy: Flexible patterns of meaning in
mind and language. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110895698
Norušis, M. J. (1999). SPSS regression models 10.0. Chicago: SPSS Inc.
Office for National Statistics.(2000). Standard Occupational Classification. Volume 2. Coding Index. London: The Stationery Office.
The Oxford English Corpus. Accessed via http://www.sketchengine.co.uk [March 2008].
Oxford English Dictionary Online. Oxford University Press. Accessed via http://www.oed.com
[March 2008].
Peng, C.-Y. J., Lee, K. L., & Ingersoll, G. M. (2002). An introduction to logistic regression analysis and reporting. Journal of Educational Research, 96(1), 3–14.
DOI: 10.1080/00220670209598786
Pütz, M., Robinson, J. A., & Reif, M. (Eds.). (2012a). Cognitive Sociolinguistics: Variation in
cognition and language use. Special issue of Review of Cognitive Linguistics, 10(2).
DOI: 10.1075/rcl.10.2.01int
Pütz, M, Robinson, J. A., & Reif, M. (2012b). The emergence of Cognitive Sociolinguistics: An
introduction. Annual Review of Cognitive Linguistics, 10(2), 241–263.
DOI: 10.1075/rcl.10.2.01int
Quantifying polysemy in Cognitive Sociolinguistics 115
Pütz, M., Robinson, J. A., & Reif, M. (Eds.). (2014). Cognitive Sociolinguistics. Social and cultural variation in cognition and language use. Benjamins Current Topics, 59. Amsterdam/
Philadelphia: John Benjamins.
Rakova, M., PethÅ‚, G., & Rákosi, C. (Eds.). (2007). The cognitive basis of polysemy: New sources
of evidence for theories of word meaning. Frankfurt/Main: Peter Lang.
Rao, V. R., & Steckel, J. H. (1995). Selecting, evaluating, and updating prospects in direct mail
marketing. Journal of Direct Marketing, 9(20), 20–31. DOI: 10.1002/dir.4000090205
Ravin, Y., & Leacock, C. (Eds.). (2000). Polysemy: Theoretical and computational approaches.
Oxford: Oxford University Press.
Reif, M., Robinson, J. A., & Pütz, M. (Eds.). (2013). Variation in language and language use:
Sociolinguistic, socio-cultural and cognitive perspectives. Frankfurt/Main: Peter Lang.
Robinson, J. A. (2010a). Awesome insights into semantic variation. In D. Geeraerts,
G. Kristiansen, & Y. Piersman (Eds.), Advances in Cognitive Sociolinguistics (pp. 85–109).
Berlin & New York: Mouton de Gruyter.
Robinson, J. A. (2010b). Semantic variation and change in present-day English. Unpublished
PhD dissertation, University of Sheffield. Available via http://etheses.whiterose.ac.uk/2232/
Robinson, J. A. (2012a). A sociolinguistic perspective on semantic change. In K. Allan, & J. A.
Robinson (Eds.), Current methods in historical semantics (pp. 191–231). Berlin & Boston:
Mouton de Gruyter.
Robinson, J. A. (2012b). A gay paper: Why should sociolinguistics bother with semantics? English Today, 28(4), 38–54. DOI: 10.1017/S0266078412000399
Saltini, A., Mazzi, M. A., Del Piccolo, L., & Zimmermann, C. (2004). Decisional strategies for
the attribution of emotional distress in primary care. Psychological Medicine, 34(4), 729–
739. DOI: 10.1017/S0033291703001260
Schmid, H. J. (2010). Decision trees. In A. Clark, C. Fox, & S. Lappin (Eds.), The handbook
of computational linguistics and natural language processing (pp. 180–196). Oxford:
Blackwell. DOI: 10.1002/9781444324044.ch7
SPSS (2005). Answer tree algorithm summary; SPSS white paper series.
Sweetser, E. (1990). From etymology to pragmatics: Metaphorical and cultural aspects of semantic structure. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511620904
Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics. Boston: Allyn and Bacon.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston: Pearson
Addison-Wesley.
Taylor, J. R. (1995). Linguistic categorization: Prototypes in linguistic theory. Oxford: Oxford
University Press.
Tummers, J., Speelman, D., & Geeraerts, D. (2004). Quantifying semantic effects: The impact
of lexical collocations on the inflectional variation of Dutch attributive adjectives. In
G. Purnelle, C. Fairon, & A. Dister (Eds.), Le poids des Mots: Actes des 7emes journées internationales d’analyse statistique des données textuelles (pp. 1079–1088). Louvain-la-Neuve:
Presses Universitaires de Louvain.
Vanhove, M. (Ed.). (2008). From polysemy to semantic change: Towards a typology of lexical
semantic associations. Amsterdam: John Benjamins. DOI: 10.1075/slcs.106
The many uses of run
Corpus methods and Socio-Cognitive Semantics
Dylan Glynn
University of Paris VIII
Multifactorial usage-feature analysis (profile-based approach) has been successfully applied to polysemy research (Gries 2006; Glynn 2009, 2010). This chapter
represents a repeat analysis of Gries (2006). The study has three aims: (i) to
verify the results of the previous study; (ii) to identify limitations in the application of the statistical technique employed (hierarchical cluster analysis) in the
previous study; and (iii) to demonstrate the need to account for sociolinguistic
dimensions in polysemy research. The study is based on a sample of 500 occurrences of the lexeme to run, extracted in even proportions from British English
and American English and from online personal journals (blogs) and conversations (American National Corpus and British National Corpus).
Keywords: cluster analysis, Cognitive Semantics, corpus linguistics, polysemy,
multifactorial usage-feature analysis, sociolinguistics
1. Introduction
Gries’ (2006) study ‘Corpus-based methods and cognitive semantics: The many senses of to run’ counts amongst the most influential contributions to the description of
polysemy in Cognitive Linguistics. It is important not because its methodology is
original, nor because it is complete or extensive, nor even because of the theoretical
claims it makes, but because it simply and overtly shows how corpus-driven methods
can be applied to the study of polysemy. Its contribution is the combination of theory
and method: two pieces of a puzzle that establish the foundations of a theoretically
and empirically coherent approach to the description of semasiological structure.
The current study does not challenge the theory, the method, or the results of
Gries (2006), but seeks to refine each. The chapter divides into two parts. The first
part considers three theoretical issues. Firstly, how can current corpus-driven methods of semantic analysis inform the description of prototype effects on conceptual
118 Dylan Glynn
structure? Secondly, how is the notion of lexical sense operationalised in multifactorial usage-feature analysis? Thirdly, it is argued that the study of prototype structured
lexical senses must also integrate the social dimensions of language for descriptive
adequacy.
The second part of the chapter takes the form of a case study that repeats the
semasiological analysis of run presented in Gries (2006). This case study treats three
issues. Firstly, Gries’ results are largely confirmed, but it is shown that cluster analysis,
the statistical method employed, produces unstable representations. It is argued that
its use needs further research before it can be considered reliable. Secondly, correspondence analysis, an alternative statistical technique, is introduced. By adding a
sociolinguistic dimension to the analysis, it is shown that this statistical technique is
capable of representing a more complex semasiological structure. Thirdly, with the
use of confirmatory statistical modelling, it is demonstrated that in order to obtain descriptive adequacy, semasiological analysis must account for sociolinguistic structure.
2. Usage-based Cognitive Semantics
2.1
Corpus-driven radial network analysis
How can quantitative corpus-driven analysis inform our understanding of polysemy
and prototype structures in lexical semantics? In order to answer this question, we
need to identify the aims of the endeavour.
Lakoff (1990) presented the commitment to empiricism and inductive research as
the gold standard of Cognitive Semantics. Radial network analysis (Lakoff 1987) and
Frame Semantics (Fillmore 1985) are analytical models designed to offer an empirical
means for describing meaning structure, assuming both prototype effects and encyclopaedic semantics.1 The research of the era demonstrated that:
i. linguistic semantics, in the strict sense, cannot adequately account for meaning structure in language – instead, it demonstrated the need for ‘encyclopaedic
semantics’;
ii. necessary and sufficient conditions cannot adequately determine socio-conceptual categories – instead, it demonstrated the need for ‘prototype effects’.
Although prototype category theory and the radial network analysis that employs
it (Lakoff 1987) represented a necessary and substantial step toward an empirical
1. Lakoff (1987) was an early protagonist of both the theory of prototype categorization and
the model of radial network analysis. Prototype theory was developed and refined by Geeraerts
(1989, 1993, 1997, 2000), Taylor (1989), and Kleiber (1990). Radial network analysis was developed and formalised by, especially, Rudzka-Ostyn (1989), Cuyckens (1993), and Janda (1993).
The many uses of run 119
Cognitive Semantics, it did not quite attain that goal. Theoretically and empirically,
the shortcomings are well established (Geeraerts 1993; Sandra and Rice 1995). Methodologically, radial network analysis often continued the Structuralist and Generativist tradition:
i. analytically, radial network analysis assumed this structure to take the form of
discrete senses or ‘nodes’;
ii. empirically, radial network analysis employed introspection to determine semasiological structure.
Seen from this perspective, the Cognitive Semantics era of radial network analysis
made the first important steps towards empiricism, but ultimately fell short of the
mark. We will now briefly consider the shortcomings of radial network analysis and
then move on to consider how corpus-driven methods, especially usage-feature analysis (profile-based) corpus methods (Geeraerts et al. 1994; Gries 2003), resolve these
shortcomings.
There is no need to re-cover well-trodden ground. Let us assume the theoretical
models of encyclopaedic semantics (Fillmore 1985; Lakoff 1987) and prototype categorisation (Rosch 1975; Lakoff 1987). The first replaces truth-conditional semantics
with world knowledge. The second replaces necessary and sufficient conditions with
prototype structure. These two criteria for semasiological structure accepted, we can
focus on the two problems: (i) the analytical assumption of discrete senses and (ii) the
methodological technique of introspection.
Firstly, the assumption of discrete senses is something that is intuitively attractive.
Indeed, like it is obvious that the world is flat, it would seem obvious that words have
meanings and that we choose between those meanings in communication. Understood in these terms, senses are reified as discrete units. This naïve operationalisation
may aid in language learning, dictionary writing and, typically, only comes amiss in
inter-personal disputes. However, the evidence for discrete lexical senses is as naïvely
sound as the horizon is evidence for the flatness of the world.
At a broader conceptual level, Fuzzy Set Theory, a distinct yet related part of
Prototype Set Theory, disposes of the notion of discrete conceptual categories. If we
do not suppose that the concept associated with a lexeme is discrete, why then do
we continue to assume that the sub-categories of lexical senses are discrete? Lakoff ’s
(1987) study of over identifies a list of usage-features in terms of minimal perceptual
distinctions expressed as image schemata. Yet rather than seeing meaning construction as the relative correlation of schema features, Lakoff continues to hunt for ‘senses’
as reified configurations of those schema features. Indeed, there appears no reason to
assume the existence of discrete lexical senses and there is a growing body of research
that refutes them (Geeraerts 1993; Kilgarriff 1997; Zlatev 2003; Glynn 2010 inter
alia). Let us assume, therefore, that discrete senses are a useful heuristic in discussing
120 Dylan Glynn
semasiological structure in lexicography, but let us not assume that reified senses actually exist.
Secondly, radial network analysis employed introspective methodology. Although introspection has an essential and inarguable role in language research, both
for proposing hypotheses and performing analyses, with no truth-conditional tests
to help determine conceptual structure, it will only ever be one part of an empirical
science. Tyler and Evans (2001) have attempted to develop a ‘principled approach’
to identifying semasiological structure. This goes a long way towards minimising
the risk of ad hoc categorisation using introspection. It does not, however, offer the
possibility of result falsification. It is this second point that is essential. According to
their own models of language, Generativists and Structuralists both had means for
falsification using introspection. First, a proposed structure could be falsified by the
intuition of a native speaker, whose linguistic knowledge was thought to exactly represent the grammar of a language. Second, truth-conditional semantic tests could be
‘failed’, thus establishing the lack of membership of a discrete category. However, if we
assume a usage-based model of language, then neither of these possibilities is open
to us, making introspection severely limited in terms of its ability to test hypotheses
or falsify results.
Accusing the tradition of radial network analysis of relying solely on introspection and of assuming the existence of discrete senses is, perhaps, unfair. Lakoff (1987)
is careful in his wording to avoid the issue and Geeraerts (1989, 1995) explicitly develops a representational format that permits the description of polysemy without employing ‘nodes’. Moreover, there exist both corpus-driven and experimental studies in
the tradition (cf. Glynn, this volume 7–38). Indeed, the corpus-driven usage-feature
approach propounded by Gries (2006) goes back to the very origins of Cognitive Semantics (Dirven et al. 1982), just as elicitation-driven usage-feature analysis (Lehrer
1982) also finds itself at the origins of the theoretical paradigm. Moreover, as early as
Geeraerts (1993), Geeraerts et al. (1994), and Lehrer and Lehrer (1994), in both theoretical and empirical terms, the argument for a non-reified approach to lexical senses
was put forward. Therefore, even if it did not represent the main drive of research,
Cognitive Semantics can be argued to have been slowly moving towards empirical
methods and some in the field have long held that meanings cannot be understood
as reified objects.
Seen in this light, Gries (2006) is but one empirical Cognitive Semantic study in a
long history. Its step was to explicitly apply corpus-driven usage-feature and multivariate statistics to the question of prototype-structured polysemy. This step, interpreting
multivariate results in terms of prototype effects, is important. Empirical methods and
non-reified senses may be theoretically sound, but with no way of modelling non-discrete results or coherently representing the structuring of language, the application
of the method will struggle to gain ground. Therefore, a corpus-driven usage-feature
methodology will build on the radial network tradition by fulfilling four aims:
The many uses of run 121
a.
b.
c.
d.
2.2
identifying encyclopaedic semantic structure;
identifying prototype effects in that conceptual structure;
positing non-discrete lexical sense;
positing results that can be empirically falsified.
Operationalising prototype structured non-discrete lexical senses
It is one thing to demonstrate that lexical senses are not discrete in nature, it is another question altogether to develop a rigorous means for identifying and describing
non-discrete semasiological structures. Multifactorial usage-feature analysis achieves
this while simultaneously and empirically accounting for prototype effects in the
structure. This section explains how the method represents non-discrete senses but
also how it offers corpus evidence for the prototype structuring of those senses.
Is it possible to make claims about the prototypicality of conceptual structure
with corpus data? Prototype structure is an analytical model; it is not an object of
study. It can be used to explain different structures in language, depending on how
it is operationalised. Gries (2006:â•›75) offers a “non-exhaustive” list of different operationalisations of the notion of prototype structured polysemy: intuition determined
judgements of similarity and ‘goodness’; elicitation ease; diachronic evidence; centrality/predominance in a radial network, and so forth.
Geeraerts (1987:â•›288) argues that there are two basic operationalisations of prototypicality. He terms these the analytic and introspective criteria. Although his debate
was with the proponents of truth-conditional semantics, we can rephrase this, mutatis
mutandis, as frequency-based versus salience-based prototypicality. There are many
different approaches to prototype structure, but it is likely that they will all be based
on one of these two operationalisations: perceptual – conceptual ‘prominence’ versus
relative frequency ‘commonness’.
For the semasiological variation of a term such as run, we can suppose there
would be little debate that ‘fast pedestrian motion’ is the prototypical ‘sense’. From
synchronic frequency of use, diachronic evidence of earliest uses, and intuition-based
conceptual salience, to widely accepted theories of embodiment and primacy of perception, all evidence points unanimously to ‘fast pedestrian motion’ as the ‘central’
meaning. However, as Geeraerts (1987) shows, theoretically, there is no reason a priori to assume that prototype models using one or the other operationalisation would
offer the same results. Of course, this is not to say they will not. For an example as
conceptually basic as run, it is likely they will and this is why, in Gries’ (2006:â•›76) comparison of different methods, each method indicates the same prototype structure.
However, if we are developing a methodology for identifying semantic structure, it is
important we do not make the assumption that these different methods should necessarily offer convergent results.
122 Dylan Glynn
Schmid (2000) and Gries (2003) have both made claims about the relationship
between frequency and conceptual structure. These claims have yet to be confirmed
empirically and the authors appear to have distanced themselves from their earlier position (Schmid 2010; Gries p.c). Although frequency of occurrence surely has
an important role in determining conceptual structuring, conceptual and perceptual
salience are also likely to have an impact. It is, therefore, unlikely that there is a oneto-one index where more frequent equates more central. Arppe and Järvikivi (2007),
Arppe et al. (2009), Tribushinina (2009) and Gilquin (2010) are recent examples of
research seeking to understand how these two fundamentally different operationalisations of prototypicality interact. Eventually, we may understand how their interaction impacts upon language structure and learning, but for the moment, this has not
been determined.
Having established the fact that we are restricting the notion of prototypicality
to one based on frequency, let us turn to the non-discrete senses that we attempt to
structure in these prototype terms. It is important to understand that the two senses
that Gries identifies as the most frequent, and therefore (proto)typical, were predicted
by a set of usage-features (Gries 2006:â•›85). They were not identified in the data as senses per se. For this reason, it is not, in fact, these two senses, but two ‘configurations of
features’, to use the terminology of Geeraerts et al. (1994), or the ‘behaviour profiles
of ID tags’ to use Gries’ terminology, that are the (proto)typical structures. This is why
we should not speak of the many ‘senses’ of run but of the many ‘uses’ of run, where
‘use’ is understood as a re-occurring configuration of features (or ID-tags). This does
not contradict Gries’ results. On the contrary, it further emphasises their theoretical
and methodological implications.
Given the context of tense (past), transitivity (transitive), complement syntax
(to + infinitive), and agent type (Human), Gries is able to predict, with 100% accuracy, occurrences of the lexeme that would be traditionally defined as the ‘fast pedestrian motion’ sense of run. However, what Gries has identified as a ‘sense’ is merely
a tendency for this configuration of features to occur together. In a given example,
one or two of these features may not be applicable, but, relatively, it would still be an
example of this usage/sense (as opposed to discrete lexical sense). What this gives us
is a non-discrete operationalisation of ‘lexical sense’.
It is important to understand that this is independent from prototypicality. The
fact that this configuration of features is so stable and predictive only means that this
particular sense is relatively discrete. In other words, it has a clear behavioural profile
and its usage pattern can be accurately identified. This is independent from the relative frequency of this configuration, which determines its (proto)typicality. In terms
of frequency, the features of ‘past tense’, ‘transitive’, ‘finite to infinitive syntax’, and
‘human agent’ were also the most frequent. Therefore, we have a non-discrete definition of a sense of run, but also a quantification of its typicality, or frequency-based
prototypicality.
The many uses of run 123
This discussion has sought to show how lexical senses can be operationalised
through usage-feature analysis and how this brings in, seamlessly, the notion of
frequency-based prototypicality. Yet the implications go further than providing an
operational definition of prototypicality and lexical sense. These configurations of usage-features are, in fact, usage-contexts. This brings us to the question of the sociolinguistic dimension of meaning.
2.3
Multidimensional prototype effects: Form, meaning and context
Having established that corpus-driven usage-feature analysis can attain the goals set
out in the era of radial network analysis, we now turn to what is argued to be an essential element of the object of study that is often side-lined. If we accept the usage-based
model of language and the two operationalisations above ((i) sense – the configuration of features and (ii) prototypicality – the frequency of those configurations),
the description of semasiological structure must integrate the social dimensions of
language into its analysis.
To appreciate why this is the case and why it is important, we must return to the
question of prototype structuring. The operationalisation of semasiological structure
in terms of frequency has an inherent limitation. Frequency, as a tool for determining
linguistic structure, must necessarily be treated relatively. It is precisely this mistake
Chomsky makes with the infamous argument that I live in New York is more common
than I live in Dayton, Ohio tells us nothing of language structure. Chomsky (1964:â•›215)
holds that frequency is external to language structure and will give us information
about the world, or the context, instead of language itself. Although it is possible to
argue that language is a mirror of the world and therefore, at some level, I live in New
York is, in fact, more important in language than I live in Dayton, it would be difficult
to demonstrate this to be the case for the identification of semasiological prototypicality. Instead, we can simply examine frequency relative to context. To understand how
this notion applies to the polysemy, let us consider the example of Gries’ two frequency-based (proto)typical senses ‘fast pedestrian motion’ and ‘manage’. If the corpus had
been children’s literature or sports magazines, ‘fast pedestrian motion’ would likely be
frequent and ‘manage’ infrequent. By contrast, in the context of economic news press
the ‘fast pedestrian motion’ is likely to be extremely infrequent, especially compared
to the ‘manage’ sense. It should be obvious how essential the notion of context is to
frequency-based studies of meaning.
Are we trying to determine a typical meaning that is true for all language in all
contexts? If this were possible (especially for a language as diverse as English), it is
surely not possible with any corpus currently available or that will be available in the
foreseeable future. Even taking a single context distinction, spoken versus written,
the largest and most ‘balanced’ corpus in existence is non-representative to an unimaginable degree. The is because, in reality, the amount of spoken language greatly
124 Dylan Glynn
outweighs the quantity of written langauge, where the reverse is currently true of electronic corpora.
In brief, the study of frequency-based prototype effects must be relative to context. We, therefore, must posit (proto)typicality structures, not for an entire language
but for a language context, a specific place and time. We must avoid employing usage-based methods to describe the reductionist notion of the langue of Structuralism
or the ideal speaker competence of Generativism. Our object of study is synchronically and diachronically varied – our models of conceptual structure must be sensitive
to this.
3. Case study: run in America and Britain in diaries and conversation
3.1
Two corpus-based studies on ‘run’
Our current study imitates Gries (2006) in the set of usage-features (ID tags) analysed as closely as possible. The aim is not to test the results or to improve upon them
through more advanced statistical analysis or a larger, more diverse sample. The aim
is merely to show that even for a lexeme as culturally ‘simple’ and as socially ‘neutral’
as run, one must account for the social dimension of language in semantic analysis.2
In doing so, we will see why the statistical method he employs faces issues of reliability
and we will introduce a different statistical technique. We begin with a summary of
Gries’ (2006) study.
Gries’ analysis is based on 815 occurrences of the lemma to run, extracted from
the British component of the International Corpus of English and the Brown Corpus
of American English. Approximately 400 occurrences were taken from each. These
occurrences were manually analysed and categorised (using intuition) as belonging
to one of 48 senses. These senses were taken from the Collins Cobuild E-Dictionary,
the Merriam Webster’s American online dictionary, and the WordNet project. This
categorisation in terms of dictionary senses is the first factor of the analysis. Although
it is normally the goal of usage-feature analysis to determine different ‘senses’ through
the identification of ‘feature configurations’ (Geeraerts et al. 1994) or ‘behavioural
profiles’ (in Gries’ terminology), being able to match such configurations against
2. Kudrnáčová (2010) has also followed up Gries’ (2006) study with a more fine-grained
corpus-based semantic analysis. Her study is not quantitative, but her corpus-illustrated insights will inform future research. In descriptive terms, the next step is to apply a more detailed
usage-feature analysis and begin, not with dictionary senses, but a range of subtle semantic
features. The senses should then be clusterings of those semantic features rather than simply
matches between dictionary entries and observed occurrences.
The many uses of run 125
dictionary definitions is a useful heuristic. In Gries (2006), it is used to show how a frequency-based study can inform an understanding of prototype structure in polysemy.
The 815 occurrences in Gries’ data set are analysed for a range of factors, or usage
dimensions. These factors consist of the usage-features typical in this kind of methodology – formal and semantic features ranging from syntax and collocation, tense
and aspect, to the semantics of the argument structure and participants. In this study,
the formal factors include tense, aspect, voice, transitivity, mood, and clause type. The
semantic factors include subject type, object type, and complement type. These ‘type’
features are categories such as human, concrete countable object, concrete mass noun,
machines, abstract entities, organisations, locations, quantities, events, processes, etc.
The dictionary senses found are exemplified and enumerated. The most frequent
dictionary senses identified are that of ‘fast pedestrian motion’ (203 occurrences /
25%, exemplified on p. 63) and ‘manage’ (101 occurrences / 12% exemplified on
p. 71). The analysis and subsequent categorisation of the occurrences as dictionary
senses is systematically explained by example. It is this systematic explanation that is
used in the current study to repeat the analysis and to categorise the occurrences as
dictionary definitions.
The current study is based on 500 occurrences of run, 250 each of British and
American English, subdivided again into 125 examples each from conversation and
online personal diaries. The sample was restricted to this relatively small number due
to practical reasons – usage-feature analysis is laborious and resource consuming.
The point of the study being to investigate the need to include sociolinguistic parameters in polysemy research, the improved descriptive accuracy afforded by increasing
this number would not substantially improve the ability to demonstrate the point.
Also, the methods under investigation must be shown to produce coherent results
with small numbers, since, for the same practical reasons, the usage-feature (or profile-based) method tends to deal with small samples. The British and American diary
examples were taken from the LiveJournal corpus, developed by Dirk Speelman, at
the University of Leuven, and the conversation examples were taken from the British
National Corpus and the American National Corpus. The usage-feature analysis is
replicated using the same dictionary senses employed by Gries and the same range of
formal and semantic usage-features.
An aside should be made here. Despite the fact that Gries more than adequately demonstrates the principle of the method, descriptively, the study is preliminary
(Gries 2006:â•›81). The obvious question of why one would focus on dictionary senses
(instead of solely usage-features, or ID-tags, to use Gries’ terminology) can be answered by the fact that the study’s aim is to show how prototype structure can be
handled with the method. Nevertheless, in terms of descriptive adequacy, this option
is far from ideal. Moreover, as the author stresses himself, the size of the sample is
too small to properly apply multivariate statistical analysis. It is not that the sample
is small in itself, but the type-token (or perhaps ‘sense-token’) ratio is not acceptable
126 Dylan Glynn
for multifactorial analysis. Gries repeatedly stresses this point, but it should be added
that this problem is compounded by the fact that the study is not restricted to run,
but includes all the verb particle constructions based on run. Arguably, this makes
the study partially one of near-synonymy instead of polysemy. Many of the senses
identified are determined formally by the combination of the verb and the particle.
Verb particle constructions in Germanic, just like the prefixed verb constructions in
Slavic (see Fabiszak et al., this volume 223–252), challenge the distinction between
synonymy and polysemy. In any case, many of the senses in question are both formally and semantically distinct. A true test of the usage-feature method for the study of
semasiological variation is when that variation is not linked to any overt, or obvious,
formal distinction. By excluding the verb particle construction, Gries’ study would
have included less semantic variation but also less formal variation for ‘automatically’
determining it. This does not detract from the goal or the results of Gries’ study, but
future work should take such questions into account. Note that the current study also
uses dictionary senses as one of its analytical factors and includes the particle constructions. This is done solely to permit a comparison with Gries.
Table 1 lists the most common senses in the current study, compared with the
figures from Gries (2006). The list of senses applied in this study was determined by
the senses submitted to the hierarchical cluster analysis in Gries (2006:â•›82). For some
of these senses, the number of occurrences (supplied in the preceding section, Gries
2006:â•›63–73) are not known. Although the reasoning behind the categorisation of the
examples as dictionary definitions is reasonably clear, taxonomical issues of hyperonymy in the discussion occasionally mean that the number of occurrences for a given
sense is not stated. This is the case for ‘function’ vs. ‘execute’ and ‘manage’ and for ‘free
motion’ versus ‘motion’ and ‘fast motion’.
The application of Gries’ dictionary senses to our data was reasonably straightforward, using the examples and explanations included in the study. There were, of
course, some classification issues. For example, what constitutes ‘fast’ in ‘fast motion’?
Table 1.╇ Most frequent dictionary senses
Dictionary sense
Current study
Gries (2006)
‘fast pedestrian motion’
‘escape’
‘motion’
‘fast motion’
‘free motion’
‘execute’
‘in charge of ’
‘manage’
‘function’
‘become used up’
160 (32%)
â•⁄ 57 (11.5%)
â•⁄ 23 (4.5%)
â•⁄ 17 (3.5%)
â•⁄ 17 (3.5%)
â•⁄ 18 (3.5%)
â•⁄ 16 (3%)
â•⁄ 25 (5%)
â•⁄ 17 (3.5%)
â•⁄ 26 (5%)
203 (25%)
â•⁄ 32 (4%)
â•⁄ 24 (3%)
â•⁄â•⁄4 (0.5%)
â•⁄â•›–
â•⁄ 28 (3.5%)
â•⁄ 24 (3%)
101 (12%)
â•⁄â•›–
â•⁄ 14 (2%)
The many uses of run 127
The large difference in the number of occurrences on this point suggests that there
may have been a difference in coding for this sense. Nevertheless, assuming there is
bound to be some analytical variation, the results are reasonably comparable. This is
especially true given the small size of the samples and the differences between the corpora. The principal differences are ‘become used up’ and ‘escape’, which are more frequent in this study, and ‘manage’, which is substantially more frequent in Gries’ study.
For this final difference, even if we allow for some confusion over the semantically
similar categories of ‘execute’, ‘in charge of ’, ‘function’, and ‘manage’, the difference is
marked. We can suppose that such differences are a result of register. Indeed, this is
precisely the problem with frequency-based studies addressing (proto)typicality. Thematic variation, or variation in ‘topic of discourse’, can have a substantial effect, even
upon coarse-grained analysis of semasiological structure.
There is no need to examine such differences and similarities further. Both samples are small, with a high type-token ratio, which means statistical significance would
tell us little. Remembering that the ultimate point is to show that frequency-based
prototype structures are context dependent, it is sufficient to show that the overall
study is comparable to that of Gries’.
Below is an exemplified list of the most common senses. The examples are all
extracted from the LiveJournal Corpus sample under investigation. For further exemplification and discussion, see Gries (2006:â•›63–73).
(1) ‘fast pedestrian motion’
I want to like run into a bathroom at school and cry my eyes out whenever i
see him.
(2) ‘escape’
does anyone have about $400 laying around, i think i want to run away to Las
Vegas for a few days, lol.
(3) ‘motion’
Action Cat is really starting to like the new kitty, who I call Buddy cause he
has yet to receive a formal name. They run around and play all the time now
and it’s really cute.
(4) ‘fast motion’
Hang on, till I get the brake on, or you’ll run into the river.
(5) ‘free motion’
we’ve made three different trips … the group of friends that i run around with.
(6) ‘execute’
you know like it’s easier for you to go and run a program you know through
the disk.
128 Dylan Glynn
(7) ‘in charge of ’
there’s er it was for the er cat scanner and it was run by the Co-Op it was, it
was just oh I saw that sign outside.
(8) ‘manage’
I am now the new landlord of the rose and crown pub which mama used to
run.
(9) ‘function’
they said that uh cars would cost two dollars and they would run forever.
(10) ‘become used up’
Well, it doesn’t do so bad. It’s usually cigs we run out of not petrol.
3.2
Semasiological clustering without social dimensions
Gries (2006:â•›81–82) submits all the senses (minus ‘idiomatic’ ones) to an agglomerative hierarchical cluster analysis (see Divjak 2010; Divjak and Fieller, this volume,
405–442, for an explanation of the technique). The senses are clustered using the
full range of features (Gries 2006:â•›fn. 19, p. 94). The results of the cluster analysis
are reasonably coherent, especially given the number of senses versus the number
of examples and the number of usage-features. There is some degree of intuitively
sound clustering, which could be re-interpreted as prototype structuring. Nevertheless, there is also a large amount of clustering that does not appear semantically motivated. Gries (2006:â•›81, 83) accepts this and suggests that the data sparseness is, at least
partially, to blame.
Replicating the procedure gives similar results – a reasonable degree of intuitively
sound clustering but also a reasonable amount of ‘noise’ in the dendrogram where
clusters make little or no sense. For the sake of brevity, we will not present the dendrogram, but instead present the cluster results obtained by simplifying the data and
limiting the usage-features used to cluster them. In order to obtain a more coherent
clustering of senses, rare senses were omitted. Also, the two most frequent senses,
‘fast pedestrian motion’ and ‘escape’, were omitted. These two senses were found to
systematically dominate the clustering, rendering the relations between the other
senses difficult to discern. We can suppose that these two senses were so distinct in
usage that the clustering could not model their relationship and more subtle relations
of the other senses simultaneously. This effect was found, regardless of the distance
measure used.
A combination of three factors is used to cluster the dictionary senses. Following
Gries’ study, these factors consist of Transitivity, Subject ‘Type’ Semantics and Object
‘Type’ Semantics. Figure 1 presents the results using the simplified range of senses and
these three factors. It is produced using the Euclidean distance measure (the simplest
The many uses of run 129
free motion
motion
fast motion
in charge of
manage
execute
become used up
meet
free motion metaphoric
increase
motion metaphoric
check/rehearse
motion into difficulty
broadcast
campaign
caused motion
be
copy
extend temporarily
flow
exist in abundance
function
diffuse
10
0
5
Height
15
20
Cluster dendrogram
hclust (*, "average")
Figure 1.╇ Hierarchical cluster analysis of dictionary senses. Distance matrix – Euclidean;
agglomeration method – ‘average’
distance measure) and ‘average’ as the agglomeration method (a common agglomeration method).
We must interpret such plots with caution. Even having removed the rarely
occurring senses, some of the remaining senses are still infrequent, for example –
‘caused motion’, ‘motion into difficulty’ and ‘campaign’. Nevertheless, the overall picture seems reasonably coherent. Examining the dendrogram, two broad sense clusters
emerge, clustered by the right and the left branches. The left branch includes most of
the abstract senses, with perhaps the exception of ‘function’, which is less abstract.
Note, however, that the analysis has ‘function’ and ‘diffuse’ as quite distinct from this
abstract cluster. It appears that the analysis has trouble incorporating these senses.
The intuitive adequacy of the model is left up to the reader, but it is worth pointing out
that the literal motion senses are coherently grouped together as well as the control
senses (‘manage’, ‘in charge of ’, ‘execute’). However, the place of ‘become used up’ with
these two groupings of senses is not clear, nor is the relationship between the ‘control’
senses and the literal motion senses.
There does exist an internal logic to the cluster of abstract senses. The metaphoric motion senses are grouped together, just as are the ‘spread’ senses of ‘flow’,
‘exist in abundance’, and ‘extend temporarily’. The other groupings are not illogical,
but apart from representing abstract or metaphoric meanings of run, they share little
semantically.
130 Dylan Glynn
Gries (2006:â•›fn. 19, pp. 93–94) found that changing the distance matrix and/or the
agglomerating method did not alter the results. This was not the case with the current
data set. Experimenting with different agglomeration methods greatly improved or
worsened the interpretability of the dendrogram and, occasionally, the actual results
of the cluster analysis. Likewise, different distance measures also produced different
results. This could, perhaps, be a sign of the instability of the analysis – attempting to
cluster 23 senses based on a sample of 500 is far from an ideal condition in multivariate statistics. Figure 2 presents the results of the Canberra distance matrix. It is clustered with the Ward agglomeration method. The different agglomeration methods did
not change the results for the Canberra matrix, only legibility. The Ward method gave
the clearest dendrogram.
In Figure 2, again we see two main branches. At first, the overall branching and
clustering of the senses appears more coherent than those produced using the Euclidean distance measure. However, if we inspect the clustering more closely, intuitive
semantic coherence is not wholly systematic. At the coarse-grained level, we have
lost the clear distinction between relatively concrete uses such as ‘manage’ and ‘literal motion’ versus ‘metaphoric motion’ as well as the extending and disseminating
senses. For the four sub-clusters, there is a little more semantic coherence. The first
sub-cluster of ‘execute’ and ‘manage’ is intuitively sound. The second is also coherent,
execute
manage
free motion
motion
motion metaphoric
fast motion
in charge of
difficulty
campaign
meet
exist in abundance
extend temporarily
be
copy
flow
rehearse
caused motion
free motion metaphoric
diffuse
function
become used up
broadcast
increase
10
0
5
Height
15
20
25
Cluster dendrogram
dist_mat
hclust (*, "ward")
Figure 2.╇ Hierarchical cluster analysis of dictionary senses. Distance matrix – Canberra;
agglomeration method – ‘Ward’
The many uses of run 131
save for the sense ‘in charge of ’. This is semantically related to ‘execute’ and ‘manage’
and, therefore, given the small sample, is more or less in the ‘correct’ branch. The next
sub-cluster of ‘difficulty’ (run into difficulty), ‘campaign’ (run for election), and ‘meet’
(run into a friend) is semantically coherent, given a broad interpretation of ‘campaign’
that includes meeting people and difficulties. This is not as unlikely an interpretation
as one might first suppose. Recall that the different semantic types of objects and subjects determine these sense clusters.
Moving to the right across the clusters, the next sub-cluster of ‘exist in abundance’
and ‘extend temporarily’ is intuitively coherent. However, the rest of the group appears semantically heterogeneous. The last cluster on the right, although distinct with
a long branch stemming from the rest of the dendrogram, also lacks obvious semantic
coherence. Although one is able interpret semantic structure here, it is not self-evident why ‘diffuse’ and ‘function’ or ‘broadcast’ and ‘increase’ should group together.
The point of both this small study and Gries’ is merely to consider two methodological possibilities. In light of this, the fact that the two distance matrices produced different clusterings raises important methodological questions. Standards and
checks for appropriateness need to be developed before the use of cluster analysis can
be relied upon to determine frequency-based semasiological structure.3
3.3
Semasiological clustering with social dimensions
Before we consider the effects of social variation on semantic structure, it must be
stressed that one would not expect to find substantial variation with these data and
for this lexeme. Therefore, even a small degree of variation is a sign of the extent of the
issue. There are four reasons for this:
1. In terms of cultural variation, run is a ‘simple’ lexeme. It is the kind of lexeme
where one would not expect variation across dialects.
2. In terms of register, run is a ‘neutral’ lexeme, not belonging to either formal or
informal registers. It is the kind of lexeme where one would expect relatively little
variation across text types. One exception to this might be the two central senses
of ‘fast pedestrian motion’ versus ‘manage’, where text type would be expected to
show variation in use.
3. Divjak and Gries (2006:â•›37) state that the Canberra distance matrix is best suited to small
cell counts, such as we have here. Gries (2009:â•›317) says the choice is subjective. Gries and
Stefanowitsch (2010:â•›79) employ the Manhattan distance matrix, citing Levy et al. (1999) as
justification. Levy et al.’s study compares five distance matrices but not Canberra. It seems that
the question of how the choice of distance matrix affects the results needs to be investigated
systematically. Divjak and Fieller (this volume, 405–442) also discuss the range of methods.
132 Dylan Glynn
3. Although the differences between American and British English are substantial,
the dialects remain mutually intelligible for most speakers of both varieties, especially in written language and educated speech. In other words, the difference
between American and British English is not that great, making dialect a good
test case.
4. Although there are certainly differences between the registers of spoken conversation and online personal diaries, the style of the latter is also extremely informal
and is also dialogic. Unlike traditional diaries, authors here engage in discourse
with readers and the style of the genre is conversational and casual. Therefore,
just as for dialect variation, one would not expect substantial differences in the
text type variation.
We could repeat the clustering presented in the previous section for the two dialects
and the two registers and compare the clustering. However, the cluster analyses on the
full data set are obviously unstable and halving the data would make any multivariate
analysis impossible. Let us begin, rather, with a Chi-squared test of independence
that identifies statistically significant differences along the lines of register and dialect.
A Pearson’s Chi-squared test of independence for dialect identifies significant
differences between the British and American data for the dictionary senses (p =
0.001263). The residuals show that ‘become used up’ but also ‘escape’ and ‘fast motion’
are more typical of the British use, and ‘meet’, ‘increase’ but also ‘execute’ and ‘diffuse’ of the American use. Register also reveals a significant difference (p = 6.376e-05)
with the residuals showing that ‘escape’, ‘fast pedestrian motion’, ‘metaphoric motion’
are associated with the diaries, and ‘caused motion’, ‘diffuse’, ‘execute’, ‘function’, and
‘increase’ with the conversation data. Having established that there is significant variation, let us move to trying to capture how that variation interacts with the semasiological structure.
Although cluster analysis is a powerful tool for identifying how the different senses are related, it cannot show how register and dialect affect those relations. Ideally,
given enough data, we could label the occurrences of the different senses for dialect
and register and even both simultaneously. The cluster analysis would then show the
relations between the different senses relative to the social factors, clustering, for instance, ‘fast pedestrian motion BrEng’ and ‘fast pedestrian motion AmEng’ etc. Although a straightforward procedure, for the number of senses involved, this would
require a much larger data set.
Another statistical technique, explained in Glynn (this volume, 443–485), is correspondence analysis. A multivariate and exploratory technique similar in many ways
to cluster analysis, it visualises relations between all the factors considered rather
than just one factor. Figure 3 presents the results of a binary correspondence analysis,
which examines the interaction of dialect, register, and dictionary sense.
The many uses of run 133
Cause Motion
Flow
Extend Time
0.5
Use up
BrE.Conv
Function
Motion Manage
Escape
Rehearse
Fast Ped. Motion
Increase
AmE.Conv Diffuse
Execute
Difficulty
AmE.Blog Met. Motion
–0.5
0.0
Fast Motion
BrE.Blog
Free Motion Become Broadcast
Meet
Campaign
Extend Space
–1.0
Copy
–0.5
0.0
0.5
1.0
1.5
Figure 3.╇ Binary correspondence analysis of register, dialect, and dictionary sense
The first two dimensions of the analysis explain 87% of the variation (inertia),
which is a relatively stable analysis. Immediately, it is visible that American Conversation (AmE.Conv) is distinct in use relative to the dictionary senses, dominating the
right two quadrants of the plot on the central axis line. The senses ‘increase’, ‘diffuse’,
and ‘motion into difficulty’ (Difficulty) are distinctly and highly associated with the
American conversation data point on the right of the plot. In the bottom half of the
plot, we find a range of senses distinctly associated with the American diary genre (AmE.Blog). The senses ‘campaign’, ‘copy’, and perhaps ‘metaphoric motion’ (Met.
Motion) are highly and distinctly associated with American diary use. ‘Meet’ and ‘extend space’ are likely to be associated with American English but are not distinct to
either register, lying between the two data points for American Diary and American
Conversation.
Moving to the British uses, the plot becomes more difficult to interpret. The analysis suggests that there is less register variation in the British sample, the two data
points British Conversation (BrE.Conv) and British Diary (BrE.Blog) both lying in
the same top left quadrant. Nevertheless, the dialect variation is clear – the senses
‘flow’ and ‘extend time’ are highly and distinctly associated with the British use. Other
senses, such as ‘use up’, ‘cause motion’ and ‘escape’, are also relatively associated with
British use, but this association is not distinctive.
134 Dylan Glynn
0.5
S.Cause Motion
S.Flow
S.Use up
S.Function
0.0
S.Increase
S.Diffuse
S.Execute
Reg.Conversation
Lang.BrEng
S.Manage
S.Free Motion S.Motion
S.Broadcast
S.Become
S.Fast Motion
S.Extend Time
S.Fast Ped. Motion
S.Difficulty
S.Escape
S.Rehearse
Lang.AmEng
Reg.Blog
S.Meet
–0.5
S.Extend Space
S.Metaphoric Motion
S.Campaign
S.Copy
-1.0
-0.5
0.0
0.5
1.0
Figure 4.╇ Multiple correspondence analysis. Burt matrix, method ‘adjusted’
In order to obtain a clearer picture of the interactions at hand, let us submit the
same data to a multiple correspondence analysis. The binary analysis, in Figure 3,
gives us a reliable and stable representation of the associations, but it cannot capture
interactions between dialect and register. This is because these two factors were concatenated in order to produce a two-dimensional contingency table for the analysis.
We can expand that table into a three-dimensional table and apply multiple correspondence analysis. The results are more difficult to interpret and can be less stable
(less accurately capturing and representing associations in the data). However, the
plot in Figure 4 was produced using the recently developed ‘adjusted’ method which
addresses both issues of stability and clarity. Fortunately, the results are clear and the
explained inertia is 86.7% (Dim. 1: 61.2%, Dim. 2: 25.5%), which for a adjusted multiple correspondence analysis, using Burt matrices, is a stable result (Greenacre 2007).
Further details on and an explanation of the technique of correspondence analysis,
and its limitations and strengths, can be found in Glynn (this volume, 443–485).
The results presented in Figure 4 largely reflect the binary correspondence analysis, but by treating the factors of dialect and register independently, the analysis
affords us a clearer depiction of their interaction. Each of the four quadrants is characterised by one of the four sociolinguistic features: the top right – British dialect (Lang.
BrEng); the bottom right – diary register (Reg.Blog); the bottom left – American dialect (Lang.AmEng); and the top left – conversation register (Reg.Conversation).
The many uses of run 135
We see that senses, such as ‘execute’ and ‘diffuse’ between the American data point
and the Conversation data point, are common to these two usage dimensions. The
senses ‘campaign’ and ‘metaphoric motion’, lying between the American data point
and the register of diary (Reg.Blog), are common to these dimensions. The senses
‘beyond’ the American data point, relative to the British dialect data point in the top
right-hand quadrant, are neutral with regard to register, but are distinctly American
in contrast to British. These senses include ‘extend in space’, ‘copy’, and ‘meet’.
Repeating the interpretation, beginning from the top right-hand quadrant and
the British data point, we see that ‘use up’ is distinctly typical of British conversation
and that ‘fast motion’ is typical of British diaries. The senses ‘flow’ and ‘extend time’
are less associated with a given register, but are distinctly British, relative to the American data. Again we see that register variation for the British use is less important.
Finally, note the position of ‘manage’ and ‘fast pedestrian motion’. These data
points, along with some other senses, are in the centre of the plot. The senses located
in the centre are the senses that are not affected by either of the two sociolinguistic
usage factors. These senses are central, but not just in the way that Gries (2006) argued. Although still understood in terms of frequency, we now also have two usage
dimensions, dialect and register. Not only are these senses among the most frequent,
they are among the senses least affected by context. This is a crucial refinement to the
frequency operationalisation of (proto)typicality – uses that that are common (frequent) across all contexts are more central to the meaning of a lexeme. This finding is
equally as important as discerning which senses are typical of specific contexts.
Gries (2006) stresses that the small sample means that the study can only be seen
as a methodological test, rather than a fully descriptive analysis. For these reasons, the
statistical techniques employed are only exploratory. He suggests the use of configural
frequency analysis to identify statistical significance in the results, allowing one to
determine which correlations are not chance, and which may be simply a result of the
small sample. Although configural frequency analysis would be an excellent choice
for this, it requires more data than is available in either study. It also follows that, with
more data, log-linear analysis or multinomial logistic regression would be even better,
giving not only statistical significance but also predictive strength to the model. Such
analyses are now within the capabilities of corpus-driven research, but require a larger
scale analysis.
Moreover, before such an analysis is undertaken, the identification of senses must
be better operationalised. The analysis of the usage-features must be found to cluster into senses and then these multivariate senses must be shown to be statistically significant. With senses based on clusters of usage-features (ID Profiles), rather
than revealed by matching occurrences with dictionary entries, we can then return
to the clustering. This step in corpus-driven polysemy research has begun (Glynn
2009, 2010, in press), but remains at the initial stages. Once we are armed with the
analytical tools to identify multivariate senses (rather than dictionary senses), then we
136 Dylan Glynn
need to progress to modelling the semasiological structure and the prototype effects,
using more advanced statistical procedures such as configural frequency analysis and
log-linear analysis. The present purposes are to demonstrate that sociolinguistic effects must be integrated into the study of prototype structuring. To these ends, let us
submit the data to binary logistic regression.
Explained in Speelman (this volume, 487–533), logistic regression is a confirmatory multivariate technique that allows us not only to determine which of the usage
features and/or dictionary senses are significantly associated with either of the sociolinguistic factors, but it also enables us to determine how important that association is.
Logistic Regression – Dialect
Let us begin with dialect. Three logistic regression models are reported: a multiple
model based on usage-features excluding dictionary senses (Model 1); a second multiple model that includes dictionary senses (Model 2); and a simple model with the
dictionary senses as a sole predictor variable (Model 3).
The models are all checked for multicollinearity, and factors producing a variance
inflation of more than 2.5 are removed.4 Moreover, the models are checked for singularity with a Kappa calculated condition number – any model with a value higher
than 6 is rejected.5 The strict check on variance inflation and singularity assure an
orthogonal model. The models are also checked for influential observations as well
as overfitting, neither of which is a problem. Outliers are not removed. In a backward
elimination of factors, model selection was based on significance values and the Akaike’s information criterion (AIC), not on predictive strength.6 For readers unfamiliar
with logistic regression, the testing of the model and criteria for acceptability were
extremely strict, making the results as conservative as possible.
For the sake of brevity, some non-significant levels are omitted, indicated by ‘…’.
Positive coefficients predict British English and negative coefficients (“–”) predict
American English. Since we are comparing models, only the coefficients and some
4. Some authorities indicate a variance inflation factor of 10 to be acceptable (DeMaris
2003:â•›517; Dodge 2008:â•›96; Chatterjee and Hadi 2006:â•›238; Marques de Sá 2007:â•›307; Speelman
p.c.), other authorities are non-committal (Faraway 2002:â•›117–120; Maindonald and Braun
2003:â•›201–203). Glynn (2010) and Speelman (this volume, 487–533) opt for a maximum inflation value of 4. Szmrecsanyi (2006:â•›215) notes that even values as low as 2.5 can be a cause for
concern. Multicollinearity is a serious issue in regression and can lead to Type I errors. Since
we do not necessarily understand the relationship between many of the factors in our model, a
maximum VIF of 2.5 is used to determine which factors can be combined in the model.
5. Baayen (2008:â•›182) states that a condition number between 0 and 6 indicates no multicollinearity and 15 indicates a medium degree.
6. The AIC score helps compare the parsimony of different models. The scores are relative and
a lower number indicates a more parsimonious model.
The many uses of run 137
Table 2.╇ Logistic regression models for dialect
Coefficients
Model
Statistics
Transitivity – Transitive
Tense – Past
Tense – Present
Aspect – Progressive
Aspect – Simple
Mood – Imperative
Mood – Interrogative
Clause Type – SubPronoun
Clause Type – SubNP
Clause Type …
Subject – Human
Subject – Locations
Subject – Machine
Subject …
Sense – Use Up
Sense – Diffuse
Sense – Execute
Sense – Campaign
Sense – Fast Motion
Sense – Flow
Sense – Increase
Sense – Meet
Sense …
d.f.
G2
C
Nagelkerke R2
Bootstrapped R2
Model 1
Model 2
Model 3
â•⁄0.596619*
â•⁄0.658282*
â•⁄0.387003
â•⁄–
â•⁄–
â•⁄0.533600
â•⁄1.048808*
–1.807732º
–0.490190
â•⁄…
–1.446052º
–0.289667
–1.564679*
â•⁄…
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
20
41.39**
â•⁄0.668
â•⁄0.112
â•⁄0.0138
â•⁄0.487405º
â•⁄–
â•⁄–
â•⁄0.178450
â•⁄0.542254º
â•⁄0.392821
â•⁄1.094234*
–2.225015
–2.225015º
â•⁄…
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄0.955200º
–1.364175º
–1.196454*
–1.289732
â•⁄0.792536
â•⁄1.405831
–2.287706*
–1.566828*
â•⁄…
27
70.71***
â•⁄0.716
â•⁄0.186
â•⁄0.0435
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄0.94852*
–1.65945*
–1.14862*
–1.65945
â•⁄0.82546
â•⁄1.55943
–2.12945*
–1.59046*
â•⁄…
20
60.16***
â•⁄0.671
â•⁄0.156
â•⁄0.057
essential model statistics are reported.7 In Table 2, the coefficients for each of the
levels (usage-features) are listed with the alpha levels (º p < 0.1, * p < 0.05, ** p < 0.01,
*** p < 0.001).8 The degrees of freedom (d.f.), Log Likelihood chi-squared or deviance
measure (G2), the C index or coefficient of concordance statistic (C), the Nagelkerke
7. The R output includes the estimated standard errors and the Wald Chi square, or z-test
(z) obtained by dividing the coefficient by its error. See Speelman (this volume, 487–533) for a
detailed explanation. The output for a logistic regression analysis, using rms in R, is explained
in Baayen (2008:â•›2004) and Gries (2009:â•›297). See also Chatterjee and Hadi (2006).
8. Significance levels are primarily used for model selection. Caution should be taken in interpreting them relatively (Faraway 2002:â•›126).
138 Dylan Glynn
pseudo R2 (N. R2), and a bootstrapped estimation of the pseudo R2 (Boot. R2) are
included as model statistics.
Models 1 and 2 are the result of a backward elimination of a full model without
and with dictionary senses, respectively. Object and complement usage-features were,
interestingly, found to be not at all significant. This could be due to the large amount
of features (levels) belonging to these factors. A much larger data set is needed to
handle such complexity. For these same reasons, it was not possible to examine interactions. Moreover, in order to improve the model, some small cells were removed for
the variable of dictionary sense. This weakens predictive strength, but improves overall parsimony of the model. In model 2, entering subject semantics and dictionary
senses simultaneously shows signs of multicollinearity. The subject is, thus, omitted.
Although models 1 and 2 show some significant features, we are not interested in
finding differences between the two dialects, but in finding differences in the semasiological structuring of the two dialects. Since we do not know how the different formal
usage-features interact in terms of semasiological structure, it is difficult to interpret
models 1 and 2 in these terms. It is clear, however, that these usage-features do not
offer a predictive model, which reassures us that there are no ‘obvious’ differences that
would superficially distinguish one dialect from the other in a trivial manner.
Model 3, on the other hand, is simple to interpret. If we recall the residuals of the
Chi-squared test shown above, we see confirmation of those results, but this time with
statistical significance as well as a score indicating effect size. For the American data,
the sense ‘increase’ is by far the most distinctive feature, followed by ‘execute’, ‘meet’
and then ‘diffuse’. ‘Campaign’ is not statistically significant (p-0.1338), but it would almost surely become significant with more data. We saw in the correspondence analysis
that ‘campaign’ is highly associated with American usage and it is known that this
sense is effectively unique to the American dialect. In British English, the verb stand
for an elected post is more typical than run for an elected post. The sense ‘motion into
difficulty’ behaved similarly to ‘campaign’, but was removed due to its small count.
With the current data set, five of the senses are statistically significant. This confirms what we saw in the Chi-squared test and correspondence analysis. In the logistic
regression, the coefficients give us a rank of influence similar to the Pearson’s residuals
obtained from the Chi-squared tests. This kind of ranking is exactly the type of information needed for understanding the effects of such dialect variation on prototype
structuring.
However, none of the five senses in question is a particularly strong predictor,
though the sense ‘increase’, associated with American usage with a coefficient of 2.1,
is quite strong. At the other end of the spectrum, ‘become used up’ predicting British
English is a relatively weak predictor. Ranked in order of influence, we now know that
‘increase’, ‘diffuse’, ‘meet’ and then ‘execute’ are distinctly American in use, where only
‘become used up’ is a significant predictor for British usage.
The many uses of run 139
The statistics presented beneath the table of coefficients show that the model is
not predictively strong. A fourth model, without the infrequent senses of ‘campaign’
and ‘motion to difficulty’, produces comparable statistics (G2: 57.18, d.f. 18, R2: 0.151,
C: 0.667). The poor predictive strength of the model, of course, is to be expected. If
the differences between the dictionary senses in themselves were so great that one
could predict one dialect over the other with this information alone, we would have
such obvious semasiological variation that this study would not be needed. What the
logistic regression gives us is a clear and specific picture that, although all the senses
are possible in both dialects (‘campaign’ aside), the differences in frequency of occurrence are great enough that even with a small data set, significant differences can be
identified.
Register effects
Just as for the dialect effects, we will consider three models: a single regression analysis of the dictionary senses and two multiple logistic regressions. Table 3 summarises
the models. Model selection followed the same criteria as for the previous logistic
regression. Positive coefficients predict conversation register and negative coefficients
predict diary register.
Just as for the regression analyses above, none of the models are predicatively
strong. This means that, for the features analysed, we cannot predict whether an example will be one dialect or the other or one register or the other. This does not mean,
however, that we cannot interpret the table of coefficients to see where significant
differences do exist.
In model 1, we have an interesting selection of significant Subject Semantic and
Object Semantic categories. The interaction of such features is likely to represent the
usage-configurations of profiles that could be understood as non-reified senses. Evidence for this can be found in the high collinearity that is produced when these variables entered into the regression with dictionary senses. Of course, further research is
needed to ascertain what these configurations would consist of.
Many individual (as opposed to configurations of) Subject and Object Semantic
categories could be seen as operationalisations of different dictionary senses in their
own right. For example, the Object Semantic feature of ‘machine’ would be a reasonable operationalisation of the sense ‘operate’ (or ‘execute’), just as the Subject Semantic
category of ‘machine’ would indicate ‘function’. We see here highly significant and
important predictors of the register conversation. This is the kind of extra-linguistic effect on frequency-based prototype structure we referred to in section 1, when
we compared run ‘manage’ and run ‘fast pedestrian motion’. Rather than dictionary
senses, we see how different semantic features are interacting with other dimensions
of use. Although not predicatively strong, model 1 offers interesting insights into the
kind of semantic variation we have between the two registers.
140 Dylan Glynn
Table 3.╇ Logistic regression models for register
Usage – feature (levels)
Model 1
Model 2
Model 3
Tense – Past
Tense – Present
Clause Type – Subj. NSub
Clause Type – Subj. Pronoun
Clause Type – Subj. NP
Subject Sem. – Human
Subject Sem. – Location
Subject Sem. – Machine
Subject Sem. – Quality
Subject Sem. …
Object Sem. – Animate
Object Sem. – Concrete Count Noun
Object Sem. – Concrete Mass Noun
Object Sem. – Events
Object Sem. – Human
Object Sem. – Location
Object Sem. – Machine
Object Sem. – Organisation
Object Sem. – Quantity
Object Sem. – NA
Object Sem. …
Sense – Become Used Up
Sense – Diffuse
Sense – Escape
Sense – Execute
Sense – Function
Sense – In Charge of
Sense – Increase
Sense – Manage
Sense – Meet
Sense – Motion to Difficulty
d.f.
G2
C
Nagelkerke R2
Bootstrapped Nagelkerke R2
–0.57321º
â•⁄0.0518
â•⁄–
â•⁄–
â•⁄–
â•⁄1.21485º
â•⁄2.30656*
â•⁄2.53738**
â•⁄2.56357*
â•⁄…
â•⁄ 2.61971 +
â•⁄1.92852**
â•⁄2.12490**
â•⁄ 2.09651 **
â•⁄1.60972*
â•⁄1.68428**
â•⁄2.26511**
â•⁄2.30594
â•⁄1.65779º
â•⁄1.17705º
â•⁄…
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
25
67.45***
â•⁄0.697
â•⁄0.176
â•⁄0.064
â•⁄–
â•⁄–
–1.12794*
–0.46960
–0.67469*
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄1.12163*
â•⁄2.09507**
–0.60738º
â•⁄1.43060*
â•⁄1.92870**
â•⁄1.36388*
â•⁄2.77507*
â•⁄0.82270º
â•⁄0.73929º
â•⁄1.88122*
26
82.31***
â•⁄0.721
â•⁄0.211
â•⁄0.077
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄–
â•⁄0.9639*
â•⁄1.9373*
–0.4453
â•⁄1.4265**
â•⁄1.8684**
â•⁄ 1.1164 *
â•⁄2.4073*
â•⁄0.5691
â•⁄0.4457
â•⁄1.8320*
19
63.95***
â•⁄0.69
â•⁄0.167
â•⁄0.063
The many uses of run 141
Models 2 and 3 differ little. The addition of other factors in the multiple regression of model 2 merely identifies some syntactic variation. The dictionary senses
again confirm what we saw in the Chi-squared test, although we now see that none
of the senses are significantly associated with the diary register. We see also that the
senses ‘increase’, ‘motion to difficulty’ and ‘diffuse’ are the senses most highly associated with conversation.
4. Summary
This study examined the semasiological variation of run, replicating the study of Gries
(2006), but adding two sociolinguistic contexts. The aim was to add two usage dimensions to the polysemy ‘map’ in order to more accurately represent usage-structure.
The descriptive findings of Gries are largely confirmed. One further sense, ‘escape’,
was found to be relatively important. More importantly, the study added to Gries’
results by identifying which senses are neutral with regard to the two contexts considered. That certain senses are not affected by sociolinguistic variation, yet others are,
adds considerable weight to the argument that they represent the prototype senses.
The theoretical and methodological goal of this study was straightforward – to
demonstrate that, although at times subtle, sociolinguistic factors have a significant impact upon semasiological structure. Cognitive Linguistics has propounded
a usage-based model of language since its beginning, but its approach to semantic
structure remained largely Structuralist and Generativist in its assumptions and
methodologies. The increased use of observational techniques as well as multivariate
statistics improves our understanding of the complexity of polysemy structures, but
also brings out the need to treat semantic structure in a radically new way. Our very
own model of language states that language structure is varied and emergent, and
that categorisation is rarely discrete. We must accept the ramifications of this in our
research and resist the temptation to assume that discrete reified lexical senses exist
or that those senses exist in some abstract system, independent from the variation of
societies and cultures that use them. Since meaning is emergent, multidimensional,
and ultimately non-reifiable, a description of polysemy that is both cognitively and
communicatively realistic will depend upon developing empirical methodology that
can adequately describe the complexity of this object of study.
References
Arppe, A., & Järvikivi, J. (2007). Every method counts: Combining corpus-based and experimental evidence in the study of synonymy. Corpus Linguistics and Linguistic Theory, 3,
131–159.
142 Dylan Glynn
Arppe, A., Gilquin, G., Glynn, D., Hilpert, M., & Zeschel, A. (2010). Cognitive corpus linguistics: Five points of debate on current theory and methodology. Corpora, 5, 1–27.
DOI: 10.3366/cor.2010.0001
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511801686
Chatterjee, S., & Hadi, A. (2006). Regression analysis by example. Chichester & New Jersey:
Wiley. DOI: 10.1002/0470055464
Chomsky, N. (1964). A transformational approach to syntax. In J. A. Fodor, & J. Katz (Eds.), The
structure of language (pp. 211–241). Englewood Cliffs: Prentice-Hall.
Cuyckens, H. (1993). The Dutch spatial preposition “in”: A cognitive-semantic analysis. In
C. Zelinsky-Wibbelt (Ed.), The semantics of prepositions: From mental processing to natural
language processing (pp. 27–72). Berlin & New York: Mouton de Gruyter.
DeMaris, A. (2003). Regression with social data: Modeling continuous and limited response variables. Chichester & New Jersey: Wiley.
Dirven, R., Goossens, L., Putseys, Y., & Vorlat, E. (1982). The scene of linguistic action and
its perspectivization by SPEAK, TALK, SAY, and TELL. Amsterdam & Philadelphia: John
Benjamins. DOI: 10.1075/pb.iii.6
Divjak, D. (2010). Structuring the lexicon: A clustered model for near-synonymy. Berlin & New
York: Mouton de Gruyter.
Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2, 23–60. DOI: 10.1515/CLLT.2006.002
Dodge, Y. (2008). The concise encyclopedia of statistics. Heidelberg & New York: Springer.
Faraway, J. (2002). Extending the linear model with R: Generalized linear, mixed effects and nonparametric regression models. London: Chapman and Hall.
Fillmore, C. (1985). Frames and the semantics of understanding. Quaderni di Semantica, 6,
222–254.
Geeraerts, D. (1987). On necessary and sufficient conditions. Journal of Semantics, 5, 275–291.
DOI: 10.1093/jos/5.4.275
Geeraerts, D. (1989). Prospects and problems of prototype theory. Linguistics, 27, 587– 612.
DOI: 10.1515/ling.1989.27.4.587
Geeraerts, D. (1993). Vagueness’s puzzles, polysemy’s vagaries. Cognitive Linguistics, 4, 223–
272. DOI: 10.1515/cogl.1993.4.3.223
Geeraerts, D. (1995). Representational formats in Cognitive Semantics. Folia Linguistica, 39,
21–41.
Geeraerts, D. (1997). Diachronic prototype semantics: A contribution to historical lexicology.
Oxford: Clarendon Press.
Geeraerts, D. (2000). Salience phenomena in the lexicon: A typology. In L. Albertazzi (Ed.),
Meaning and Cognition (pp. 125–136). Amsterdam & Philadelphia: John Benjamins.
Geeraerts, D., Grondelaers, S., & Bakema, P. (1994). The structure of lexical variation: Meaning,
naming, and context. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110873061
Gilquin, G. (2010). Corpus, cognition and causative constructions. Amsterdam & Philadelphia:
John Benjamins. DOI: 10.1075/scl.39
Glynn, D. (2009). Polysemy, syntax, and variation: A usage-based method for Cognitive Semantics. In V. Evans, & S. Pourcel (Eds.), New directions in Cognitive Linguistics (pp. 77–
106). Amsterdam & Philadelphia: John Benjamins.
The many uses of run 143
Glynn, D. (2010). Testing the hypothesis: Objectivity and verification in usage-based Cognitive Semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 239–270). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110226423
Glynn, D. (In press). Cognitive socio-semantics: The theoretical and analytical role of context
in meaning. Review of Cognitive Linguistics.
Gries, St. Th. (2003). Multifactorial analysis in corpus linguistics: A study of particle placement.
London & New York: Continuum Press.
Gries, St. Th. (2006). Corpus-based methods and Cognitive Semantics: The many senses of
to run. In St. Th. Gries & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110197709
Gries, St. Th. (2009). Statistics for Linguistics with R: A practical introduction. Berlin: Mouton de
Gruyter. DOI: 10.1515/9783110216042
Gries, St. Th., & Stefanowitsch, A. (2010). Cluster analysis and the identification of collexeme
classes. In S. Rice, & J. Newman (Eds.), Empirical and experimental methods in cognitive/
functional research (pp. 73–90). CSLI: Stanford.
Janda, L. (1993). A geography of case semantics: The Czech dative and the Russian instrumental.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110867930
Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31, 91–113.
DOI: 10.1023/A:1000583911091
Kleiber, G. (1990). Sémantique du prototype : catégorie et sens lexical. Paris: Presses Universitaires de France.
Kudrnáčová, N. (2010). On pragmatic and cognitive processes in meaning variation. Linguistica Silesiana, 31, 55–67.
Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal about the mind. Chicago & London: University of Chicago Press. DOI: 10.7208/chicago/9780226471013.001.0001
Lakoff, G. (1990). The invariance hypothesis. Cognitive Linguistics, 1, 39–74.
DOI: 10.1515/cogl.1990.1.1.39
Lehrer, A. (1982). Wine and conversation. Bloomington: Indiana University Press.
Lehrer, K., & Lehrer, A. (1994). Fields, networks, and vectors. In F. Palmer (Ed.), Grammar and
meaning: A festschrift for John Lyons (pp. 26–47). Cambridge: Cambridge University Press.
Levy, J., Bullinaria, J., & Patel, M. (1999). Explorations in the derivation of cooccurrence statistics. South Pacific Journal of Psychology, 10, 99–111.
Maindonald, J., & Braun, J. (2003). Data analysis and graphics using R: An example-based approach. Cambridge: Cambridge University Press.
Marques de Sá, J. (2007). Applied statistics using SPSS, STATISTICA, MATLAB and R.
Heidelberg & New York: Springer. DOI: 10.1007/978-3-540-71972-4
Rosch, E. (1975). Cognitive representations of semantic categories. Journal of Experimental Psychology, 104, 192–233. DOI: 10.1037/0096-3445.104.3.192
Rudzka-Ostyn, B. (1989). Prototypes, schemas, and cross-category correspondences: The case
of ask. In D. Geeraerts (Ed.), Prospects and problems of prototype theory (Special edition of
Linguistics 27) (pp. 613–661). Berlin & New York: Mouton de Gruyter.
Sandra, D., & Rice, S. (1995). Network analysis of prepositional meaning: Mirroring whose
mind – the linguist’s or the language user’s? Cognitive Linguistics, 6, 89–130.
DOI: 10.1515/cogl.1995.6.1.89
144 Dylan Glynn
Schmid, H.-J. (2000). English abstract nouns as conceptual shells: From corpus to cognition.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110808704
Schmid, H.-J. (2010). Does frequency in text instantiate entrenchment. In D. Glynn, &
K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 101–
135). Berlin & New York: Mouton de Gruyter.
Szmrecsanyi, B. (2006). Morphosyntactic persistence in spoken English: A corpus study at the
intersection of variationist sociolinguistics, psycholinguistics, and discourse analysis. Berlin
& New York: Mouton de Gruyter. DOI: 10.1515/9783110197808
Taylor, J. (1989). Linguistic categorization: Prototypes in linguistic theory. Oxford: Clarendon
Press.
Tribushinina, E. (2009). On prototypicality of dimensional adjectives. In J. Zlatev, M. Andrén,
M. Johansson Falck, & C. Lundmark (Eds.), Studies in language and cognition (pp. 111–
128). Newcastle: Cambridge Scholars.
Tyler, A., & Evans, V. (2001). Reconsidering prepositional polysemy networks: The case of over.
Language, 77, 724–65. DOI: 10.1353/lan.2001.0250
Zlatev, J. (2003). Polysemy or generality? Mu. In H. Cuyckens, R. Dirven, & J. Taylor (Eds.),
Cognitive approaches to lexical semantics (pp. 447–494). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110219074.447
Visualizing distances
in a set of near-synonyms
Rather, quite, fairly, and pretty
Guillaume Desagulier
Université Paris 8, Université Paris Ouest /
Nanterre La Défense, UMR 7114 MoDyCo
I aim to uncover revealing aspects of the conceptual structure of four English
moderators (rather, quite, fairly, and pretty) and shed new light on previous
studies on degree modifiers. I develop an original methodology for handling
and visualizing measures of significant attraction between lexical items. This
methodology combines univariate and multivariate statistics. Collexeme analysis is used as input for hierarchical agglomerative cluster analysis, and multiple
distinctive collexeme analysis is used as input for correspondence analysis. Visualizing collexemes with exploratory tools does more than depict proximities
and distances between individuals and variables. It is also an accurate means to
(a) unveil fine semantic differences in a set of near-synonymous constructions,
(b) determine entrenchment continua, and (c) represent a significant part of the
complex inventory of intensifying constructions.
Keywords: Cognitive Construction Grammar, collostructional analysis,
correspondence analysis, degree modifiers, hierarchical cluster analysis,
near-synonymy
1. Introduction
It is a well-known fact that natural languages avoid true synonymy: “languages abhor absolute synonyms just as nature abhors a vacuum” (Cruse 1986:â•›270). This is
why absolute synonyms are rare whereas near-synonyms are extremely frequent. In
Cognitive-Grammar terms (Langacker 1987; 1991), synonymous expressions have
identical conceptual content and impose the same construal upon that conceptual
content, while near-synonyms share the same conceptual content but differ in terms of
146 Guillaume Desagulier
construal.1 If we could measure conceptual content similarities and construal differences, and then represent them graphically, we could unveil fine semantic differences
in sets of near-synonyms.
Bearing in mind that “a word is known by the company it keeps” (Firth 1957),
I explore the collocation preferences of four English degree modifiers: rather, quite,
pretty, and fairly. Because these adverbs are near-synonyms, we may expect them to
share identical conceptual content but differ in how this conceptual content is construed. Although these adverbs can grade other adverbs (pretty badly, fairly easily) or
noun phrases (quite a shock, rather a surprise), I restrict my investigation to the prototypical contexts where they are used as degree modifiers of adjectives, as examples
(1) to (4) illustrate:
(1) I heard a rather odd conversation.
(Corpus of Contemporary American English)
(2) These two teams are quite similar in some ways. (ibid.)
(3) I think it’s fairly easy for anyone to get anything they want (…). (ibid.)
(4) He seemed in pretty good form. (ibid.)
In the above sentences, rather, quite, pretty, and fairly index the properties of the adjectives they modify as ‘not fully X’. They function as ‘word modifiers’, not as ‘phrasal
modifiers’ (Stoffel 1901). They are constituents of adjective phrases and they scale
inherent qualities or properties of the heads.2
This paper addresses three challenges. The first challenge is to determine the conceptual content that rather, quite, pretty, and fairly presumably share. The second challenge is to spot the distinctive construals that these adverbs impose on the conceptual
content of the adjectives they modify. The third challenge is to identify how adjectives
modify the conceptual content of moderators. The working hypothesis is that overlap
in collocation preferences will reveal that rather, quite, fairly, and pretty have similar
conceptual content, whereas subtle differences will indicate that these four degree
modifiers impose different modes of construal.
In the pages that follow, moderators are contrasted in their collocation preferences in the 415M-word Corpus of Contemporary American English (Davies 1990-present). The proposed method requires extracting all tokens of the <moderator +
1. For example, the distributive quantifiers each and every both particularize a referent by
referring to its discreteness in a group. But they impose different ways of construing this conceptual content: each profiles a discrete entity within a group, whereas every profiles a group of
discrete entities.
2. Moderators, and degree modifiers in general, should not be mixed up with scalar focus
modifiers (even, rarely, barely). While the former are inherently scalar, focus modifiers are not:
they merely evoke a scale (Traugott 2008).
Visualizing distances in a set of near-synonyms 147
adjective> construction and implementing two techniques from a family of methods known as collostructional analysis. First, a collexeme analysis (Stefanowitsch &
Gries 2003) determines which adjectives are most strongly attracted to each moderator. Overlap in collocation preferences will be an indication that moderators have
similar conceptual content. Then, a multiple distinctive collexeme analysis (Gries &
Stefanowitsch 2004) determines those adjectives that are distinctively associated with
each moderator. If, despite overlap, the moderators attract distinct adjective classes,
then: (a) moderators form a functionally coherent paradigm; (b) this paradigm has a
complex internal structure.
When one investigates semantic similarities and differences between related
lexemes or constructions, one can choose among many association measures, such
as mutual information (pointwise MI, MI2, MI3), the t-score, the Fisher-Yates exact
test, the binomial test, the χ2 test, etc. Not all of them are equally satisfactory though
(Evert 2004; Wiechmann 2008). For example, pointwise mutual information (Church
& Hanks 1990) tends to overestimate rare words and is sensitive to data sparseness
(Manning and Schütze 1999:â•›180–181; Kilgarriff 2001). The χ2 test presupposes that
the linguistic phenomenon under investigation is distributed randomly across a corpus, but as Kilgarriff (2005) puts it, “language is never, ever, ever, random”.
Collostructional analysis is a good place to start because it is a “safe” option that
is based on association measures that (a) do not overestimate effect for low-frequency
pairs and (b) do not violate distributional assumptions. However, one of the collostructional analysis methods, namely multiple distinctive collexeme analysis, should
be used cautiously. Multiple distinctive collexeme analysis is best used to investigate
collocation preferences within a predefined closed set of functionally similar constructions (i.e. alternations). Therefore, one needs to make sure that the paradigm
under investigation is structured coherently before running such an analysis. As far
as moderators are concerned, this is no easy task because they can scale upwards or
downwards depending on the adjectives they modify (Paradis 1997:â•›87). Additionally,
past research has shown that context dependency between adverbs and adjectives is
not always decisive (Allerton 1987; Paradis 1997; Athanasiadou 2007). It is often impossible to decide whether quite is a maximizer or a moderator. A second difficulty
is to generalize the output of collostructional analyses. Collexeme and (multiple) distinctive collexeme analyses output large tables containing as many rows as there are
collexemes for each variable. The linguist uses these tables to find out whether collexemes fall into distinct semantic classes, but such classes are generally scattered across
many rows, and are invisible to the naked eye. Finally, in order to fully determine the
semantic profile of the <moderator + adjective> construction, one needs to capture
not only the semantic relations between each moderator and its distinctive adjectives
(which collostructional analysis does), but also the semantic relations between moderators and between adjectives. A multifactorial analysis can handle all three types of
semantic relations.
148 Guillaume Desagulier
To solve each of these issues, the output of collexeme and multiple distinctive
analyses serves as input for two multifactorial methods: hierarchical cluster analysis
(Divjak & Fieller, this volume 405–442) and correspondence analysis (Benzécri 1973;
Benzécri 1984; Greenacre 2007). Both are unsupervised, exploratory clustering techniques. They help find structure in multivariate data thanks to observation groupings.
In both cases, the linguist makes no assumption as to what groupings should be there.
Hierarchical cluster analysis uses distance matrices to cluster data in a tree-like
format – more specifically a dendrogram. This method is used to determine the preferred collexemes of 23 degree modifiers in English and to cluster the results to see
if the classes that one obtains are consistent with the traditional functional classes
that degree modifiers fall into: maximizers, boosters, moderators, approximators and
diminishers (Allerton 1987; Paradis 1997; Quirk et al. 1985). If moderators form a
coherent functional category, it is relevant to perform a multiple distinctive collexeme
analysis. If this category proves coherent, it will come precisely from the kind of collocates that moderators attract.
To account for the semantic correspondences between moderators and adjectives, I focus on the distinctive lexical preferences of the four moderators rather, quite,
fairly, and pretty. More specifically, I submit the output of multiple distinctive collexeme analysis to correspondence analysis, a multifactorial approach that provides
a low-dimensional map of the data by calculating matrices between the rows and
the columns of a contingency table. The larger the distance is between two rows or
columns, the further apart the row or column coordinates will be on the map. Correspondence analysis has two advantages. First, it captures semantic relations between
(a) moderators, (b) adjectives, and (c) moderators and adjectives. Second, it offers an
elegant and efficient way of visualizing distances between lexical variables and collexemes, even when these are computed from large tables.
The paper is organized as follows. Section 2 reviews previous qualitative and
quantitative research on degree modifiers of adjectives. Section 3 presents the corpus
and the methods used. Section 4 summarizes the results. In Section 5, I discuss the
results and explain why an original combination of univariate and multivariate statistical methods can only improve our understanding of near-synonymy.
2. Previous research
2.1
Typologies of degree modifiers
Degree modifiers – also known as intensifiers – are a subclass of degree words
(Bolinger 1972). They give specifications of degree concerning the adjectives they
modify. Adverbs such as very, extremely, absolutely scale adjectival properties “upwards”, whereas other adverbs, such as slightly, a little, somewhat scale adjectival
Visualizing distances in a set of near-synonyms 149
properties “downwards”. Rather, quite, fairly, and pretty set the qualities that gradable
adjectives denote to a moderate level. Along with moderately and relatively, these degree modifiers are known as ‘moderators’ (Paradis 1997).
Like most degree modifiers, rather, quite, fairly, and pretty are typologically unstable because they do not always neatly fit in the functional categories that linguists
have assigned them. For example, quite is likely to be interpreted as a maximizer
when it modifies an extreme/absolutive adjective (this novel is quite excellent) or a
telic/limit/liminal adjective (quite sufficient), but it is likely to be a moderator when
it modifies a scalar adjective (quite big) (Paradis 1997:â•›87). Past research has shown
that context dependency between adverbs and adjectives is not always decisive. It is
often impossible to decide whether quite is a maximizer or a moderator. For example,
quite is ambiguous when it modifies the adjective different (Allerton 1987:â•›25). Recent claims that quite has undergone grammaticalization have lead linguists to think
that this adverb is ambiguous with absolutive and scalar adjectives alike (the village is
quite beautiful, the play is quite good). Similarly, rather, pretty and fairly can scale upwards or downwards due to increasing subjectification (Nevalainen & Rissanen 2002;
Athanasiadou 2007).
Given that context dependency is the only way to determine whether a given degree modifier such as quite is a maximizer or a moderator, conducting a quantitative
study of collocational preferences between adverbs and adjectives across corpora is a
methodologically sound approach. It is also a popular approach, as evidenced by the
number of corpus-based studies of degree modifiers from the early 1990s onwards
(Altenberg 1991; Lorenz 2002; Kennedy 2003; Simon-Vandenbergen 2008). Most of
these quantitative approaches are intuitively attractive, but they are problematic for
two reasons. First, some of them do not rely on corpora that are sufficient in size
and thus fail to provide a detailed picture of the collocational preferences of degree
modifiers. By way of illustration, the London Lund Corpus, which Altenberg (1991)
and Paradis (1997) exploit, contains 0.5M-words. Some linguists could object that it
is presumably too small to reveal relatively rare, but not necessarily statistically insignificant, patterns of attraction between intensifiers and adjectives. Second, linguists
tend to use measures that are inadequate to reveal two-way interactions between collocants. Raw frequencies and coarse-grained relative frequencies such as percentages
and counts per n-thousand words (Altenberg 1991; Paradis 1997) underestimate significant collocations because they do not filter away lexemes that are highly frequent
regardless of the specific contexts where they occur. Likewise, the choice of pointwise
mutual information (Kennedy 2003) is not particularly helpful for the reasons outlined in Section 1.3
3. This can be confirmed easily on COCA, whose search interface allows end users to compute MI scores: (corpus.byu.edu/mutualinformation.asp).
150 Guillaume Desagulier
In the wake of Stoffel (1901), typologies of degree modifiers have proliferated
(Quirk et al. 1985; Allerton 1987; Paradis 1997), a sign that these adverbs do not all
lend themselves easily to a classificatory exercise. One recurring issue is the lack of
a fully satisfactory organizing principle. Quirk et al. (1985) posit a coarse-grained
distinction between ‘amplifiers’, which “scale upward from an assumed norm”, and
‘downtoners’, which “scale downwards”.4 Amplifiers further subdivide into ‘maximizers’ and ‘boosters’, and downtoners into ‘approximators’, ‘compromisers’, ‘diminishers’,
and ‘minimizers’. However, Quirk et al.’s classification is problematic in at least two
respects. Firstly, its internal structure is unjustified and inconsistent. In this regard,
Allerton (1987:â•›18–19) and Paradis (1997:â•›24) observe that the distribution of degree
modifiers across the aforementioned categories is inaccurate. Secondly, Quirk et al.
disregard the fact that the semantic influence between an intensifier and the head it
modifies is bidirectional.
Allerton observes that some approximators do not occur with all adjectives (e.g.
virtually unique vs. ?virtually large). He proposes a four-entry classification of degree
modifiers of adjectives (scalar modifiers, telic modifiers, absolutive modifiers, differential modifiers). This classification takes into account the collocational restrictions
of the <degree modifier + adjective> sequence (1987:â•›19–21). However, Allerton gives
no corpus evidence to support his claims and, although convincing, his model is not
empirically grounded.
Paradis’s taxonomy is finer-grained. Like Quirk et al., Paradis (1997:â•›27–28) proposes a general bipartition between modifiers that scale upwards (‘reinforcers’) and
modifiers that scale downwards (‘attenuators’). She postulates the existence of a cline
between these two poles. She unifies each set on the cline by means of the principle of cognitive synonymy (Cruse 1986). She further subdivides ‘reinforcers’ and ‘attenuators’ into ‘totality modifiers’ and ‘scalar modifiers’. Unlike Quirk et al., but like
Allerton, Paradis is well aware that some degree modifiers are polysemous, semantically fuzzy, and therefore paradigmatically unstable. Unlike Allerton’s, Paradis’s
typology is empirically founded. She devotes a chapter of her monograph to the distribution of degree modifiers of adjectives across the London-Lund Corpus of spoken English (1997:â•›Chapter 2). Her goal is both to examine the distribution of degree
modifiers of adjectives across modes (spoken vs. written) and to inspect the detail of
collocational preferences in the hope of establishing a subtle classification. She concludes that intonation is the key to understanding the meaning and the grading force
of degree modifiers (1997:â•›Chapter 4).
Previous corpus-based approaches to degree modifiers make extensive use of raw
counts, or relative frequencies that do not filter away overrepresented adjectives. Other
4. Strictly speaking, Quirk et al.’s two-branch model is restricted to degree modifiers used
as subjuncts, but as Paradis (1997:â•›23–24) observes, this model applies equally well to degree
modifiers of adjectives.
Visualizing distances in a set of near-synonyms 151
approaches may also suffer from the kind of corpus that they use, the lack of statistical
methods, or the selection of inadequate statistics. If one uses a corpus that is relatively
small, statistical methods that are invalid for low counts, such as (pointwise) mutual
information or χ2, are inappropriate, and one ends up being trapped in a vicious circle. This is a problem if one wants to investigate collocations, some of which might be
relatively rare but by no means insignificant, both statistically and semantically.
2.2
Corpus-based Cognitive Linguistics
Cognitive Linguistics is a usage-based approach to language that makes no principled distinction between language use and language structure. A linguistic unit is
entrenched and stored in grammar when a usage pattern generalizes across recurring instances of language use. The more frequently speakers encounter a linguistic
unit, the more the linguistic unit is entrenched, i.e. established as a cognitive routine.
In Cognitive Linguistics, entrenched patterns of usage provide privileged access to
speakers’ knowledge of their language.
The meaning of a linguistic unit “involves both conceptual content and the construal of that content” (Langacker 2008:â•›44). Conceptual content is referred to as a
domain: a consistent knowledge representation that serves as a basis for the construal
of other conceptual units. For instance, adjectives such as damp, dampish, moist or
dry relate to conceptual units to be construed with respect to the domain of wetness.
The content that the domain designates can further be construed with respect to the
more basic domain of property. Construal is the way a speaker presents a conceptual
representation through the choice of a linguistic expression. While lexemes and constructions bring with them a conventional meaning, this coded meaning is modified
by the context in which linguistic units occur. Far from being a fixed, pre-assigned
feature, meaning is negotiated locally and socially. By adopting a specific focal adjustment, the speaker linguistically organizes a scene and influences the way the conceptual representation that the expression evokes will be received by the hearer.
This is consistent with the theoretical framework of Cognitive Construction
Grammar (CCxG), as laid out in Langacker (1987), Goldberg (1995), and further
refined in Goldberg (2003, 2006, 2009) and Langacker (2008, 2009). One of the tenets
of CCxG concerns the link between constructions and construal.5 Most Construction Grammar supporters recognize that a given construction is both a product and a
vector of conceptualization (see Note 1). Goldberg (2003:â•›219) notes that a linguistic
pattern counts as a construction “as long as some aspect of its form or function is not
strictly predictable from its component parts or from other constructions recognized
to exist”. Productive or semi-productive constructions such as day after day, twistin’
5. The fact that the noun construction has two corresponding verbs, construct and construe, is
not innocent.
152 Guillaume Desagulier
the night away, or boy, was she in trouble! are idiosyncratic and non-compositional.
They must be learned on the basis of the input. Likewise, it may well be the case that
entrenched types of the <moderator + adjective> construction provide access to conceptual structures in ways that are not necessarily compositional.
Competence-based versions of Construction Grammar might deny the <moderator + adjective> pattern the constructional status or at least warn that we should not
treat de facto all collocation patterns as constructions. In Fillmorean Construction
Grammar, for example, only collocation patterns that are truly productive count as
constructions, other kinds of collocation patterns being consigned the ‘meta-grammar’ together with other ‘patterns of coining’ (Kay 2013). In contrast, usage-based
models of Construction Grammar such as CCxG have a broader understanding
of the notion of a construction. In a neo-Saussurean fashion, they consider that all
form-meaning pairings count as constructions: “it’s constructions all the way down”
(Goldberg 2003). To put it plainly, according to competence-based models of Construction Grammar, the linguist has no a priori reason to think that a unit counts as
a construction, whereas according to usage-based models, linguists have no a priori
reason to believe that a unit is not a construction. In this paper we adopt a usage-based
perspective because it leaves it to empirical analysis to decide whether a unit counts
as a construction or not. Our choice is reinforced by the fact that the quantitative
apparatus designed for constructions can also handle regular collocation patterns (i.e.
patterns that competence-based models consider as not necessarily constructional).
The entrenchment of some instances of the <moderator + adjective> construction explains why moderators scale in different ways depending on the adjective they
modify. For example, the scaling force of pretty depends on the adjective it modifies.
As Athanasiadou points out, “pretty small is not very small, but pretty straight is very/
quite straight” (Athanasiadou 2007:â•›557). More recently, Goldberg (2006:â•›45) allowed
that “individual patterns that are fully compositional are recorded alongside more
traditional linguistic generalizations”. This means that all types of the <moderators +
adjective> construction, i.e. entrenched as well as non-entrenched types, are worth
investigating because each is the trace and the vector of construal mechanisms in the
domain of degree modification.
Since corpus-based linguistics provides a comprehensive array of methods to
capture context and knowledge, it is not surprising that it has become central in
the investigation of cognitive patterns of usage in Cognitive Linguistics (Gries &
Stefanowitsch 2006). Paradis (1997:â•›62) claims that the semantic relation between degree modifiers and adjectives is bidirectional. More precisely, the adjective selects a
degree modifier, which in turn restricts the interpretation of the adjective. This means
two things. First, we should not examine probabilistic co-occurrences of words regardless of their morphosyntactic environment. Instead, we should inspect co-occurrences of moderators and adjectives within constructional patterns because both
moderators and adjectives contribute to the meaning of the <moderator + adjective>
Visualizing distances in a set of near-synonyms 153
construction. Second, the semantic interaction between a moderator and an adjective
is not necessarily compositional.
2.3
Cognitive synonyms or near-synonyms?
According to Paradis (1994:â•›160; 1997:â•›71), moderators are ‘cognitive synonyms’ because substituting one for the other does not change the truth-value of the proposition (Cruse 1986:â•›270ff.).6 Cognitive synonyms share identical conceptual content
and differ only in style (die vs. pass away), register (drunk vs. pissed), and connotation
(firm vs. stubborn). In other words, cognitive synonymy is a matter of both sameness
and difference.
Near-synonyms are sometimes referred to as ‘plesionyms’ (Hirst 1995; Edmonds
& Hirst 2002; Storjohann 2009). Plesionyms differ from cognitive synonyms because
the former involve a slight change in denotational reference, be it in terms of degree
(drunk vs. hammered), fuzzy boundary (forest vs. woods), viewpoint (slim vs. skinny),
intensity (break vs. destroy), etc. Substituting near-synonyms alters truth conditions
but the sentences where they appear remain semantically similar. Cognitive synonymy and near-synonymy are often hard to tell apart, and an opposition between these
two concepts is somehow unhelpful. According to Edmonds and Hirst, near-synonyms bring with them a finer-grained representation than cognitive synonyms. Their
conclusion is that “it should not be necessary to make a formal distinction between
cognitive synonyms and plesionyms” (Edmonds & Hirst 2002).
The conceptual content that moderators share is essentially functional: rather,
quite, fairly, and pretty moderate the qualities denoted by the adjectives they modify.
But moderators are not completely interchangeable in all contexts. Two reasons have
been put forth. First, moderators do not express the same degree of moderation. Second, they involve distinct modes of construal (Paradis 2000, 2008). I will put forth another reason: moderators do not always operate within the same conceptual domains.
3. Method
I combine two broad types of statistics: analytical statistics and multifactorial methods. Analytical statistics uses measures such as frequencies in the domain of hypothesis testing. Rather than formulate and test hypotheses, multifactorial methods aim to
formulate a statistical model, i.e. the statistical description of variation and complex
relations in a multi-variable dataset.
6. Cruse’s definition of cognitive synonymy echoes Quine’s (1951).
154 Guillaume Desagulier
To describe and interpret the distribution of the four moderators, we shall start
with a study of collocation patterns. The information concerning the distribution of
moderators will serve as input for two methods of multifactorial analysis. The paragraphs that follow will justify the choice of the Corpus of Contemporary American
English, explain how the data was extracted, and present the measures of lexico-grammatical co-occurrence that are used, namely collexeme analysis, and multiple distinctive collexeme analysis. Finally, two multifactorial analyses will be introduced:
hierarchical cluster analysis and correspondence analysis. Their input will be provided by the results of the collexeme and multiple distinctive collexeme analyses.
3.1
The corpus
At the time of writing, the Corpus of Contemporary American English (Davies,
1990-present), henceforth COCA, consists of 414,771,808 words of spoken and written American English divided among 169,140 texts. The spoken part contains approximately 85 million words and consists of transcripts of conversation from TV
and radio programs. The written part is divided evenly between four genres: fiction
(80 million words), popular magazines (86 million words), newspapers (82 million
words), and academic journals (82 million words). The whole corpus spreads across
the period from 1990 to 2010, and 20 million words are added each year. One advantage of COCA is that it is probably the largest publicly and freely available annotated
corpus of English. Its size and sampling scheme increase the reliability and validity
of observations of relatively rare linguistic phenomena. However, one major disadvantage with COCA is that the texts themselves are not available for download for
copyright reasons. The only way to run queries is via the native search interface.7 Nevertheless, it is simple to copy and export query results into a text editor or a spreadsheet, clean up the data, cross-tabulate, and run statistical tests.
Davies claims that COCA is balanced. To some extent this is true because words
are evenly distributed across genres. Yet, as is the case with all major corpora, balance
is more an ideal than a reality. COCA also suffers from the fact that speaker-related
metadata are not computed relative to the overall size of the corpus, making them
very difficult to measure statistically. For example, COCA indicates the identity of the
speakers, but gender is left implicit. However, for the current purposes, the fact that
speaker-related information is hard to obtain is of little importance since the main
7. This could be a major issue in light of Leech’s “standards of good practice” for corpus users
and corpus compilers (Leech 1997:â•›6). At the top of Leech’s list is the following recommendation: “[t]he raw corpus should be recoverable.” Even if the whole texts are unavailable to the
COCA end-users, it is nevertheless possible to “dispense with the annotations, and to revert to
the raw corpus” (ibid.), although partially.
Visualizing distances in a set of near-synonyms 155
focus of the present study is on function-related distributions and collocation patterns, regardless of dialectal issues.
Last but not least, queries on the COCA yield duplicates. This is largely due to
automatic text collection, a small price to pay for a corpus of this size. I have not
removed duplicates for two reasons. First, duplicates that result from a genuine verbatim quotation are very hard to distinguish from duplicates that result from an error
in automatic text compilation. Second, the number of duplicates does not affect the
statistics in any significant way. All corpora come with restrictions, and it is important
to bear these in mind when interpreting results. All things considered, the advantages
of COCA outnumber its disadvantages.
3.2
From collocates to collexemes
A common assumption in corpus linguistics is that the context of a variable (lexical or phrasal) reveals important aspects of its syntactic and semantic properties
(Sinclair 1991; Biber et al. 1998). The easiest way to analyze the context of a variable is
to extract its collocates and determine those that most typically combine with the variable. According to Sinclair (1991:â•›170), collocation is “the occurrence of two or more
words within a short space of each other in a text”. Traditionally, the conventional
size of what Sinclair calls ‘a short space’ varies from 1 to 5 words on both sides of the
variable, depending on the case study. Syntactic studies tend to examine wider spans,
while lexical studies (e.g. those that focus on idiomatic compounds such as bread and
butter) examine shorter spans. Some authors also assume that statistical methods can
easily filter out the noise generated by a wide concordance span, but once again the
wider the span, the weaker the syntactic claims the linguist can make.
In the literature on intensifiers, little is said as to the optimal span to investigate.
In his study on amplifiers, Kennedy inspects “a window of two words” on each side
so as to “retrieve collocations that may have been separated by intervening words”
(2003:â•›472–473). Amplifiers can modify elements that precede or follow. This fact
alone justifies a two-way search. Amplifier collocates can be adjectives, other adverbs,
or verbs, and given their multifunctionality, “intervening words” are frequent. Moderators are slightly different. Insofar as they are ‘word modifiers’ (Stoffel 1901), there is
no need to inspect too wide a range of words, and since these adverbs are premodifiers
it is useless to inspect the left context.8 Bearing these specificities in mind, I extracted
8. In many collocation-based studies, node words or phrases are often clustered on the basis
of their collocates within a relatively large span. One big problem with this method is that
the semantic relation between many of these collocates and the node(s) is loose, which brings
about noise. By adopting a constructional approach, and therefore treating the sequence <moderator + adjective> as a construction, I restrict the semantic investigation to the syntactic frame
of the construction and inevitably minimize the risk of obtaining noise in the data points.
156 Guillaume Desagulier
all adjectives that occur in the first two slots to the right of rather, quite, fairly, and
pretty. I also adopted the “paradigmatic reduction” outlined in Lorenz (2002:â•›144) and
considered only moderators of adjectives so as to focus on subtle semantic relations.
Collocation extraction is a preliminary step to determining sequences of moderators and adjectives that co-occur more frequently than would be expected by chance.
Determining collocation strength is a particularly helpful way of spotting differences
between semantically similar expressions. Even though we assume that rather, quite,
pretty and fairly are near-synonyms, and therefore map onto similar content domains
in a quality-related conceptual space, we may expect significantly distinct collocation
patterns for each moderator. A related assumption is that distinct collocation patterns
reflect subtle differences in schematic-domain profiling.
As mentioned earlier, the semantic relation between degree modifiers and adjectives has been described as bidirectional. While this is a convincing assumption, a
simple collocate-based inquiry based on raw counts and/or basic relative frequencies
is unsatisfactory because it fails to distinguish those adjectives that are significantly
attracted to a degree modifier from those that are frequent in the corpus regardless of
the context where they appear.9
In this study of the <moderator + adjective> construction, we shall turn to collexeme analysis (Stefanowitsch & Gries 2003), a method that can both handle two-way
semantic influences and filter out adjectives that have a high overall token frequency in the corpus. Collexeme analysis is part of a family of methods known as collostructional analysis (Hilpert, this volume, 391–404).10 It investigates which lexemes
typically occur in a given slot in a single grammatical construction (e.g. the X-er, the
better). It takes as input the frequency of a lexeme in a given construction, the frequency of the same lexeme in all other constructions, the frequency of the construction
with other lexemes, and the frequency of all other constructions with other lexemes
(Stefanowitsch & Gries 2003:â•›218). The data is tabulated and the 2x2 table is submitted
to the Fisher-Yates Exact test (Pedersen 1996). Unlike χ2-statistics, the Fisher-Yates
Exact test does not presuppose or violate any distributional assumptions (Kilgarriff
2005). Unlike χ2-statistics and Mutual Information, the Fisher-Yates Exact test is not
invalid when counts are low. This means that rare collocations in COCA (such as quite
medical or pretty social) can nevertheless be included in the overall calculation of collostruction strength. The p-value that one obtains through the Fisher-Yates Exact test
is an indication of the association strength between the lexeme and the construction:
9. Raw counts provide the simplest association measure for pair types. When the construction
under investigation displays low token frequency, raw counts are often the only option. However, they should be regarded as a last resort.
10. Each of these methods measures the degree of attraction or repulsion between lexical items
and constructions. In collostructional analysis, a construction is defined as a conventional pairing of form and meaning, in light of Goldberg (1995).
Visualizing distances in a set of near-synonyms 157
the smaller the p-value, the stronger the association between a lexeme and a construction. In recent versions of collexeme analysis, the p-value is transformed in an inverse
logarithmic function so as to make the distinction between very small p-values easier to identify. In the present study, this operation will be repeated for all adjective
types that co-occur with each moderator. Collexeme analysis will first help determine which adjectives are most strongly attracted to the construction <moderator +
adjective> by quantifying the bidirectional attraction between a moderator and the
adjective it modifies. Second, it will help confirm that rather, quite, pretty, and fairly
are functional synonyms by revealing significant overlap in the selection of adjectives.
The more significant the overlap, the more these four moderators map onto similar
content domains, and the stronger the claim that moderators are cognitive synonyms.
Cognitive synonyms also display differences. To make these differences apparent, one can look for the distinctive collexemes of each construction. This is where a
second method of collostructional analysis is needed: distinctive collexeme analysis
(Gries & Stefanowitsch 2004). Insofar as we want to compare four near-synonymous
constructions, <rather + adjective>, <quite + adjective>, <pretty + adjective>, and
<fairly + adjective>, and contrast them in their respective collexeme preferences,
we should implement multiple distinctive collexeme analysis (Hilpert 2006; Gilquin
2007). This method differs from simple collexeme analysis because it filters away
overlapping collocates and retains only those adjectives that are specific to each moderator. The output makes it possible to classify the distinctive adjectives according
to their function and meaning, and to get a better grasp of the individual functional
specificities of rather, quite, pretty, and fairly.
3.3
Collexemes as input for multivariate statistics
Moderator collexemes can serve as input for two usage-based techniques whose aim is
to capture semantic relations between near-synonyms on the basis of multiple factors:
hierarchical cluster analysis and correspondence analysis. Both methods are exploratory. Instead of testing a hypothesis in relation to pre-assigned categorical clusters,
multifactorial analyses follow a bottom-up approach. Indeed, clusters are determined
by the similarity of the members of the same groupings and their dissimilarity to the
members of other groupings. In theory, these methods dispense with the linguist’s
preconception of how the data is categorized. But in practice the clustering results in
large part from the criteria that the linguist adopts to combine points into clusters.
Recent research on lexical near-synonymy in Cognitive Linguistics has made
extensive use of cluster analysis (Divjak 2006, 2010; Divjak & Fieller this volume,
405–441; Divjak & Gries 2008). Hierarchical cluster analysis is the generic name of a
family of statistical techniques for clustering data (i.e. for structuring observed data
into groups) and representing them graphically in a tree-like format. Among these
clustering techniques, I use hierarchical agglomerative clustering, whereby individual
158 Guillaume Desagulier
data points are successively agglomerated into similar clusters, and similar clusters
are merged iteratively into bigger clusters until one last cluster is obtained.11 Results
appear in the form of a dendrogram, which facilitates an objective and accurate identification of semantic classes and subclasses for a given lexical set. Hierarchical agglomerative cluster analysis will be used to justify the existence of the paradigm of
moderators within the broader paradigm of degree modifiers of adjectives. The collostruction strength of the collexemes of 23 English degree modifiers of adjectives will
serve as input to compute the distance matrix.
Correspondence analysis is another distance-based clustering technique that represents the structure of cross-tabulations graphically on a multi-dimensional plane
(Benzécri 1973; Benzécri 1984; Greenacre 2007; Glynn, this volume, 443–486). In
lexical semantics, correspondence analysis offers a convenient way of mapping correlations between lexical items graphically. Like hierarchical cluster analysis, correspondence analysis has become a popular method in Cognitive Semantics (Glynn
2010a). Interpreting a correspondence map is relatively simple: the closer the data
points on the map, the stronger the correlation between these data points. However,
interpreting a map can also be tricky, since it flattens multi-dimensional distances
onto a two-dimensional plane. Nevertheless, correspondence analysis yields promising results in areas such as exemplar-based semantics or cognitive sociolinguistics
(Glynn 2010b). Cognitive linguists like the fact that a correspondence map may provide access to the complex conceptual maps that structure language knowledge and
language use. Correspondence analysis is used to show how moderators imply a specific construal depending on the adjectives they correlate with. Input is provided by
the cross-tabulation of the frequencies of 25 distinctive collexemes for each moderator. The list of 25 distinctive collexemes will be obtained via the multiple distinctive
collexeme analysis outlined above.
This paper attempts to show that near-synonymy is a complex phenomenon
whose study can only benefit from a combination of different statistical methods.
These methods range from frequencies and collocations – including (multiple distinctive) collexeme analysis – to exploratory multifactorial techniques – namely cluster analysis and correspondence analysis. If cognitive synonymy is indeed a matter of
sameness and difference, only a combination of statistical techniques can provide a
sound-enough basis for a complete picture of semantic relations to emerge regarding
moderators.
11. Hierarchical divisive clustering proceeds in reverse order: it starts at the root and successively splits the clusters.
Visualizing distances in a set of near-synonyms 159
4. Results
4.1
Collexeme analysis
Because rather, quite, pretty, and fairly are alternate ways of expressing moderation,
we should expect them to display similarities and differences. Each feature will be
illustrated in turn, starting with similarities, which will be captured by means of a collexeme analysis. Collexeme analysis generates a ranked list of attracted lexemes (i.e.
‘collexemes’) and rejected lexemes. This list can be used to determine what meanings
are congruent with the semantics of the construction and what meanings are incongruent. For our current purpose, we shall focus on congruent meanings.
Table 1.╇ Top 10 adjectival collocates of rather, quite, fairly, and pretty in COCA
rather
quite
adjective
frequency
in corpus
frequency
in construction
adjective
frequency
in corpus
frequency
in construction
large
different
small
difficult
unusual
simple
limited
good
high
strange
119,992
â•⁄17,021
165,348
â•⁄â•⁄6,672
â•⁄17,736
â•⁄48,134
â•⁄â•⁄3,533
378,826
191,591
â•⁄23,457
260
231
189
129
116
â•⁄96
â•⁄95
â•⁄89
â•⁄85
â•⁄80
different
sure
clear
good
right
possible
similar
ready
common
simple
â•⁄17,021
137,372
â•⁄81,553
378,826
â•⁄â•⁄4,558
â•⁄88,919
â•⁄60,967
â•⁄55,338
â•⁄63,239
â•⁄48,134
2,247
1,347
â•⁄â•‹805
â•‹â•⁄ 578
â•‹â•⁄ 548
â•‹â•⁄ 458
â•‹â•⁄ 328
â•‹â•⁄ 300
â•‹â•⁄ 278
â•‹â•⁄ 271
fairly
pretty
adjective
frequency
in corpus
frequency
in construction
adjective
frequency
in corpus
frequency
in construction
good
easy
large
common
high
simple
new
certain
small
typical
378,826
â•⁄59,914
119,992
â•⁄63,239
191,591
â•⁄48,134
â•⁄64,824
â•⁄71,149
165,348
â•⁄20,483
346
337
281
278
247
243
202
201
190
154
good
sure
bad
clear
big
tough
cool
close
hard
strong
378,826
137,372
â•⁄90,297
â•⁄81,553
187,641
â•⁄33,746
â•⁄â•⁄3,235
â•⁄94,845
123,894
â•⁄69,137
7,492
1,241
â•‹â•⁄ 743
â•‹â•⁄ 729
â•‹â•⁄ 583
â•‹â•⁄ 486
â•‹â•⁄ 481
â•‹â•⁄ 471
â•‹â•⁄ 436
â•‹â•⁄ 383
160 Guillaume Desagulier
Let us postulate that each moderator of adjectives forms a construction that attracts certain adjectival collexemes and rejects others. To calculate the collostruction
strength of a given adjective A for a given <moderator + adjective> construction C,
collexeme analysis needs four frequencies: the raw frequency of A in C, the raw frequency of A in all other constructions (i.e. ¬C), the frequency of C with adjectives
other than A (i.e. ¬A), and the frequency of ¬C with that of ¬A (Stefanowitsch &
Gries 2003:â•›218, Hilpert this volume). To generate a ranked list of attracted adjectives based on collostruction strength, this operation is repeated for each adjective
that co-occurs with each moderator in the corpus. Table 1, above, summarizes the
frequencies of the 10 most frequent adjectives in the four <moderator + adjective>
constructions in COCA, i.e. <rather + adjective>, <quite + adjective>, <fairly + adjective> and <pretty + adjective>.
Each of the four subcomponents of the table are submitted to a collexeme analysis, along with the following information:
corpus size: 415M words
frequency of <rather + adj.>: 12,574
frequency of <quite + adj.>: 29,735
frequency of <fairly + adj.>: 10,834
frequency of <pretty + adj.>: 35,949
Table 2 presents the ten most strongly attracted adjectives of the four moderators
based on collostruction strength.12
Table 2.╇ Top 10 collexemes of rather, quite, fairly, and pretty
rather
quite
adjective
coll.
adjective
strength
large
different
unusual
small
difficult
odd
remarkable
limited
vague
strange
1025.19
â•⁄709.26
â•⁄705.49
â•⁄521.34
â•⁄480.6
â•⁄436.5
â•⁄415.88
â•⁄413.37
â•⁄400.68
â•⁄384.69
different
sure
clear
possible
similar
good
ready
simple
remarkable
common
fairly
coll.
adjective
strength
15082.43
â•⁄8231.32
â•⁄4923.96
â•⁄2216.18
â•⁄1614.27
â•⁄1482.02
â•⁄1480.81
â•⁄1357.46
â•⁄1297.76
â•⁄1259.48
easy
common
simple
large
good
straight-forward
certain
typical
high
consistent
pretty
coll.
adjective coll.
strength
strength
2686.55
2078.88
1883.49
1751.18
1523.3
1433.43
1326.03
1315.09
1250.31
1112.44
good
sure
clear
bad
tough
cool
amazing
big
close
strong
12. It was performed thanks to the script Coll.analysis 3.2 for R (Gries 2007).
61820.34
â•⁄8148.39
â•⁄4764.46
â•⁄4734.36
â•⁄3635.13
â•⁄3628.34
â•⁄2848.64
â•⁄2604.22
â•⁄2531.19
â•⁄2139.59
Visualizing distances in a set of near-synonyms 161
One might object that there is little difference between the ranked list based on
raw frequencies (Table 1) and the ranked list based on collostruction strength (Table 2). However, the latter is more reliable since collostruction strength is the product
of an association measure – here Dunning’s log-likelihood ratio – which uses absolute
and collocate frequencies to determine which lexemes co-occur more frequently than
expected in a given construction. All the above collexemes are attracted to each construction at the very significant level of p < 0.001 since coll.strength > 3.13
Upon inspection, collexeme analysis informs a semantic analysis of moderator
constructions. The collexemes of rather split into sets of adjectives that denote spatial
dimension (large, small, limited), atypicality (unusual, odd, vague, strange, remarkable), difference, and difficulty. The collexemes of quite split into the following classes:
difference/similarity (different, similar), epistemic modality (sure, possible), dynamic
modality (ready), positive value (good, clear), (a)typicality (common vs. remarkable),
and simplicity. As regards fairly, collexemes divide up into adjectives that denote simplicity (easy, simple, straightforward), typicality (common, typical, consistent), spatial
dimension or position (large, high), positive value (good), and epistemic modality
(certain). The collexemes of pretty fall into the following sets: positive/negative values
(good vs. bad, cool, clear, strong), difficulty (tough), nonstandard identification (amazing), spatial dimension or position (big, close), and epistemic modality (sure).
As expected, collexeme analysis reveals that the behavioral profiles of moderators are both similar and different. Similarity is evidenced by the significant degree of
overlap in the selection of collexemes. Many adjectives co-occur with more than one
moderator (certain, clear, common, different, difficult, good, large, remarkable, simple,
sure). Some semantic classes are shared among moderators. For example, both rather
and quite are compatible with the expression of (a)typicality. Moderating qualities
that belong to the class of spatial extension can be done by means of fairly or rather.
The expression of epistemic modality is a feature common to quite, pretty, and fairly.
This confirms that moderators are synonyms to some extent.
However, at a finer-grained level of analysis, semantic correspondences between
moderators are not that straightforward. Even though the expression of atypicality
is observed with quite, it is in fact a distinctive feature of rather because adjectives
denoting that semantic class are more varied with rather. Even though the expression
of epistemic modality is common to the three moderators, it is actually characteristic of quite. The adjective different co-occurs with both rather and quite. However, a
quick look at the collostruction strength shows that different is far more distinctive
of quite (coll. strength = 15082.43) than it is of rather (coll. strength = 709.26). Good
is attracted to pretty, fairly, and quite, but it is definitely more distinctive of pretty
13. As a rule, if collostruction strength is based on p-values, the following equivalences hold
true: coll.strength > 3 = p < 0.001; coll.strength > 2 = p < 0.01; coll.strength > 1.3 = p < 0.05.
162 Guillaume Desagulier
(coll. strength = 61820.34) than it is of fairly (coll. strength = 1523.3), and quite (coll.
strength = 1482.02).14
Collexeme analysis is not the easiest way of spotting differences in collocational
preferences because it does not filter away overlapping adjectives and it requires a
tedious comparison of collostruction strengths to determine relevant thresholds of
attraction. This can be done by means of a multiple distinctive collexeme analysis
(MDCA), which contrasts constructions in their distinctive collocational preferences by getting rid of overlapping collexemes. As explained above, this method is best
suited for comparing related constructions, preferably alternations. Whether rather,
quite, pretty, and fairly are alternative ways of expressing moderation is undeniable.
However, the internal structure of the paradigm that these four adverbs belong to is
problematic because some of these adverbs (e.g. quite) can be used as maximizers. To
make sure the paradigm of moderators is internally coherent and thus to maximize
the interpretation of MDCA, one interesting option is to conduct a hierarchical cluster analysis over a pool of 23 degree modifiers in English and see if the four adverbs
cluster together on the basis of their preferred collexemes.
4.2
Collexeme analysis as input for hierarchical cluster analysis
Hierarchical cluster analysis describes a range of multifactorial methods for investigating structure in data, with the goal of identifying subgroups of similar objects.
Following Gries & Stefanowitsch (2010), I use hierarchical agglomerative clustering
(Everitt et al. 2011:â•›Section 4.2) to see how English degree modifiers cluster on the
basis of their preferred collexemes. The 23 degree modifiers, which include the four
moderators under investigation, are: a bit, a little, absolutely, almost, awfully, completely, entirely, extremely, fairly, frightfully, highly, jolly, most, perfectly, pretty, quite,
rather, slightly, somewhat, terribly, totally, utterly, very. Originally, Paradis selected
them because they epitomize the degree modifier paradigm in most lexicographic
works (1997:â•›15–17). If rather, quite, pretty, and fairly cluster together, then these four
adverbs form a homogeneous paradigm despite their multifunctional behavior.
For each of the 23 adverb types listed above, I first extracted all adjectival collocates in COCA, amounting to 432 adjective types and 316,159 co-occurrence tokens.
Then, I conducted a collexeme analysis for each of the 23 degree modifiers. To reduce
the data set to manageable proportions, the 35 most attracted adjectives were selected
on the basis of their collostruction strength. For these 23 adverb types and their 432
adjective types, a 23-by-432 co-occurrence table containing the frequency of each adverb-adjective pair type was submitted to a hierarchical agglomerative cluster analysis,
which requires a distance object as input. The distance object is a dissimilarity matrix
14. It is also a collexeme of rather, although to a much lesser extent (coll. strength = 34.34).
Visualizing distances in a set of near-synonyms 163
au bp
edge #
88 3
20
JOLLY
ALMOST
FRIGHTFULLY
86 11
72 26
89 44
17
15 14
85 68
6
TERRIBLY
HIGHLY
VERY
MOST
EXTREMELY
QUITE
FAIRLY
83 24
12
88 59
4
91 73
2
PRETTY
A_BIT
SOMEWHAT
A_LITTLE
UTTERLY
SLIGHTLY
ENTIRELY
PERFECTLY
90 64
9 86 43
7
88 73
3
92 12
18
75 36
16
73 48
82 84
10
5
TOTALLY
96 100
1
ABSOLUTELY
88 16
13 81 29
1162 36
8
RATHER
79 7
19
AWFULLY
84 7
21
COMPLETELY
Height
200 250 300 350 400 450 500 550 600
Figure 1.╇ Cluster dendrogram of 23 degree modifiers of adjectives in English, clustered
according to their adjectival collexemes (distance: Canberra; cluster method: Ward)
that one obtains by converting tabulated frequencies into distances with a user-defined
distance measure. When variables are ratio-scaled, the linguist can choose from several distance measures (Euclidean, City-Block, correlation, Pearson, Canberra, etc.).15
For our purpose, the measure of dissimilarity of the adverb types in the columns was
computed using the Canberra distance metric, because it handles the relatively large
number of empty occurrences best (see Divjak & Gries [2006:â•›37] for further methodological details). Finally, one needs to apply an amalgamation rule that specifies
how the elements in the distance matrix get assembled into clusters. Here, clusters
were amalgamated using Ward’s method (Ward 1963), which evaluates the distances
between clusters using an analysis of variance. This method has the advantage of generating clusters of moderate size.16 Figure 1 shows the resulting dendrogram.
The plot should be read from bottom to top. There are three numbers around each
node. The number below each node specifies the rank of the cluster (here, from 1 to
21, i.e. from the 1st generated cluster to the 21st). The two numbers above each node
15. For reasons of space, I cannot discuss the reasons why one should prefer a distance measure
over another. A description of some distance measures can be found in Gries (2010:â•›313–316).
16. All computations were performed with R 2.13 (R Development Core Team 2011) and the
package pvclust (version 1.2-2, www.is.titech.ac.jp/~shimo/prog/pvclust/). This package allows the user to include confidence estimates through multiscale bootstrap resampling, a possibility missing in other packages, such as hclust.
164 Guillaume Desagulier
indicate two types of p-values,17 which are calculated via two different bootstrapping
algorithms: AU and BP. The number to the left indicates an ‘approximately unbiased’
p-value (AU) and is computed by multiscale bootstrap resampling. The number to
the right indicates a ‘bootstrap probability’ p-value (BP) and is computed by normal
bootstrap resampling. The number to the left is a much better assessment of how
strongly the cluster is supported by the data. In both cases, the closer the number is to
100, the stronger the cluster. AU p-values suggest the clusters we obtain represent the
data accurately. Indeed, the plot shows the standard values of most clusters are significantly high, with AU p-values ranging from 79 to 96. An AU p-value of 96 implies that
the hypothesis that the cluster is invalid is rejected with a significance level of 0.04.
The dendrogram displays several homogeneous clusters:18
a. cluster 19 groups together maximizers; it breaks down into cluster 1 (completely,
totally) and cluster 13 (perfectly, absolutely, entirely, utterly);
b. cluster 9 groups together diminishers (slightly, a little, a bit, somewhat);
c. cluster 12 groups together moderators (rather, pretty, fairly, quite);
d. cluster 18 groups together boosters and breaks down into cluster 16 (most, very,
extremely, highly), cluster 6 (awfully, terribly), and cluster 14 (frightfully, jolly); the
presence of an approximator (almost) within the cluster of boosters (cluster 15) is
surprising but may be due to its intensive use as a sentential adverb, more than a
modifier of adjectives.19
The cluster analysis based on collexemes yields functionally and semantically motivated groups. As Paradis (1997:â•›27) observed, rather, quite, pretty, and fairly do cluster
together under the moderator paradigm (cluster 12) despite their multifunctionality.
However, the internal structure of this cluster still needs explaining. It is not clear
why fairly and quite cluster together (cluster 2), and why rather is not part of cluster 4, which groups together pretty and cluster 2. Furthermore, the stratification of
cluster 12 does not follow their conventional distribution in terms of grading force
(rather> quite> pretty> fairly), as found in Paradis (1997:â•›148–155). For now, suffice
it is to say that moderators form a functionally coherent class. Performing MDCA to
explore internal distinctions is therefore relevant.
17. The term “p-value” is the one that the authors of the pvclust package have adopted. Actually, it seems that these p-values are confidence estimates.
18. In the classification of degree modifiers that follows, I adopt Paradis’s terminology.
19. See Paradis (1997:â•›37) for confirmation.
Visualizing distances in a set of near-synonyms 165
4.3
Multiple distinctive collexeme analysis
We saw above that rather, quite, pretty, and fairly display similarities and differences.
One way to amplify these differences is to conduct a multiple distinctive collexeme
analysis. Instead of computing the degree of attraction between a lexical item and a
construction, distinctive collexeme analysis contrasts constructions in their respective collocational preferences (Gries & Stefanowitsch 2004). This method has proved
useful when it comes to distinguishing minimal semantic and functional differences
between near-synonymous constructions (e.g. the ditransitive vs prepositional dative
alternation). The input is slightly different from what we have in collexeme analysis.
This time, one needs to tabulate the type frequency of the collexeme in the first construction, the type frequency of the same collexeme in the second construction, and
the frequencies of the two constructions with words other than the collexeme under
investigation (Gries & Stefanowitsch 2004:â•›102). Again, 2x2 tables are submitted to
the Fisher-Yates Exact test for each relevant lexeme. However, when one wants to
compare more than two constructions and input more complex tables, such as Table
3 below, the Fisher-Yates Exact test cannot be used. Instead, one needs to carry out a
one-tailed exact binomial test, and the method goes under the name of multiple distinctive collexeme analysis.20 The same script as the one used for collexeme analysis
was used.
Below, Tables 4 to 7 list, for each moderator, the ten most distinctive adjectives. MDCA compares the observed frequency of each adjective with its expected
frequency.
If adjectives were distributed at random over the different moderator constructions, we would not find any significant deviation between observed and expected
frequencies because the distribution of each adjective would follow the frequencies
of the moderators. For each construction token, the script performs a binomial test
Table 3.╇ Input for a multiple distinctive collexeme analysis of an adjective in four
moderator constructions
Construction
Adjective A
Other adjectives
Row totals
rather + adj
quite + adj
pretty + adj
fairly + adj
column totals
a
c
e
g
a+c+e+g
b
d
f
h
b+d+f+h
a+b
c+d
e+f
g+h
a+b+c+d+e+f+g+h
20.Gilquin (2007) illustrates how multiple distinctive collexeme analysis determines the verbs
that are distinctively associated with the non-finite verb slot of English periphrastic causative
constructions.
166 Guillaume Desagulier
Table 4.╇ The 10 most distinctive adjectives of rather
rather + adj
observed
frequency
expected
frequency
pbin rather
SumAbsDev
odd
unusual
strange
vague
difficult
simplistic
lengthy
peculiar
bizarre
curious
formal
â•⁄74
116
â•⁄80
â•⁄56
129
â•⁄28
â•⁄36
â•⁄31
â•⁄51
â•⁄29
â•⁄29
14.51
37.58
25.70
12.71
65.35
â•⁄4.70
â•⁄9.26
â•⁄5.94
14.09
â•⁄5.94
â•⁄5.94
38.99
30.40
21.54
24.75
13.89
18.31
13.81
17.21
17.46
14.91
14.91
56.75
46.39
38.70
35.86
35.86
28.42
27.53
27.08
25.83
25.09
23.49
Table 5.╇ The 10 most distinctive adjectives of quite
quite + adj
obs freq
exp freq
pbin quite
SumAbsDev
different
right
possible
ready
true
likely
similar
capable
willing
correct
2247
â•⁄548
â•⁄458
â•⁄300
â•⁄235
â•⁄206
â•⁄328
â•⁄152
â•⁄142
â•⁄106
825.30
184.75
150.95
104.02
â•⁄91.23
â•⁄76.46
148.32
â•⁄51.19
â•⁄49.55
â•⁄36.10
Inf
238.72
216.96
120.33
â•⁄70.29
â•⁄69.12
â•⁄66.10
â•⁄66.87
â•⁄56.31
â•⁄45.22
Inf
410.01
374.49
206.00
120.24
118.13
116.57
114.97
â•⁄96.25
â•⁄78.33
Table 6.╇ The 10 most distinctive adjectives of fairly
fairly + adj
obs freq
exp freq
pbin fairly
SumAbsDev
common
easy
new
constant
recent
certain
typical
consistent
straightforward
regular
278
337
202
117
122
201
154
130
123
â•⁄70
â•⁄83.86
108.56
â•⁄43.38
â•⁄16.63
â•⁄20.36
â•⁄66.03
â•⁄35.55
â•⁄33.26
â•⁄30.73
â•⁄12.05
76.85
83.94
88.58
84.16
72.50
49.00
61.99
45.98
44.94
40.51
145.26
130.39
123.43
123.41
109.92
â•⁄98.80
â•⁄89.24
â•⁄68.47
â•⁄65.25
â•⁄59.78
Visualizing distances in a set of near-synonyms 167
Table 7.╇ The 10 most distinctive adjectives of pretty
pretty + adj
obs freq
exp freq
pbin pretty
SumAbsDev
good
bad
cool
tough
big
hard
scary
smart
amazing
close
7731
â•⁄758
â•⁄488
â•⁄489
â•⁄591
â•⁄447
â•⁄212
â•⁄196
â•⁄347
â•⁄498
3613.04
â•⁄343.37
â•⁄214.45
â•⁄226.02
â•⁄295.44
â•⁄225.61
â•⁄â•⁄98.34
â•⁄â•⁄91.32
â•⁄195.03
â•⁄314.03
Inf
201.75
144.61
122.02
113.65
â•⁄83.52
â•⁄52.78
â•⁄48.17
â•⁄44.89
â•⁄40.52
Inf
349.84
252.23
213.70
204.40
143.97
â•⁄93.43
â•⁄83.66
â•⁄82.05
â•⁄77.60
to determine the probability of a particular observed frequency given the expected frequency.21 This probability is then log-transformed. The resulting value (pbin)
captures distinctiveness.22 It is used to determine whether a given adjective is distinctive for a particular construction or not, and whether the co-occurrence between
the adjective and the moderator construction is statistically significant or not. The
co-occurrence is statistically significant if the absolute distinctiveness value is higher
than 1.3, p < 0.05. Finally, SumAbsDev gives the sum of all absolute pbin values for
a particular adjective. The higher the figure, the more the adjective deviates from its
expected distribution.
MDCA makes patterns of attraction more visible. It confirms that rather attracts
adjectives that denote atypicality/deviation from a norm (odd, unusual, strange, vague,
peculiar, bizarre, curious). Additionally, rather attracts adjectives that denote difficulty/simplicity. By far, the most distinctive collexeme of quite is different (pbin and
SumAbsDev = infinite). Its antonym (similar) is also among the 10 most distinctive
collexemes. Modal meanings are well represented: possible and likely denote epistemic
meaning, and ready, capable, and willing denote dynamic meaning. Also distinctive of
quite are adjectives that denote factuality (right, true, correct). Fairly attracts some sets
that are semantically close, such as typicality (common, typical), similarity/stability
(constant, consistent, regular), and epistemicity (certain). Other sets include easiness
(easy, straightforward), and time location (new, recent). Lastly, the most distinctive
collexeme of pretty is good (pbin and SumAbsDev = infinite). Good belongs to the
21. For instance, the probability to find 74 occurrences of odd in <rather + adj.> when you
would have expected it 14.51 times.
22. It receives a positive sign when the verb occurs more frequently than expected in the construction and a negative sign when the verb occurs less frequently than expected. In short, positive values indicate attracted collexemes whereas negative values indicate repelled collexemes.
For reasons of space, I have selected positive values only.
168 Guillaume Desagulier
category of positive values, along with cool and smart. Pretty also attracts an antonym
such as bad, which denotes a negative value. Other distinctive semantic sets include
difficulty (tough, hard), spatial dimension or location (big, close), deviation from a
norm (amazing), and psychological stimulus (scary).
Since MDCA excludes overlap (i.e. those adjectives which collexeme analysis revealed as common to at least two moderators), it makes some tendencies that collexeme analysis revealed more apparent:
a. atypicality as well as difficulty/simplicity are the most distinctive features of
rather;
b. difference/similarity and modal meanings are the most distinctive features of
quite;
c. typicality is a distinctive feature of fairly, along with similarity/stability;
d. whatever their polarity, value judgments are the most distinctive features of pretty, along with difficulty and dimension/position.
But MDCA also reveals tendencies that were harder to grasp with collexeme analysis.
Indeed, moderators follow a division of labor in the expression of some functions.
Atypicality is a distinctive feature of rather, whereas typicality is a distinctive feature
of fairly. Easiness is a distinctive feature of fairly, whereas the expression of difficulty
is distinctive of both rather and pretty. There is a difference in register though: it seems
that pretty is less formal than rather (rather difficult vs. pretty tough, pretty hard). All
in all, MDCA shows that even though moderators are functionally close, they do not
profile the same conceptual domains.
Even though collexeme analysis and MDCA reveal tendencies that were much
harder to capture with only raw frequencies, the above observations are partial because of the limited number of selected collexemes. For a deeper assessment of the
synonymy of moderators and the division of labor that they follow, we should increase
the level of granularity of our analysis. One obvious solution is to investigate more
collexemes, but the more data we have, the more difficult it is to make generalizations.
Rather than inspect and compare collostruction-based frequency tables manually, we
should also be able to compute and visualize the relative attraction between (a) moderators, (b) adjectives, (c) moderators and adjectives. With this goal in mind, we can
use the output of MDCA as input for correspondence analysis.
4.4 Multiple distinctive collexeme analysis as input for correspondence
analysis
Correspondence analysis (henceforth CA) is an exploratory statistical technique that
takes the frequencies of multiway tables as input, then summarizes and visualizes
distances between the variables. It determines the probability of global association
Visualizing distances in a set of near-synonyms 169
Table 8.╇ Input for correspondence analysis (sampled)
Adjective
(distinctive collexeme)
fairly
pretty
quite
rather
able
abstract
accurate
amazing
aware
awful
awkward
bad
beautiful
big
…
â•⁄3
14
83
â•⁄6
â•⁄1
â•⁄0
â•⁄0
19
â•⁄1
56
…
â•⁄â•⁄4
â•⁄â•⁄7
â•⁄66
347
â•⁄11
100
â•⁄12
758
â•⁄â•⁄2
591
â•⁄…
â•⁄67
â•⁄â•⁄5
109
â•⁄97
109
â•⁄12
â•⁄â•⁄7
â•⁄32
129
â•⁄45
â•⁄…
â•⁄0
22
â•⁄4
22
â•⁄0
â•⁄6
27
22
21
23
…
between rows and columns, and tests this association using the χ2 test.23 Two rows/
columns will be close to each other if they associate with the columns/rows in the
same way.
Table 8 shows a sample of the input used for CA. It brings together the 25 most
distinctive collexemes of each moderator and the raw frequency of each collocation
type. The whole table contains 400 cells. CA uses these frequencies to compare (a) line
profiles, i.e. adjectives, (b) column profiles, i.e. moderators, (c) line profiles and column profiles, i.e. moderators and adjectives. This method reintroduces overlap because the table contains raw frequencies of adjectives that co-occur with at least two
moderators. As we saw above, overlap is a characteristic of the paradigm of moderators. Taking overlap into account is therefore a way of mapping co-occurrence patterns more realistically than if we simply ignored it.
CA transposes the multidimensional distances to a two-dimensional plane that
maps the correlations between the variables. More precisely, it transforms the input
table (i.e. a table of numerical information) into a graphic display in which each row
and each column is represented as a point in a Euclidean space. Figure 2, below, is the
graphic output of CA.24
23. Since CA is an exploratory technique, one does not need to check whether the conditions
of use of χ2-statistics are met. For our current purpose, the hypothesis of independence can be
rejected because χ2 = 33623.82, df = 297, and p-value < 2.2e-16.
24. To conduct CA and output the graph, I used R with the packages FactoMineR (http://
cran.r-project.org/web/packages/FactoMineR/index.html) and dynGraph (http://cran.rproject.org/web/packages/dynGraph/index.html).
170 Guillaume Desagulier
Correspondence analysis graph
2,0
Dimension2 (28.14%)
1,5
1,0
0,5
0,0
–0,5
–1.5
–0.75
0
0.75
1.5
Dimension1 (52.77%)
Figure 2.╇ CA biplot of the <moderator + adjective> construction in COCA
The plot is built along two axes, which are the principal axes of inertia.25 Their
intersection defines the average profile of all the points in the cloud. CA decomposes
the overall inertia by identifying a small number of representative dimensions. Each
axis corresponds to a dimension. The plot displays only two dimensions, which are
selected according to their eigenvalues. The eigenvalue of a dimension measures how
much information is present along the axis of that dimension. The first axis (dimension 1, eigenvalue = 0.533) represents 52.77% of the inertia, whereas the second axis
(dimension 2, eigenvalue = 0.284) represents 28.14% of the inertia. There is a third
dimension, whose eigenvalue is 0.193.
Even though dimension 3 accounts for 19.08% of the inertia, it is not taken into
account in the plot. This is not a problem because the first two dimensions already
explain 80.91% of the information contained in the input table, and the results can be
interpreted with enough accuracy without dimension 3.
Because the plot contains a lot of data, we should examine each dimension in turn.
On the horizontal axis, dimension 1 contrasts pretty and quite. Each of them attracts
25. In CA, “inertia” is very similar to the “moment of inertia” in applied mathematics. It measures the total variance of the data table.
Visualizing distances in a set of near-synonyms 171
Correspondence analysis graph
1,5
Dimension2 (27.06%)
1,0
0,5
0,0
–0,5
–0.5
0
0.5
0.75
1.5
Dimension1 (57.51%)
Figure 3.╇ CA biplot of the <moderator + adjective> construction in COCA
(with semantic annotation)
its own cloud of adjectives, and each cloud is clearly delimited. On the vertical axis,
dimension 2 opposes fairly and rather (at the top of the cloud) to pretty and quite (at
the bottom of the cloud). This goes against the implicit assumption that moderators
split up between rather and quite on the one hand, and pretty and fairly on the other
hand (see, for example, Downing & Locke 2006). The proximity between fairly and
rather is evidenced by the continuum formed by their distinctive collexemes (both
horizontally and vertically). Comparatively, only two adjectives (surprising and difficult) stand halfway between rather and quite. In all likelihood, the boundary between
fairly and rather can be drawn above a cluster of adjectives with negative connotations
(obscure, crude, mundane, lengthy, simplistic), which are distinctive of rather. At this
stage, it is still difficult to spot any division of labor among moderators because of the
granularity of the plot. Figure 3, above, presents the graphic output of CA once all
adjectives have been semantically annotated. The relative position of moderators in
the cloud is very similar to the configuration displayed in Figure 2.
172 Guillaume Desagulier
Annotating adjectives makes it considerably easier to identify the functional specificities of each moderator as well as the division of labor among them. The specificities of each moderator are listed below:
rather:dimension or position in space (ex. long, high), atypicality/oddity (ex. odd,
bizarre), negative attitudes (ex. ironic), unclearness (ex. vague, obscure);
quite:epistemic, dynamic, and factual meanings (ex. likely, able, true), difference
(ex. different, separate), psychological states (ex. surprised, concerned, content);
fairly:location in time (ex. recent, new), typicality (ex. typical, common, standard);
pretty:appreciative and unappreciative values (ex. good, great vs. bad, awful), cleverness and stupidity (ex. smart vs. stupid, dumb), difficulty (ex. difficult, tough,
hard), psychological stimuli (ex. scary, funny).
The above list shows that moderators follow a division of labor in the intensification
of some complementary meanings:
–rather modifies spatial location and atypicality whereas fairly modifies time location and typicality;
–rather modifies the expression of negative attitude whereas quite modifies the
expression of positive attitude;
–pretty modifies the expression of difficulty whereas fairly and rather modify the
expression of simplicity;
–pretty modifies the expression of psychological stimuli whereas quite modifies the
expression of psychological states;
–quite modifies the expression of difference whereas fairly and rather modify the
expression of similarity/stability.
Lastly, some meanings are not distinctive of any moderator in particular:
– modifying the degree of surprise/salience and atypicality/extraordinariness is
common to pretty, rather, and quite;
– modifying the degree of simplicity and similarity/stability is common to both
fairly and rather.
To summarize, we have three major configurations:
– first configuration: moderators operate within one conceptual content;
– second configuration: two complementary aspects of a conceptual domain are
intensified by two distinct moderators;
– third configuration: one conceptual content can be intensified indiscriminately
by different moderators.
Visualizing distances in a set of near-synonyms 173
5. Discussion and conclusion
In this paper, we have proposed and combined several statistical methods to provide
a bidirectional semantic modeling of the <moderator + adjective> construction. We
have made three points. Firstly, we have reasserted the need for better statistics in the
collocation-based study of degree modifiers. Collostructional analysis is superior to
most techniques based on raw counts and/or percentages because it filters away co-occurring pairs that are unrealistically too frequent or too rare, regardless of the size
of the corpus. Secondly, we have shown that combining univariate and multivariate
statistics can help map usage patterns and conceptual structure in a set of near-synonyms. The relationship between moderators and adjectives is indeed bidirectional,
and it can be represented spatially. Thirdly, my results partly support Paradis (1997)
regarding the cognitive synonymy of moderators, which are both similar and different. Moderators are similar because they have a functional basis in common, namely
modifying the degree of a property denoted by an adjective. Moderators are also different because they do not modify the same classes of adjectives. In Cognitive Grammar terms, moderators do not always operate within the same conceptual domains. If
they do, they follow a division of labor.
These findings are of great significance to the study of constructions since two
items that co-occur significantly are likely to be entrenched as a constructional unit.26
Figure 2 shows that some pairs are more entrenched than others. For example, quite
surprised is more entrenched than quite surprising; pretty crazy is more entrenched
than pretty silly; rather vague is more entrenched than rather abstract; and fairly
straightforward is more entrenched than fairly easy. Once a pairing of lexemes is sufficiently entrenched, it is likely to acquire a meaning/function of its own. Figure 3
shows that the division of labor of moderators is not limited to the expression of intensification. It includes the expression of various meanings, such as the expression of
modality, value judgments, dimension, position in time or in space, etc.
Cognitive Construction Grammar takes an inventory approach to the mental
representation of grammar. Such an approach assumes that grammar predominantly
stores language structure in a complex constructional network instead of building
structure “on demand”. In a section on partial productivity, Goldberg postulates that
the candidates for the verb slot in the ditransitive construction are stored in speakers’
memories as similarity clusters on the basis of their type frequencies (1995:â•›133–136).
The higher the type frequency, the bigger the cluster, and the more productive the
26. However we should be wary of not establishing too strong a correspondence between high
frequencies and entrenchment (Geeraerts 2000). Studies in cognitive semantics have shown
that what determines the entrenchment of a linguistic unit is not so much its high frequency as
its absolute frequency as its frequency of occurrence relative to the frequency of similar units in
similar contexts (Geeraerts, Grondelaers & Bakema 1994). Also, some linguistic units are entrenched not because they occur frequently, but because they are salient (Schmid 2007, 2010).
174 Guillaume Desagulier
verb class is. In an effort to map usage patterns, Goldberg provides a two-dimensional representation where verbs cluster spatially according to similarity (1995:â•›135).
Accordingly, give, pass, bequeath or grant are good candidates for the verb slot in the
ditransitive construction, whereas envy or forgive are poor (but by no means impossible) candidates. Goldberg’s map is a theoretical abstraction because it is not based
on actual corpus data or similarity metric, as opposed to Figure 2 and Figure 3 in
Section 4. Although the latter bear on a different case study, they can be considered
as corpus-driven and statistically grounded extensions of Goldberg’s graphic intuition. Figure 2 is flexible enough to represent both entrenched collocations (e.g. rather
vague, quite different, fairly new, pretty good) and collocations that are improbable, yet
possible (e.g. rather neat, quite cool, fairly stupid, pretty right). It reflects the fact that
speakers tend to use certain adjectives with certain degree-modifiers, but can also
extend moderators idiosyncratically to other classes of adjectives. The existence of
dense, neat clusters of adjectives around pretty and quite suggests that speakers are
more conservative in their use of adjectives with these two moderators. Figure 3 confirms that adjectives cluster around moderators on the basis of semantic similarity. In
sum, I have presented evidence that shows that types of the <moderator + adjective>
construction form a network structured by similarity clusters.
Recent studies on near-synonymy in the Cognitive Linguistic framework have
concluded that multifactorial techniques can help map usage patterns (Glynn 2010b)
and spot “clusters in the mind” (Divjak & Gries 2008; Divjak 2010). Given how representative the corpus I have used is, the same kind of conclusions can be drawn regarding the <moderator + adjective> construction, pending experimental verification.
Perhaps in comparison to other less macroscopic approaches, the results presented in this paper seem conditional. Nevertheless, these results take usage-based
representations of linguistic units seriously and are verifiable. One can test these findings with a different corpus and compare the results. Confirmatory statistics such as
logistic or log-linear regression will have to corroborate the claim that these findings
are not due to chance and provide a faithful representation of the reality of the data.
For reasons of space, I have deliberately left aside three aspects of the <moderator +
adjective> construction, some of which have received much attention in the past,
such as syntactic idiosyncrasies (Allerton 1987; Gilbert 1989), grading force (Paradis
1997), and subjectivity (Nevalainen & Rissanen 2002; Athanasiadou 2007). However,
I believe that the methodology presented in this paper can shed new light on each of
these aspects. Regarding syntactic idiosyncrasies, distinctive collexemes can be used
as input for correspondence analysis to obtain a clearer picture of alternations such
as <quite a + adjective> vs. <a quite + adjective>, <quiteMODERATOR + scalar adjective> vs. <quiteMAXIMIZER + absolutive adjective>, or <a fairly + adjective + noun> vs.
<NP + be + fairly + adjective>. Regarding grading force, adjectives can be annotated
according to gradability in correspondence analysis following the categories proposed
in Paradis (1997:â•›49), namely non-gradable, scalar, extreme, limit. Regarding subjectivity, the methodology I have proposed can be applied to a diachronic corpus (see
Visualizing distances in a set of near-synonyms 175
also Hilpert 2006), so as to conduct a quantitative assessment of subjectification over
the long-term history of rather, quite, fairly, and pretty.
I have made use of multifactorial methods to visualize data that is not properly
speaking multifactorial. Instead of clustering lexical co-occurrence data alone, it will
be in the interest of future research to expand what I have presented in Figure 3 and
integrate more strata of richly annotated data within the same plot (e.g. information
concerning grading force, boundedness, and semantic classes of adjectives). The resulting two-dimensional map will provide a finer-grained representation of the scale
of synonymy of moderators. It will also help explain why collostructions featuring the
same adjective do not have the same connotation depending on which moderator is
used.27
To summarize, using multiple distinctive collexemes as input for correspondence
analysis has three assets. First, it enables the linguist to ignore collocates that would
be frequent or infrequent whatever the context (because they have a high overall frequency throughout the corpus) and focus on relevant pairs. Once distinctive collexemes have been identified, their raw frequencies can be used safely in a multi-way
table to map correlations between moderators and adjectives by means of a multivariate statistical technique.
Second, one graphic output is enough to synthesize similarities and differences
in a set of near-synonyms. Similarities between moderators (e.g. rather and fairly) are
evidenced by their relative proximity on the map. Differences (e.g. pretty vs. quite) are
made apparent by the relative distance between items. This is also true of adjectives,
which tend to cluster according to meaning.
Third, visualizing distinctive collexemes in a correspondence analysis plot can
do more than depict proximities and distances within separate paradigms. It is also
a potentially accurate means of determining entrenchment continua along the two
dimensions that structure the Euclidean space. Hopefully, the methodology I have
proposed can be used to represent the complex inventory of constructions that shapes
speakers’ grammars.
References
Allerton, D. J. (1987). English intensifiers and their idiosyncrasies. In R. Steele & T. Threadgold
(Eds.), Language topics: Essays in honour of Michael Halliday (pp. 15–31). Amsterdam &
Philadelphia: John Benjamins.
27. For example, even though good is a distinctive collexeme of pretty, raw frequencies show
that it also co-occurs with fairly and quite. Presumably, it does not have the same meaning in
each construction. It will be in the interest of future research to include context-based variables
to clarify these variations in meaning.
176 Guillaume Desagulier
Altenberg, B. (1991). Amplifier collocations in spoken English. In S. Johansson & A. Stenström
(Eds.), English computer corpora: Selected papers and research guide (pp. 127–149). Berlin
& New York: Mouton de Gruyter.
Athanasiadou, A. (2007). On the subjectivity of intensifiers. Language Sciences, 29, 554–565.
DOI: 10.1016/j.langsci.2007.01.009
Benzécri, J-P. (1973). L’analyse des données, 2. L’analyse des correspondances. Paris: Dunod.
Benzécri, J.-P. (1984). Analyse des correspondances, exposé élémentaire (2nd ed.). Paris: Dunod.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics – Investigating language structure
and use. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511804489
Bolinger, D. L. M. (1972). Degree words. The Hague: Mouton. DOI: 10.1515/9783110877786
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.
Cruse, D. A. (1986). Lexical semantics. Cambridge: Cambridge University Press.
Davies, M. (1990–present). The Corpus of Contemporary American English (COCA): 410 + million words. http://corpus.byu.edu/coca
Divjak, D. (2006). Ways of intending: Delineating and structuring near-synonyms. In St. Th.
Gries & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches
to syntax and lexis (pp. 19–56). Berlin & New York: Mouton de Gruyter.
Divjak, D. (2010). Structuring the lexicon: A clustered model for near-synonymy. Berlin & New
York: Mouton de Gruyter.
Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2, 23–60. DOI: 10.1515/CLLT.2006.002
Divjak, D., & Gries, St. Th. (2008). Clusters in the mind? Converging evidence from near synonymy in Russian. The Mental Lexicon, 3, 188–213. DOI: 10.1075/ml.3.2.03div
Downing, A., & Locke, P. (2006). English grammar: A university course (2nd ed.). London:
Routledge.
Edmonds, P., & Hirst, G. (2002). Near-synonymy and lexical choice. Computational Linguistics,
28, 105–144. DOI: 10.1162/089120102760173625
Everitt, B. S, Landau, S., Leese, M., & Stahl, D. (2011). Cluster Analysis (5th ed.). Oxford:
Wiley-Blackwell. DOI: 10.1002/9780470977811
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Unpublished
doctoral dissertation. Institut für maschinelle Sprachverarbeitung. University of Stuttgart.
Firth, J., R. (1957). A synopsis of linguistic theory, 1930–1955. In J. R. Firth (Ed.), Studies in
linguistic analysis. Special volume of the Philological Society (pp. 1–32). Oxford: Blackwell.
Geeraerts, D. (2000). Salience phenomena in the lexicon: A typology. In L. Albertazzi (Ed.),
Meaning and Cognition (pp. 79–101). Amsterdam & Philadelphia: John Benjamins.
Geeraerts, D., Grondelaers, S., & Bakema, P. (1994). The structure of lexical variation: Meaning,
naming, and context. Berlin, New York: Mouton de Gruyter.
Gilbert, E. (1989). Quite, rather. Cahiers de recherche en grammaire anglaise, 4, 4–61.
Gilquin, G. (2007). The verb slot in causative constructions. Finding the best fit. Constructions,
SV1-3/2006. www.elanguage.net/journals/index.php/constructions/article/view/18/23
Glynn, D. (2010a). Corpus-driven Cognitive Semantics. An introduction to the field. In
D. Glynn & K. Fischer (Eds.), Corpus-driven Cognitive Semantics: Quantitative approaches
(pp. 1–42). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423.1
Glynn, D. (2010b). Synonymy, lexical fields, and grammatical constructions. A study in usage-based Cognitive Semantics. In H. Schmid & S. Handl (Eds.), Cognitive foundations of
linguistic usage-patterns: Empirical studies (pp. 89–118). Berlin & New York: Mouton de
Gruyter.
Visualizing distances in a set of near-synonyms 177
Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure.
Chicago: University of Chicago Press.
Goldberg, A. E. (2003). Constructions: A new theoretical approach to language. Trends in Cognitive Sciences, 7, 219–224. DOI: 10.1016/S1364-6613(03)00080-9
Goldberg, A. E. (2006). Constructions at work: The nature of generalization in language. Oxford:
Oxford University Press.
Goldberg, A. E. (2009). The nature of generalization in language. Cognitive Linguistics, 20, 201–
224. DOI: 10.1515/COGL.2009.013
Greenacre, M. J. (2007). Correspondence analysis in practice (2nd ed.). Boca Raton: Chapman &
Hall/CRC. DOI: 10.1201/9781420011234
Gries, St. Th. (2007). Coll.analysis 3.2. A program for R for Windows 2.x.
Gries, St. Th. (2010). Statistics for linguistics with R: A practical introduction. Berlin & New York:
Mouton de Gruyter.
Gries, St. Th., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based
perspective on ‘alternations’. International Journal of Corpus Linguistics, 9, 97–129.
DOI: 10.1075/ijcl.9.1.06gri
Gries, St. Th. & Stefanowitsch, A. (2010). Cluster analysis and the identification of collexeme
classes. In J. Newman, & S. Rice (Eds.), Empirical and experimental methods in cognitive/
functional research (pp. 73–90). Stanford, CA: CSLI.
Gries, St. Th. & Stefanowitsch, A. (2006). Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110197709
Hilpert, M. (2006). Distinctive collexeme analysis and diachrony. Corpus Linguistics and Linguistic Theory, 2, 243–256. DOI: 10.1515/CLLT.2006.012
Hirst, G. (1995). Near-synonymy and the structure of lexical knowledge. In J. Klavans (Ed.),
AAAI Symposium on Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity (pp. 51–56). Cambridge (Mass.): AAAI Press.
Kay, P. (2013). The limits of (Construction) Grammar. In T. Hoffmann & G. Trousdale (Eds.),
The Oxford Handbook of Construction Grammar (pp. 32–48). Oxford: Oxford University
Press.
Kennedy, G. (2003). Amplifier collocations in the British National Corpus: Implications for
English language teaching. TESOL Quarterly, 37, 467–487. DOI: 10.2307/3588400
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6, 97–133.
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus Linguistics and Linguistic
Theory, 1, 263–276.
Langacker, R. W. (1987). Foundations of Cognitive Grammar. Vol. 1. Theoretical prerequisites.
Stanford: Stanford University Press.
Langacker, R. W. (1991). Foundations of Cognitive Grammar. Vol. 2. Descriptive application.
Stanford: Stanford University Press.
Langacker, R. W. (2008). Cognitive Grammar: A basic introduction. Oxford: Oxford University
Press.
Langacker, R. W. (2009). Cognitive (construction) grammar. Cognitive Linguistics, 20, 167–176.
Leech, G. (1997). Introducing corpus annotation. In R. Garside, G. Leech, & T. McEnery (Eds.),
Corpus annotation: Linguistic information from computer text corpora (pp. 1–18). London:
Longman.
178 Guillaume Desagulier
Lorenz, G. (2002). Really worthwhile or not really significant? A corpus-based approach
to the delexicalization and grammaticalization of intensifiers in Modern English. In
I. Wischer, & G. Diewald (Eds.), Speech, place, and action: Studies in deixis and related
topics (pp. 143–161). New York: Wiley.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing.
Cambridge, MA: MIT Press.
Nevalainen, T., & Rissanen, M. (2002). Fairly pretty or pretty fair? On the development and
grammaticalization of English downtoners. Language Sciences, 24, 359–380.
Paradis, C. (1994). Compromisers – a notional paradigm. Hermes, 13, 157–167.
Paradis, C. (1997). Degree modifiers of adjectives in spoken British English. Lund: Lund University Press.
Paradis, C. (2000). It’s well weird: Degree modifiers of adjectives revisited: The nineties. In J. M.
Kirk (Ed.), Corpora galore: Analyses and techniques in describing English (pp. 147–160).
Amsterdam & Atlanta: Rodopi.
Paradis, C. (2008). Configurations, construals and change: Expressions of degree. English Language and Linguistics, 12, 317–343.
Pedersen, T. (1996). Fishing for exactness. In Proceedings of the South-Central SAS Users Group
Conference (pp. 188–200). Austin, TX.
Quine, W. V. O. (1951). Main trends in recent philosophy: Two dogmas of empiricism. The
Philosophical Review, 60, 20–43.
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the
English language. London & New York: Longman.
R Development Core Team. (2011). R: A language and environment for statistical computing.
Vienna: Austria: R Foundation for Statistical Computing.
Schmid, H.-J. (2007). Entrenchment, salience, and basic levels. In D. Geeraerts & H. Cuyckens
(Eds.), The Oxford handbook of Cognitive Linguistics (pp. 117–138). Oxford: Oxford University Press.
Schmid, H.-J. (2010). Does frequency in text instantiate entrenchment in the cognitive system? In D. Glynn & K. Fischer (Eds.), Quantitative methods in cognitive semantics: Corpus-�
driven approaches (pp. 101–133). Berlin, New York: Mouton De Gruyter.
Simon-Vandenbergen, A.-M. (2008). Almost certainly and most definitely: Degree modifiers
and epistemic stance. Journal of Pragmatics, 40, 1521–1542.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Stefanowitsch, A., & Gries, St. Th. (2003). Collostructions: Investigating the interaction of
words and constructions. International Journal of Corpus Linguistics, 8, 209–243.
Stoffel, C. (1901). Intensives and downtoners: A study in English adverbs. Heidelberg: Carl
Winter.
Storjohann, P. (2009). Plesionymy: A case of synonymy or contrast? Journal of Pragmatics, 41,
2140–2158.
Traugott, E. C. (2008). The semantic development of scalar focus modifiers. In A. van Kemenade
& B. Los (Eds.), The handbook of the history of English (pp. 335–359). Oxford: Blackwell.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the
American Statistical Association, 58, 236–244.
Wiechmann, D. (2008). On the computation of collostruction strength: Testing measures
of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4,
253–290.
A case for the multifactorial assessment
of learner language
The uses of may and can in French-English
interlanguage
Sandra C. Deshors and Stefan Th. Gries
New Mexico State University / University of California, Santa Barbara
In this study, we apply Gries and Divjak’s Behavioral Profile approach to compare native English can and may, learner English can and may, and French pouvoir. We annotated over 3,700 examples across three corpora according to more
than 20 morphosyntactic and semantic features and we analysed the features’
distribution with a hierarchical cluster analysis and a logistic regression. The
cluster analysis shows that French English learners build up fairly coherent categories that group the English modals together followed by pouvoir, but that they
also consider pouvoir to be semantically more similar to can than to may. The
regression strongly supports learners’ coherent categories; however, a variety of
interactions shows where learners’ modal use still deviates from that of native
speakers.
Keywords: Behavioral Profiles, hierarchical cluster analysis, logistic regression,
modal verbs
1. Introduction and overview
Acquiring a foreign language is one of the most cognitively challenging tasks, given how languages differ in every level of linguistic analysis. From a cognitively and
psycholinguistically-oriented perspective, learning a language requires identifying a
very large amount of co-occurrence data – tense t and number n require subject-verb
agreement with morpheme m, idiom i consists of word w and word x, communicative function f is communicated with intonation curve c, etc. – as well as storing and
retrieving them. Crucially, these types of co-occurrences are typically probabilistic
only rather than absolute/deterministic and, thus, hard to discern and learn: usually,
learners need to cope with many-to-many mappings between forms and functions,
180 Sandra C. Deshors and Stefan Th. Gries
and often it is only the confluence of differently predictive information on several
levels of linguistic analysis that narrows down the search for a particular meaning (in
comprehension) or a particular form (in production). In the Competition Model by
Bates and MacWhinney (1982, 1989), for example, this situation is modeled on the
assumption that forms and functions are cues to functions and forms, respectively,
and many different cues of different strengths, validities, and reliabilities must be integrated to, say in production, arrive at natural-sounding choices.
Semantics is a particularly tricky linguistic domain in this regard, in native language, but even much more so in foreign language learning. Not only do languages
often carve up semantic space very differently (so that the categories of the language
acquired first will influence category formation in the following), but semantic differences are also often much less explicitly noticeable (than, say, the presence or absence
of a plural morpheme), which makes the identification of probabilistic co-occurrence
patterns all the more difficult. In order to allow for a precise description of semantic,
or more generally functional, characteristics of synonyms, antonyms, and senses of
polysemous words, Gries and Divjak developed the so-called Behavioral Profile (BP)
approach (cf. Gries and Divjak 2009). This approach, to be discussed in more detail
below, is highly compatible with a psycholinguistic perspective of the type outlined
above and involves a very fine-grained annotation of corpus data as well as their statistical analysis.
The method of behavioral profiles has been successfully employed in a variety
of contexts – synonyms, antonyms, and word senses of polysemous words have been
studied both within one L1 or across two different L1s – as well as having received first
experimental support, but so far there have been no studies that test the BP approach’s
applicability to L1 and L2 data, which is what we will undertake here. The semantic
domain we will explore is one that has proven particularly elusive, namely, modality.
While many semantic phenomena can be clearly delineated and, to some degree, explained by the linguistic analyst, modality has been much more problematic; in fact,
even the scope of the notion of modality has not really been agreed upon yet. In this
chapter, we specifically focus on the semantic domain of possibility as reflected in:
– the choices of can vs. may in essays written by native speakers of English;
– the choices of can vs. may in essays written by French learners of English;1
– the use of pouvoir in essays written by native speakers of French.
In Section 2, we discuss in what sense these modals pose a particular challenge to the
analyst as well as present previous corpus-based work on can and may and highlight
1. Following Bartning (2009), the term “advanced learner” is henceforth assumed to refer
to “a person whose second language is close to that of a native speaker, but whose non-native
usage is perceivable in normal oral or written interaction” (Hyltenstam et al. 2005:â•›7, cited in
Bartning 2009:â•›12).
A case for the multifactorial assessment of learner language 181
some of the shortcomings of such work. In Section 3, we discuss the BP approach
in general as well as our own data and methods in particular. Section 4 presents the
results of our exploration, and Section 5 concludes the chapter.
2. Setting the stage
2.1
What is problematic about the modals?
As near synonyms in the domain of modality, may and can have fueled much theoretical debate with regard to their semantic relations. As a pair, both forms have
overlapping semantics which cover simultaneously the meanings of possibility, permission and ability (cf. Collins 2009). This means that both forms can be used to express epistemic, deontic and dynamic types of possibility. It follows that the semantic
investigation of may and can triggers two problematic questions: first, to what extent
the various senses of each form can be distinguished, and second, to what degree both
forms are semantically equivalent?
With regard to the first question, studies such as Leech (1969) and Coates (1983)
have illustrated the difficulty in distinguishing between the senses of may and can.
Leech (1969:â•›76), for instance, notes that “[t]he permission and possibility meanings
of may are close enough for the distinction to be blurred in some cases”. Similarly,
Coates (1983:â•›14) identifies a “continuum of meaning” – i.e. gradience – in which
possible modal uses shade into each other. In the case of the meanings of can, for
instance, Coates notes that while permission and ability correspond to the core of two
largely intersecting fuzzy semantic sets, possibility, on the other hand, is found “in the
overlapping peripheral area” (p. 86).
With regard to the issue of the semantic equivalence of may and can, the literature
reveals similarly debated standpoints. While some studies recognize the similarities
of the two forms, others do not. In the former case, for instance, Collins (2009:â•›91)
states that “[t]he two modals of possibility may and can, share a high level of semantic overlap” (despite their differing frequency of occurrence and different degrees of
formality), and Leech (1969:â•›75) notes that “[i]n asking and giving permission, can
and may are almost interchangeable”. Conversely, studies such as Coates (1983) have
clearly distinguished the two forms. For instance, while Coates (1983) does recognize
that the English modals share certain meanings and can be organized into semantic
clusters, she generally denies the synonymy of may and can by classifying the two
forms into two distinct semantic groups. Although she accepts that the two forms
may have overlapping meanings in some cases, she claims that even then, the two
forms do not occur in free variation.
The occurrence of one form over the other has been shown to be influenced, to
some extent, by its linguistic context. It has indeed been illustrated that particular
182 Sandra C. Deshors and Stefan Th. Gries
co-occurring grammatical categories interfere with the interpretation of the modals.
Leech (2004:â•›77), for instance, notes that certain uses of may are only to be found in
particular grammatical contexts: “only the permission sense, for instance, is found
in questions (…) and the negation of the possibility sense is different in kind from
the negation of the permission sense”. Generally, several grammatical categories have
been recognized as interacting with the uses of may and can. While negation is one
category that has commonly been identified (cf. Hermerén 1978; Palmer 1979; Coates
1980, 1983; De Haan 1997; Huddleston 2002; Radden 2007; Byloo 2009), voice and
sentence types have also been shown to have similar influences on the forms.
Overall, the above-mentioned studies all provide clear illustrations of the complexity of the semantic relations between may and can on the basis of empirically
gathered evidence. However, they all tend to be based on generalized observations of
idiosyncratic behavioral tendencies. In that respect, they all raise the issue of how to
provide a more systematic account of the modals’ semantic characteristics and how to
integrate qualitative findings into a quantitative and empirically-grounded approach.
2.2
Previous corpus-based work on the modals
2.2.1 Native English
As already mentioned above, Hermerén (1978) has shown that the semantics of the
modals in native English are morphosyntactically motivated to a considerable degree such that linguistic categories such as voice, grammatical person, type of main
verb (action, state, etc.), aspect and sentence type influence the interpretation of the
modals: “if these categories can be shown to modify the meaning of the modal […] it
is important that this should be accounted for in the description of the semantics of
the modals” (p. 74). While this claim calls for empirical validation, one implication of
Hermerén’s (1978) argument is that the quantitative study of modal forms will require
a powerful and versatile methodological approach. In a very similar fashion, Klinge
and Müller (2005:â•›1) argue that, to capture the essence of modal meaning, “it seems
necessary to cut across the boundaries of morphology, syntax, semantics and pragmatics and all dimensions from cognition to communication are involved”.
A second corpus-based study of the modals in native English is Gabrielatos and
Sarmento (2006). This study illustrates an attempt to account for syntactic contextual information while using a quantitative corpus-based approach to investigate core
English modals (i.e. can, could, may, might, must, shall, should, will and would). Although their study does not involve the comparison of English varieties, it presents,
however, a comparative analysis of the frequencies of uses of the modals in an aviation
corpus and a representative corpus of American English. Generally, it raises the following questions:
A case for the multifactorial assessment of learner language 183
– To what degree do syntactic structures and modal forms interact contextually?
– To what degree does such interaction affect investigated modal forms semantically?
– How can such interaction be quantitatively investigated in a corpus including
cross-linguistic and interlanguage data?
The authors acknowledge that the modals’ distribution varies as a function of their
syntactic contexts and they show that frequencies of occurrence of core English
modals reflect the type of syntactic environment in which they feature: “there is a
great deal of variation in the use of modal verbs and the structures they occur in,
depending on the context of use” (p. 234). However, their lack of a suitable cognitively-motivated theoretical framework prevents them from providing a meaningful
interpretation of the data and to further explore their findings.
To this date, Collins (2009:â•›1) presents:
the largest and most comprehensive [study] yet attempted in this area [modality]
based on an analysis of every token of the modals and quasi-modals (a total of
46,121) across the spoken and written data.
Collins (2009) investigates the meanings of the modals in three parallel corpora of
contemporary British English, American English and Australian English. Despite the
author’s recognition that a corpus quantitative approach “typically combined with a
commitment to the notion of ‘total accountability’ may influence hypotheses applied
to the data, or formulated on the basis of it” (p. 5) and despite the large size of his data
set, his analysis is of limited informative value due to:
– a theoretical framework that does not allow for the full exploitation of the linguistic context of the modals, and;
– a statistical approach that inhibits rather than unveils linguistic patterns at play in
the data.
With regard to the first point, Collins (2009) restricts his approach to the identification of the forms’ lexical meanings. His theoretical framework consists of a traditional tripartite taxonomy including epistemic, deontic and dynamic senses. Regrettably,
while he recognizes that some uses of the modals can yield preferences for particular
syntactic environments, his analysis does not address that fact in a systematic quantitative fashion. As for the second point, while, statistically, Collins (2009) limits his
investigation to providing frequency tables of modal forms, his overall approach is
problematic because it is based on the erroneous assumption that the frequent occurrence of a modal form warrants its linguistic relevance. In the case of may and can, for
instance, Collins uses raw frequencies to show that deontic may is the “least common”
sense of the three as it is chosen 7% of the time over epistemic may (79%) and dynamic may (8.1%). However, he does not show whether the (low) frequency of deontic
may is significantly different from the also low frequency of dynamic may, and our
184 Sandra C. Deshors and Stefan Th. Gries
analysis of his data shows that, excluding the indeterminate cases, the distribution of
may’s senses across the American, Australian, and British data is highly significant
(χ2 = 42.68; df = 4; p < 0.001). This, in turn, raises the questions of:
– To what extent are Collins’ (2009) frequencies of the occurrences of modal forms
in each corpus comparable?
– Since the observed frequency discrepancies are not a matter of chance, then what
motivates, linguistically, the different uses of each form in each independent
corpus?
So in sum, while studies such as Gabrielatos and Sarmento (2006) and Collins (2009)
provide many descriptive results, they are often merely or largely form-based alone
and are lacking in terms of determining which of the many frequencies are statistically and/or linguistically relevant. As a result, such studies do not come close to allow
us to develop a characterization of modals that essentially allows us to classify/predict
modal use.
2.2.2 Learner English and contrastive approaches
From a cross-linguistic and an interlanguage perspective, investigating the modals
raises two related issues, namely (i) the possibility of a lack of (direct) semantic equivalence between the modal forms in the learner’s native language (L1) and his/her
target language (L2), and (ii), the fact that such cross-linguistic semantic dissimilarity
will affect the uses of the forms in L2. The modals may and can and native French
pouvoir illustrate the case in point. Despite the fact that all three forms contribute to
the expression of the semantic notion of possibility, pouvoir synchronically covers
the whole range of the modal uses of may and can.
One corpus-based study of learners’ use of modals is Aijmer (2002), which is
based on a corpus of Swedish L2 English writers. She compares (i) the frequencies
of key modal words in native English and advanced Swedish-English interlanguage,
as well as (ii) frequencies encountered in Swedish learner English with those from
comparable French and German L2 English. Aijmer’s study indicates “a generalized
overuse of all the formal categories of modality” and she further points out that “it
is only at a functional level that any underuse was detected, with the learner writers
failing to use may at all in its root meaning” (p. 72).
Similarly, Neff et al. (2003) investigate the uses of modal verbs (can, could, may,
might and could) by writers from several L1 backgrounds. Neff et al. (2003) use a
learner corpus including Dutch-, French-, German-, Italian-, and Spanish-English
interlanguage, which they contrast with a reference corpus of American university
English. Neff et al. (2003:â•›215) identify the case of can as potentially interesting “since
it is overused by all non-native writers”. They further report that the frequency of may
by French native speakers stands out in comparison to the frequencies by all other
non-native speakers included in the study, but since their study does basically nothing
A case for the multifactorial assessment of learner language 185
but compare raw frequencies of occurrence regardless of any contextual features , it is
not particularly illuminating.
Generally, and similar to Gabrielatos and Sarmento (2006) and Collins (2009),
both Aijmer (2002) and Neff et al. (2003) made the disadvantageous methodological
decision to conveniently, but ultimately problematically, rely on information that is
retrievable without human effort. In addition, even the studies that address learner
use do not relate their findings to the wider context of (second) language acquisition.
In a corpus-based contrastive study, Salkie (2004) investigates the nature of the
semantic relations between the three forms in native English and native French. He
uses a subpart of the parallel corpus INTERSECT (cf. Salkie 2000), and focuses on
three working hypotheses, namely that:
–“pouvoir corresponds more closely to one of the English modals rather than the
other” (p. 169);
–“pouvoir is less specific than the English modals” (p. 170);
–“pouvoir has a sense which is different from both the English modals but is not
just a general sense of possibility” (p. 170).
While Salkie (2004) concludes in favour of the third hypothesis, it is worth pointing out, however, that his results were based on only 100 randomly extracted occurrences of each English modal form (i.e. may and can) and their respective French
translations.
By way of a more general summary, it is probably fair to say that corpus-based
approaches to modality in L1 and L2s leave things to be desired. Some studies point
to the immense complexity of the subject but do not choose multifactorial or multivariate methods that are capable of addressing this degree of complexity. In addition,
some studies are based on large numbers of modals but, frankly, do not do very much
with the vast amount of data other than present arrays of statistically under-analyzed
frequency tables. On the other hand, the analytically much more interesting studies
of the kind of Salkie (2004) are based on very small samples. Finally, many studies are
largely if not exclusively form-based and focus only on learners’ over-/underuse of
modals in particular examples or kinds of contexts.
2.3
Characteristics of the present study
2.3.1 Methodological considerations
The above discussion fairly clearly indicates what kinds of steps would be desirable,
an approach that:
– can integrate linguistic information and patterning from many different levels of
linguistic analysis in a way alluded to by Hermerén (1978), as well as Klinge and
Müller (2005);
186 Sandra C. Deshors and Stefan Th. Gries
– involves not only a sample that is studied with regard to more linguistic parameters, but at the same time also larger than the previous studies that aimed at more
than description;
– explores similarities and differences of L1 uses of can and may, but also explores
the way these English modals are used in L2 language (here from French learners)
as well as how the same concept is used by the learners in their L1 (here pouvoir).
Given these demands, we decided to use the so-called Behavioral Profile approach,
which fits the above wish list very well. It combines the statistical methods of contemporary quantitative corpus linguistics with a cognitive-linguistic and psycholinguistic perspective or orientation (cf. Divjak and Gries 2006, 2008, 2009; Gries 2006,
2010b; Gries and Divjak 2009, 2010; and others). As such, it diverges radically from
the above-mentioned more traditional corpus-based approaches to modality in both
L1 and L2. Methodologically, it involves four steps:
– the retrieval of all instances of a word’s lemma from a corpus in their context;
– a manual annotation of a number of features characteristic of the use of the word
forms in the data; these features are referred to as ID tags and typically involve
morphosyntactic and semantic features in particular. Each ID tag contributes to
the profiling of the investigated lexical item(s);
– the generation of a table of co-occurrence percentages, which specify, for example, which words (from a set of near-synonymous words) or senses (of a polysemous word) co-occur with which morphosyntactic and/or semantic ID tags; it is
these vectors of percentages that are called profiles;
– the evaluation of that table by means of statistical techniques.
Given how this approach is completely based on various kinds of co-occurrence information, it comes as no surprise that, just like much other work in corpus linguistics, the BP approach assumes that “the distributional characteristics of the use of an
item reveals many of its semantic and functional properties and purposes” (Gries and
Otani 2010:â•›3). While these previous studies have investigated a variety of different
lexical relations (near synonymy, polysemy, antonymy) both within languages (English, Finnish, Russian) and across languages (English and Russian), the present study
will add to the domains in which Behavioral Profiles have been used in two ways:
(i) so far, no non-native language data have been studied, and (ii) we will add French
to the list of languages studied.
As the first BP study focusing on learner data, and only the second BP study that
compares data from different languages, this paper is still largely exploratory. We will
mainly be concerned with the following two issues:
– To what degree can the Behavioral Profiling handle the kind of learner data that
are inherently more messy and volatile than native data and provide a quantitatively adequate and fine-grained characterization of the use of can and may by
A case for the multifactorial assessment of learner language 187
native speakers and learners, and how does that use compare to the use of French
speakers’ use of pouvoir?
– As a follow-up, and if meaningful groups of uses emerge, to what degree do the
distributional characteristics that BP studies typically include allow us to predict
native speakers’ and learners’ choices of modal verbs, and how do these speaker
groups differ?
The former question will be explored with the kind of cluster-analytic approach usually employed in BP studies; for the latter question, we will turn to a logistic regression
(cf. Arppe 2008 for another BP approach using (multinomial) regression).
2.3.2 Theoretical orientation
In previous studies, the BP approach was used for more than just the quantitative
description of the data. Rather, it is firmly grounded in, and attempts to relate the
results of the statistical exploration of the data to usage-based/exemplar-based approaches within Cognitive Linguistics and psycholinguistics. While this orientation
is also compatible with our current goals, there is one particular earlier model in L2/
FLA research that is especially well-suited to, or compatible with, our current objectives, namely the Competition Model (CM) by Bates and MacWhinney (cf. Bates and
MacWhinney 1982, 1989). This model is “a probabilistic theory of grammatical processing which developed out of a large body of crosslinguistic work in adult and child
language, as well as in aphasia” (Kilborn and Ito 1989:â•›261). MacWhinney (2004:â•›3)
himself characterized it as a “unified model [of language acquisition] in which the
mechanisms of L1 learning are seen as a subset of the mechanisms of L2 learning”.
The CM is characterized by the two following assumptions:
– Linguistic signs map forms and functions onto each other (probabilistically) such
that forms and functions are cues to functions and forms respectively.
– In language production, forms compete to express underlying intentions or functions, and in language comprehension, the input contains many different cues of
different strengths, validities, and reliabilities, which must be integrated: native
speakers “depend on a particular set of probabilistic cues to assign formal surface
devices in their language to a specific set of underlying functions” (Bates and
MacWhinney 1989:â•›257).
As a usage-based and probabilistic model, the CM assumes that both frequency and
function determine the choice of grammatical forms in language production; as with
most usage-based and/or corpus-linguistic approaches, we too consider frequency in
a corpus as a proxy for frequency of exposure (in both comprehension and production). Cross-linguistically, this is an important assumption because across languages
cues are instantiated in different ways and speakers assign them varying degrees of
strength. It is therefore important to describe and explain L1 statistical regularities as
188 Sandra C. Deshors and Stefan Th. Gries
“[t]hey are part of the native speaker’s knowledge of his/her language, and they are an
important source of information for the language learner” (Bates and MacWhinney
1989:â•›15).
Overall, Kilborn and Ito (1989: 289) conclude that existing psycholinguistic studies have successfully demonstrated that the CM is appropriate for the characterization
of learner language through cue distributions and they report “extensive evidence for
the invasion of L1 strategies into L2 processing”. In addition, it is also obvious how
much the CM is compatible with a BP approach. The main notions that drive the
Competition Model are cue strengths, validities, and reliabilities, and all of these are
essentially conditional probabilities, i.e. percentages. While the BP approach as such
does not cover the full complexity of how conditional cue strengths, validities, and
reliabilities can interact, it is a useful and experimentally validated (cf. Divjak and
Gries 2008) approach employing a similar logic.
A theory of language transfer requires that we have some ability to predict where
the phenomena in question will and will not occur. In this regard contrastive
(Gass 1996:â•›324)
analysis alone falls short; it is simply not predictive.
3. Data and methods
3.1
Retrieval and annotation
The data are from three untagged corpora: the French subsection of the International
Corpus of Learner English (henceforth ICLE-FR), the Louvain Corpus of Native English
Essays (LOCNESS), and the Corpus de Dissertations Françaises (CODIF). All corpora
included in the present work were collected by the Centre for English Corpus Linguistics (CECL) at the Université Catholique de Louvain (UCL) and made available
to us by the Director of the Centre, Professor Sylviane Granger. ICLE-FR has a total
of 228,081 words, including 177,963 words of argumentative texts and 50,118 words
of literary texts. LOCNESS is a 324,304-word corpus that includes three sub-data
sets: a 60,209-word-sub-corpus of British A-Level essays, a 95,695-word sub-corpus
of British university essays and a sub-corpus of American university essays that has
168,400 words. The CODIF is a corpus of essays written by French-speaking undergraduate students in Romance languages at the Université Catholique de Louvain (UCL). CODIF also includes argumentative and literary texts and has a total of
100,000 words.2
2. Information on the total number of words featuring in each individual text type (i.e. argumentative, literary) is not available.
A case for the multifactorial assessment of learner language 189
Table 1.╇ Excerpt of an annotation table including selected variables
Case Match Corpus ClType
5
133
1760
1886
2876
3540
3645
may
may
may
can
cannot
peut
peuvent
native
native
native
il
il
fr
fr
coordinate
main
main
coordinate
subordinate
main
subordinate
Use
VerbSemantics
Neg
RefAnim
process
state
process
process
state
process
process
ment/cog/emotional
copula
ment/cog/emotional
ment/cog/emotional
abstract
ment/cog/emotional
abstract
affirmative
affirmative
negative
affirmative
negative
negative
negative
animate
inanimate
animate
animate
inanimate
animate
inanimate
Given the corpora’s compositions, the three corpora included in our study are
highly comparable. They all consist of written data produced by university students
(ICLE, CODIF, the LOCNESS British and American university sections) or by students approaching university entrance (i.e. the LOCNESS British A-Level section).3
All participants’ contributions are in the form of an essay of approximately 500 words
long. In terms of content, all essays deal with similar topics such as: crime, education,
the Gulf War, Europe, or university degrees.
The data we subjected to the BP approach consist of instances of may and can in
native English and French-English interlanguage as well as pouvoir in native French
from the above corpora. Using scripts written in R (cf. R Development Core Team
2010), we retrieved 3,710 occurrences of the investigated modal forms from all
sub-corpora, which were imported into a spreadsheet software and annotated for 22
morphosyntactic and semantic variables.4 Table 1 exemplifies this database with a
very small excerpt of these data, and Table 2 presents the total range of variables included in the study and their respective levels.
For each variable, an encoding taxonomy was designed prior to annotation. Due
to the large number of variables included in this study and the absence of a number of
them from previous studies on the English modals, not all encoding taxonomies were
theoretically motivated. In cases where the annotation is not based on accounts from
the existing literature, a bottom-up approach was adopted for the identification of recurrent features in the data. This procedure, for instance, was carried out in the case of
the variable VerbSemantics where, prior to annotation, recurrent semantic features
were identified as characteristic of the lexical verbs used alongside the modals.
3. The inclusion of the LOCNESS British A-Level section alongside sub-corpora solely including university participants is not judged problematic as LOCNESS only involves English
native speakers whose level of English is not expected to develop any further.
4. Although the annotation process included a variable encoding the semantic role of the
subject referent of the modals, this study does not account for that variable due to its high correlation with VOICE.
190 Sandra C. Deshors and Stefan Th. Gries
Table 2.╇ Overview of the variables used in the study and their respective levels
Type
Variable
Levels
data
Corpus
GramAcc (acceptability)
Neg (negation)
SentType (sentence type)
ClType (clause type)
Form
SubjMorph: subject
morphology
SubjPerson: subject person
SubjNumber: subject number
Voice
Aspect
Mood
SubjRefNumber: subject
referent number
Senses
SpeakPresence
Use
native, interlanguage, French
yes, no
affirmative, negated
declarative, interrogative
main, coordinate, subordinate
can, may, pouvoir (and negated forms)
adj., adv., common noun, proper noun,
relative pronoun, date, noun phrase, etc.
1, 2, 3
singular, plural
active, passive
perfect, perfective, progressive
indicative, subjunctive
singular, plural
syntactic
morphological
semantic
VerbSemantics
RefAnim: subject referent
animacy
AnimType: subject referent
animacy type
epistemic, deontic, dynamic
weak, medium, strong
accomplishment, achievement, process,
state
abstract, general action, action incurring
transformation, action incurring movement, perception, etc.
animate, inanimate
animate, floral, object, place/time, mental/
emotional, etc.
Because of space restrictions, we are not able to provide a more comprehensive
account of the annotation process (but cf. Deshors 2010 for details). However, three
variables – Senses, VerbType, and VerbSemantics – require some brief explanatory
comments.
3.1.1 The variable Senses
As for Senses, the semantic category of modality includes a wide range of heterogeneous meanings that many scholars have attempted to unite under a variety of
categorization systems (cf. Palmer 1979; Coates 1983; Bybee and Fleischman 1995;
Huddleston 2002; Nuyts 2006; Byloo 2009). While Depraetere and Reed (2006:â•›277)
note that “in classifying modal meanings, it is possible to use various parameters as
criterial to their classification”, this study assumes a coding taxonomy based on a
traditional tripartite distinction between epistemic, deontic and dynamic meanings.
A case for the multifactorial assessment of learner language 191
Following Nuyts (2006:â•›6), epistemic senses concern “an indication of the epistemic
estimation, typically, but not necessarily, by the speaker, of the chances that the state
of affairs expressed in the clause applies in the world”. Consider (1) as an illustration
of epistemic may:
(1) indeed, Europe 92 may lead to the disappearance of cultural differences
Following Palmer (1979:â•›58), deontic modality refers to cases where “[b]y uttering a
modal, a speaker may actually give permission (may, can)”. (2) illustrates deontic can:
(2) if all public schools started to say you can only come here if you are Hispanic
or if you are Polish, our schooling system would be in great chaos
Finally, dynamic meanings denote “an ascription of a capacity to the subject-participant of the clause (the subject is able to perform the action expressed by the main
verb in the clause)” (Nuyts 2006:â•›3). Generally, dynamic modality expresses the potentiality of an event occurring. Nuyt’s type of dynamic modality includes ability/
capability cases where the possibility of event occurrence stems from the ability of the
(grammatical) subject to carry out the event. In that regard, the term ability is not restricted to a ‘physical’ interpretation and equally applies to mental and technical types
of ability. Example (3) illustrates dynamic can:
(3) Mrs Ramsay is the central character because she can see the whole personality
of the other ones
Generally, our frequencies of use of may and can in their different senses match those
previously encountered in existing studies solely concerned with the native use of
the modals, such as Coates (1980) and Collins (2009). While Coates (1980:â•›218), for
instance, reports that “by far the most common usage of may is to express epistemic
possibility”, she stresses the distinctive nature of the uses of may and can:
The patterns resulting from my analysis of the data (…) leads me to conclude
that in normal everyday usage may and can express distinct meanings: may is
primarily used to express epistemic possibility, while can primarily expresses root
possibility.5
3.1.2 The variable VerbType
The variable VerbType targets the lexical verbs with which the forms are used and
characterizes their telicity. Conceptually, the variable VerbType follows Vendler
(1967) in its recognition that the notion of time is crucially related to the use of a
5. Coates (1980, 1983) categorizes modal meaning according to a two-way distinction that
includes epistemic and non-epistemic modality. She refers to the latter type as “root” modality.
192 Sandra C. Deshors and Stefan Th. Gries
verb and is “at least important enough to warrant separate treatment” (p. 143). This
variable assesses:
–whether may and can have preferences for lexical verbs denoting a state, a process,
an accomplishment or an achievement,6 and if so,
– it identifies in which type of corpus preferential patterns occur.
3.1.3 The variable VerbSemantics
Similarly to the variable VerbType, VerbSemantics identifies the type of semantic
information conveyed by the lexical verbs used with the modals. The internal organization of this variable results from a bottom-up approach and does not follow any
particular theoretical framework. This variable consists of the levels denoting abstract
process, physical actions, actions incurring movement, actions incurring some physical transformation, communicative processes, mental/cognitive/emotional processes,
perception processes and verbal statement involving a copula verb. Example (4) illustrates a case where the lexical verb expresses a mental/cognitive/emotional process:
(4) Her search for the final touch can be seen as a search for harmony
Once all matches were annotated, the resulting data table was evaluated statistically.
3.2
The BP approach in this study: Statistical analysis
As mentioned above, the data were evaluated in two different ways.7 The first of these
involved the type of cluster analysis that is characteristic of much work using the BP
methodology. In this first part, we used Gries ’s (2010a) R script Behavioral Profiles
1.01 and computed five behavioral profiles, one for each modal form as occurring in
each language variety, i.e. native can, native may, interlanguage (IL) can, IL may, and
native pouvoir (FR). Such profiles consist of vectors of co-occurrence percentages of a
single modal form with each level of all independent variables and provide form-specific summaries of their semantic and morphosyntactic behavior in each sub-corpus. In a second step, the profiles were assessed statistically with a hierarchical cluster
analysis to explore the similarity and differences between the modal forms, and in
keeping with previous studies (cf. Divjak and Gries 2006), we chose the Canberra
metric as a measure of (dis)similarity and Ward’s rule as an amalgamation strategy.
6. Accomplishment verbs encode verbal statements that imply a unique and definite time
period; achievement verbs encode verbal statements that imply a unique and definite time instant; process verbs identify statements that reflect non-unique and indefinite time periods;
state verbs identify statements that reflect non-unique and indefinite time instants.
7. All statistical computations and plots were performed with R (for Linux), version 2.11.0
(see R Development Core Team 2010).
A case for the multifactorial assessment of learner language 193
Following Gries and Otani (2010), we computed different cluster analyses, one involving all variables that the uses of the modals were annotated for, one for only the
syntactic variables, and one for only the semantic variables.
The second analytical step involved a binary logistic regression including the following variables and predictors:
–Form as the dependent variable with only two levels here: can vs. may;
–GramAcc, Neg, SentType, ClType, SubjMorph, SubjPerson, SubjNumber,
Voice, Aspect, Mood, SubjRefNumber, Senses, SpeakPresence, Use, VerbSemantics, RefAnim, AnimType as independent variables in the form of main
effects;
– all these variables’ interactions with Corpus as additional predictors (to see
which variables’ influence on modal use differs the most between L1 English and
L2 English).
The logistic regression was then performed with the model selection process during
which insignificant predictors were discarded from the model: first insignificant interactions, then individual variables that were not significant and did not participate
in a significant interaction.
4. Results and discussion
4.1
Cluster analysis
Our first cluster analysis yielded the results shown in Figure 1. The left plot is a dendrogram of the five modal forms that were clustered; the right plot represents average
silhouette widths for assuming two, three, and four clusters. The average silhouette
widths point to a two-cluster solution, maybe a three-cluster solution, but the difference is minor since the former would result in a French-vs.-English clustering, and
the latter in a French-vs.-can-vs.-may clustering. This is compatible with Salkie’s analysis, who argued that pouvoir is very different from both can and may, and intuitively,
both these solutions “make sense”, which provides first evidence in favor of the approach. To anticipate the potential objection that this may seem trivial, let us mention
that it is in fact not. The data in Figure 1 show that the BP vectors are good and robust
descriptors of how the modals behave because many other theoretically possible cluster solutions, such as the ones listed in (5), would not have made linguistic sense at all.
(5) a. {{{canil maynative pouvoir} cannative} mayil}
b. {{{cannative mayil pouvoir} canil} maynative}
c. {{canil maynative} {pouvoir mayil} cannative}
1.0
0.03
0.4
0.6
0.8
0.07
0.2
(Average) silhouette widths
0.1
0.0
MAY.native
MAY.il
CAN.native
CAN.il
POUVOIR.fr
Height
700 750 800 850 900 950
194 Sandra C. Deshors and Stefan Th. Gries
0
1
2
3
4
Number of clusters in the solution
18
16
14
CAN.native
CAN.il
MAY.native
MAY.il
POUVOIR.fr
12
6
CAN.native
CAN.il
8
10
MAY.il
MAY.native
POUVOIR.fr
Height
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
Figure 1.╇ Dendrogram for all independent variables (il = interlanguage)
Figure 2.╇ Dendrograms for all morphosyntactic variables (left panel) and all semantic
variables (right)
However, in what follows we show that a fine-grained comparative description of
cross-linguistic language varieties can be obtained by focusing on differences between
the independent variables used for clustering. Consider Figure 2, which shows the
dendrograms for all morphosyntactic variables and all the semantic variables in the
left and right panel, respectively.
Interestingly, the results show that the intuitively very reasonable dendrogram in
Figure 1 is not replicated by looking at morphosyntax or semantics alone, which to
some extent at least contrasts with Gries and Otani’s results, where the results did not
differ very much between the three clusterings. The reasonable similarities of Figure 1
emerge only when all variables are combined. In particular, in both panels of Figure 2
canil and cannative are grouped together, but then the remaining forms are grouped differently. In the morphosyntactic dendrogram, the two kinds of may are successively
amalgamated and the French pouvoir is only added after all English forms have been
A case for the multifactorial assessment of learner language 195
affirmative
er
en
ce
s
di
ff
o
rn
yo
in
lt
se
ve
ra
negated
subord clauses
main clauses
0.4
0.2
0.0
–0.2
–0.4
Syntactic differences mat interlang - may native
Figure 3.╇ Snakeplot for most extreme differences between syntactic ID tags of may
clustered. In other words, morphosyntactically, we find a clear English-French divide,
but interlanguage may is too different from native may to be grouped together. To
identify the source of this difference, we used what in BP approaches has been called
a snakeplot, namely a plot of the pairwise differences between the percentages for, in
this case, mayil and maynative (cf. Divjak and Gries 2009 or Gries and Otani 2010 for
more examples).
As indicated in Figure 3, the main morphosyntactic ways in which learners deviate from native speakers are that learners underuse may in subordinate clauses and
in negated clauses. This is in fact an interesting finding because it means that learners
disprefer the rarer of the two modals – may – in those contexts which are already
morphosyntactically more challenging, as if using can is the default they resort to
when they are already under a higher processing load (cf. the so-called complexity
principle).
In the semantic dendrogram, by contrast, we find a different patterning. Semantically, canil and cannative are again very similar and grouped together early, but then
the next clustering step groups the two forms of may together. However, interestingly,
it is not the English forms that are then all grouped together – rather, contrary to
Salkie’s earlier analysis, pouvoir is semantically more similar to can than may is.
4.2
Logistic regression
The model selection process involved thirteen steps during which insignificant predictors were discarded. The final and minimally adequate model includes 16 significant
variables and 6 significant interactions and returned a highly significant correlation:
loglikelihood chi-square = 3296.47; df = 60; p < 0.001; the correlation between the
196 Sandra C. Deshors and Stefan Th. Gries
Table 3.╇ Overview of the results of the final GLM model
Predictor
Chi-square (df)
Predictor
Chi-square (df)
Corpus
GramAcc
Use
Elliptic
ClType
VerbType
VerbSemantics
SubjPerson
SubjNumber
SubjMorph
RefAnim
â•⁄ 24.9 (1) ***
â•⁄ 13.8 (1) ***
â•⁄ 67.9 (1) ***
100.0 (2) ***
â•⁄ 10.9 (1) ***
â•⁄ 97.4 (2) ***
384.9 (6) ***
â•⁄ 26.6 (2) ***
â•⁄â•⁄ 1.3 (1) ns
â•⁄ 49.1 (4) ***
â•⁄ 59.2 (1) ***
AnimType
Voice
SentType
Negation
SpeakPresence
Corpus:ClType
Corpus:VerbSemantics
Corpus:SubjNumber
Corpus:RefAnim
Corpus:AnimType
Corpus:Negation
â•⁄â•⁄â•⁄98.2 (11) ***
â•⁄â•⁄â•⁄55.0 (1) ***
â•⁄â•⁄â•⁄47.2 (1) ***
â•⁄â•⁄â•⁄87.2 (1) ***
29905.9 (2) ***
â•⁄â•⁄â•⁄60.0 (2) ***
â•⁄â•⁄â•⁄32.2 (6) ***
â•⁄â•⁄â•⁄37.4 (1) ***
â•⁄â•⁄ 122.2 (1) ***
â•⁄â•⁄ 118.2 (11) ***
â•⁄â•⁄â•⁄12.0 (1) ***
observed forms – may vs. can – and predicted probabilities is very high: R2 = 0.955.
Correspondingly, the model’s classificatory power was found to be very powerful with
a classification accuracy of 99%. Table 3 summarizes all the significant variables and
interactions yielded in the final model.
Overall, the final model includes one significant interaction involving a morphological variable (out of seven morphological variables), two significant interactions
involving syntactic variables (out of three syntactic variables) and three significant interactions involving semantic variables (out of eight semantic variables). But what do
the interactions reflect? Let us begin with Corpus:ClType, as represented in Figure 4.
The frequencies of may and can differ with regard to the type of clauses in which
they occur in native and learner English. The (weak!) effect is that, in interlanguage
may
may
can
can
can
may
may
can
can
can
0.4
0.2
Coordinate
Main Subordinate
0.0
0.4
0.2
0.0
may
0.6
0.8
may
0.6
0.8
1.0
Native
1.0
Interlanguage
Coordinate
Figure 4.╇ Bar plots of relative frequencies of Corpus:ClType
Main Subordinate
A case for the multifactorial assessment of learner language 197
0.8
may
0.4
0.6
may
can
can
0.2
Affirmative
Negative
0.0
0.4
can
can
0.2
0.0
Native
1.0
may
may
0.6
0.8
1.0
Interlanguage
Affirmative
Negative
Figure 5.╇ Bar plots of relative frequencies of Corpus:Neg
English, can is more strongly preferred over may in main clauses than it is in native
English.
While, as previously noted, existing literature concerned with the native use of
the modals commonly recognizes negation as “an important aspect of modal meaning” (Hermerén 1978), our study not only confirms the need to include negation in
an investigation of the uses of the modals but further recognizes its significance as a
morphological criteria to assess interlanguage (dis)similarity. Consider Figure 5 for
the interaction Corpus:Neg.
Figure 5 shows that, while all speakers prefer to use can in negated clauses, the interlanguage speakers do so more strongly. This result does not come as a surprise: On
the one hand, this is also compatible with the complexity principle – negated clauses
are more complex and preferred with the more frequent modal. On the other hand,
where epistemic may not would be used in English, French speakers would tend to
use a lexical verb along with the adverb peut-être to indicate the speaker’s uncertainty,
as illustrated in (6):
(6) a. This may not be the case
b. Ce n’est peut-être pas le cas
Consider Figure 6 for the interaction Corpus:SubjNumber.
While native speakers use can more often with singular subjects than with plural
subjects, it is the other way round with the learners, again a result compatible with the
complexity principle.
While the native speakers’ choices of may and can do not vary much between
animate and inanimate subjects, the learners’ choices do: with animate subjects, they
prefer can much more strongly. Figure 7 represents the interaction Corpus:RefAnim.
198 Sandra C. Deshors and Stefan Th. Gries
may
can
can
pl
sg
may
0.4
0.6
may
can
can
0.2
0.0
0.0
0.2
0.4
0.6
0.8
may
0.8
1.0
Native
1.0
Interlanguage
pl
sg
Figure 6.╇ Bar plots of relative frequencies of Corpus:SubjNumber
1.0
may
can
0.4
0.6
may
can
can
0.2
can
Animate
Inanimate
0.0
0.4
0.6
may
0.2
0.0
Native
0.8
may
0.8
1.0
Interlanguage
Animate
Inanimate
Figure 7.╇ Bar plots of relative frequencies of Corpus:RefAnim
Consider Figure 8 for the interaction Corpus:VerbSemantics; the upper panel
represents the interlanguage data, the lower panel represents the native speaker data,
and the bars are sorted from large absolute pairwise differences (left) to small absolute
pairwise differences (right).
The learners and the native speakers differ most strongly with semantically more
abstract verbs and time/place verbs, as in He thinks that if he can achieve one impossible act, then this will change everything.
The learners prefer can with abstract verbs more strongly than the native speakers, but they prefer may more strongly with time/place verbs. However, there are also
(less pronounced) differences for verbs that would typically have a human agent.
A case for the multifactorial assessment of learner language 199
may
may
may
may
may
may
may
can
can
can
can
can
can
can
0.0
0.2
0.4
0.6
0.8
1.0
Interlanguage
abstr
templ
comm
act_transf act_gen/mot ment/perc
copula
may
may
may
may
may
may
may
can
can
can
can
can
can
can
0.0
0.2
0.4
0.6
0.8
1.0
Native
abstr
templ
comm
act_transf act_gen/mot ment/perc
copula
Figure 8.╇ Bar plots of relative frequencies of Corpus:VerbSemantics
For instance, the learners prefer may with communication verbs and can with action-transformation verbs. Virtually no difference at all is found with copulas.
As for the final interaction, Corpus:AnimType, we do not represent it here
graphically. While it is significant, the large number of categories plus the fact that the
most pronounced differences occur with a small number of very infrequent categories
does not yield much in terms of interesting findings.
As for the main effects, we will not discuss them here in detail. This is because
these main effects by definition do not tell us anything about the can and may variables across languages (since these variables do not interact with Corpus). However,
since they do tell us something about which modal verb is preferred by both native
speakers and learners, we summarize them here visually in Figure 9. The x-axis lists
the main effects, on the y-axis we show the percentage of can obtained for levels of
these main effects, and then the levels are plotted at their observed percentage of can;
the dashed line represents the overall percentage of can in the data.
weak
passive
il
native
yes
active
literal declarative no
two
one prop_noun
pr
accmp-achv
proos
three com_noun
metaphorical
0.6
0.4
interrogative yes
no
state
rel-dem/pr
other
strong
0.0
0.2
Percentage of can
0.8
1.0
200 Sandra C. Deshors and Stefan Th. Gries
medium
corpus speaker grm. voice
presence accept
use
sent.
type
elliptic
verb
type
subj.
subj.
pers. morph.
Figure 9.╇ Main effects of the logistic regression
Finally, a brief look at the regression’s misclassifications seems to indicate that
they did not occur randomly. While all 34 misclassifications occurred in the interlanguage data, 29 of them occurred with may in a form characteristic only of the
French-English learner language. In the large majority of those misclassifications,
may is found to express a possibility that results from some sort of theoretical demonstration. Consider the examples in (7) and (8). While the ones in (7) illustrate our
current point, (8) provides an additional example of an atypical occurrence of learner
may, which clearly denotes a strong sense of possibility and whose interpretation is
heavily reminiscent of that of can.
(7) a. So we may say that …
b. To conclude, we may say that …
c. As a conclusion, we may say that …
d.This is why we may now speak of the stupefying effect
e. This is the reason why we may say that …
(8) “Dresden is an old town”, we may read of its history
5. Concluding remarks
By way of a summary, the BP approach and the subsequent logistic regression allows
us to recognize how can and may (in native and learner English), as well as pouvoir,
relate to each other as well as what helps determine native speakers’ and learners’
choices. On the whole, distributionally we do find the expected groupings: the cans,
then the mays, and only then pouvoir. However, it is interesting that, semantically,
A case for the multifactorial assessment of learner language 201
English can is more similar to French pouvoir than to English may, and the subsequent regression results provided some initial information on why that is so. More
specifically, the way learners choose one of the two verbs is often compatible with
a processing-based account in terms of the complexity principle – they choose the
more basic and frequent can over may when the environment is complex – but is also
strongly influenced by the animacy of the subject and the semantics of the verb: can is
overpreferred by learners with animate subjects and with abstract verbs, and underpreferred with time/place verb semantics.
With regard to the modals per se, our results confirm previous studies’ recognition of the influential role of the linguistic context in the uses of may and can. Indeed,
while the main effects included in our final logistic regression model support studies
that have identified morphosyntactic components such as Voice and SentType as
particularly influential categories (Leech 1969, 2004; Huddleston 2002; Collins 2009),
our results reveal the necessity to also take the semantic context of modals more seriously, as reflected by the strong effects of VerbType and VerbSemantics.
More generally speaking and in the parlance of the Competition Model, the cluster analysis and the high classification accuracy of the regression suggest that, on the
whole, the learners have built up mental categories for can and may that are internally
rather coherent. However, the interactions in the regression show that these cues are
weighted incorrectly and sometimes trigger a verb choice that is not in line with native speaker choices, but that even this kind of incorrect choice is largely predictable
(because the regression can still make the correct classifications (cf. Deshors 2010 for
more detailed discussion as well as a distinctive collexeme analysis revealing additional verb-specific preferences). In other words, even though this is the first study
involving learner data (and only the second involving different languages), the BP approach and especially the follow-up in terms of the logistic regression are therefore an
interesting diagnostic: (i) the overall results can testify to the strength of the categories
that are being studied, and (ii) the regression with its inclusion of the interactions of
all variables with “native speaker vs. learner” exactly pinpoints where interactions
become significant, i.e. where the categories of the learner are still substantially different from the native speaker. For further applications and extensions, see Gries and
Wulff (2013) for a similar application to the choice of (of- and s-) genitives by native
speakers and learners, and Gries and Deshors (to appear) for an even more advanced
approach to precisely pinpoint where non-native speakers’ choices deviate from those
of native speakers and how much so. Needless to say, more and more rigorous testing is necessary, but to our knowledge this is the first study proposing this kind of
approach more generally and the use of a regression with a native-learner variable
as a measure of L2 “proficiency”; the results illustrate that learners’ “non-nativeness”
manifests itself at all linguistic levels simultaneously.
202 Sandra C. Deshors and Stefan Th. Gries
References
Aijmer, K. (2002). Modality in advanced Swedish learners’ written interlanguage. In S. Granger,
J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition
and foreign language teaching (pp. 55–76). Amsterdam: John Benjamins.
Arppe, A. (2008). Univariate, bivariate and multivariate methods in corpus-based lexicography: A study of synonymy. Unpublished PhD dissertation, University of Helsinki. Available at: <http://urn.fi/URN:ISBN:978-952-10-5175-3>.
Bartning, I. (2009). The advanced learner variety: 10 years later. In E. Labeau, & F. Myles (Eds.),
The advanced learner variety: The case of French (pp. 11–40). Frankfurt/Main: Peter Lang.
Bates, E., & MacWhinney, B. (1982). Functionalist approaches to grammar. In E. Wanner, &
L. R. Gleitman (Eds.), Language acquisition: The state of the art (pp. 173–218). Cambridge:
Cambridge University Press.
Bates, E., & MacWhinney, B. (1989). Functionalism and the competition model. In B. Mac�
Whinney, & E. Bates (Eds.), The cross-linguistic study of sentence processing (pp. 3–73).
Cambridge: Cambridge University Press.
Bybee, J., & Fleischman, S. (1995). Modality in language and discourse. Amsterdam: John
Benjamins. DOI: 10.1075/tsl.32
Byloo, P. (2009). Modality and negation: A corpus-based study. Unpublished PhD dissertation,
University of Antwerp.
Coates, J. (1980). On the non-equivalence of may and can. Lingua, 50(3), 209–220.
DOI: 10.1016/0024-3841(80)90026-1
Coates, J. (1983). The semantics of the modal auxiliaries. London: Croom Helm.
Collins, P. (2009). Modals and quasi modals in English. Amsterdam: Rodopi.
De Haan, F. (1997). The interaction of modality and negation: A typological study. New York:
Garland.
Depraetere, I., & Reed, S. (2006). Mood and modality in English. In B. Aarts, & A. MacMahon
(Eds.), The handbook of English linguistics (pp. 268–287). London: Blackwell.
Deshors, S. C. (2010). A multifactorial study of the uses of may and can in French-English
interlanguage. Unpublished PhD dissertation, University of Sussex.
Divjak, D. S., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2(1), 23–60. DOI: 10.1515/CLLT.2006.002
Divjak, D. S., & Gries, St. Th. (2008). Clusters in the mind? Converging evidence from near
synonymy in Russian. The Mental Lexicon, 3(2), 188–213. DOI: 10.1075/ml.3.2.03div
Divjak, D. S., & Gries, St. Th. (2009). Corpus-based cognitive semantics: A contrastive study
of phasal verbs in English and Russian. In K. Dziwirek, & B. Lewandowska-Tomaszczyk
(Eds.), Studies in cognitive corpus linguistics (pp. 273–296). Frankfurt/Main: Peter Lang.
Gabrielatos, C., & Sarmento, S. (2006). Central modals in an aviation corpus: Frequency and
distribution. Letras de Hoje, 41(2), 215–240.
Gass, S. (1996). Second language acquisition and linguistic theory: The role of language
transfer. In W. C. Ritchie, & T. K. Bhatia (Eds.), Handbook of second language acquisition
(pp. 317–340). San Diego: Academic Press.
Gries, St. Th. (2006). Corpus-based methods and cognitive semantics: The many meanings of
to run. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in cognitive linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin: Mouton de Gruyter.
DOI: 10.1515/9783110197709
A case for the multifactorial assessment of learner language 203
Gries, St. Th. (2010a). Behavioural Profiles 1.01: A program for R 2.7.1 and higher.
Gries, St. Th. (2010b). Behavioral profiles: A fine-grained and quantitative approach in corpus-based lexical semantics. The Mental Lexicon, 5(3), 323–346.
Gries, St. Th., & Deshors, S. C. (To appear). Using regressions to explore deviations between
corpus data and a standard/target: two suggestions. Corpora.
Gries, St. Th., & Divjak, D. S. (2009). Behavioral profiles: A corpus-based approach to cognitive
semantic analysis. In V. Evans, & S. Pourcel (Eds.), New directions in cognitive linguistics
(pp. 57–75). Amsterdam: John Benjamins.
Gries, St. Th., & Divjak, D. S. (2010). Quantitative approaches in usage-based cognitive semantics: Myths, erroneous assumptions, and a proposal. In D. Glynn, & K. Fischer (Eds.),
Quantitative cognitive semantics: Corpus-driven approaches (pp. 333–354). Berlin: Mouton
de Gruyter.
Gries, St. Th., & Otani, N. (2010). Behavioral profiles: A corpus-based perspective on synonymy and antonymy. ICAME Journal, 34, 121–150.
Gries, St. Th., &Wulff, S. (2013). The genitive alternation in Chinese and German ESL learners:
Towards a multifactorial notion of context in learner corpus research. International Journal of Corpus Linguistics, 18(3), 327–356.
Hermerén, L. (1978). On Modality in English: A study of the semantics of the modals. Lund:
LiberLäromedel/Gleerups.
Huddleston, R. D. (2002). The Cambridge grammar of the English language. Cambridge:
Cambridge University Press.
Hyltenstam, K., Bartning I., & Fant L. (2005). High Level Proficiency in Second Language Use.
Research program for Riksbanken Jubileumsfond. (Stockholm university) http://www.
biling.su.se/~AAA.
Kilborn, K., & Ito, T. (1989). Sentence processing strategies in adult bilinguals. In B. Mac�
Whinney, & E. Bates (Eds.), The cross-linguistic study of sentence processing (pp. 257–291).
Cambridge: Cambridge University Press.
Klinge, A., & Müller, H. H. (2005). Modality: Intrigue and inspiration. In A. Klinge, & H. H.
Müller (Eds.), Modality studies in form and function (pp. 1–4). London: Equinox.
Leech, G. (1969). Towards a semantic description of English. Bloomington, IN: Indiana University Press.
Leech, G. (2004). Meaning and the English verb. London & New York: Longman.
MacWhinney, B. (2004). A unified model of language acquisition. Retrieved from <http://
psyling.psy.cmu.edu/papers/CM-general/unified.pdf> [Accessed 18 June 2010].
Neff, J., Dafouz, E., Herrera H., Martínez, F., & Rica, J. P. (2003). Contrasting the use of learner corpora: The use of modal and reporting verbs in the expression of writer stance. In
S. Granger, & S. Petch-Tyson (Eds.), Extending the scope of corpus-based research: New
applications, new challenges (pp. 211–230). Amsterdam: Rodopi.
Nuyts, J. (2006). Modality: Overview and linguistic issues. In W. Frawley (Ed.), The expression
of modality (pp. 1–26). Berlin: Mouton de Gruyter.
Palmer, F. (1979). Modality and the English modals. London & New York: Longman.
Radden, G. (2007). Interaction of modality and negation. In W. Chłopicki, A. Pawelec, &
A. Pokojska (Eds.), Cognition in language: Volume in Honour of Professor Elżbieta TabaÂ�
kowska (pp. 224–254). Kraków: Tertium.
R Development Core Team (2010). R: A language and environment for statistical computing.
Foundation for statistical computing. Vienna, Austria. <http://www. R-project.org>.
204 Sandra C. Deshors and Stefan Th. Gries
Salkie, R. (2000). Corpus linguistics: A brief guide to research in French language and linguistics. AFLS Cahiers, 6, 44–52.
Salkie, R. (2004). Towards a non-unitary analysis of modality. In L. Gournay, & J.-M. Merle
(Eds.), Contrastes: mélanges offerts à Jacqueline Guillemin-Flescher (pp. 169–182). Paris:
Ophrys.
Vendler, Z. (1967). Verbs and times. In Z. Vendler (Ed.), Linguistics in philosophy (pp. 97–121).
New York: Cornell University Press.
Dutch causative constructions
Quantification of meaning and meaning
of quantification
Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
F.R.S. – FNRS, Université catholique de Louvain / University of Leuven
This chapter is a multivariate corpus-based study of two near-synonymous
periphrastic causatives with doen and laten in Dutch. Using multiple logistic
regression and classification trees, the study explores the conceptual differences
between the constructions. The results support the existing definition of doen
as the direct causation auxiliary, and interpretation of laten as the indirect
causative (e.g. Verhagen and Kemmer 1997). However, the analyses also reveal
more specific patterns: the most distinctive semantic pattern of doen is affective
causation, whereas the contexts with the highest probability of laten refer to
inducive causation. These differences remain valid when we control for geographic and thematic variation, as well as for the individual Effected Predicates
treated as random effects in a mixed model.
Keywords: classification trees, logistic regression, mixed model, periphrastic
causatives
1. Introduction
This chapter is a contribution to empirical Cognitive Semantics (e.g. Glynn and
Fischer 2010).1 It is a corpus-based multivariate onomasiological study (cf. Tummers
et al. 2005), which uses quantitative corpus evidence to describe, explain and predict the choices that speakers make between semantically related constructions when
they categorize their experience. To do so, the linguist needs to identify the relevant
semantic, pragmatic, social and other features that influence this choice. This kind of
1. This research was supported with a grant from the Flemish Research Fund – FWO
(G033008). The authors would also like to thank Kris Heylen for his help in collecting the corpus data. The usual disclaimers apply.
206 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
study requires advanced statistical multivariate techniques, such as logistic regression, which allow the researcher to model the impact of each factor, while controlling
for the others.
The approach is relatively new, but a number of studies have been implemented already. It is interesting to note that a substantial share of these studies focus on
constructions that differ from each other with respect to information structure and
processing. Examples are the dative alternation in English (e.g. Bresnan et al. 2007),
presence or absence of the presentative er-construction in Dutch (Grondelaers et al.
2007), particle placement in English (Gries 2003), word order variation in Dutch final verbal clusters (de Sutter 2009) and in the German ‘middle field’ (Heylen 2005).
It seems that these alternations, a challenge for traditional linguistic descriptions,
have benefited the most from the multifactorial probabilistic methods due to a variety of ways in which the underlying information-processing factors can be captured.
However, more “semantic” constructional variation, like the one discussed here, can
benefit from these methods too because highly abstract grammatical meaning can
be captured in a corpus by a multitude of indirect indicators used as circumstantial
evidence.
The current chapter, which is an elaboration of the pilot study carried out by
Speelman and Geeraerts (2009), focuses on the near-synonymous Dutch causative
constructions with doen and laten. Our study incorporates several linguistic, thematic
and geographical factors in a multivariate statistical model, which allows us to test the
existing semantic hypotheses about doen and laten, keeping the conceptual factors
apart from the other sources of variation. We argue that the distinctive conceptual
features that emerge in the quantitative model constitute the distinctive prototypes of
the constructions – the semantic configurations with the highest intercategorial cue
validity.
Although the Prototype Theory of categorization (Rosch 1975; Rosch and Mervis
1975) has been dominant in Cognitive Linguistics, many psychological and, more
recently, linguistic studies (e.g. Medin and Schaffer 1978; Bybee and Eddington 2006)
have demonstrated the crucial role of specific exemplars (or low-level schemata) in
category organization and development. This is why we also test whether the abstract
semantic differences between the constructions still hold if we take into account the
lexemes that fill in the effected predicate slot, many of which display strong preference
for doen or laten. The method applied for this purpose is mixed-effect modelling with
the general semantic and other factors as fixed effects and the specific effected predicates as random effects.
The chapter has the following structure. First, we give a brief introduction of the
Dutch causative constructions. In Section 3, the data and the potentially relevant variables are presented. Section 4 reports the results of the multiple logistic regression
analysis and additional tests, which are interpreted linguistically and cognitively in
Section 5. The chapter ends with a summary of our findings.
Dutch causative constructions 207
2. Dutch causative constructions
Modern standard Dutch has two periphrastic causatives with the infinitive: the constructions with doen ‘do’ and laten ‘let’. They share the same schematic pattern: an
initiator causes another entity to acquire a state or perform an action. Consider example (1):
(1) De politie deed/lietde auto stoppen.
the policedid/let the car stop
‘The police stopped the car’.
Using the terminology from Kemmer and Verhagen (1994), de politie ‘the police’, is
the causer of the event; de auto ‘the car’ is the causee that performs the action specified
by the effected predicate stoppen ‘stop’. The forms deed and liet are the past forms of
the causative auxiliaries doen and laten, respectively. The most striking feature of the
Dutch causatives is that laten as a causative auxiliary can refer both to the enabling
and coercive types of causation (see Verhagen and Kemmer 1997:â•›69). Compare the
situations in (2a), (2b) and (2c):
(2) a. De trainer liet de spelers loopoefeningen doen. [coercive]
the coach let the players running-exercisesdo
‘The coach made the players do running exercises’.
[ambiguous]
b. Hij liet iedereen zijn roman lezen. Heleteveryonehisnovelread
‘He made/had/let everyone read his novel’.
c. De politielietde daderontsnappen.[enabling]
the police let the criminal escape
‘The police let the criminal escape’.
There have been a number of usage-based studies that have tried to establish the
differences between the constructions (Kemmer and Verhagen 1994; Verhagen and
Kemmer 1997; Degand 2001; Stukker 2005; Speelman and Geeraerts 2009). Verhagen
and Kemmer (1997) write about the semantic difference between doen and laten in
terms of the speaker’s conceptualization of the situation as direct or indirect causation, respectively. Direct causation means that “there is no intervening energy source
‘downstream’ from the initiator: if the energy is put in, the effect is the inevitable
result” (Verhagen and Kemmer 1997:â•›70). Indirect causation, which also includes the
situations of enablement and permission, emerges when the situation “can be conceptualized in such a way that it is recognized that some other force besides the initiator
is the most immediate source of energy in the effected event” (Ibid.:â•›67).
Speelman and Geeraerts (2009) showed, in their multivariate analysis of the Corpus of Spoken Dutch, that there is also a substantial amount of geographic and register variation in the use of the constructions. From the conceptual point of view, it
208 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
was suggested that doen is an obsolescent form with a tendency towards semantic
and lexical specialization, most probably in direct physical causation (Speelman and
Geeraerts 2009:â•›200), although this hypothesis was not tested.
The highly abstract conceptual patterns, such as direct and indirect causation,
cannot be directly observed in a corpus-based study. Our aim is to explore a set of
independent contextual factors (“diagnostic features”, according to Speelman and
Geeraerts 2009) that can serve as indirect, or circumstantial, evidence of semantic differences between the constructions. These contextual factors, operationalized as independent variables in the logistic regression model, are listed in Section 3, as well as
the extralinguistic (geographic and thematic) variables that are explored in this study.
3. Data and variables
3.1
Data
The study is based on an 8 million token corpus of Netherlandic and Belgian Dutch,
compiled from the TwNC and LeNC newspaper corpora (2001–2002). The corpus
was balanced with regard to four subject domains of the articles: politics, economy,
football and music. We used a syntactically parsed version of the data, which was
obtained with the help of the Alpino parser of Dutch (Bouma et al. 2001). This allowed us to extract the contexts with constructions automatically. The contexts were
then checked manually to avoid spurious hits and formally similar but functionally
different constructions, such as the adhortative laten in Laten we gaan ‘Let’s go’. We
also excluded idiomatic expressions with effected predicates that do not occur independently, e.g. begaan, which only occurs in the set expression laten begaan ‘release,
give freedom’. After the manual cleaning, we were left with 6,808 observations, which
were then coded for seven semantic, syntactic, geographical and thematic variables
presented in the next section.
3.2
The response variable
The speaker’s choice for doen or laten in the given context was used as the binary
response variable. The distribution of the constructions in the data set was skewed towards laten, which occurred 5,636 times, while doen was used only in 1,172 contexts,
which is approximately 5 times less.
Dutch causative constructions 209
3.3
The linguistic predictors
The variable CrSem refers to the semantic class of the causer: animate (humans and
animals) or inanimate (material and abstract entities). All previous studies reported
the more frequent use of animate causers with laten and inanimate ones with doen.
Verhagen and Kemmer (1997) and Stukker (2005) studied the causer’s semantics only
in combination with the semantic class of the causee. They found, however, that inanimate causers in combination with both animate or inanimate causees tend to be used
more frequently with doen because these configurations correspond to physical and
affective causation types, respectively, which imply direct causation. The most typical configuration for laten, which normally represents inducive causation, consists
of animate causers and causees. This type of causation is indirect because humans
cannot influence other humans’ minds directly, telepathy disregarded (Verhagen and
Kemmer 1997:â•›71). The remaining possibility, the combination of animate causers
with inanimate causees, allows both for direct and indirect interference of the causer.
CeSem stands for the semantic class of the causee, which can also be animate or
inanimate. If the causer is not the main source of energy in indirect causation, then
it should most probably be the causee (cf. Stukker 2005). Thus, one could expect a
higher degree of animacy of the causee in the laten-construction in comparison with
doen. This variable has not been examined separately in any of the previous studies,
although from Verhagen and Kemmer’s (1997) description of causation types it follows that the chances for inanimate causees to be used with doen are somewhat higher
than for animate ones. Both explicit and implicit causees (see below) were classified,
depending on the context and the semantics of the effected predicate. Nevertheless,
we were unable to classify 13 cases with implicit causees, so we left those contexts out.
CdEventSem describes the semantic class of the caused event. It can be mental
or non-mental (physical or social). In case of metaphorical meaning, we assigned the
semantic class that corresponded to the target domain. For example, in (3) the caused
event was coded as mental:
(3) Hetdoethet belletjerinkelen.
it makesthe bell_DIMring
‘It rings a bell’.
This variable was included to test whether doen is associated with the physical causation, as Speelman and Geeraerts (2009) suggested.
EPTrans refers to the transitivity (including ditransitivity) or intransitivity of the
effected predicate. The previous studies showed that laten is more favoured by transitive verbs, which was regarded as evidence for the indirectness of the causative situations indicated by this construction because it involved a longer causation chain with
more participants.
210 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
The variable CeSynt was inspired by Kemmer and Verhagen’s (1994) observations
about the laten-construction. In some contexts, the causee in Dutch allows not only
for zero-marking as in (4a), but also for the prepositions aan and door, the dative and
instrumental/agentive markers in Dutch, respectively, as in (4b) and (4c):2
(4) a. Hij liet zijn vrouw zijn nieuwe gedicht lezen.
‘He made/let his wife read his new poem’.
b. Hij liet zijn nieuwe gedicht aan zijn vrouw lezen.
‘He let his wife read his new poem.’
c. Hij liet zijn nieuwe gedicht door zijn vrouw lezen.
‘He had his new poem read by his wife’.
d.Hij liet zijn nieuwe gedicht lezen.
‘He had his new poem read’.
Kemmer and Verhagen (1994) argue on the basis of cross-linguistic evidence that
propositional or indirect-object marking of the causee implies a smaller degree of integration of the causee into the causative event and its lower affectedness in comparison with the default zero-marking (or, for personal pronouns, marking with the case
of the direct object). This smaller integration and affectedness is typical of indirect
causation. Therefore, we should expect prepositional marking to boost laten.
On the other hand, Kemmer and Verhagen also suggested that implicitness of
the causee, like in (4d), means even larger peripherality and non-affectedness of the
causee (Kemmer and Verhagen 1994:â•›139). However, a more recent research study by
Loewenthal (2003) has shown that implicit causees in the laten-construction have a
moderate degree of affectedness, although the peripherality claim may still hold. Considering all this, and also the low frequencies of the prepositional marking, we distinguished two levels of the predictor: “Central” (the causee is explicit and unmarked)
and “Peripheral” (the causee is implicit or marked with a preposition).
4. Statistical analysis
Multiple logistic regression allows us to model the speaker’s behaviour by taking into
account several factors that influence the speaker’s choice simultaneously. The analyses were carried out with the help of R statistical software (R Development Core
Team 2010). The first step of multivariate analysis is the selection of variables that
have an impact on the speaker’s choice. To select the relevant variables, we used the
forward and backward stepwise selection procedures based on Akaike’s Information
2. Note that all three marking options are available only for one effected predicate, lezen ‘read’.
The aan-marking is typical for verbs of perception and, consequently, mental causees, whereas
the preposition door normally marks agentive causees.
Dutch causative constructions 211
Table 1.╇ Results of multiple regression (simple main effect model)
Predictor
Estimate (log odds ratio)
(Intercept)
CrSem = Inanimate
EPTrans = Intransitive
Country = BE
CdEventSem = Mental
CeSynt = Peripheral
SubjectDomain = Football
SubjectDomain = Music
SubjectDomain = Politics
–4.38 (p < 0.001)
â•⁄ 3.44 (p < 0.001)
â•⁄ 1.48 (p < 0.001)
â•⁄ 0.68 (p < 0.001)
â•⁄ 0.79 (p < 0.001)
–0.90 (p < 0.001)
â•⁄ 0.12 (p = 0.38)
â•⁄ 0.45 (p < 0.001)
â•⁄ 0.35 (p = 0.009)
Criterion (AIC). This criterion helps strike a balance between the predictive power of
a model and its parsimony. All predictors, except CeSem, entered the final model (see
Table 1).3
The order of the variables in the table reflects their importance in predicting the
speaker’s behaviour, as selected by the forward stepwise algorithm. The most important predictor is CrSem, and the least influential one is SubjectDomain. The column
with the estimates provides the log odds ratios of the doen-construction for the given
value of the predictor in comparison with the reference level (the values of the variables not mentioned in the table: the animate causer, the transitive effected predicate,
the Netherlands, the explicit zero-marked causee, the non-mental effected predicate,
and Economy as the article’s subject domain). If the log odds ratio is equal to 0, doen
and laten have equal chances to occur, which means that the predictor is not informative. A positive value means that the chances of doen are higher for the given value
in comparison with the reference level of the same predictor. A negative log odds
ratio, conversely, stands for relatively higher chances of laten. For example, the inanimate causer increases the log odds ratio of doen in comparison with the reference
level, the animate causer, by 3.40, which corresponds to the simple odds ratio of 29.96
(i.e. the chances for inanimate causers to occur in the doen-construction are almost
30 times as high as those of the animate causers). The p-values next to the estimate
demonstrate how confident one can be that the estimate is not equal to 0: the lower
the p-value, the more certain one can be. Conventionally, a value of α = 0.05 is used as
a cut-off point for significant effects.
The overall quality of the model is satisfactory, as the measurements in the lefthand column in Table 3 suggest. The most intuitive measure is the proportion of correct predictions of doen and laten by the model. We can correctly predict 90.1% of
3. Not all statisticians agree on the value of stepwise selection (e.g. Harrell 2001). However,
analysis of the full model and single term deletion tests yield the same model structure.
212 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
the speakers’ choices (the cut-off probability is set to 0.5). However, this is not the
most informative measure because laten is so frequent that if we simply predicted
laten for all contexts, we would be correct in 82.8% of the cases, which serves as the
baseline. There are special measures that neutralize the skewness, for example, the
index of concordance C, also called the area under the ROC curve (see Hosmer and
Lemeshow 2000:â•›160), which is 0.893 in our case. This number means that for all pairs
of contexts with doen and laten, the model assigns a higher probability to the auxiliary actually observed in the context in almost 90% of the cases. C is equal to 0.5 if
the predictions are random, and is equal to 1 if they are perfect. The related measures
are Somers’ Dxy = 0.787 (rank correlation between the predicted probabilities and observed responses ranging from 0 to 1) and Goodman-Kruskal’s Gamma = 0.795 (with
the range from –1 to 1). R2 is Nagelkerke’s generalized R2 index for logistic regression
models, which is analogous to the measure of explained variation in linear models. It
stands for a proportional reduction in the absolute value of the log-likelihood measure in comparison with the intercept-only model (see e.g. Menard 2001:â•›24–27 for
more details). It ranges from 0 (no predictive power) to 1 (a perfect fit of the data). For
this model, R2 = 0.523. All these values demonstrate that the model has a substantial
predictive power.
However, one more thing should be taken into account. The effect of some of the
predictors on the response variable may be non-additive, i.e. it cannot be explained
by the summary effect of the predictors taken separately. An example of such an interaction from health care is a situation when one and the same medical treatment
produces different effects on patients depending on their sex or age. In this study,
we focused on the interactions between the intralinguistic variables (semantic and
syntactic ones). We selected a model with interaction terms on the basis of AIC. The
procedure was performed in such a way that all five semantic and syntactic variables
in all possible combinations had a chance to occur in the model. One three-way interaction CrSem:EPTrans:CeSynt turned out to be significant. Table 2 displays the model
with the interaction (it also lists the relevant lower-order interaction terms).
In a model with interaction terms, interpretation of the coefficients is less straightforward because we can no longer estimate an independent impact of a predictor that
participates in an interaction without taking into account different levels of the other
variables with which it interacts. For example, the estimate for CrSem = Inanimate
in Table 2 should be interpreted as the combination of CrSem = Inanimate AND EPTrans = Transitive AND CeSynt = (the two latter terms are the reference levels of the
corresponding variables).
Table 3 lists the summary statistics for the two models. One can see that the predictive power of the model with interactions is slightly better in comparison with the
main-effect only model, although not dramatically. This shows that our main-effect
only model was informative, but too coarse to deal with some combinations of the
predictor values.
Dutch causative constructions 213
Table 2.╇ Model with main effects and three-way interactions
Predictor
Estimate (log odds ratio)
(Intercept)
CrSem = Inanimate
(for EPTrans = Transitive and CeSynt = Central)
EPTrans = Intransitive
(for CrSem = Animate and CeSynt = Central)
Country = BE
CdEventSem = Mental
CeSynt = Peripheral
(for EPTrans = Transitive and CrSem = Animate)
SubjectDomain = Football
SubjectDomain = Music
SubjectDomain = Politics
EPTrans = Intransitive: CeSynt = Peripheral
(for CrSem = Animate)
CrSem = Inanimate: CeSynt = Peripheral
(for EPTrans = Transitive)
CrSem = Inanimate: EPTrans = Intransitive
(for CeSynt = Central)
CrSem = Inanimate: EPTrans = Intransitive:
CeSynt = Peripheral
–3.59 (p < 0.001)
â•⁄ 3.67 (p < 0.001)
â•⁄ 0.41 (p = 0.051)
â•⁄ 0.68 (p < 0.001)
â•⁄ 0.78 (p < 0.001)
–1.93 (p < 0.001)
â•⁄ 0.16 (p = 0.27)
â•⁄ 0.49 (p < 0.001)
â•⁄ 0.34 (p = 0.014)
â•⁄ 3.48 (p < 0.001)
–0.31 (p = 0.437)
â•⁄ 0.26 (p = 0.459)
–2.60 (p < 0.001)
Table 3.╇ Summary statistics for two models
Statistic
Model without interactions
Number of observations
Proportion of correct predictions
(baseline = 82.8%)
C
Dxy
Gamma
Generalized R2
AIC
6795 (doen: 1170, laten: 5625)
90.1%
90.3%
0.893
0.787
0.795
0.523
3695.8
Model with interactions
0.91
0.821
0.829
0.553
3525.2
Three-way interactions are hard to grasp intuitively, so we used another technique, named CART (Classification And Regression Trees), to visualize the interactions in a convenient way. The algorithm splits up the observations according to the
values of each predictor, trying to separate the observations with doen from those
with laten in the best possible way. It begins with the split that allows the cleanest
separation, and then proceeds with the resulting subsets, choosing the next best split.
214 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
CrSem=Anim
EPTrans=Tr
Laten
5111/318
CeSynt=Periph
Laten
307/122
Doen
20/61
Doen
187/669
Figure 1.╇ Classification tree for semantic and syntactic variables
The procedure then cross-validates the resulting tree against different subsets of data,
selecting the most parsimonious model with the purest “leaves”.
Figure 1 shows the classification tree for our dataset (only the semantic and syntactic variables took part in the classification). It was implemented with the help of the
rpart package in R.4 The minimum number of observations allowed in a split was 20;
the algorithm performed 10 cross-validations. Under these conditions, only three of
the five linguistic features take part in splits: CrSem, EPTrans and CeSynt. Recall that
they are also the ones that interact significantly in the regression model. Each split is
labeled with a decision rule, e.g. “CrSem=Anim”. If the condition is met, i.e. for all animate causers, one should follow the left branch; otherwise, the right branch should be
explored. The names of the leaves display the predominant construction in the group;
the numbers below stand for the number of laten- and doen-observations. The error
rate of the classification was low: 9.5%, in comparison with 17.2% if we simply always
predicted laten as the default auxiliary (cf. the baseline in Table 4).
The first observation one can make is that doen needs more conditions to be met
(inanimate causer AND intransitive effected predicate, or, in very few cases, inanimate
causer AND transitive effected predicate AND explicit unmarked causee), whereas
the animate causer is perfectly sufficient to obtain a leaf with a sufficient probability
of laten, which also contains the largest number of observations. In addition, laten
4. An alternative solution is to use conditional inference trees. Their main advantage is that
they neutralize the bias towards covariates with many possible splits (see Hothorn et al. 2006).
However, all linguistic variables in the present analysis are binary, so this factor should not
cause problems.
Dutch causative constructions 215
emerges in more specific situations with the inanimate causer, transitive effected predicate and peripheral causee. The classification also tells us that the features EPTrans
and CeSynt are relevant for the classification only in the case of the inanimate causer.
For animate causers, these features are not powerful enough to influence the outcome.
In a similar way, the syntactic expression of the causee has a decisive effect only in the
case of an inanimate causer and a transitive effected predicate.
5. Linguistic interpretation of the statistical models
Some of our expectations based on the (in)direct causation hypothesis were confirmed by the marginal effects of the variables in the simple main effect model: inanimate causers, intransitive effected predicates and syntactically central causees do
favour doen. However, we found no indication that the semantic class of the causee
is significant, although one might expect that an animate causee is a better candidate
for an indirect causation event because it is the main source of energy in the causation process. This lack of evidence creates a dilemma that is common in empirical
studies (Geeraerts 1999). On the one hand, it may cast doubt on the indirect-direct
causation hypothesis in the way it was formulated above. Alternatively, one could
question our operationalization of the causee’s role in terms of animacy or inanimacy.
The latter scenario seems to be more reasonable. The fact that doen is preferred by
mental caused events suggests that the causation categorized with doen frequently involves animate causees as experiencers. In contrast, the laten-construction, preferred
by more dynamic non-mental caused events, contains animate causees, who can play
a more active, agentive role. Therefore, other, more sophisticated ways of determining
the causee’s role could be helpful.
The observed preference of doen by mental caused events is unexpected. In combination with inanimate causers, it seems that doen is highly associated with affective
causation, which involves a stimulus (causer) that triggers a mental reaction of an
experiencer (causee). This behaviourist-like causation type is very direct. However,
Verhagen and Kemmer’s (1997) theory did not predict the predominance of affective
causation within the semantics of doen; Speelman and Geeraerts (2009) even spoke
about direct physical causation as doen’s specialization.
Next, we calculated the probabilities of doen and laten for every configuration of
the linguistic features, as predicted by the model with interactions. To do so, we first
calculated the sum of the relevant estimates provided in Table 2, including the intercept value, and then transformed the resulting log odds ratios into probabilities.5 The
5. According to the formula P = exp(x)/(1+exp(x)), where x is the sum of the log odds ratios
(coefficients) for all variables (the relevant values) and the intercept.
216 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
configurations with the highest probabilities of doen and laten can be illustrated by
contexts (5) and (6), respectively.
(5) Het artikel (…) deed mij terugdenken aan mijn ontmoeting met de Algerijnse
ambassadeur in Brussel, voorjaar 1972 op een receptie in Den Haag.
‘The article (…) made me think back to my meeting with the Algerian ambassador in Brussels in the spring of 1972 at a reception in The Hague’.
(6) Prinses Juliana zou in de jaren 60 liefdesbrieven aan haar dochters hebben laten
analyseren door een grafoloog.
‘They say that in the 1960s Princess Juliana had love letters to her daughters
studied by a graphologist’.
Context (5), where the probability of doen is 89.7%, contains an inanimate causer,
an intransitive effected predicate, an explicit unmarked causee and a mental caused
event. This is a typical example of affective causation. Context (6) illustrates the configuration with the highest probability (99.2%) of laten, combining an animate causer,
a transitive effected predicate, a peripheral causee and a non-mental caused event.
The example evokes the service frame, when the causer uses the causee’s professional
services to have some work done. Note that both examples contain animate causees,
who play different semantic roles (an experiencer and agent, respectively).
In his linear discriminant analysis of verb particle placement, Gries (2003) interprets the clusters of attributes with a highest distinctive load (highest discriminant
scores of the sentences) as prototypes of each of the two constructions that he contrasts. Can we follow this approach and claim that contexts like (5) and (6) exemplify
the prototypes of the constructions with doen and laten, respectively?
Indeed, it was shown by Rosch and Mervis (1975) that the features that help
maximally distinguish the given category from the others are also the features that
are maximally shared between the members of the same category. The presence of
these features, operationalized as a family resemblance score of a category member,
also correlated positively with the member’s prototypicality in the category based on
typicality ratings. This could lead us to the conclusion that the doen- and laten-observations such as (5) and (6), with the most distinctive features, should also be the best
representatives of the categories from the intracategorial perspective.
However, the more recent studies of natural language categories (e.g. Ceulemans
and Storms 2010) show that these kinds of salience do not always correlate. The results
of additional experiments (Levshina 2011) show that significant positive correlations
between the intra- and intercategorial types of salience operationalized in several different ways are observed only in the case of doen, and not in the case of laten. A possible explanation is that laten, which is used much more frequently than doen, is also
a more heterogeneous category. In a polysemous category, the sense that is the most
semantically distant from a contrasting category may not be central intracategorially. In addition, one can imagine that the distinctive features of laten with regard to
Dutch causative constructions 217
another construction may be different from those with regard to doen. All this means
that we should be very specific about the perspective and operationalization when
using the term ‘prototype’. The configurations of the distinctive features exemplified
by the sentences (5) and (6) are thus distinctive corpus-based ‘prototypes’ with regard
to the choice between doen and laten modelled with the help of logistic regression.
In addition, the analyses reveal that doen is indeed quantitatively and semantically restricted, as Speelman and Geeraerts (2009) wrote. This can be illustrated by the
classification tree in Figure 1, which shows clearly that a larger number of semantic
and syntactic conditions should be satisfied for doen to have more chances to occur in
a context than laten. Therefore, doen seems to have more Gestalt-like semantics than
laten, which has a looser set of semantic features. This conclusion can be supported
by the fact that laten is a highly schematic auxiliary with a semantic range from permission to coercion.
So far, we have not discussed the behaviour of the Subject Domain variable. The
effect of the topic on the distribution of doen and laten was not predicted by any of
the previous hypotheses. It would be natural to assume that different topics differ
with regard to lexicon. This is why we looked at the top five most popular effected
predicates in the four subject domains, which are shown in Table 4 with their relative
frequencies.
One can see that the four topics are dominated by the same highly frequent verbs:
zien ‘see’, weten ‘know’, horen ‘hear’, denken ‘think’, liggen ‘lie’, vallen ‘fall’, gaan ‘go’
with different relative frequencies. This might not be a serious problem if these highly frequent verbs did not demonstrate an outspoken preference either for laten or
doen. For instance, weten is a typical laten-verb, with 450 occurrences in our data,
all of them with laten. Some previous research, e.g. Levshina et al. (2009), which was
based on collostructional analysis (Stefanowitsch and Gries 2003), also showed that
attraction between the effected predicates and the constructions (auxiliaries) is indeed very strong. This is why the differences in the relative frequencies of these influential predicates may have an effect on the distribution of doen and laten in the
subject domains. One of the ways to capture this idiomatic difference is to incorporate the lexical “noise” in the model as random effects. This method, called mixed-effect modelling, has proved to be a powerful tool in linguistic research, especially in
Table 4.╇ Top five most frequent effected predicates in the four subject domains
Economy
Football
Music
Politics
zien ‘see’ 13%
weten ‘know’ 11%
liggen ‘lie’ 5%
vallen ‘fall’ 3%
stijgen ‘go up’ 3%
liggen ‘lie’ 9%
zien ‘see’ 8%
weten ‘know’ 5%
vallen ‘fall’ 4%
spelen ‘play’ 2%
horen ‘hear’ 11%
denken ‘think’ 8%
zien ‘see’ 4%
klinken ‘sound’ 3%
weten ‘know’ 2%
weten ‘know’ 10%
zien ‘see’ 6%
vallen ‘fall’ 4%
gaan ‘go’ 2%
denken ‘think’ 2%
218 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
psycholinguistic experiments with individual subject- and item-related noise (see a
variety of case studies in Baayen 2008). It is also helpful in corpus-based studies when
idiosyncrasies of individual words cannot be handled with the help of coarse-grained
semantic classifications. Using the lmer package in R , we fit a mixed-effect with the
effected predicates (1,165 types represented by 6,795 tokens) as random effects. By
doing so, we “tell” the model that some effected predicates may inherently prefer doen,
and that some verbs may prefer laten. The algorithm slightly lowers or increases the
value of the intercept for each verb depending on its preferences in the data. It also
takes into account the frequency of the verb in the data set. Ideally, we would have
to do the same for the constructional slots of the causer and the causee. However,
application of this method is less evident for nominal slots because of pronominal reference and lower type-token ratios. There is also evidence that the ties between nouns
and constructions in which they appear are weaker than those between constructions
and verbs (cf. Tomasello et al. 1997).
Fitting a mixed-effect model (with main effects only) yields the results shown
in Table 5. The factors and the tendencies that we had in the corresponding model
without random effects remain very similar, with the exception of the subject domain,
which ceases to contribute substantially to the model’s performance (according to
AIC). Therefore, we can conclude that the difference in probabilities of doen and laten
across different topics is due to the lexical effects. Also, most of the absolute values of
the coefficients in the mixed model are slightly higher than those in the fixed-effect
only model (see Table 1) because we have filtered out some part of the lexical “noise”,
which caused overfitting in the initial model.
The model demonstrates that the abstract features related to direct or indirect
causation are still significant when conditioned on the lexical effects (cf. Bresnan et al.
2007:â•›87). At the same time, additional tests show that the random-effect model alone
would allow the prediction of the choice of doen and laten correctly in a vast majority
of cases (78% for the Netherlandic subcorpus and 74% for the Belgian data). This can
be seen as evidence of strong exemplar effects at the level of the effected predicates.
Table 5.╇ Logistic regression model with effected predicates as random effects
Predictor
Estimate (log odds ratio)
(Intercept)
CrSem = Inanimate
EPTrans = Intransitive
Country = BE
CdEventSem = Mental
CeSynt = Peripheral
SubjectDomain = Football
SubjectDomain = Music
SubjectDomain = Politics
–5.95 (p < 0.001)
â•⁄ 4.22 (p < 0.001)
â•⁄ 2.25 (p < 0.001)
â•⁄ 0.76 (p < 0.001)
â•⁄ 1.11 (p < 0.001)
–0.83 (p = 0.003)
Not Available
Dutch causative constructions 219
However, the best-performing model is the one with both the abstract and the lexical
features. This finding is perfectly in line with the non-reductionist constructionist
approach to language, which assumes that high-level generalizations coexist with
low-level schemata in the speaker’s knowledge about constructions (Langacker 1987;
Goldberg 1995, 2006).
6. Conclusion
In this multivariate corpus-based onomasiological probabilistic study, we used logistic regression to find the factors that influence the choice between the causative constructions with doen and laten by speakers of Dutch. The analyses showed that the
highest probability of doen is observed in the contexts of affective causation: an inanimate stimulus causing a conceptually and syntactically central cognizer to experience
some mental state (an intransitive event). Conversely, the laten-construction has the
highest chances of being observed when the causer is animate, the effected predicate is transitive, the causee is implicit or marked with a preposition (syntactically
and conceptually peripheral), and the caused event is non-mental. This configuration
can be exemplified by the service frame. The combinations of these features, which
have the highest cue validity with regard to the choice between the categories, can
be regarded as the distinctive prototypes of the constructions. Their relations with
the other salience phenomena, such as family resemblance, goodness of membership,
entrenchment, etc., are to be explored empirically, although there are indications that
the inter- and intracategorial typicality measures tend to correlate for compact categories without rich polysemy (in our case, doen).
We also found evidence of exemplar effects in categorization at the level of the
lexemes that fill in the effected predicate slot, which can serve as powerful predictors
of the speaker’s choice on their own. However, the best prediction is achieved when
the model combines both the above-mentioned semantic generalizations and the
lexemes. This supports the constructionist hypothesis that the mind stores linguistic
knowledge at different levels of generalization.
In addition, the results show that the doen-construction has more chances of being chosen by a Flemish speaker than by a Dutch one, which supports the previous
findings by Speelman and Geeraerts (2009) for spoken data and can be explained
historically (the Flemish variety is believed to retain more archaic features). The constructions also display different behaviour across the four subject domains, which, as
the mixed-effect model demonstrates, can be explained by the domain-specific differences in the distribution of the effected predicates.
220 Natalia Levshina, Dirk Geeraerts, and Dirk Speelman
References
Baayen, R. H. (2008). Analysing linguistic data: A practical introduction to statistics using R.
Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511801686
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, R. H. (2007). Predicting the dative alternation. In
G. Boume, I. Kraemer, & J. Zwarts (Eds.), Cognitive foundations of interpretation (pp. 69–
94). Amsterdam: Royal Netherlands Academy of Science.
Bouma, G., van Noord, G., & Malouf, R. (2001). Alpino: Wide-coverage computational analysis of Dutch. In W. Dalemans, K. Sima’an, J. Veenstra, & J. Zavrel (Eds.), Computational
linguistics in the Netherlands 2000: Selected papers from the Eleventh CLIN meeting (pp. 45–
59). Amsterdam: Rodopi.
Bybee, J. (2010). Language, usage and cognition. Cambridge: Cambridge University Press.
DOI: 10.1017/CBO9780511750526
Bybee, J., & Eddington, D. (2006). A usage-based approach to Spanish verbs of ‘becoming’.
Language, 82(2), 323–355. DOI: 10.1353/lan.2006.0081
Ceulemans, E., & Storms, G. (2010). Detecting intra and inter categorical structure in semantic
concepts using HICLAS. Acta Psychologica,133 (3), 296–304.
DOI: 10.1016/j.actpsy. 2009.11.011
De Sutter, G. (2009). Towards a multivariate model of grammar: The case of word order variation in Dutch clause final verb clusters. In A. Dufter, J. Fleischer, & G. Seiler (Eds.),
Describing and modeling variation in grammar (pp. 225–254). Berlin & New York: Mouton
de Gruyter.
Degand, L. (2001). Form and function of causation. A theoretical and empirical investigation of
causal constructions in Dutch. Leuven: Peeters.
Geeraerts, D. (1999). Idealist and empiricist tendencies in cognitive semantics. In T. Janssen,
& G. Redeker (Eds.), Cognitive linguistics: Foundations, scope and methodology (pp. 163–
194). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110803464.163
Geeraerts, D. (2006). Salience phenomena in the lexicon: A typology. In D. Geeraerts (Ed.),
Words and other wonders: Papers on lexical and semantic topics (pp. 74–97). Berlin & New
York: Mouton de Gruyter. DOI: 10.1515/9783110219128.1.74
Glynn, D., & Fischer, K. (2010). Quantitative methods in Cognitive Semantics: Corpus-driven
approaches. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423
Goldberg, A. E. (1995). Constructions: A construction grammar approach to argument structure.
Chicago: University of Chicago Press.
Goldberg, A. E. (2006). Constructions at work: The nature of generalizations in language.
Oxford: Oxford University Press.
Gries, St. Th. (2003). Multifactorial analysis in corpus linguistics: A study of particle placement.
New York: Continuum.
Grondelaers, S., Geeraerts, D., & Speelman, D. (2007). A case for a cognitive corpus linguistics.
In M. Gonzalez-Marquez, I. Mittleberg, S. Coulson, & M. Spivey (Eds.), Methods in cognitive linguistics (pp. 149–169). Amsterdam & Philadelphia: John Benjamins.
Harrell, F. E. (2001). Regression modelling strategies with applications to linear models, logistic
regression, and survival analysis. Heidelberg & New York: Springer.
Heylen, K. (2005). A quantitative corpus study of German word order variation. In S. Kepser,
& M. Reis (Eds.), Linguistic evidence: Empirical, theoretical and computational perspectives
(pp. 241–264). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110197549.241
Dutch causative constructions 221
Hosmer, D. W. & Lemeshow, S. (2000). Applied logistic regression. New York: Wiley.
DOI: 10.1002/0471722146
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional
inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
DOI: 10.1198/106186006X133933
Kemmer, S., & Verhagen, A. (1994). The grammar of causatives and the conceptual structure of
events. Cognitive Linguistics, 5(2), 115–156. DOI: 10.1515/cogl.1994.5.2.115
Langacker, R. W. (1987). Foundations of cognitive grammar: Volume I: Theoretical prerequisites.
Stanford: Stanford University Press.
Levshina, N. (2011). Doe wat je niet laten kan: A usage-based analysis of Dutch causative constructions [Do what you cannot let: A usage-based analysis of Dutch causative constructions]. Unpublished doctoral dissertation, University of Leuven.
Levshina, N., Geeraerts, D., & Speelman, D. (2009). Collostructional analysis of Dutch causative constructions. Paper presented at the Third International AFLiCo Conference, 28 May,
Paris.
Loewenthal, J. (2003). Meaning and use of causeeless causative constructions with laten in
Dutch. In A. Verhagen, & J. van de Weijer (Eds.), Usage-based approaches to Dutch (pp. 97–
130). Utrecht: LOT.
Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning. Psychological
Review, 85(3), 207–238. DOI: 10.1037/0033-295X.85.3.207
Menard, S. (2001). Applied logistic regression analysis. Thousand Oaks: Sage.
R Development Core Team (2010). R: A language and environment for statistical computing.
Foundation for statistical computing. Vienna, Austria. <http://www.R-project.org>.
Rosch, E. (1975). Cognitive representation of semantic categories. Journal of Experimental Psychology, 104(3), 192–233. DOI: 10.1037/0096-3445.104.3.192
Rosch, E., & Mervis, C. B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7(4), 573–605. DOI: 10.1016/0010-0285(75)90024-9
Speelman, D., & Geeraerts, D. (2009). Causes for causatives: The case of Dutch doen and laten.
In T. Sanders, & E. Sweetser (Eds.), Causal categories in discourse and cognition (pp. 173–
204). Berlin & New York: Mouton de Gruyter.
Stefanowitsch, A., & Gries, St. Th. (2003). Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.
DOI: 10.1075/ijcl.8.2.03ste
Stukker, N. (2005). Causality marking across levels of language structure. Unpublished PhD dissertation, University of Utrecht.
Tomasello, M., Akhtar, N., Dodson, K., & Rekau, L. (1997). Differential productivity in young
children’s use of nouns and verbs. Journal of Child Language, 24(2), 373–87.
DOI: 10.1017/S0305000997003085
Tummers, J., Heylen, K., & Geeraerts, D. (2005). Usage-based approaches in cognitive linguistics: A technical state of the art. Corpus Linguistics and Linguistic Theory, 1(2), 225–261.
DOI: 10.1515/cllt.2005.1.2.225
Verhagen, A., & Kemmer, S. (1997). Interaction and causation: Causative constructions in
modern standard Dutch. Journal of Pragmatics, 27(1), 61–82.
DOI: 10.1016/S0378-2166 (96)00003-3
The semasiological structure
of Polish myśleć ‘to think’
A study in verb-prefix semantics
Małgorzata Fabiszak, Anna Hebda, Iwona Kokorniak,
and Karolina Krawczak
Adam Mickiewicz University, Poznań
The aim of the present chapter is to investigate the semasiological structure of
the Polish verb myśleć ‘to think’ relative to the construal imposed by its prefixes.
It juxtaposes the results of cognitive linguistic corpus-based introspective analysis with the results of a series of statistical tests of almost 4,000 manually coded
example sentences. Multiple correspondence analysis employs the techniques
presented in Glynn (this volume), hierarchical cluster analysis follows Divjak
and Fieller (this volume) and logistic regression is based on Speelman (this
volume). The study shows that the do- prefix has a strong preference for clausal,
processual complements, while the wy- prefix opts for nominal complements,
suggesting objectifiable results of the thinking process.
Keywords: hierarchical cluster analysis, logistic regression, multiple
correspondence analysis, Polish verbal prefixes, Polish verb myśleć ‘to think’
1. Introduction
Given that it is a universal human capacity to think and to verbalize one’s thoughts
in an intersubjective context, it is a reasonable assumption that a lexeme designating such a cognitive activity and state would be present in every language (Fortescue
2001:â•›15). However, despite its universality, an accurate explanation of the meaning
and use of a cognition verb, such as myśleć in Polish or think in English, is a particularly challenging task. The difficulty in representing the semantic structure of such predicates stems from their conceptual schematicity. This schematicity is elaborated by
instantiation in context to produce any one of the verb’s multiple senses. Along lexical
and discursive lines, it encodes various loads of intentionality, modality, subjectivity,
and in some of its uses, it is highly grammaticalized. Through detailed corpus-driven
224 Małgorzata Fabiszak et al.
conceptual analysis, our project takes up the challenge of systematically describing
the semantic structure of prefixed think verbs in Polish.1 Specifically, the analysis examines the contribution of six prefixes to the semantics of the mental predicate relative to its object. The first stage of the study employs introspective analysis in order
to establish a set of testifiable hypotheses about the semantic contribution of these
highly schematic prefixes. The second stage turns to a quantitative analysis of a large
number of occurrences, annotated manually for an array of grammatical and semantic features. The ‘behavioural profiles’ of the verbs that result from this analysis are
identified through subsequent multivariate statistical modeling (Geeraerts et al. 1994;
Gries 2006; Glynn 2010b).
Thinking, in most general terms, can be perceived as “internalized (and abbreviated) speech”, which is thus tantamount to “self-awareness” (Fortescue 2001:â•›17f.).
This “private, internal activity” can be further specified into at least three kinds of processes, which consist of “evaluating” someone or something, “believing in the truth of
a proposition” or “‘mulling over’ some mental content” (Fortescue 2001:â•›30).
On a metalinguistic plane, ‘think’ can be regarded as a language-independent
semantic universal (Wierzbicka 1992, 1996, 1997, 1999), which has recently been corroborated in several cross-linguistic studies, e.g. by Amberber (2003) on Amharic
or by Junker (2003) on East Cree, an Algonquian language. The question remains,
however, as to which senses of ‘think’ should be universally constitutive of the core of
human thought (Fortescue 2001:â•›32). It is, after all, indisputable that language-specific
exponents of think are characterized by polysemy and that they enter onomasiological relations by way of “lexical elaboration” (Goddard 2003).
Attempts have also been made to reduce such presumably irreducible concepts. Jackendoff (1983:â•›234ff., as cited in Fortescue 2001:â•›32) factorized such mental
predicates to the formula [be [rep[x]][in y’s mind]]. This formula is similar to the
metaphorical representation of the thought process, as analysed within Conceptual
Metaphor Theory, where the mind is understood as a container for thoughts-objects
(Lakoff and Johnson 1980; Szwedek 2007). Perhaps a good solution would be to view
the mental verb ‘think’ as belonging to “a network of experiential categories” or as a
“radial” or “polycentric” category, whose most focal sense should be “the lowest common denominator” shared by all senses of ‘think’ (Fortescue 2001:â•›32–35).
1. The EmBeR project focuses on the conceptualization of abstract categories, such as
Em(otion), Be(lief), and R(eason) in Polish and English. It uses corpus-driven techniques
and multivariate statistics in an effort to capture the complex semantic structure of the three
categories. Team members include Małgorzata Fabiszak, Anna Hebda, Iwona Kokorniak,
Karolina Krawczak, and Barbara Konat at Adam Mickiewicz University, Poznań, Poland. We
are greatly indebted to Dylan Glynn for introducing us to the methods of multivariate statistics.
Our thanks also extend to three anonymous reviewers and the editors of this volume for their
constructive comments. Any remaining shortcomings remain our own.
The semasiological structure of Polish myśleć ‘to think’ 225
Linguistically speaking, ‘think’ is indeed characterized by semantic generality and
impoverishment (Danielewiczowa 2002:â•›131), but its conceptual schematic structure
is contextually expanded (Kustova 2000:â•›250). It is therefore of utmost importance to
examine the structural and semantic characteristics of this mental verb, bearing in
mind that even seemingly negligible differences in verb valency may be significant
(Danielewiczowa 2000:â•›231). There is at least a twofold distinction to be drawn in
the semasiological structure of ‘think’ between states and activities, which is formally
realized in English in the use of aspects, where the progressive can only be applied
to non-stative mental events (e.g. Vendler 1967:â•›110f.; D’Andrade 1987, 1995). This
bipartite division leads to the treatment of the mind as a “processor” or “container”
(D’Andrade 1987, as cited in Palmer 2003:â•›270). Goddard (2003:â•›112) goes beyond
Vendler’s (1967) bifurcation into think about and think that, and specifies the semantic prime think into four subgroups, depending on their conceptual syntactic features, whereby:
a.
b.
c.
d.
X thinks about Y [topic of thought]
X thinks something (good/bad) about Y [complement]
X thinks like this: ——— [quasi-quotational complement]
X thinks that [——— ]S [propositional complement].
(Goddard 2003:â•›112)
Danielewiczowa (2000) proposes a further distinction, showing that thinking can also
be dynamic, intentional, knowledge-driven, factual or hypothetical. It can be conscious or overwhelmingly intensive and subconscious, it can induce change in the
subject, be a point-of-time or period-of time event, refer to an object or be self-oriented. The object of thinking can be abstract or concrete, factual or hypothetical,
verifiable or non-verifiable (Danielewiczowa 2002:â•›132f.). It can vary considerably in
terms of content, length, cohesion, origin, and structure (Pawłowska 1981). Object
semantics will also be affected by how the object is introduced. For example, the complementizer że [that] is likely to specify the object more than pronouns such as co
[what], gdzie [where], dlaczego [why], which only hint at what the object relates to
(Pawłowska 1981).
Insofar as the grammatical form of mental verbs influences their “semantic potential” and “pragmatic effects” (Danielewiczowa 2000:â•›265), their aspect, object form
and object semantics will naturally entail changes in the function and profiling of such
predicates. It is the aim of the present study to identify the semasiological organization of and onomasiological relations between the instantiations of think relative to
the construal imposed by prefixes. Obviously, the schematic semantics of the unprefixed predicate contributes to the content of each of the prefixed forms. It constitutes
the superordinate meaning component, the “semantic primitive” or “building block”
of the subordinate expressions (Wierzbicka 1992:â•›9). Their semantico-pragmatic load
is specified and disambiguated by their prefixal modification. It is, therefore, expected
226 Małgorzata Fabiszak et al.
that the internal structure of such prefixed ‘think’ forms and the near-synonymous
relations holding between them will be particularly distinct.
Myśleć ‘think’, when unaccompanied by any prefix, is an imperfective verb. Its
aspect may change into the perfective one by means of adding a prefix, such as do(domyślić się ‘guess sth’), po- (pomyśleć ‘think sth’), prze- (przemyśleć ‘think over’),
roz- (rozmyślić się ‘change one’s mind’), wy- (wymyślić ‘come up with’), za- (zamyślić
się ‘fall into deep thought’), and others. However, the ‘perfectivized’ verb form can
be further ‘imperfectivized’ through suffixation (cf. Comrie 1976), e.g. domyślać się
‘guess’, rozmyślać ‘ponder’, wymyślać ‘invent’. In the present contribution, the semantics of the aforementioned prefixes relative to their perfective and imperfective aspects will be considered in quantitative terms in relation to their combination with
objects. The procedure adopted here is in line with the tradition of corpus-driven
Cognitive Linguistics, as practised by Dirven et al. (1982), Geeraerts et al. (1994),
Gries (2003, 2006), Divjak (2006, 2010), and Glynn (2009, forthc.), to mention but a
few. In accordance with this tradition, introspective conceptual analysis, if crucial and
necessary, is viewed as insufficient, as it cannot produce falsifiable answers to research
questions. It is, therefore, essential for a large number of examples of the object of
study to be subjected to detailed manual conceptual analysis of their syntactic and
semantic characteristics. Once this stage is completed, the annotated data can be submitted to exploratory and, subsequently, confirmatory multivariate statistical modelling. The purpose of this procedure is to find and verify “patterns of usage features”
(Glynn 2009, 2010b), also known in the field as “feature configurations” (Geeraerts
et al. 1994) or “behavioural profiles” (Gries 2006; Divjak and Gries 2006).
Summing up, thinking in its simplicity and universality is a multidimensional
mental event of states and processes, which can be dynamic, factual, hypothetical,
consciously entertained by the subject, or unconscious, based on and expressing
knowledge, or unfounded, oriented toward an object or objectless, intentional or unintentional. All these features contribute to the holistic semasiological structure of
‘think’, rather than portioning it indefinitely. In this article we seek to account for
some of the active semantic components in Polish prefixed myśleć, mapping its meaning from usage events. Section 2 presents a conceptual linguistic analysis of the Prefix,
Aspect and Object Form relations, illustrated with examples from the PWN Corpus
of the Polish Language. Section 3 describes the corpus, Section 4 outlines the coding
procedure, and Section 5 presents the multivariate analysis. The results of the usage-feature analysis are submitted to the exploratory statistical techniques: multiple
correspondence analysis and hierarchical cluster analysis. Where applicable, the results are further supported by the confirmatory method of logistic regression analysis. The selection of these three statistical techniques is motivated by the different
affordances they offer to researchers. Correspondence analysis has been developed for
nominal categorical data and allows the visualisation of the correspondences between
multiple factors – linguistic features. The most promising tendencies observed in the
The semasiological structure of Polish myśleć ‘to think’ 227
correspondence analysis maps are then further tested with the confirmatory statistical
method – logistic regression, which goes beyond the identification of correspondences and gives probability scores indicating which of the correspondences are significant
and what is the contribution of the individual features to the overall pattern. Finally,
cluster analysis offers a representation which can be best compared with the result of
the introspective conceptual analysis offered in Section 2.
2. Introspective conceptual analysis of the prefixed forms
of myśleć ‘to think’ in Polish
At first sight, the grammatical constructions that the verb myśleć ‘think’ and its
prefixed forms go into may appear to be arbitrary and unpredictable. However, as
Langacker (1991:â•›294) points out, the unpredictability of grammar arises from the objectivist approach to semantics. By contrast, the Cognitive Grammar approach finds:
semantic value in every one of its uses. Moreover, it is precisely because of their
conceptual import – the contrasting images they impose – that alternate grammatical devices are commonly available to code the same situation.
(Langacker 1991:â•›295)
Thus, a fine-grained analysis should elucidate the complex nature of grammatical
structures (Langacker 1991:â•›294).
Myśleć ‘think’ can be considered as an imperfective verb. Its aspect may change
into the perfective one by means of adding a prefix, such as do- (domyślić się), po- (pomyśleć), prze- (przemyśleć), roz- (rozmyślić się), wy- (wymyślić), za- (zamyślić się), and
others. However, these ‘perfectivized’ forms may undergo suffixal imperfectivization,
resulting in the imperfective form, e.g. domyślać się, rozmyślać, wymyślać. The semantics of the aforementioned prefixes will be considered below and their combination
with objects will be accounted for.
The prefix do- may combine with verbs to indicate an approximation to a goal or
result (Śmiech 1986:â•›90–91). In cognitive terms, it represents the path image schema
in which the goal or final point of the path along which a movement takes place is
in focus. Reaching the goal may involve encountering certain difficulties along the
way, where the trajector (TR) makes every effort to achieve the goal regardless of any
obstacles. The intensity of the effort is observed by Dickey (2009), who calls do- verbs
“intensive-resultative”.
In domyśla/ić się the TR corresponds with the subject of the main clause. What
constitutes the landmark (LM) here is the metonymic gathering of information in the
course of the thinking activity. Whether the activity is complete or not is indicated, in
the case of domyśla/ić się, by the suffix, where -ać, as in (1a–b), construes an uncompleted process, and -ić , as in (2a–b), a completed process. The gathered information
228 Małgorzata Fabiszak et al.
in the LM position is usually rendered explicitly in sentences, i.e. either with a nominal phrase, as in (1a) and (2a), or with a clausal complement, as in (1b) and (2b).
(1) a. Nie domyślał sięprzyczyny
[Not
DO-thought-masc.3.sg.impf reflcause-sg.gen
sporu.
[argument-sg.gen]
‘He did not guess what the reason for the argument was.’
b. Nie domyślał się, żektoś jestw
[Not
DO-thought-masc.3.sg.impf refl that someone-nom is in]
pokoju.
[room-sg.loc]2
‘He did not guess that someone was in the room.’
(2) a. Nietrudnobyłodomyślić się przeznaczeniatego
[easily wasDO-think-inf refl purpose-gen this-gen]
miejsca.
[place-gen]
‘One could easily guess what the purpose of this place was.’
b. Zaskoczony,zupełnieniekontrolowałem wtedy
[Surprised totally not controlled-masc.3sg.impf then]
twarzyimogę siędomyślić, co
[face-genandcan-1sg.pres reflDO-think-inf what
wyczytałazniejtabiedna
out-read-fem.3sg.perffromit-fem.gen thispoor-fem-3sg]
brzydula
[ugly girl-nom]
‘Surprised, I did not control my facial expression and I can guess what she
read in it, this poor ugly girl.’
The nominal phrase that domyśla/ić się can be combined with takes the genitive case,
which in one of its uses may conceptualize a goal that is being aimed at (Rudzka-�
Ostyn 2000:â•›201). It is not due to chance, then, that do-predicates are called ‘attainment verbs’, as both their nominal complement and the prefix construe the attainment
of a desired result (e.g. Dickey unpubl. manusc.:â•›15).
Additionally, the reflexive pronoun in domyśla/ić się indicates that the experiencer of the process taking the subject position is both the instigator and the passive
experiencer affected by the result of the process (Dąbrowska 1997:â•›93). The ‘beneficial’
or ‘favourable’ effect of the process on the subject has been observed by Przybylska
2. In the word-for-word gloss the prefixes are rendered with capital letters in their Polish
form. We follow here Tabakowska (2003a).
The semasiological structure of Polish myśleć ‘to think’ 229
(2006:â•›60–61), as well as by Tabakowska (2003b:â•›15), who considers it rather natural
that “one does not, under normal circumstances, want to cause damage to oneself ”.
Another prefix, po-, may form delimitative verbs to indicate (i) a short duration
of an action; (ii) a limited nature of an action; or (iii) when combined with the dative
reflexive pronoun sobie, an ‘affective’ state of the subject (Piernikarski 1975:â•›33). In
combination with myśleć, po- tends to indicate sense (ii). It may change the verb into
a perfective one, which according to Dickey’s (2000) ‘east-west aspect theory’ has the
meaning of temporal definiteness, where a temporally definite event “is unique in
the temporal fact structure of a discourse, i.e. (…) it is viewed as both (a) a complete
whole and (b) qualitatively different from preceding and subsequent states of affairs”
(Dickey and Hutcheson 2003:â•›27–28).3
When it comes to complementation, pomyśleć may take a clausal complement, as
in (3a), or a prepositional phrase with o ‘about’, and a nominal phrase in the locative
(LOC) case. It may refer to “dispersed motion in any direction relative to a landmark”
in the abstract temporal domain (Radden and Dirven 2007:â•›321), thus indicating that
the object of thinking is not directly focused on, as in (3b). In less frequent cases, the
object of thinking is profiled by means of nad ‘over’ with the nominal in the instrumental (INSTR) case, where the preposition, in one of its spatial senses, construes
a multiplex path with “a single point-like TR moving in a variety of directions (…)
to trace a multiplex collection of paths, which in turn can be construed as a mass”
(Dewell 1994:â•›375). Metaphorically, in relation to pomyśleć, this schema can be understood as considering alternative ways to follow before reaching a final solution, where
only the solution is mentioned explicitly, as in (3c):
(3) a. Z radością pomyślałaś,że
[With
joy-instr PO-thought-fem.2sg.perfthat]
spotkałaśmężczyznę,którylubi
[met-em.2sg.perfman-gen who like-3sg.pres]
ciętaką,jakajesteś.
[you-acc such like be-2sg.pres]
‘You have thought with joy that you met a man who likes you the way you
are.’
b. Gdybymiałnormalnewarunkidopracy,
[ifhad-masc.3sgnormalconditions-acc to work-gen]
nawetniepomyślałby owyjeździe.
[even not PO-thought-masc.3sg.cond about departure-loc]
‘If he had normal working conditions, he wouldn’t even think about going
abroad.’
3. However, Śmiech (1986:â•›18–19) adds that a verb with the po- prefix, although taking a perfective form, may behave like an imperfective one. In the case of pomyśleć, the process seems to
be completed and the verb form is perfective.
230 Małgorzata Fabiszak et al.
c. Gdybychociażmożnabyłogdzieś tam
[ifat leastcan-3sg.preswas-3sgsomewhere there]
dostaćpracętomożepomyślałbym
[get-infjob-accthen perhapsPO-thought-masc.3sg.cond]
nad
powrotem
[over
comeback-instr]
‘If it were possible to get a job somewhere there, I would think about
coming back.’
Prze-, yet another prefix, in the spatial domain may depict a three-dimensional and
bounded LM, such as a tunnel in which the TR moves from one end to the other,
where the TR “gradually fills the whole volume of the landmark” (Pasich-Piasecka
1993:â•›19). When extended metaphorically, prze- may carry the meaning of ‘entirely
through’, as observed by Pasich-Piasecka (1993:â•›18). In przemyśleć, the prefix not only
implies the in-depth nature of the mental activity, but also points to its completeness.
The TR does not move in the physical, but in the temporal space here, covering a
certain period of time to perform the activity. The entity depicted is in the accusative
case, as in (4), constitutes the LM which is directly affected by the experiencer of the
process:
(4) Wszystkoprzemyślał, wszystko
[all-acc PRZE-thought-masc.3sg.perf all-acc
rozważył
considered-masc.3sg.perf]
‘He thought everything over, he considered everything.’
As Janda (1993:â•›143–145) points out, the accusative is the case marker which shows
the dynamicity of the process and perfectly reflects the fact that the entity taking it is
directly influenced by the TR and, as a result, may undergo certain changes. The prefix
in the verb przemyśleć points to the completeness of the temporally extended process,
which quite often may be complemented by other elements in the sentence, such as
cały ‘whole’, wszystko ‘all’, etc. Due to its extended and complete sense, przemyśleć belongs to the perdurative group of verbs (Grochowska 1979:â•›70; Przybylska 2006:â•›160;
Dickey 2009, unpubl. manusc.).
Roz-, in its basic image schema, represents the TR and LM constituting one entity before a change, which takes different forms afterwards. Thus, the comparison
of the two states of the entity before and after the change profiles different senses of
roz-, which are to a great extent determined by the verb they go with (Przybylska
2006:â•›201). When in combination with roz-, myśleć may take either an imperfective
or a perfective form. In the former case, the observed change is in the intensity of the
mental activity. As Przybylska (2002:â•›271) notes, the activity is represented in both the
basic and the prefixed form of the verb; the difference may lie in the duration and intensity of the activity. The same observation is made by Dickey (unpubl. manusc.:â•›14),
The semasiological structure of Polish myśleć ‘to think’ 231
who calls this type of verbs ‘procedural’ as they “do not alter the basic lexical meaning
of the source verb” and modify only the aforementioned aspects. Compare the following sentences (invented examples):
(5) a. Piotr myślał oMarysi.
[Peter-masc.nom.sg was thinking-3sg.impf about Mary-fem.loc.sg]
‘Peter was thinking about Mary.’
b. Piotr rozmyślał o
[Peter-masc.nom.sg through-was thinking-3sg.impf about
Marysi.
Mary-fem.loc.sg]
‘Peter mused/was musing about Mary.’
In both (5a) and (5b), the verbs take the imperfective form; in the latter case, however,
the prefix emphasizes the extendedness of the process.
The dispersed nature of the mental activity is additionally highlighted by the
preposition which rozmyślać may go with. Just like pomyśleć, it may be combined
with o ‘about’ and the nominal phrase in LOC, as in (5b), or with nad ‘over’ and the
substantive in INSTR, as in (5c). The prepositions and the case markers show that
the objects are not directly affected by the process. The prepositions, whose senses
discussed earlier for pomyśleć are also relevant for rozmyślać, profile either proximity
or a variety of directions to follow, respectively. The cases, on the other hand, indicate
just the area, but not the direct focus, of attention.
(5) c. Otarł je szybkoi, zacisnąwszy
[Wiped-masc.3sg.perf them fast and clenching]
szczęki, rozmyślałgorączkowonad
[jaws-acc was ROZ-thought-masc.3sg.impf frantically over
sposobami ratunku
ways-INSTR help-gen]
‘He wiped them fast and, clenching his jaws, he frantically pondered over
the ways to help.’
d.Zróbsobiekawęi przestańbezproduktywnie
[Make-impself-datcoffee-acc and stop-impunproductively
rozmyślać!
ROZ-think-inf]
‘Make yourself a coffee and stop pondering unproductively.’
As can be observed, the sense of roz- is strongly related to the prepositional senses it
goes with. However, rozmyślać does not have to take any object and it still implicitly
shows that the TR’s (subject’s) thoughts go in more than one direction, as in (5d).
Roz- can also take a perfective form with myśleć, resulting in a reflexive verb rozmyślić się ‘change one’s mind’. The observed change in the subject’s mental state is
232 Małgorzata Fabiszak et al.
between the ‘normal’ process of the mental activity represented by the unprefixed
form into the ‘changed’ mental state represented by the prefixed form. The reflexive
pronoun emphasizes the internal mental change of the subject, which may also bring
about a change in the subject’s behaviour, frequently conceived of by observers as a
negative change (Przybylska 2002:â•›279–280). The behavioural change may be considered to be the oblique result of the mental activity (Dickey unpubl. manusc.:â•›15). Consider (6), where Salisz’s change of mind may result in his change of behaviour and his
calling another person back, reflected in (6) by zawoła go z powrotem ‘call him back’:
(6) Pędziłcosił wobawie,że Salisz
[Rushed-masc.3sg.impfwhatstrength-pl.gen in fear-loc that Salisz
się rozmyśli,że zawoła goz
refl][ROZ-think-3sg.fut that call-3sg.fut him-accwith
powrotem.
return-instr]
‘He ran with all his might for fear that Salisz would change his mind and call
him back.’
The prefix wy- in the spatial domain construes the TR’s emergence from the LM, or
its coming into existence by leaving the bounded region of the LM, thus evoking the
container image schema. Some correspondences can be found between wy- and
English out of, which in its prototypical image schema depicts “removal or departure
of one concrete object from within another object or place” (Lindner 1983:â•›81). In
combination with myśleć, wy- metaphorically refers to a mental state as a result of
which one, as in (7a), or more ideas, as in (7b), emerge from one’s mind:
(7) a. Sama wymyśliłamprzepisnamoje
[alone-fem WY-thought-fem.3sg.perfrecipe-acc on my
najukochańsze ciasto.
lovliest-acc cake-acc]
‘I came up with the recipe for my favourite cake myself.’
b. Copewienczaspannywymyślałytakie
[What
some
maidens-nomWY-thought-fem.3pl.impfsuch
time
wieloznacznesłowa.
ambiguous-acc][words-acc]
‘Every now and again girls came up with such ambiguous words.’
In the perfective form of the verb (7a), one focuses on the effect of the process of
thinking, indicating also its thorough nature. The result is reflected in the direct object position, representing the LM, which takes the accusative marking, usually in the
singular form, to profile a single completed event.
In the imperfective form of the verb (7b), however, the distributive nature of
wy- is more conspicuous (Śmiech 1986). As Dickey (unpubl. manusc.) observes, in
The semasiological structure of Polish myśleć ‘to think’ 233
this case “closure is either not reached or not salient in the conceptualization, e.g. for
states, events in process, [or] habitually repeated events”. Whether the mental activity
is in constant process, producing regular results, or whether it can be considered as a
habitually repeated event should not really matter; what counts in this case is that the
effect, represented by the nominal phrase in the accusative case, frequently takes the
plural form, emphasizing the distributive nature of the mental process.
The prefix za-, among many other senses distinguished by Tabakowska
(2003a:â•›166–172), can represent a construal of ‘excess’ with intransitive perfective
verbs, being extended from the sense of ‘going beyond a boundary’. In the group of
intensive-resultative verbs, Dickey (unpubl. manusc.:â•›15) classifies za… się (lit. ‘behind’… refl) verbs as absorbtives, as they construe a continuous process whose subject, by becoming deeply engrossed in the activity, loses control over it. As the mental
activity occurs independently of the TR’s will, some adverse consequences, or, in other words, oblique results, may be expected to take place.
In the analysis of the prefixed verbs and the objects that they take, one should be
able to observe that the thinking activity may give either direct or oblique results, as
distinguished by Dickey (2009, unpubl. manusc.), thus forming two endpoints of a
continuum. The former – giving a direct, and usually positively loaded, effect of the
mental process – is represented by the accusative case and is combined with wymyślić/
ać ‘think out’ or przemyśleć ‘think through’. Then comes domyślić/ać się ‘guess’, indicating the goal of the process in the genitive case. It is followed by pomyśleć o+LOC/
nad+INSTR ‘think about/over’ or rozmyślać o+LOC/nad+INSTR ‘ponder about’. Objects in both the locative, construing an indirect constant movement of the TR round
the focal point, and the instrumental, not indicating the goal of the action, but serving
wyprze-
do-
ACC
GEN
direct result
porozo+LOC/
nad+INST
podo-
rozzaREFL
Intr
Clause
oblique result
Abbreviations: ACC – Accusative; GEN – Genitive; LOC – Locative;
INST – Instrumental; REFL – Reflexive; and Intr – Intransitive.
Figure 1.╇ Relationship between prefixed verb and the affectedness of the object4
4. Domyślić się and zamyślić się are reflexive verbs, which may take complements, while rozmyślić się is reflexive and intransitive.
234 Małgorzata Fabiszak et al.
to manifest it (Janda 1993:â•›143), are not directly affected by the action, and thus appear to be more oblique. Neither is the clausal object that comes together with pomyśleć, że…‘think that…’ or domyślić się, że… ‘guess that…’ Rozmyślić się ‘change one’s
mind’ and zamyślić się ‘be lost in thought’ constitute the end of the continuum, taking oblique objects. Their presence is not mentioned in sentences at all; however, the
verbs themselves, as Dickey (2009:â•›27) points out, may indicate an adverse or harmful
result of the process, which represents the most extreme degree of subjectification, as
understood by Langacker (1999).
Introspective and qualitative analysis of corpus examples, the results of which are
represented above in Figure 1, allows us to formulate the following hypothesis: wy-,
prze-, do- myślić (się) on the one hand, and po-, do-, roz-, za-myslić (się) on the other,
are conceptually similar. This hypothesis will be explored with correspondence analysis and hierarchical cluster analysis in Section 5. First, however, we will devote some
space to the description of the corpus and the coding procedure. At this juncture, it
is also noteworthy that our research question is further specialized, as intransitive
reflexive forms, which do not take any object, will be excluded from the present study
due to its exclusive focus on the relation holding between the prefixed forms of Polish
‘think’ and the semantics of their objects.
3. The corpus
The lexico-grammatical analysis of the lexemes under study has been conducted on
the online version of the PWN Corpus of Polish. Glynn (2009:â•›84) stresses the importance of representativity and corpus size in lexical semantic studies:
Content words repeat infrequently and lexical variation is typically sensitive to
extra-linguistic factors. These two conditions mean that for a lexical semantic
study to capture any degree of semantic subtlety of even the most common usages associated with a given lexeme, the corpus must be large and preferably representative of various types of language and register.
It is essential, however, that we be aware that “even the largest corpus in the world is
but a microscopic fraction of actual language use” (Glynn, this volume).
The PWN Corpus of Polish consists of 40 million words sampled from 386 books,
977 issues of 185 newspapers and magazines, 84 recorded (and transcribed) conversations, 207 websites, as well as several hundred leaflets. It is, therefore, relatively balanced both in terms of genre and topic distribution. It must be noted, however, that it
is biased toward unspontaneous written language. Tables 1 and 2 show the percentage
ratios of particular text types and topics in the corpus.
Unfortunately, there is no dedicated tool that would allow the researcher to tag
the genre and topic of specific word uses automatically. To avoid the misattribution
The semasiological structure of Polish myśleć ‘to think’ 235
Table 1.╇ Genre distribution in the PWN Corpus of Polish
Source
Percentage of corpus
Number of text samples
fiction
non-fiction
newspapers and magazines
spoken
leaflets
Internet (websites, blogs, chats, forums)
20
21
45.5
â•⁄4.5
â•⁄5.5
â•⁄3.5
195
192
997 issues/185 titles
â•⁄84
272 files
207
Table 2.╇ The PWN Corpus of Polish: Topic distribution
Subject matter
Percentage
Subject matter
Percentage
philosophy, religion
history, geography
literary criticism, linguistics
sciences
â•⁄7
17
â•⁄9
â•⁄9
politics, economy
social sciences
applied sciences
arts
other
14
â•⁄5
â•⁄8
â•⁄5.5
25.5
of examples to the specific topics on the basis of limited co-text, we did not include
these variables in our analysis. It is thus strictly a semantic-syntactic analysis. The
tables above are only provided in order to give the reader an idea of what the corpus
represents.
The study has been designed in such a way so as to examine the relationship and
potential correlation between the suffixes do-, po-, prze-, roz-, wy- and za-, the aspect,
object form and object semantics. With that purpose in mind, the dataset has been
manually tagged in order to measure the extent to which the lemmas in question may
be synonymous.
4. Feature annotation
As observed by Divjak (2006:â•›22), a project aimed at establishing the degree of similarity between lexical items requires “precise syntactic and semantic data on the distribution of the potentially near-synonymous lexemes over constructions and of their
collocates over the slots of those constructions”. Therefore, after all the contextualized
instances of the investigated lexemes were extracted from the corpus, each of them
was “translated into metalanguage”, as Divjak (2006:â•›21) puts it. In other words, every
occurrence of domyślać, domyślić (się) pomyśleć, przemyśleć, przemyśliwać, rozmyślać,
rozmyślić, wymyślać, wymyślić and zamyślić was analysed and annotated for a number
of formal and semantic properties. Laborious and time-consuming as it was, manual
236 Małgorzata Fabiszak et al.
tagging of the data was necessary for securing the presence of all the elements crucial
to the semantic description of a lexeme. This approach, referred to as ‘usage-feature’
analysis (Glynn 2010c:â•›8), has proven successful in corpus-driven Cognitive Linguistic research (Gries 2003, 2006; Divjak 2006, 2010; Gries and Stefanowitsch 2006;
Grondelaers et al. 2007; Glynn 2009, 2010a, in press; Speelman et al. 2009; Glynn and
Fischer 2010).
The decision as to which variables and features to code was made on the basis of relevant literature, including, among others, Vendler (1967), Divjak (2006),
Gries (2006), and Glynn (2009, 2010b). The annotation used by Divjak (2006), Gries
(2006), and Glynn (2009, 2010b) proved particularly informative. According to
Divjak (2006:â•›23), what is essential is that “the main participants of events or situations are encoded”. At first, the occurrences were annotated for a limited, yet diverse, variety of morphological, syntactic and semantic features, including: tense,
aspect, and voice; transitivity, mood as well as the subject’s and complement’s form,
humanness, animacy, abstractness, etc. (see Gries 2006:â•›73, 75; Divjak 2006:â•›34–36;
and Glynn 2010b:â•›245–249 for more examples). With time (and the growth of the
coders’ expertise), new categories were added to the coding schema. The description
of the morphological features now also includes the prefix, person and number of the
verb. As for syntactic properties, apart from mood and transitivity, negation has been
incorporated into the adopted usage-feature set. Finally, with regard to the semantics
of subjects and objects attested in the corpus, the following categories were added:
subject visibility, competence and specificity, as well as some further formal features
such as object case/person/number, and (in the case of clausal complements) clause
semantics.
Given that it is the mutual dependence between the prefix, the aspect and the
complement type that lies at the heart of the present analysis, every direct object, be it a
noun phrase, a clause, or a reduced clause, was decomposed into its formal and semantic characteristics. Consequently, NP complements (coded for case and number) were
labelled as human (e.g. Piotr ‘Peter’/detektyw ‘detective’), concrete (e.g. potrawa ‘dish’),
or abstract (e.g. plan ‘plan’/rozwiązanie ‘solution’), while clausal complements were
grouped into achievements (i.e. single moment actions e.g. die), accomplishments (i.e.
durative, heterogeneous processes, e.g. reach a verdict), states (e.g. be, have), activities
(i.e. durative and uncompleted actions, e.g. to be hiding in the shed), or hypothetical (i.e.
conditionals or clauses in the future tense).5 Pronominal direct objects were tagged for
case, number and person (if applicable) and their co-referents were classified according to the same principles as NP complements. In the analysis, however, it turned out
that object form (NP or clause) and object semantics were the most informative. NP
object case is not a feature independent of the prefix (a particular prefixed form takes a
5. The first four clausal categories were distinguished on the basis of Vendler (1967).
The semasiological structure of Polish myśleć ‘to think’ 237
Table 3.╇ Prefixed forms of myśleć and their co-occurrence with Object Forms
Predicate
NP
Clause
Reduced Pronoun Narrative Intrans. Lexeme Lemma
Clause
total
total
domyślać się
domyślić się
pomyśleć
przemyśleć
przemyśliwać
rozmyślać
rozmyślić się
wymyślać
wymyślić
zamyślić się
Total
â•⁄â•⁄70
â•⁄â•⁄28
â•⁄155
â•⁄115
â•⁄â•⁄12
â•⁄â•⁄56
â•⁄â•⁄â•⁄0
â•⁄168
â•⁄569
â•⁄â•⁄14
1187
â•⁄338
â•⁄237
â•⁄570
â•⁄â•⁄10
â•⁄â•⁄â•⁄5
â•⁄â•⁄38
â•⁄â•⁄â•⁄0
â•⁄â•⁄â•⁄0
â•⁄â•⁄78
â•⁄â•⁄20
1296
â•⁄3
â•⁄3
46
â•⁄0
â•⁄0
â•⁄0
â•⁄0
â•⁄0
â•⁄0
â•⁄1
53
â•⁄51
â•⁄22
â•⁄81
â•⁄86
â•⁄â•⁄7
â•⁄28
â•⁄â•⁄0
â•⁄36
373
â•⁄â•⁄5
689
â•⁄2
â•⁄1
57
â•⁄0
â•⁄1
â•⁄9
â•⁄0
â•⁄0
â•⁄0
â•⁄0
70
â•⁄78
â•⁄59
â•⁄89
â•⁄14
â•⁄â•⁄3
â•⁄49
â•⁄64
â•⁄59
â•⁄37
145
597
â•⁄542
â•⁄350
â•⁄998
â•⁄225
â•⁄â•⁄28
â•⁄180
â•⁄â•⁄64
â•⁄269
1057
â•⁄185
3892
â•⁄â•‹892
â•‹â•⁄ 998
â•‹â•⁄ 253
â•‹â•⁄ 244
1,326
â•‹â•⁄ 185
3892
particular object case). NP object number did not seem to interact in any regular ways
with the prefixes. So that is why these two features were later disregarded.
Altogether, a total of 3,982 tokens were extracted from the corpus as representative of the six investigated lemmas in their two aspectual forms. Table 3 illustrates the
distribution of the lemmas in the dataset as well as the number of NP/pronominal/
clausal complements for every type. Table 3 shows that not all the prefixed forms are
equally frequent, with only 28 examples of przemyśliwać and as many as 1,057 tokens
of wymyślić. Relative to the representativeness of the corpus, we could claim that, visà-vis its relative frequency, wymyślić is the most conspicuous of the analysed lexemes.
Moreover, the distribution of the same prefix between the two aspects is far from
balanced. As many as 61% of the do- forms are Imperfective, as if the process of reaching the goal indicated by do- was more often focused on than the very act of having
reached it. Even though the prefix prze- focuses on the period of time covered by the
verb, it also emphasizes the completion and is thus far more commonly used in the
Perfective Aspect (89% of cases). Since we focus on the verb-object interaction here,
we disregard the intransitive uses of roz-. Considering its transitive uses, roz- is always
Imperfective, po- is always Perfective, while za-, unlike po-, can potentially co-occur
with either aspect. In the investigated corpus it is also restricted to the Perfective Aspect. For wy- the ratio is 80% Perfective to 20% Imperfective. This may suggest that
it is only do- which can be either Perfective or Imperfective, while the other prefixes
show a clear preference for one Aspect: po-, prze-, wy- and za- for the Perfective and
roz- for the Imperfective. It can be explained by the processual, dispersed nature of
roz-, as indicated in the introspective part of the study.
As the aim of this paper is to explain the interaction of Prefix and Object grammar and semantics in the creation of meaning, the intransitive forms will be discarded
from further analysis. Also, the Narrative selects almost exclusively for the po-prefix,
238 Małgorzata Fabiszak et al.
so there is no need to investigate their interaction with other prefixes. The Reduced
Clauses are rare; hence, these occurrences are grouped with full Clause complements.
In the cases where the semantics of the pronominal object could be identified and
coded, the examples were included in further analysis. Those cases, however, whose
semantics could not be analysed, were deleted. Following this, the pronominal objects
were re-coded as NPs when they stood for NPs and as Clauses if they stood for clauses, as in the construction myśleć o tym, że … ‘to think about it that…’.
As will be shown in the sections that follow, a careful analysis of all the annotated features and their subsequent submission to multivariate statistics have enabled
us to identify “patterns of usage” (Glynn 2009). Apart from producing results corresponding to and, therefore, corroborating the hypothesis presented in Section 2,
the exploratory statistics applied here will also reveal interesting variation in usage
not predicted by the introspective analysis. This demonstrates that introspection does
indeed need to be complemented by quantitative measures to produce more comprehensive results.
5. Multivariate analysis of the results of feature annotation
In this section, our data are further explored with two multivariate techniques: correspondence analysis and cluster analysis (for descriptions of these methods see Glynn,
this volume, and Divjak and Fieller, this volume). Then, the confirmatory method
of logistic regression analysis is used to corroborate the results of the correspondence analyses. These will enable us to investigate the contribution of Object Semantics
and other essential factors to the holistic meaning structure of the prefixed cognition
verb myśleć. The section ends with a discussion of the interaction of all the analysed
variables.
First, a chi-square test is applied to the data to check if the variation between the
lexemes is significant:
Pearson’s chi-square test
χ² = 3803.6, df = 55, p-value < 2.2e-16
The chi-square test demonstrates that there is significant variation between the lexemes under investigation. In Table 4, the inspection of Pearson’s residuals shows that
the correlation between the prefix do- and Imperfective Aspect is the most important,
as is that between po- and Perfective Aspect. In regard to Object Form, the strongest
and most significant correlations hold between do- and Object Clause, and wy- and
Object NP. Finally, with respect to Object Semantics, the prefix do- correlates importantly with Accomplishment, Activity and State, po- with Achievement and Hypothetical clauses, while wy- correlates with Concrete and Abstract, prze- with Abstract, and
roz- with Human nominal objects.
The semasiological structure of Polish myśleć ‘to think’ 239
Table 4.╇ Pearson’s residuals for the prefixed verb forms of Polish myśleć ‘to think’
Imperfective
Perfective
Object Clause
Object NP
Object ABSTRACT
Object ACCOMPLISHMENT
Object ACHIEVEMENT
Object ACTIVITY
Object CONCRETE
Object HUMAN
Object HYPOTHETICAL
Object STATE
do-
po-
prze-
roz-
wy-
za-
â•⁄17.014
–10.499
â•⁄11.915
–11.959
â•⁄–9.362
â•⁄â•⁄3.871
â•⁄â•⁄3.91
â•⁄â•⁄6.095
â•⁄–6.118
â•⁄–2.588
â•⁄–3.637
â•⁄12.882
–14.556
â•⁄â•⁄9.125
â•⁄10.229
–10.95
–13.081
â•⁄–0.009
â•⁄â•⁄4.404
â•⁄â•⁄4.426
â•⁄–2.245
â•⁄â•⁄0.122
â•⁄10.093
â•⁄â•⁄4.57
–3.739
â•⁄2.251
–6.088
â•⁄6.305
â•⁄8.166
–0.671
–1.931
–1.246
–2.708
–1.401
–2.06
–4.853
14.704
–9.035
–0.772
â•⁄0.865
–0.273
–0.795
â•⁄0.039
–1.165
–1.902
â•⁄5.578
â•⁄3.509
–2.847
â•⁄–4.415
â•⁄â•⁄2.611
–16.923
â•⁄17.488
â•⁄16.819
â•⁄–2.628
â•⁄–6.608
â•⁄–8.353
â•⁄â•⁄9.222
â•⁄â•⁄0.633
â•⁄–6.625
–12.313
–3.134
â•⁄1.902
â•⁄0.639
–0.609
–0.775
–0.823
â•⁄0.074
–0.124
–0.454
â•⁄0.813
â•⁄1.322
â•⁄0.086
Let us now turn to correspondence analysis in order to see how these factors
interact with one another. Figure 2 presents the results of a multiple correspondence
analysis. It employs a Burt matrix/‘joint’ method to correct for low explained inertia
(Greenacre 1993). The explained inertia is 81.2% (Dim. 1 = 68.4, Dim. 2 = 12.8),
which is a stable result in multiple correspondence analysis.
As was predicted by the analysis of the results of Table 4, the Imperfective Aspect
correlates highly with the do- and roz- prefixes. The combination of the goal-indicating do- with the Imperfective Aspect shows the focus on the processual nature of
roz–
1.50
IMPERF
do–
0.75
ABSTR
NP
0.00
wy–
ACT
HUM
ACCOM CLAUSE
prze–
PERF–
CONCR
–0.75
HYPO
za–
–1
STATE
ACH
–0.5
0
po–
0.5
1
Figure 2.╇ Multiple correspondence analysis: Prefixes of the Polish verb myśleć ‘to think’
and their correlation with Aspect, Object Form and Object Semantics
240 Małgorzata Fabiszak et al.
the verb domyślać się in the conceptualization. The prefix roz- indicates the dispersed
nature of the verb it is attached to and, in combination with the Imperfective, emphasizes the extensive nature of the thinking process. As for the hypothesized correlation
between wy-, prze-, do- myślić (się), on the one hand, and po-, do-, roz-, za-myślić (się),
on the other (see Section 2 and Figure 1), Figure 2 partly confirms these assumptions,
which were arrived at via theoretical and introspective reasoning. Firstly, the prefixes
prze- and wy- in Figure 2 are grouped together relative to nominal objects of thought.
Secondly, za- and po- also correspond to one another relative to the Perfective Aspect and hypothetical clausal objects of thought. Finally, there is a third correlation
discernible between do- and roz-, which were originally expected to belong together
with the previous group; here, they are clustered together relative to the Imperfective
Aspect.
In the next step of data analysis, we will combine two factors: Aspect and Prefix,
with a view to visualizing more clearly the perfective and imperfective uses of the relevant prefixed predicates. Accordingly, the next plot presents the binary correspondence analysis of Prefix combined with Aspect, Object Form and Object Semantics.
The binary correspondence in Figure 3 clearly captures the dispersion of object
types relative to the prefixes. It is a stable analysis, with 95% explained inertia (Dim. 1:
67.16, Dim. 2: 27.84). The plot shows that nominal Abstract objects correlate closely
with Perfective prze- and Perfective and Imperfective wy-, which are plotted to the
left of the centre. Above this grouping, we can see that Concrete nominal objects are
most closely correlated with the Perfective wy-, while Human nominal objects are in
distinct correlation with Imperfective roz-. This is made evident by the semantic feature being pushed away from the rest of the plot and located just above the prefix in
question. All these correlations are illustrated in examples (8a–e) below:
(8) a. Przemyślałam jeszcze raz całą sytuacje od początku i doszłam do tego
samego.
‘I thought this situation over once more and came to the same
[conclusion].’ (prze+Perf+ObjNP+ObjAbstr)
b. Jeżeli zorientuje się, że wróciłyśmy, wymyśli jakieś usprawiedliwienie.
‘If he realizes we have come back, he will invent some excuse.’
(wy+Perf+ObjNP+ObjAbstr)
c. Bo jak facet 500 lat temu mógł wymyślic helikopter lub rower?
‘How could a guy invent a helicopter or a bike 500 years ago?’
(wy+Perf+ObjNP+ObjConcr)
d.Mówi, że fryzjerstwo jest twórcze: wymyśla własne patenty, na przykład
układanie na krem Nivea, dla efektu i z braku żelu.
‘He says that hairdressing is creative: he invents his own patents, for
example, doing hair with the use of the Nivea cream, for effect and for
want of gel.’ (wy+Imperf+ObjNP+ObjAbstr)
The semasiological structure of Polish myśleć ‘to think’ 241
e. Patrzyła na mnie i rozmyślała o Witku.
‘She was looking at me and thinking about Witek.’
(roz+Imperf+ObjNP+ObjHum)
The mental predicate przemyśleć in (8a) designates a thorough, temporally extended
activity coming to its end, which may bring a change to the object affected by the
mental process. The choice of abstract objects of thought seems justifiable for such an
in-depth analysis. It also stands to reason that the resultative, goal-oriented prefix wyattracts abstract and concrete nominals (8b, c), which profile the effect of the thinking
process in a reified and comprehensive way. The imperfective form wymyślać, rather
than focusing primarily on the effect of thinking, emphasizes the distributive nature
of the process, in the example here additionally reinforced by the plural use of the
nominal in the object position (8d). It also makes perfect sense (example (8e)) why
the prefix roz-, accentuating the extensive nature of the mental activity, should tend
towards objects designating humans, normally our loved ones.
Now, when moving to the right-hand bottom quadrant of Figure 3, we can see
how the Prefix do- combined with the Imperfective Aspect correlates with Activity
and State, while do- combined with the Perfective Aspect is in close correlation with
State and in distinct correlation with Accomplishment, as exemplified below (9a–d):
(9) a. Albo nie chciała wtedy powiedzieć wprost, że się domyśla, że w tajemnicy
przed nią popalał heroinę.
‘Or she didn’t want to say directly that she was coming to realize that he
was clandestinely smoking heroin.’ (do+Imperf+ObjClause+ObjActivity)
b. W pierwszej części filmu nikt się nie domyśla, że ten szczuplutki chłopiec o
wąskiej twarzy i dużych ustach to kobieta.
‘In the first part of the film nobody realizes that this slim boy with a narrow
face and lavish lips is a woman.’ (do+Imperf+ObjClause+ObjState)
c. Twój rozmówca natychmiast domyśli się, że przeżywałeś jakiś dramat.
‘Your interlocutor will realize at once that you were experiencing some
drama.’ (do+Perf+ObjClause+ObjState)
d.W żaden sposób nie mogliśmy się domyślić , co ich tu sprowadziło.
‘We could in no way work out what brought them here.’
(do+Perf+ObjClause+ObjAccomplishment)
The sentences above demonstrate that the verb myśleć, when coupled with the attainment prefix do- in either aspect, tends to co-occur with (1) objects designating states
that affect other people in the surrounding world (examples (9b and c)), to which the
conceptualiser is denied full access; (2) bounded or unbounded activities that happen
beyond the speaker’s immediate field of attention (example (9a)); and (3) accomplishments whose genesis and course of development do not fall within the speaker’s known
reality either (example (9d)). Therefore, it can be seen that the processual meaning of
242 Małgorzata Fabiszak et al.
Obj.Human
Obj.Hypothetical
0.50
ROZ–Imperf
Obj.Concrete
0.25
ZA–Perf
WY–Perf
ObjNP
0.00
Obj.Achievement
ObjCLAUSE
Obj.Activity
Obj.Abstract
PRZE–Perf
–0.25
PO–Perf
DO–Imperf
Obj.State
DO–Perf
WY–Imperf
Obj.Accomplishment
–0.50
–1
–0.5
0
0.5
Figure 3.╇ Binary correspondence analysis: Prefix combined with Aspect, Object Form
and Object Semantics
the verb and the semantic features of the said objects of thought, characterized by a
certain degree of obscurity, are compatible.
Considering the right-hand upper quadrant of the plot, we can see the correlation
between the prefix po- and Hypothetical, Achievement and Activity clauses, which are
illustrated in examples (10a–c):
(10) a. Pomyślałem, że i ja mógłbym latać.
‘I thought that I could fly too.’ (po+Perf+ObjClause+ObjHypo)
b. Pomyślałem, że niczego nie spostrzegł.
‘I thought that he hadn’t noticed anything.’ (po+Perf+ObjClause+ObjAch)
c. Można było pomyśleć , że się o coś modli.
‘One could think that he was praying for something.’
(po+Perf+ObjClause+ObjActivity)
This delimitative prefix is construed against the background of a variety of possible
paths, one of which is selected and leads to the final point, becoming the focus of
attention, to use Langacker’s parlance. The final point here is represented by clauses
designating hypothetical events, achievements or activities. This shows that pomyśleć
is more directly correlated with clauses, imposing upon them its delimitative nature.
It is also noteworthy that activities and achievements are more directly observable
and more easily enclosed in or confined to a single event of thought than accomplishments or states.
The semasiological structure of Polish myśleć ‘to think’ 243
Hypothetical clausal objects are attracted by the prefixes roz- and za-, as exemplified below (11a–b):
(11) a. “Widocznie każdy musi wszystko przeżyć od nowa”, rozmyślał na głos
Olejniczak.
‘“Apparently everyone has to re-live everything”, Olejniczak was thinking
aloud.’ (roz+Imperf+ObjClause+ObjHypothetical)
b. “Może warto spróbować…”, zamyślił się.
‘“Maybe it’s worth trying”, he thought.’
(za+Perf+ObjClause+ObjHypothetical)
It is noteworthy that in Section 2 both these prefixes were hypothesized to be conceptually similar. They both profile a mental process, whereby the subject is absorbed in
thought about a given object. Due to its dispersed character, the procedural intensifying prefix roz- designates an expanse of mental space affected by the thinking process,
without actually impacting whatever constitutes the object of pondering. Here, the
mind seems to be construed in a more processual manner. The intensive-resultative
za-, on the other hand, construes a situation in which either the subject seems to be
behind the curtain of thought, or his/her mind, viewed as a container, appears to be
completely covered by or filled in with the mental state (cf. Tabakowska 2003a).6 Both
these mental processes engross the subject to such an extent that any other activity is
impossible. Hypothetical objects which correlate closely with these prefixes concern
the unknown, the possible, the unrealized, which naturally intrigues humans and invites extensive (roz-) as well as deep (za-) consideration.
Let us now consider the difference in the use of nominal and clausal objects. Both
correspondence analyses reveal that this distinction is basic in the data. These two
object types divide the prefixes into two discrete groups represented in the plots by
two separate clusters on the right and left halves. Correspondence analysis offers no
statistical confirmation as to the significance or accuracy of what it identifies. In order
to obtain such information, we have to turn to logistic regression (see Speelman, this
volume).
The model below produces accurate and predictive results, which can be assessed
on the basis of the C score (0.846) and the estimated R2 (0.513). Although not expressed as a probability, the C score of 0.85 can be understood as approximating a
85% success rate in predicting the behaviour of the data in taking a nominal or clausal
object (see Speelman, this volume, for a detailed discussion on how to interpret the
model statistics). The estimated R2 score at 0.513 is a very strong indication of a stable
and predictive model, any figure above 0.3 being accepted as a strong result. The model was checked for overfitting and multicollinearity, neither of which posed a problem.
6. The prefix za- is also present in such words as zasłonięty ‘covered with’ and zapełniony ‘filled
with’.
244 Małgorzata Fabiszak et al.
Variance inflation factor values were beneath 2.5, which is well below accepted levels
of tolerance (Dodge 2008:â•›96).
Deviance Residuals:€
Min€€€€€€1Q€€
Median€€€3Q€€€€€ Max€€
-2.2793€ -0.6885€ -0.5204€€0.5139€€2.0329€€
Coefficients:
€€€€€€€€€€€
Estimate
Std. Error
z value
(Intercept)€
0.22609€€€
0.37899€€ 0.597€
Aspect Perftive€ -0.56256€€€
0.17132€ -3.284€
Prefix do-€€
-1.59438€€€
0.37208€ -4.285
Prefix po-€€
-0.98245€€€
0.34924€ -2.813€
Prefix prze-€
1.86671€€€
0.39706€€ 4.701
Prefix roz-€
-0.04704€€€
0.42383€ -0.111€
Prefix wy-€€€
2.29423€€€
0.35414€€ 6.478
--‘***’ 0.001, ‘**’ 0.01, ‘*’ 0.05, ‘.’ 0.1
Pr(>|z|)€
0.55080€€€€
0.00102 **€
1.83e-05 ***
0.00491 **€
2.59e-06 ***
0.91162
9.27e-11 ***
Null deviance:
3795.0€ on 2738€ degrees of freedom
Residual deviance: 2463.5€ on 2732€ degrees of freedom
AIC:
2477.5
Number of Fisher Scoring iterations: 4
Obs
2739
Model L.R.
1331.48
d.f.
6
P
0
C
0.846
Dxy
0.692
R2
0.513
Brier
0.14
A closer look at the table of coefficients tells us how the model predicts the outcome
as either clausal or nominal. In the table, negative coefficients predict the occurrence
of clausal objects and positive coefficients predict the occurrence of a nominal object.
The strongest predictor for the clausal object is the prefix do-. The second feature predicting clausal objects is po-. However, the coefficient score is close to zero, indicating
that its role is minor, despite its statistical significance. We can now return to the the
results of correspondence analysis in Figure 3 to better understand these findings.
Although both po- and do- are present, we now know that in fact do- is the more important association in the cluster.
Nominal complementation, in turn, is strongly predicted by the prefixes wy- and
prze-, which again bears out the correlations visualized in Figure 3. The fact that the
prefix roz- is the weakest predictor, neither significant nor important, also corresponds to the findings of the correspondence analysis, where it is located on the line
separating the halves containing nominal and clausal objects, respectively.
The semasiological structure of Polish myśleć ‘to think’ 245
Overall, the above findings indicate that the attainment prefix do-, in particular,
but also the delimiting po- are more importantly associated with processual objects,
as designated by clausal complements. The resultative, goal-oriented prefix wy-, on
the other hand, and the prefix prze- correlate importantly with the abstract category
of things, profiled by nominal objects. This corroborates the results of the exploratory correspondence analyses presented above. It is also interesting in itself that no
predictive model could be generated with the inclusion of object semantics. This may
indicate that the semantics of the object of thought for these mental predicates can
only be distinguished at a higher and more schematic level of analysis. Rather than
drawing a fine-grained distinction between particular types of objects relative to particular verbs, it appears that, at least for some prefixes, a more coarse-grained preference can be identified – either for processual or reifying construal of the object. The
semantically coarse-grained results of the logistic regression analysis are compatible
with the schematic semantic characterization of the prefixes in question here. Taking
the strongest predictors for either response variable feature, we can see that the attainment, process-oriented do- is attracted to clausal objects, whereas the resultative
wy-, bringing about more or less concrete but objectifiable results, is more likely to be
associated with nominal phrases.
Having analysed and interpreted the correspondence analyses, which brought
to light a number of revealing interdependencies between various semantic and formal features, and displayed interesting continua thereof, some of which have been
confirmed by the Logistic Regression Analysis, we may turn to Hierarchical Cluster
Analysis.
The insights from the correspondence analyses and the logistic regression analysis will allow us to interpret the results of the cluster analysis. The cluster analysis, in
Figure 4, shows that roz- and za-, prze- and wy-Imperf as well as do- and po- cluster
with each other, whereas wy-Perf stands out from the rest. This clustering only partly
correlates with the hypothesis from Section 2, where wy-, prze-, and do- clustered at
one end of the continuum, while po-, do-, roz- and za- clustered at the other. However,
the correspondence analysis conducted above allows us to account for this discrepancy. As was shown in examples (8a–c), wy- and prze- correlate with NP objects, but docorrelates with Clausal objects. This is why prze- combined with Perfective Aspect and
wy- combined with Imperfective Aspect cluster together – they both take Abstract
nominal objects. Wy- combined with Perfective Aspect is most importantly correlated
with concrete nominal objects. This is why it occupies a separate branch in the cluster.
The introspective analysis, which relied on the native speaker’s competence of
what is acceptable in the language, could not account for the difference in the frequency of use of do- with clausal vs. nominal objects. Similarly, the analysis based
on intuition revealed that po- and roz- can both take nominal objects, yet it failed
to ascertain that this potential is highly significant only for roz-. This is probably the
reason why the HCA shows a clustering of po- and do-, and not of po- and roz-. This
246 Małgorzata Fabiszak et al.
wy-Perf
57 52
6
92 78
5
po-Perf
wy-Imperf
prze-Perf
100 88
2
za-Perf
roz-Imperf
0
100 81
1
100 87
5
do-Imperf
200
96 81
4
do-Perf
400
600 800
1000
au bp
edge #
Figure 4.╇ Hierarchical cluster analysis (Distance: Euclidean, cluster method: ward)
correlation is also clearly visible in the correspondence analysis and is a result of the
propensity of these prefixes to correlate with clausal objects such as Achievement,
Activity and State. They are conceptually similar, as they both construe thinking as
a process, which is metaphorically understood as a journey on/along the road/path
(po-) leading to a goal (do-). The roz-, za- cluster is caused by the tendency of both
these prefixes to co-occur with Hypothetical clausal objects. Its presence can be further justified by the schematic meaning of the prefixes, which indicate that the subject
is construed as being deep in thought to such an extent that any other activity is
hindered.
6. Conclusion
The present paper describes the semasiological structure of the onomasiological field
of prefixed think verbs in Polish relative to their object form, object semantics and
aspect. These factors have been selected on the basis of previous pilot studies that
revealed their significance for the research question. The aim of providing an accurate
semantic description has been attained in a twofold manner. First, the prefixes were
subjected to a detailed introspective investigation in light of a number of relevant theoretical approaches. In this phase, we projected the inter- and intra-category conceptual relations, constituting the background against which crucial decisions could be
made regarding the annotation schema and the hypotheses to be tested at the quantitative stage of the study.
The semasiological structure of Polish myśleć ‘to think’ 247
The second step involved submitting the annotated data to multivariate statistical
modelling in the form of two exploratory methods, correspondence analysis and hierarchical cluster analysis, and one confirmatory method, logistic regression analysis.
The application of two mutually complementary phases of analysis ensured greater
reliability of results. Employing pertinent exploratory methods enabled us to identify
actual patterns of use in the object of our study, as well as patterns surfacing in multiple dimensions that no introspection could possibly bring to light. Importantly, the
results thus obtained are replicable and falsifiable, which adds to the accuracy of the
model produced.
Further stages of research into the semantics of the Polish mental verbs should
focus on the contribution of Subject grammar and Subject semantics as well as modality and negation to the construal of the mental process. So far, Fabiszak, Hebda and
Konat (2012) have run a pilot study on the contribution of Subject form and Object
form and semantics to the construal of the scene in the Polish verb wierzyć ‘to believe’,
while Fabiszak and Hebda (2011) looked at the role of Subject, mood and negation
in the meaning construction of wierzyć ‘to believe’. Kokorniak and Krawczak (2010),
in turn, focus on nominal complementation of the verbs myśleć ‘think’ and sądzić
‘suppose’. The construal of prefixed think verbs in Polish relative to the grammatical
subject person and adverbial modification is the object of study in Krawczak and
Kokorniak (2012). Clausal that and zero complementation of the English verb think
relative to an array of syntactic and semantic properties is analysed in Krawczak and
Kokorniak (2010), while Krawczak and Fabiszak (2011) concentrate on the construal
of the nominal as well as the clausal complementation of the verbs myśleć ‘think’ and
wierzyć ‘believe’.
References
Amberber, M. (2003). The grammatical encoding of ‘thinking’ in Amharic. Cognitive Linguistics, 14, 195–219. DOI: 10.1515/cogl.2003.008
Comrie, B. (1976). Aspect: An introduction to the study of verbal aspect and related problems.
Cambridge: Cambridge University Press.
D’Andrade, R. (1987). A folk model of the mind. In D. Holland, & N. Quinn (Eds.), Cultural
models in language and thought (pp. 112–148). Cambridge: Cambridge University Press.
DOI: 10.1017/CBO9780511607660.006
D’Andrade, R. (1995). The development of cognitive anthropology. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9781139166645
Dąbrowska, E. (1997). Cognitive Semantics of the Polish dative. Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110814781
248 Małgorzata Fabiszak et al.
Danielewiczowa, M. (2000). Główne problemy opisu i podziału czasownikowych predykatów
mentalnych. [Main problems in the description and classification of verbal mental predicates.] In R. Grzegorczykowa, & K. Waszakowa (Eds.), Studia z semantyki porównawczej.
[Studies in comparative semantics] Vol. 1 (pp. 227–247). Warszawa: Wydawnictwo UW.
Danielewiczowa, M. (2002). Wiedza i niewiedza: Studium polskich czasowników epistemicznych.
[Knowing and not knowing: A study of Polish epistemic predicates.] Warszawa: Katedra
Lingwistyki Formalnej UW.
Dewell, R. B. (1994). Over again: Image-schema transformations in semantic analysis. Cognitive Linguistics, 5, 351–380. DOI: 10.1515/cogl.1994.5.4.351
Dickey, S. M. (Unpublished manuscript). Subjectification and the Russian perfective.
Dickey, S. M., & Hutcheson, J. (2003). Delimitative verbs in Russian, Czech and Slavic. In R. A.
Maguire, & A. Timberlake (Eds.), American contributions to the Thirteenth International
Congress of Slavists (pp. 23–36). Columbus: Ohio Slavca. Retrieved from
<http://kuscholarworks.ku.edu/dspace/bitstream/1808/5473/1/Dickey%20%26%20
Hutcheson%20Delimitatives.pdf> [Accessed 9th November 2009].
Dickey, S. M. (2000). Parameters of Slavic aspect: A cognitive approach. Stanford: CSLI.
Dickey, S. M. (2009). Subjectification and the East-West aspect division. Paper presented at the
9th Slavic Cognitive Linguistics Conference, 16th October 2009, Prague.
Dirven, R., Goossens, L., Putseys, Y., & Vorlat, E. (1982). The scene of linguistic action and
its perspectivization by SPEAK, TALK, SAY, and TELL. Amsterdam & Philadelphia: John
Benjamins. DOI: 10.1075/pb.iii.6
Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2, 3–60. DOI: 10.1515/CLLT.2006.002
Divjak, D. (2006). Ways of intending: A corpus-based Cognitive Linguistic approach to
near-synonyms in Russian. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 19–56). Berlin & New
York: Mouton de Gruyter.
Divjak, D. (2010). Structuring the lexicon: A clustered model for near-synonymy. Berlin & New
York: Mouton de Gruyter.
Dodge, Y. (2008). The concise encyclopedia of statistics. Berlin: Springer.
Fabiszak, M., & Hebda, A. (2011). Social and individual cognition, modality and negation in
the use of the Polish verb wierzyć ‘to believe’. Paper presented at SLE 2011, Logrono, Spain.
Fabiszak, M., Hebda, A., & Konat, B. (2012). Dichotomy between private and public experience: The case of Polish wierzyć ‘believe’. In Ch. Hart (Ed.), Online proceedings of UK-CLA
meetings 1 (pp. 164–176). Hertfordshire: The UK Cognitive Linguistics Association. Retrieved from <http://www.uk-cla.org.uk/proceedings/volume_1>.
Fortescue, M. (2001). Thoughts about thought. Cognitive Linguistics, 12, 15–45.
DOI: 10.1515/cogl.12.1.15
Geeraerts, D., Grondelaers, S., & Bakema, P. (1994). The structure of lexical variation: Meaning,
naming, and context. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110873061
Glynn, D. (2009). Polysemy, syntax, and variation: A usage-based method for Cognitive Semantics. In V. Evans, & S. Pourcel (Eds.), New directions in Cognitive Linguistics (pp. 77–
106). Amsterdam & Philadelphia: John Benjamins.
Glynn, D. (2010a). Synonymy, lexical fields, and grammatical constructions: A study in usage-based Cognitive Semantics. In H.-J. Schmid, & S. Handl (Eds.), Cognitive foundations
of linguistic usage-patterns: Empirical studies (pp. 89–118). Berlin & New York: Mouton de
Gruyter.
The semasiological structure of Polish myśleć ‘to think’ 249
Glynn, D. (2010b). Testing the hypothesis: Objectivity and verification in usage-based Cognitive Semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 239–270). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110226423
Glynn, D. (2010c). Corpus-driven Cognitive Semantics: An overview of the field. In D. Glynn,
& K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 1–
42). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423.1
Glynn, D. (Forthcoming). Mapping meaning: Corpus methods for Cognitive Semantics.
Cambridge: Cambridge University Press.
Glynn, D., & Fischer, K. (Eds.). (2010). Quantitative methods in Cognitive Semantics: Corpus-driven approaches. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423
Goddard, C. (2003). ‘Thinking’ across languages and cultures: Six dimensions of variation. Cognitive Linguistics, 14, 109–140. DOI: 10.1515/cogl.2003.005
Greenacre, M. (1993). Correspondence analysis in practice. London: Academic Press.
Gries, St. Th. (2003). Multifactorial analysis in Corpus Linguistics: A study of particle placement.
London: Continuum Press.
Gries, St. Th. (2006). Corpus-based methods and Cognitive Semantics: The many senses of
to run. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110197709
Gries, St. Th., & Stefanowitsch, A. (Eds.). (2006). Corpora in Cognitive Linguistics: Corpus-based
approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110197709
Grochowska, A. (1979). Próba opisu reguł łączliwości przedrostka prze- z tematami czasowniÂ�
kowymi. [An attempt at the description of the combinatory rules of the prefix prze- with
verb roots.] Polonica, 5, 59–74.
Grondelaers, S., Geeraerts, D., & Speelman, D. (2007). A case for a cognitive Corpus Linguistics. In M. Gonzalez-Marquez, I. Mittleberg, S. Coulson, & M. Spivey (Eds.), Methods in
Cognitive Linguistics (pp. 149–169). Amsterdam & Philadelphia: John Benjamins.
Jackendoff, R. (1983). Semantics and cognition. Cambridge, MA: MIT Press.
Janda, L. (1993). A geography of case semantics: The Czech dative and the Russian instrumental.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110867930
Junker, M.-O. (2003). A native American view of the ‘Mind’ as seen in the lexicon of cognition
in East Cree. Cognitive Linguistics, 14, 167–194. DOI: 10.1515/cogl.2003.007
Kokorniak, I., & Krawczak, K. (2010). Thinking about thinking: Constructions of Polish mental
verbs in discourse. Paper presented at Syntax in Cognitive Grammar, Częstochowa.
Krawczak, K., & Kokorniak, I. (2010). Verbs of cognition, their construal, and complementation in interactive events. Paper presented at the 4th International Conference of the German Cognitive Linguistics Association, Bremen.
Krawczak, K., & Fabiszak, M. (2011). Cognition verbs in Polish, their construal and complement semantics. Paper presented at the 3rd Conference of the Scandinavian Association for
Language and Cognition, Copenhagen.
Krawczak, K., & Kokorniak, I. (2012). A corpus-driven quantitative approach to the construal
of Polish think. Poznań Studies in Contemporary Linguistics, 48, 439–472.
DOI: 10.1515/psicl-2012-0021
250 Małgorzata Fabiszak et al.
Kustova, G. (2000). Niektóre problemy opisu predykatów mentalnych. [Some problems in the
description of mental predicates.] In R. Grzegorczykowa, & K. Waszakowa (Eds.), Studia z
semantyki porównawczej [Studies in comparative semantics] Vol. 1 (pp. 249–263). Warszawa: Wydawnictwo UW.
Lakoff, G., & Johnson, M. (1980). Metaphors we live by. Chicago: University of Chicago Press.
Langacker, R. (1991). Foundations of Cognitive Grammar. Vol. 2. Descriptive application.
Stanford: Stanford University Press.
Langacker, R. (1999). Losing control: Grammaticalization, subjectification, and transparency.
In A. Blank, & P. Koch (Eds.), Historical semantics and cognition (pp. 147–175). Berlin &
New York: Mouton de Gruyter.
Lindner, S. (1983). A lexico-semantic analysis of English verb particle constructions. Trier: LAUT.
Palmer, G. (2003). Talking about thinking in Tagalog. Cognitive Linguistics, 14, 251–280.
Pasich-Piasecka, A. (1993). Polysemy of the Polish verbal prefix prze-. In E. Górska (Ed.), Images from the cognitive scene (pp. 11–26). Kraków: Universitas.
Pawłowska, R. (1981). Znaczenie i użycie czasownika ‘myśleć’. [The meaning and use of the
verb ‘think’.] Polonica, 7, 149–160.
Piernikarski, C. (1975). Czasowniki z prefiksem po- w języku polskim i czeskim: Na tle rodzajów
akcji w językach słowiańskich. [Verbs with the po- prefix in Polish and Czech: In the background of Aktionsarten in Slavic languages.] Warszawa: PWN.
Przybylska, R. (2002). Stru ktura schematyczno-wyobrażeniowa prefiksu czasownikowego roz-.
[Image-schematic structure of the verbal prefix ‘roz-’.] Polonica, 21, 269–286.
Przybylska, R. (2006). Schematy wyobrażeniowe a semantyka polskich prefiksów czasownikowych
do-, od-, prze-, roz-, u-. [Image schemata and the semantics of Polish verb prefixes do-,
od-, prze-, roz-, u-.] Kraków: Universitas.
Radden, G., & Dirven, R. (2007). Cognitive English grammar. Amsterdam & Philadelphia: John
Benjamins. DOI: 10.1075/clip.2
Rudzka-Ostyn, B. (2000). Z rozważań nad kategorią przypadka. [Ruminating on the category of
case.] Kraków: Universitas.
Schlesinger, I. M. (1998). Cognitive space and linguistic case: Semantic and syntactic categories in
English. Cambridge: Cambridge University Press.
Śmiech, W. (1986). Derywacja prefiksalna czasowników polskich. [Prefix derivation of Polish
verbs.] Wrocław: Ossolineum.
Speelman, D., Tummers, J., & Geeraerts, D. (2009). Lexical patterning in a construction grammar: The effect of lexical co-occurrence patterns on the inflectional variation in Dutch
attributive adjectives. Constructions and Frames, 1, 87–118. DOI: 10.1075/cf.1.1.05spe
Szwedek, A. (2007). An alternative theory of metaphorisation. In M. Fabiszak (Ed.), Language
and meaning: Cognitive and functional perspectives (pp. 312–327). Frankfurt/Main: Peter
Lang.
Tabakowska, E. (2003a). Space and time in Polish: The preposition za and the verbal prefix za-.
In H. Cuyckens, T. Berg, R. Dirven, & K.-U. Panther (Eds.), Motivation in language: Studies
in honor of Günter Radden (pp. 153–177). Amsterdam & Philadelphia: John Benjamins.
Tabakowska, E. (2003b). The notorious Polish reflexive pronouns: A plea for Middle Voice.
Glossos 4. Retrieved from <http://www.seelrc.org/glossos/issues/4/tabakowska.pdf> [Accessed 9th November 2008].
Vendler, Z. (1967). Linguistics in philosophy. Ithaca, NY: Cornell University Press.
The semasiological structure of Polish myśleć ‘to think’ 251
Wierzbicka, A. (1992). Semantics, culture, and cognition: Universal human concepts in culture-specific configurations. Oxford: Oxford University Press.
Wierzbicka, A. (1996). Semantics: Primes and universals. Oxford: Oxford University Press.
Wierzbicka, A. (1997). Understanding cultures through their key words. Oxford: Oxford University Press.
Wierzbicka, A. (1999). Emotions across languages and cultures. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511521256
A multifactorial corpus analysis
of grammatical synonymy
The Estonian adessive and adposition peal ‘on’
Jane Klavan
University of Tartu
In the present study, both monofactorial and multifactorial corpus methods
are applied to the alternative use of the analytic adpositional construction and
the synthetic case construction in present-day written Estonian. 600 examples
from the fiction sub-corpora of written Estonian were annotated for 16 different (seven morphosyntactic and nine semantic) factors. In order to determine
which factors are most influential in determining the choice between the two
constructions, a logistic regression model was built to fit the data. The analysis
confirmed the statistical influence of the following factors: lexical complexity,
type of verb, word order, word class, and mobility. The results reported in this
study align with previous research, which has shown that case affixes are used to
express more abstract relations and adpositions more concrete ones.
Keywords: adpositional construction, case construction, Finno-Ugric
languages, logistic regression
1. Introduction1
To express a spatial scene where a vase is located on a table, one can either use the
adessive case (example (1a)) or the adpositional construction with the postposition
peal ‘on’ (example (1b)) in Estonian:
(1) a. vaasonlaual
vase.sg.nom be-prs.sg3 table.sg.ad
‘the vase is on the table’.
1. I am indebted to Dylan Glynn for his valuable comments and his extensive help with the
logistic regression analysis. Methodologically, the present chapter relies heavily on Glynn
(2007, 2010).
254 Jane Klavan
b. vaason laua peal
vase.sg.nom be-prs.sg3 table.sg.gen on.ad
‘the vase is on the table’.
Among other things, I will be taking a closer look at one of the central claims made
in the literature about the difference between the ways the two strategies, i.e. the synthetic vs. analytic way of expressing meaning, are used. It is claimed that case affixes
are used to express more abstract relations and adpositions to express more concrete
ones (e.g. Bartens 1978; Comrie 1986; Luraghi 1991:â•›66–67; Ojutkangas 2008; Hagège
2010:â•›37–38; Lestrade 2010). Adpositions are said to be semantically more specific
than cases and they are used to express the less predictable spatial meanings; cases,
on the other hand, are more abstract and used to express more frequent spatial meanings (Lestrade 2010). The chapter proceeds from the theoretical premises of both
Construction Grammar (Goldberg 1995, 2006) and Cognitive Grammar (Langacker
1987, 2008), where one of the basic general assumptions is that of no-synonymy –
when two constructions differ syntactically, then they also differ either semantically
or pragmatically (Goldberg 1995:â•›67).
At present, there is no detailed corpus study on this topic and the few previous
studies that are available proceed from an approach based largely on introspection
(Rannat 1991; Vainik 1995). Accordingly, the main aim of the chapter is to carry out
a completely corpus-based analysis of the adessive case construction and the adpositional construction with the postposition peal ‘on’. The central focus will be on finding
out how significant different morphosyntactic and semantic factors are in determining the use of either the adessive constructions or the adpositional constructions, and
which factors are the most important ones. Both monofactorial and multifactorial
statistical techniques are used to analyse the results. The monofactorial analysis is
seen as a necessary and beneficial first step in the quantitative analysis of the results.
The purpose is to move from more simple, univariate exploratory analysis to more
complex but more powerful multivariate analysis. Monofactorial techniques make it
possible to identify which variables are statistically significant, but by employing such
a multivariate statistical technique as regression analysis, it is possible to determine
the contribution of the different semantic and morphosyntactic variables to the alternation and calculate the relative strength of each individual variable.
A number of researchers have successfully argued for the benefits of using multifactorial analyses (e.g. Gries 2003; Wulff 2003; Glynn 2007, 2010; Szmrecsanyi 2010).
Gries (2003:â•›155), among others, emphasises that “variation phenomena are too multifaceted to be treated adequately by means of minimal-pair tests and the researchers
own judgements”. The chapter was inspired by numerous other studies on different
variation phenomena, including those specifically about the synthetic-analytic distinction (for dative alternation in English, see Bresnan et al. 2007, Bresnan and Ford
2010; for English comparative alternation, Mondorf 2003; for the English genitive
A multifactorial corpus analysis of grammatical synonymy 255
alternation, Rosenbach 2003, Szmrecsanyi 2010; for particle placement alternation in
English transitive phrasal verbs, Cappelle 2009).
The chapter consists of six sections. In the following section, an overview of the
functions of the Estonian adessive case and the adposition peal is given. Section 3
describes the data sample; Section 4 gives an overview of the variables coded and
presents the monofactorial results. The results of a multifactorial statistical (logistic
regression) analysis are presented and discussed in Section 5; the chapter ends with a
conclusion (Section 6).
2. The Estonian adessive case and the adposition peal ‘on’
2.1
The Estonian adessive case
Estonian nouns and adjectives decline in fourteen cases; six of these cases are referred
to as locative cases and can be divided into interior locative cases (illative, inessive,
elative) and external locative cases (allative, adessive, ablative) (see Table 1). The Estonian adessive case belongs to the set of external locative cases and expresses, first
and foremost, spatial or temporal relations. It normally takes the role of an adverbial
or attribute in the clause (Erelt et al. 1995:â•›58). Estonian external locative cases express spatial relations of an open surface and they form a three-part series – allative,
adessive, ablative – expressing direction, location and source, respectively (Erelt et al.
2007:â•›240; see Table 1). The present chapter focuses only on the adessive, although
the other forms of external locative cases are also said to be synonymous with the respective forms of the adposition peal ‘on’ (direction: allative -le ~ peale ‘onto’; source:
ablative -lt ~ pealt ‘off ’).
Although the primary meaning of the locative cases was the expression of spatial
relations, in modern Estonian they fulfil a number of abstract functions. For instance,
it is more frequent for the Estonian adessive to mark the possessor or agent (examples
(2c) and (2d)) than, for example, location (example (2a)); the following functions of
the Estonian adessive case are relevant for the present analysis (Erelt et al. 2007:â•›250):
Table 1.╇ Estonian locative cases as exemplified by the noun laud ‘table’
Interior
Exterior
lative (direction)
locative (location)
separative (source)
illative
laua-sse ‘into table’
allative
laua-le ‘onto table’
inessive
laua-s ‘in table’
adessive
laua-l ‘on table’
elative
laua-st ‘out of table’
ablative
laua-lt ‘off table’
256 Jane Klavan
(2) a. Location
Vaasonlaual.
vase.sg.nombe-prs.3sgtable.sg.ad
‘The vase is on the table’.
b. Instrument
Mari mängibklaverilmõnd lugu.
Mari.nom play-prs.3sgpiano.sg.ad some tune.sg.part
‘Mari plays some tunes on the piano’.
c. Possessor
Maril
onkakslast.
Mari.adbe-prs.3pl two child.sg.part
‘Mari has two kids’.
d. Agent with finite verb forms
Seeasiununesmulkiiresti.
this
thing.sg.nomforget-prs.3sg me.sg.adquickly
‘I quickly forgot about that thing’.
2.2
The Estonian adposition peal ‘on’
In addition to the locative cases, location and change of location in Estonian can be
expressed with adpositions, adverbs, and nouns declined in interior and exterior locative cases (Erelt et al. 1993:â•›71). In Estonian reference grammars, adpositions are
treated as uninflected words that are used together with nouns and express similar
meanings as case endings. In comparison with adpositions, the meaning of cases is
said to be much more abstract and the usage range much broader (Palmeos 1985;
Erelt et al. 1995:â•›33–34; Erelt et al. 2007:â•›191). This is in line with the general claims
made concerning the differences between adpositions and case affixes (Comrie 1986;
Hagège 2010; Lestrade 2010). Nevertheless, as is stressed in the following sub-section,
there are still instances where both the adessive case and the adposition peal ‘on’ are
seen as semantic alternatives in Estonian.
A distinctive morphological characteristic of Estonian adpositions is that like locative cases they constitute three-member sets that are semantically and grammatically
divided into the lative, locative, and separative forms (see Table 2). The adposition
Table 2.╇ The three-member sets of Estonian postpositions sees ‘in’ and peal ‘on’
Interior
Exterior
lative (direction)
locative (location)
separative (source)
illative
si-sse‘into’
allative
pea-le‘onto’
inessive
‘in’
see-s
adessive
pea-l‘on’
elative
‘out of ’
see-st
ablative
pea-lt‘off ’
A multifactorial corpus analysis of grammatical synonymy 257
peal ‘on’ takes external locative case endings: peale – peal – pealt. In the present chapter, only the locative form peal ‘on’ is discussed.
At the clause level, the Estonian adpositional phrase has two basic functions –
that of an adverbial and adverbial modifier (Erelt et al. 1993:â•›137). The adposition peal
‘on’ is polysemous; relevant for the present study are the following senses:
(3) a. Location
Leibonlaua peal.
bread.sg.nombe-prs.3sgtable.sg.gen on.ad
‘Bread is on the table’.
b. Place
Turupeal olisuursagimine.
market.sg.gen on.adbe-pst.3sgbigcommotion.sg.nom
‘There was a big commotion on the market’.
c. Instrument
Mängiklaveripeal ette!
play-imp.2sgpiano.sg.gen on.adahead
‘Play something on the piano!’
2.3
The parallel use of the Estonian adessive and the adposition peal ‘on’
When comparing the meanings of the adposition peal ‘on’ (examples in (3)) to those
of the adessive (examples in (2)), it can be seen that these two forms are used as
alternatives to each other, especially in the functions of expressing location, place
and instrument. According to Palmeos (1985:â•›15), the analytic construction – genitive
together with the adposition peal ‘on’ – expresses the same meaning as the synthetic
adessive. At the same time, it has been claimed in Estonian reference grammars that
the meaning of adpositions is more concrete and specific than that of the cases (Erelt
et al. 2007:â•›191). This has also been mentioned by Palmeos (1985:â•›18), who notes that
the analytic construction conveys the meaning more clearly than the synthetic one.
This clarity of expression is partly due to the grammatical homonymy inherent in
the Estonian language – in some cases, when using the synthetic construction, it is
not clear whether we are expressing location or possession and sometimes, the use of
the adessive to express location is not possible because the possessive reading is too
strong.
Nevertheless, there are still numerous instances where both the adessive case and
the adposition peal ‘on’ can express more or less the same meaning. A small-scale
corpus analysis showed that in the 5-million-word fiction sub-corpus of the Balanced
Corpus of Estonian (2008), there are 314 different Landmarks used with both the
adessive case and the adposition peal ‘on’. Furthermore, similar results of the parallel use of these two constructions were obtained with an open production task that
258 Jane Klavan
studied the expression of different spatial relations in Estonian (Salm 2010). Klavan
et al. (2011) have reported the results of a forced choice task and a production task,
the aim of which was to determine which semantic factors play a role in the use of the
adessive and adposition peal ‘on’. The results of these studies indicate that the adessive
is used when there is an abstract relation between Trajector and Landmark and the
Landmark is a place; the adposition peal ‘on’ is used when there is an unconventional
spatial relation between Trajector and Landmark and when the Landmark is a thing.
However, the two tasks also yielded results where there was no significant difference
between the two locative constructions. The present study hopes to provide converging evidence and to shed new light upon this issue by building upon the previous
work, but using a different methodology (mono- and multifactorial corpus analysis)
and looking at other factors besides the semantic ones.
3. The data sample
The data analysed in this chapter come from the corpus of present-day written Estonian. A data sample of 300 instances of the Estonian adessive case from the fiction
sub-corpus of the Morphologically Disambiguated Corpus (2010; size 104,000 words)
and 300 instances of the Estonian adposition peal ‘on’ from the fiction sub-corpus of
the Balanced Corpus of Estonian (2008; size 5 million words) was collected. Based on
the findings from the previous variation and spatial expression studies cited above,
the data were manually coded for multiple variables or ‘predictors’, which are outlined
in the following section. It is worth pointing out that in order to get the 300 suitable instances of the adessive case, it was necessary to manually work through 1,700
instances of this construction; the reason being that the adessive fulfils a number of
other functions in Estonian besides expressing spatial relations.
4. Corpus-linguistic operationalizations and monofactorial results
The original coding schema included more than 30 variables; in the present analysis
only the sixteen most important ones are discussed. These sixteen variables can be
divided into two groups and include the following: seven morphosyntactic factors
(length of the Landmark phrase, lexical complexity of the Landmark, word order, verb
lemma, syntactic function of the Landmark phrase, word class of Landmark and Trajector) and nine semantic factors (type of relation between Landmark and Trajector,
type of Landmark, animacy, number and mobility of Landmark and Trajector, relative
size between Landmark and Trajector). The following sub-sections give an overview
of the operationalizations of these variable groups and their different levels.
A multifactorial corpus analysis of grammatical synonymy 259
The significance of each variable was also tested individually and these results are
presented directly after the presentation of the respective variable. The monofactorial
analysis of the results relies on Gries (this volume) and predominantly makes use of
the Chi-squared test to evaluate the raw frequency counts encountered in the corpus.
Following Gries (ibid.), I also computed the effect sizes and inspected the Pearson
residuals for each variable in order to determine how strong the results of the Chisquare tests are and what exactly determines the difference between the frequency
counts. As Gries (ibid.) points out, the effect size theoretically ranges from 0 (‘no
effect’) to 1 (‘perfect correlation’).
4.1
Morphosyntactic factors
4.1.1 Length of the Landmark phrase
Previous studies on grammatical alternation have shown that the more explicit form
is favoured in cognitively complex environments (Mondorf 2003:â•›294). This phenomenon has also been referred to as Rohdenburg’s complexity principle (ibid.). Given
thus the option of either using a synthetic or analytic locative construction, longer
words have been found to opt for the analytical variant and shorter words/phrases
for the synthetic variant (Cooper and Ross 1975; Hawkins 1994; Wasow 1997; Arnold
et al. 2000; Mondorf 2003; Wulff 2003). Mondorf (2003:â•›251–253) argues for a presumably universal tendency, the phenomenon that she terms analytic support:
In cognitively more demanding environments which require an increased processing load, language users – when faced with the option between a synthetic
and analytic variant – tend to compensate for the additional effort by resorting
to the analytic form.
Although Mondorf (2003) focuses on the English comparative construction, the
claim is that this compensatory strategy can be extended to other kinds of variation
phenomena that draw on the synthetic-analytic distinction. Since length constraints
have been found to affect other syntactic variation phenomena as well (e.g. Cooper
and Ross 1975; Hawkins 1994; Wasow 1997; Arnold et al. 2000; Wulff 2003), it was
decided to code the present data for the length of the locative phrase. Measures of
syntactic complexity can be efficiently operationalized by counting the number of
graphemic words (Bresnan and Ford 2010). For the present analysis, length of the
Landmark phrase was measured both in words and syllables. In line with Bresnan and
Ford (2010:â•›9, fn. 8) length was transformed by the logarithm in order to compress
extreme values and bring the distribution more closely into the logistic regression
model assumption of linearity.
The results show that the mean length of the Landmark phrase with the adessive
(1.95) differed highly significantly from the mean length of the Landmark phrase
260 Jane Klavan
Table 3.╇ Length of the Landmark phrase in words
Number of words in the Lm phrase
Adessive
peal
Total
1 word
2 words
3 words
4 words
5 words
6 words
7 words
Total
114
123
â•⁄41
â•⁄15
â•⁄â•⁄2
â•⁄â•⁄2
â•⁄â•⁄3
300
192
â•⁄88
â•⁄13
â•⁄â•⁄5
â•⁄â•⁄2
â•⁄â•⁄0
â•⁄â•⁄0
300
306
211
â•⁄54
â•⁄20
â•⁄â•⁄4
â•⁄â•⁄2
â•⁄â•⁄3
600
with the postposition peal ‘on’ (1.46), t(78) = 4.78, p < .001. As seen from Table 3,
the adessive was predominantly used with Landmark phrases that were two or more
words long and the postposition peal with Landmark phrases that were one word
long (examples (4a) and (4b)). If we take that a longer Landmark phrase implies
a cognitively more complex environment (cf. Wasow 1997; Cappelle 2009:â•›149),
Rohdenburg’s complexity principle does not hold here, i.e. the more explicit form
(the postposition peal) was not favoured in a cognitively more complex environment.
(4) a. ILU1980\stkt0001:
Pihlakad kasvasidsellelsamal põllupeenral.
rowan.pl.nom grow-pst.3plthis.sg.ad flowerbed.sg.ad
‘The rowans were growing on this flower bed’.
b. MJ_A:
Kai peal oli meeletu trügimine…
pier.sg.gen on.ad be-pst.3sg mad.sg.nom pushing.sg.nom
‘On the pier people were pushing madly…’
According to Mondorf (2003:â•›254), who discusses the English comparative construction, one of the effects of the so-called more-support that reflects the general analytic
support is that:
a separate lexeme as degree marker rather than an inflectional suffix can serve
both as an unambiguous signal indicating increased processing load to the reader
and as a less condensed and more explicit way of structuring a complex phrase.
Since the synthetic variant in -er allows recognition only after the adjective and its inflection have been processed, complex environments should call for early recognition
and hence the analytic variant would be used in English for the comparative construction (Mondorf 2003:â•›255). However, in the case of the Estonian adessive and the adposition peal ‘on’, this signalling argument does not work because both the adessive case
marker and the postposition peal ‘on’ follow the locative phrase (see examples (4a)
A multifactorial corpus analysis of grammatical synonymy 261
Table 4.╇ Length of the Landmark phrase in syllables
Number of syllables in the Lm phrase
Adessive
peal
Total
1 syllable
2 syllables
3 syllables
4 syllables
5 syllables
more than 5 syllables
Total
â•⁄â•⁄7
â•⁄64
â•⁄61
â•⁄45
â•⁄32
â•⁄91
300
â•⁄33
125
â•⁄31
â•⁄59
â•⁄17
â•⁄35
300
â•⁄40
189
â•⁄92
104
â•⁄49
126
600
and (4b)). Thus, quite the opposite effect may play a role in Estonian. Precisely due to
the fact that in the case of Estonian locative phrases, the postposition only comes at
the end of the phrase, the locative case is a better signal to indicate that what we have
here is a long locative phrase expressing a support relation – in Estonian all of the
words in the adessive locative phrase are marked for the adessive case, as sellelsamal
‘this’ in example (4a) above.
In addition, the results also show that the mean length of the Landmark phrase in
syllables was significantly different with the adessive (4.87) and the postposition peal
(3.15), t(435) = 7.18, p < .001. The proportion of peal ‘on’ uses with Landmark phrases
one or two syllables long was considerably higher than the proportion of adessive uses
with Landmark phrases of the same length; when the Landmark phrase was more
than 10 syllables long, the proportion of adessive occurrences was much bigger than
that of peal ‘on’ (Table 4).
4.1.2 Lexical complexity of Landmark
It was decided to also code the Landmark for what has been termed here as lexical
complexity, with the levels of ‘compound’ and ‘single lexeme’, e.g. writing desk vs.
desk. Table 5 shows that the number of occurrences where the Landmark word was
a compound was significantly higher with the adessive (88 instances) than with the
adposition peal ‘on’ (17 instances), χ²(1, N = 600) = 56.57, p < .001, φ = 0.31. This gives
further indication of the adposition peal being preferred with shorter, less complex
Landmark phrases, while the adessive tends to be used with longer and more complex
Landmark phrases.
Table 5.╇ Lexical complexity of Landmark
Landmark word
Adessive
peal
Total
compound
single lexeme
Total
â•⁄88
212
300
â•⁄17
283
300
105
495
600
262 Jane Klavan
Table 6.╇ Position of the locative phrase within the clause
Position of locative phrase
Adessive
peal
Total
final
initial
middle
Total
129
â•⁄83
â•⁄88
300
111
â•⁄83
106
300
240
166
194
600
4.1.3 Word order: Position of the locative phrase
Several researchers have discussed the principle of end-weight in relation to grammatical variation (Wasow 1997; Cappelle 2009:â•›149–150). This phenomenon states
that “long, complex phrases tend to come at the ends of clauses” (Wasow 1997:â•›81).
However, it is not entirely clear what is meant by ‘weight’. For example, it can either
be length or complexity (cf. Wasow 1997; Cappelle 2009:â•›149). In the present chapter
it is simply assumed that because the analytic adpositional construction with peal
‘on’ can create a heavier constituent effected by the extra lexeme (peal ‘on’), it weighs
more than the synthetic adessive case. However, it should be noted that length is only
one aspect of how the principle of end-weight can be operationalized. For the present
study, the word order factor was coded according to the position of the locative phrase
in the clause with the levels of ‘final’, ‘initial’ and ‘middle’. Estonian is a language with
a relatively free word order and the locative phrases can come at the beginning of a
clause (in which case the clause is referred to as an existential clause in Estonian reference grammars), at the end of a clause or in the middle of a clause.
However, there was no significant difference between the two construction types
in the present dataset (see Table 6), χ²(2, N = 600) = 3.02, p = .22. Contrary to the
purported principle of end-weight, the locative construction with the adessive case
was slightly more frequent in the final position. This can be, at least partly, explained
by the fact that the mean length of the locative constructions with the adessive was
longer than the mean length of the locative constructions with the postposition peal
‘on’ and that these longer Landmark phrases with the adessive predominantly occur
in the final position within a clause (see Section 4.1.1 above).
4.1.4 Word order: Loc_Nom vs. Nom_Loc
In addition to the position of the locative phrase within the clause, it was decided to
code whether the locative phrase follows or precedes the Trajector phrase. This factor
has two levels – ‘Nom_Loc’ and ‘Loc_Nom’. ‘Loc’ refers to the locative phrase and
‘Nom’ to the Trajector phrase. The relevant frequencies are given in Table 7. In general, it can be seen that the preferred word order is such that the locative phrase follows the Trajector (393 occurrences in total). Although the raw frequencies indicate
that when the locative phrase precedes the Trajector, the adessive is used more often
A multifactorial corpus analysis of grammatical synonymy 263
Table 7.╇ Word order: Loc_Nom vs. Nom_Loc
Loc_Nom vs. Nom_Loc
Adessive
peal
Total
Loc_Nom
Nom_Loc
Total
111
189
300
â•⁄96
204
300
207
393
600
Table 8.╇ Verb lemmas used with locative constructions
Verb lemma
Adessive
peal
Total
action verbs
existence verbs
motion verbs
posture verbs
no verb
Total
113
â•⁄40
â•⁄46
â•⁄51
â•⁄50
300
â•⁄91
â•⁄85
â•⁄35
â•⁄45
â•⁄44
300
204
125
â•⁄81
â•⁄96
â•⁄94
600
than the adposition peal ‘on’, statistically these results are not significant, χ²(1, N =
600) = 1.66, p = .19.
4.1.5 Verb lemma
Every instance of the adessive and the adposition peal ‘on’ was coded for the verb lemma used in these sentences. In total, there were 212 different verbs used with these locative constructions. The verbs were subcategorised into different groups based largely
on Levin (1993) and include the following levels: ‘action verbs’ (e.g. tegema ‘to do’),
‘existence verbs’ (e.g. olema ‘to be’), ‘motion verbs’ (e.g. jooksma ‘to run’), and ‘posture
verbs’ (e.g. istuma ‘to sit’). In addition, this factor also had the level of ‘no verb’ which
was used for elliptical sentences where no overt verb lemma was expressed. The raw
frequencies of these verb groups are given in Table 8. The Chi-squared test revealed
that the frequencies of the two constructions significantly differed by verb lemma
(χ²(4, N = 600) = 22.76, p < .001, φ = 0.19). The Pearson residuals show that existence
verbs determine the difference between these frequency counts – the adpositional
construction is significantly more often used with existence verbs like olema ‘to be’,
asuma ‘to be situated’, and asetsema ‘to be placed’.
4.1.6 Syntactic function of the Landmark phrase
Both the adessive and the adpositional construction can fulfil two syntactic functions – that of an adverbial and a modifier. It was therefore decided to code the syntactic function of the locative phrase in the dataset with precisely these two levels:
‘adverbial’ and ‘modifier’. The results show that there is a modest, but significant difference (Table 9), χ²(1, N = 600) = 5.78, p = .016, φ = 0.09. Although both the adessive
264 Jane Klavan
Table 9.╇ Syntactic function of the Landmark phrase
Syn. function of Lm phrase
Adessive
peal
Total
adverbial
modifier
Total
264
â•⁄36
300
281
â•⁄19
300
545
â•⁄55
600
case and the postposition peal ‘on’ predominantly fulfil the function of an adverbial,
the adessive case is slightly more frequently used in the function of a modifier, as in
example (5).
(5) ILU1980\stkt0023:
Paigadküünarnukkidelja hõlmadel tükkisidlahti …
patch.pl.nomelbow.pl.adandflap.pl.adstart-pst.3plapart
‘Patches on the elbows and flaps started to come apart…’
4.1.7 Word class of Landmark and Trajector
Different expression types have been found to affect the choice of syntactic alternatives (see, for example, Bresnan and Ford 2010). Both Landmarks and Trajectors were
coded for the following types: ‘noun’, ‘pronoun’, ‘verb phrase’. The majority of Landmarks used with the adessive case and the postposition peal ‘on’ are noun phrases
(Table 10). However, when the Landmark is a pronoun, the postposition peal ‘on’ is
used more frequently. This difference is significant – χ²(1, N = 600) = 10.10, p = .001,
φ = 0.13. This result is related to the variable length of the Landmark (see Section 4.1.1
above). Pronouns are short words and these results reflect, once again, the tendency to
use the postposition peal ‘on’ with shorter Landmarks.
Table 10.╇ Word class of Landmark
Word class of Landmark
Adessive
peal
Total
noun
pronoun
Total
292
â•⁄â•⁄8
300
274
â•⁄26
300
566
â•⁄34
600
Word class of Trajector
Adessive
peal
Total
noun
pronoun
verb phrase
Total
210
â•⁄49
â•⁄41
300
166
â•⁄82
â•⁄52
300
376
131
â•⁄93
600
Table 11.╇ Word class of Trajector
A multifactorial corpus analysis of grammatical synonymy 265
The majority of Trajectors used with both locative constructions are also noun
phrases (Table 11). Nevertheless, curiously, the same tendency to use the adpositional construction with pronouns occurs and the difference is also significant,
χ²(2, N = 600) = 14.76, p < .001, φ = 0.16.
4.2
Semantic variables
Numerous cognitive-functional studies on spatial language expressions have shown
that various properties of Trajector and Landmark participating in the locative construction influence the use of spatial expressions (e.g. Talmy 1983; Herskovits 1986;
Vandeloise 1991; Feist and Gentner 2003; Coventry and Garrod 2004; Carlson and
Van der Zee 2005). In the vein of this research tradition, various semantic properties
of Landmarks and Trajectors were coded in the present data.
4.2.1 Type of relation between Landmark and Trajector
It has been suggested in previous work on cases and adpositional constructions that
cases are semantically more abstract than adpositions (Bartens 1978; Comrie 1986;
Luraghi 1991:â•›66–67; Ojutkangas 2008; Hagège 2010:â•›37–38; Lestrade 2010). Both the
Estonian adessive and the adposition peal ‘on’ can express spatial and abstract relations between a Trajector and a Landmark. This variable was coded in the dataset with
the levels of ‘abstract’ and ‘spatial’ in order to see if this general assumption of cases
expressing abstract relations was also borne out in the present data. A relation was
coded abstract when either the Trajector or Landmark was abstract or the relation
itself was abstract, i.e. if there was a meaning transfer. There was a marginal significant
difference, χ²(1, N = 600) = 5.04, p = .02, φ = 0.10; indeed, the adposition peal ‘on’ is
more frequent with abstract relations in the present dataset (Table 12). However, one
must bear in mind here that for the present analysis, only such occurrences where the
alternation between the adessive and the adposition peal ‘on’ is possible were looked
at. If we compare the general usage of these two constructions, we can easily see that
the adessive case expresses abstract functions, where the use of an adposition is not
possible (cf. Section 2 above, examples (2c) and (2d)).
Table 12.╇ Type of relation between Landmark and Trajector
Relation type
Adessive
peal
Total
abstract
spatial
Total
â•⁄49
251
300
â•⁄71
229
300
120
480
600
266 Jane Klavan
4.2.2 Type of Landmark
It can also be predicted that there is a general difference between what types of Landmarks are used together with either the Estonian adessive case or the adposition peal
‘on’. Accordingly, Landmarks in the dataset were coded for their type, the levels of
which were ‘location’ (e.g. street) and ‘object’ (e.g. table). The rationale of making a
distinction between small easily manipulable objects and large static objects or locations is that location should lend itself more easily for abstraction and hence is more
likely to be used with the adessive (Bartens 1978). On the other hand, as has been put
forward in previous studies, adpositions are more concrete and specific than cases,
and they convey the meaning of spatial location of an object more clearly and should
thus be more frequent with small easily manipulable objects such as Landmarks
(Bartens 1978; Palmeos 1985:â•›18; Comrie 1986; Luraghi 1991:â•›66–67; Ojutkangas
2008; Hagège 2010:â•›37–38; Lestrade 2010). The results of the Chi-squared test indicate that the frequency counts of the adessive and the adposition peal ‘on’ significantly
differed by the type of Landmark (χ²(1, N = 600) = 11.21, p < .001, φ = 0.14) – the
adessive tends to be used when the Landmark is a location and the adposition peal ‘on’
when it is an object (Table 13).
4.2.3 Animacy of Landmark and Trajector
Since animacy is considered to be a very important cognitive category and is discussed in numerous linguistic and psycholinguistic studies (for overviews, see, for
example, de Vega et al. 2002:â•›121–122; Feist and Gentner 2003:â•›2; Bresnan and Ford
2010:â•›10), it was decided to code this category for the Estonian adessive and adposition peal ‘on’ dataset as well. This variable has only two levels – ‘animate’ and ‘inanimate’. Unsurprisingly, the results show that the adposition peal ‘on’, rather than the
adessive, is used in cases of animate Landmarks (Table 14).
Table 13.╇ Type of Landmark
Landmark
Adessive
peal
Total
location
object
Total
169
131
300
128
172
300
297
303
600
Animacy of Landmark
Adessive
peal
Total
animate
inanimate
Total
â•⁄â•⁄2
298
300
â•⁄17
283
300
â•⁄19
581
600
Table 14.╇ Animacy of Landmark
A multifactorial corpus analysis of grammatical synonymy 267
The results of the Fisher’s exact test revealed a significant difference – p < .001; the
odds ratio is 0.11. However, here it must be, once again, emphasised that the adessive
has another important function in Estonian besides expressing space – that of expressing possession (see example (2c) above). Since animate objects are very apt to
possessing things, the combination of an animate Landmark and the adessive case
fulfils the function of the possessive construction. Therefore, if there is need in Estonian to talk about an object placed on top of an animate Landmark, this would be
expressed with the adposition peal ‘on’.
There was no significant difference between the adessive and the adposition peal
‘on’ for the variable animacy of Trajector (Table 15), χ²(1, N = 600) = 0.44, p = .51. The
data do, however, confirm the general tendency of Trajectors to be more frequently
animate than inanimate.
4.2.4 Number of Landmark and Trajector
Another cognitively and typologically important category in grammar is number
(Greenberg 1966), which also plays a role in certain grammatical variations (e.g.
Bresnan and Ford 2010:â•›179). In the dataset, both Landmarks and Trajectors were
coded either ‘plural’ or ‘singular’. When context and formal plural marking were in
conflict or when there was ambiguity, the analysis proceeded from the context. The
results (Table 16) show that there is no difference between the two locative constructions. Although the proportion of plural Landmarks is a little higher with the adessive
construction, this difference is not significant: χ²(1, N = 600) = 2.84, p = .09.
Interestingly, there is a difference in the use of the adessive and the adposition
peal ‘on’ according to whether the Trajector is singular or plural – the proportion of
adessive occurrences with a plural Trajector is higher than the proportion of the adposition peal ‘on’ occurrences with a plural Trajector (Table 17) and this difference is
significant: χ²(1, N = 600) = 4.03, p = .04, φ = 0.08. This may be due to an interaction
Table 15.╇ Animacy of Trajector
Animacy of Trajector
Adessive
peal
Total
animate
inanimate
Total
167
133
300
175
125
300
342
258
600
Number of Landmark
Adessive
peal
Total
plural
singular
Total
â•⁄40
260
300
â•⁄27
273
300
â•⁄67
258
600
Table 16.╇ Number of Landmark
268 Jane Klavan
Table 17.╇ Number of Trajector
Number of Trajector
Adessive
peal
Total
plural
singular
Total
â•⁄94
206
300
â•⁄72
228
300
166
434
600
Mobility of Landmark
Adessive
peal
Total
mobile
static
Total
116
184
300
178
122
300
294
306
600
Table 18.╇ Mobility of Landmark
with another variable – type of Landmark.2 Since the adessive tends to be used with
locations and since locations are large, immobile objects such as streets or fields, one
would expect to find more than one object (Trajector) located on such large locations,
e.g. people/cars on the street vs. a person/a car on the street. Indeed, a little more than
half of the occurrences with a plural Trajector have location as the type of Landmark
in the dataset.
4.2.5 Mobility of Landmark and Trajector
Following de Vega et al. (2002), both Landmarks and Trajectors in the dataset were
coded for the variable mobility, the levels of which were ‘mobile’ and ‘static’. Mobile
objects are those that do not have a fixed position in the environment, either because
they move by themselves (e.g. humans, animals) or can be moved by an external agent
(e.g. a table). Static objects (the majority of which in the dataset are also locations, but
not all) have a fixed position in the environment (e.g. streets, trees).
It can be seen from Table 18 that the adessive very frequently occurs with a static
Landmark and the adposition peal ‘on’ with a mobile Landmark; this difference is also
highly significant: χ²(1, N = 600) = 25.64, p < .001, φ = 0.21. There was no significant
difference between these two constructions for the mobility of Trajector, χ²(1, N =
600) = 0.15, p = .69 (Table 19).
4.2.6 Relative size between the Landmark and the Trajector
In Cognitive Linguistic analyses of spatial expressions, it has been claimed that Landmarks tend to be larger than Trajectors (Talmy 1983; Herskovits 1986; Langacker
1987; Vandeloise 1991). In order to validate this claim and to see whether this factor
influences the use of the Estonian adessive case and the postposition peal ‘on’, the
2. I am indebted to Krista Ojutkangas for this suggestion (p.c.).
A multifactorial corpus analysis of grammatical synonymy 269
Table 19.╇ Mobility of Trajector
Mobility of Trajector
Adessive
peal
Total
mobile
static
Total
268
â•⁄32
300
265
â•⁄35
300
533
â•⁄67
600
Table 20.╇ Relative size between the Landmark and the Trajector
Relative size between Tr and Lm
Adessive
peal
Total
conventional
same
unconventional
Total
193
â•⁄58
â•⁄49
300
140
â•⁄95
â•⁄65
300
333
153
114
600
relative size between the Landmark and Trajector was coded either as ‘conventional’
(Landmark > Trajector), ‘same’ (Landmark = Trajector) or ‘unconventional’ (Landmark < Trajector). The results indicate that in general, Landmarks tend to be indeed
larger than Trajectors. Moreover, there is a difference between the adessive and the
adposition peal ‘on’. The adessive is used when the Trajector is smaller than the Landmark and the adposition peal ‘on’ when the Trajector and the Landmark are of the
same size or when the Trajector is bigger than the Landmark (Table 20). This difference is significant: χ²(2, N = 600) = 19.63, p < .001, φ = 0.18.
4.3
Summary of the variables
Many of the variables that are described and that have been put forward in the literature on variation were confirmed to have a significant effect on the use of the Estonian
adessive and adposition peal ‘on’. Table 21 presents in a summary fashion all the variables argued to contribute to the alternation between the Estonian adessive and the
adposition peal ‘on’; the p-values and effect sizes were obtained using the Chi-square
test (see Gries, this volume).
Out of the sixteen variables discussed, nine were highly significant, three marginally significant and four were not significant. The highly significant factors were the
length of the Landmark phrase (in both words and syllables), the lexical complexity
of Landmark, the verb lemma, word class of Landmark and Trajector, type, animacy
and mobility of Landmark, and the relative size between Trajector and Landmark. The
factors that were marginally significant were the syntactic function of the Landmark
phrase, type of relation between Trajector and Landmark, and the number of Trajector; the factors that were not significant include the following: word order, animacy of
Trajector, number of Landmark, and mobility of Trajector.
270 Jane Klavan
Table 21. Variables that are argued to contribute to the alternation between the Estonian adessive and the adposition peal ‘on’
Variable name
Level for the adessive construction
Level for the adpositional construction
p-value
Effect size
Length of Lm phrase in words
Length of Lm phrase in syllables
Lexical complexity of Landmark
Word order: Position of the locative phrase
Word order: Nom_Loc/Loc_Nom
Verb lemma
Syntactic function of Lm phrase
Word class of Lm
Word class of Tr
Type of relation between Tr & Lm
Type of Lm
Animacy of Lm
Animacy of Tr
Number of Lm
Number of Tr
Mobility of Lm
Relative size Tr & Lm
2 or more words 4
or more syllables
compound
1 word
1–3 syllables
single lexeme
action verbs
modifier
existence verbs
p < .001
p < .001
p < .001
p = .22
p = .19
p < .001
p = .02
p < .001
p < .001
p = .03
p < .001
p < .001
p = .51
p = .09
p = .04
p < .001
p < .001
–
–
0.3
–
–
0.2
0.1
0.1
0.2
0.1
0.1
0.1
–
–
0.1
0.2
0.2
noun
location
plural
static
Lm > Tr
pronoun
pronoun
abstract
object
animate
mobile
Lm = Tr; Lm < Tr
A multifactorial corpus analysis of grammatical synonymy 271
One of the most surprising results among the less significant factors was the variable word order. Taking into consideration previous studies on grammatical variation,
specifically the proposed end-weight principle (see Section 4.2.1), it was predicted that
since the adposition peal ‘on’ adds an extra word to the locative phrase, thus making
the whole locative phrase longer, it would prefer the final position within the clause.
Instead, the locative phrases with the adessive occurred slightly more frequently in
the final position. A possible explanation is the interaction between the length of the
Landmark phrase and word order – longer Landmark phrases were used with the
adessive in the dataset. It is clear, however, that this issue needs further research.
Another morphosyntactic variable that seems to play at least a marginal role in
the alternation between the two locative constructions is the syntactic function of
the locative phrase. The first function is considerably more frequent, but if the locative phrase is used in the modifier function, it tends to be expressed by the adessive
and not with the adposition. The results also show that there was a tendency for the
adessive to occur when the Landmark was a location, and for the adposition peal ‘on’
to occur when it was an object. Unpredictably, the plurality of Trajector stood out –
with plural Trajectors, the adessive construction was frequent. This interacts with the
previous factor – type of Landmark – the adessive occurring with locations as Landmarks. Locations, in turn, tend to imply more than one Trajector.
Out of the nine highly significant factors, five were morphosyntactic and four
were semantic factors. Length of the Landmark phrase proved to be highly significant
in a number of ways. First and foremost, the mean scores for the length of Landmark
phrase both in words and syllables were different for the adessive and the adposition
peal ‘on’. If the Landmark phrase is more than two words or more than four syllables
long, the adessive is used; if the Landmark phrase is composed of only one word or
is three or less syllables long, the adposition peal ‘on’ is used. The tendency to use the
adposition with shorter Landmarks was also illustrated with the factor lexical complexity of the Landmark phrase – single lexemes were used with the adposition peal
‘on’ and compound words with the adessive. Unpredictably, there was a significant
difference between the two constructions in the type of verb used – the adposition
peal ‘on’ was frequently used with existence verbs. Another unpredictable significant
factor was the word class of Trajector – the adposition peal ‘on’ was frequently used
with pronominal Trajectors.
From the semantic factors, type, animacy and mobility of Landmark, and the
relative size between Trajector and Landmark were highly significant. Due to the fact
that animate Landmarks with the adessive case express the possessor, the spatial support relation with animate Landmarks is expressed with the adposition peal ‘on’. The
factor mobility of Landmark interacts with the type of Landmark – the adessive is
used with static Landmarks, which in many cases are locations, and the adposition
peal ‘on’ with mobile Landmarks, which in many cases are objects. Furthermore, the
adessive is used when the Landmark is bigger than the Trajector; when the Landmark
272 Jane Klavan
and the Trajector are of the same size or when the Trajector is bigger than the Landmark, the adpositional construction was more frequent.
Although the results of the monofactorial analysis indicate that a number of factors are significant in the alternation between the Estonian adessive and the adposition peal ‘on’, this way of analysing the data is not sufficient on its own. When speakers
use either of these locative constructions, they probably do not consider the value
of one factor only – in actual language use, all of the factors interact simultaneously
and need to be analysed as such. Therefore, a multifactorial approach is necessary to
determine which of the variables are more decisive and predictive for the choice of the
construction. In the following section, I will present the results of a logistic regression
analysis.
5. Multifactorial results. Logistic regression analysis
Multiple logistic regression (Glynn 2007:â•›241–275; Baayen 2008:â•›195–208) is used
to quantify the contribution of the factors presented above to the alternative use of
the Estonian adessive and the adposition peal ‘on’. Being a confirmatory modelling
technique, regression analysis gives probabilistic scores and calculates the explanatory power of the model (Glynn 2010:â•›257). I will not discuss here the details that lie
behind this statistical technique. The interested reader is referred to elsewhere in this
volume for a detailed overview of what the model does, what are its inner workings,
its weaknesses and strengths; Glynn (2007, 2010), for example, gives an introduction
to the mechanics of this technique.
The results of the monofactorial analyses showed that many different factors were
important in determining the difference between the adessive and the adpositional
constructions. All of these factors were included in the regression modelling and several models were run with various combinations of these factors. Due to the concerns
of multicollinearity, a number of factors could not be entered into the model simultaneously; this in turn increased the total number of models created. After comparing a
range of models, the most significant and explanatory model was selected. The analysis was performed in R (version 2.10.1) and the model is presented below:
Binominal Logistic Regression
Locative Phrase ~ Verb_Lemma + WO_LocNom + LM_LexComp + LM_WC +
LM_Mobility
Deviance Residuals:
Min
-2.1839
1Q
-1.0214
Median
0.2539
3Q
0.9923
Max
2.2529
A multifactorial corpus analysis of grammatical synonymy 273
Coefficients:
Estimate Std.Error
(Intercept)
z value Pr(>|z|)
-1.0388
0.3293
-3.155
Verb_Lemma Existence 1.4509
0.3347
4.334
Verb_Lemma Action
Verb_Lemma Motion
Verb_Lemma Posture
WO_LocNom Nom_Loc
LM_LexComp Comp
LM_WC Pron
LM_Mobility Mobile
---
0.3116
0.1471
0.3205
0.5132
-2.2415
2.0282
0.9781
0.2913
0.3485
0.3334
0.2061
1.070
0.422
0.962
2.490
0.3244
-6.911
0.1929
5.070
0.6405
3.167
0.00161
0.28471
**
1.46e-05 ***
0.67304
0.33626
0.01277
*
0.00154
**
4.83e-12 ***
3.97e-07 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 822.06 on 592 degrees of freedom
Residual deviance: 676.47 on 584 degrees of freedom
AIC: 694.47
Number of Fisher Scoring iterations: 4
Summary of Model
Predictive Power of Model
D.f.: 8
C: 0.764
Model L.R .: 145.59
P: 0
Pseudo R²: 0.29
Somers’ Dxy: 0.527
Adding interactions and changing the order of the levels did not improve the model.
The Nagelkerke R², a pseudo R² statistic for logistic regression, gives us an idea of the
variability accounted for by the present model. As Glynn (2010:â•›259) points out, any
figure above 0.3 is a sign of predictive power (for the model under discussion, the
figure is just below 0.3–0.29). Another important score is the C-score (ROC), which
is a scaled rank correlation between predicted and observed outcomes. Although not
expressed in terms of probability of success, it can be interpreted as a rough indicator of such, where 1 represents prefect predictions and .5 pure chance. Although the
score of C–0.764 does not indicate a predicatively strong model (.8 is typically taken
as a predictive model), it does indicate that predictor variables explain a reasonable
amount of the differences in use. The model was also checked for multicollinearity by
calculating the variance inflation factors. Over dispersion does not appear to be a serious issue either. Although the indicators of the model’s explanatory power are all close
to the lower boundary, it can be concluded that the model still bears some statistical
significance and explanatory power and a cursory look at the results can be taken.
274 Jane Klavan
The model includes the factors type of verb (Verb_Lemma), word order (WO_
LocNom), lexical complexity of the Landmark phrase (LM_LexComplexity), word
class of the Landmark (LM_WC) and mobility of the Landmark (LM_Mobility) and
their different values or features – these predict the outcome of an example as either
the adessive or the adpositional construction. In the list of the estimates of the coefficients, negative numbers predict the adessive construction and positive numbers the
adpositional construction. Lexical complexity of the Landmark is the most important
predictor for the adessive construction. This confirms the results of the monofactorial analysis, where we saw that when the Landmark word is a compound word, the
adessive construction is used. Important predictors of the adpositional construction
are the word class of the Landmark with the level of ‘pronoun’, verbs of existence,
mobility of the Landmark with the level of ‘mobile’ and such a word order sequence
where the locative phrase follows the Trajector. Glynn (2010:â•›260) points out that
“[a]s a rule, any figure higher or lower than +/–1 is a relatively important predictor”.
Since almost all of the significant levels or values are around +/–1, it can be concluded
that in combination, the factors included in the model are fairly strong predictors. It
seems, therefore, that although monofactorial results showed that for the present dataset a large number of factors were highly significant in determining the alternative
use of the adessive and the adposition peal ‘on’ only five of these factors are predictive
when we consider all of the factors together in combination, i.e. in a truly multifactorial situation what everyday language use no doubt is.
One of the conclusions to be drawn from the fact that, in theory, the model could
have more explanatory power is that there may be other factors not included in the
present analysis that may play an even more significant role in determining the use of
these two constructions. For instance, all of the discourse-functional variables, such
as topic, register, preceding and subsequent mention of the adessive or adpositional
construction, etc., and variables like idiolect and dialect are absent from the present
analysis. Furthermore, the corpus data discussed in this study comes from a corpus
of fiction, where the language used is that of edited written texts. Incorporating data
from spoken or internet language (i.e. unedited texts) may provide very useful insights into the analysis and the model in general, which in turn would produce more
coherent results that would predict the data more accurately. Nevertheless, both the
monofactorial and multifactorial analyses of the present data also systematically indicate that there are significant differences between the Estonian adessive and the
adposition peal ‘on’, as can be predicted if we proceed from the premises of Cognitive
Grammar and Construction Grammar.
A multifactorial corpus analysis of grammatical synonymy 275
6. Conclusion
The present chapter looked at the alternation between the Estonian adessive case and
the adposition peal ‘on’ in the corpus of present-day written Estonian. Both the monofactorial and multifactorial analyses showed that the use of these two constructions
is determined by a variety of morphosyntactic and semantic variables. More specifically, the multifactorial analyses of the data confirmed the statistical influence of the
following factors: lexical complexity of Landmark, type of verb, word order, word
class of Landmark, and the mobility of Landmark. Adessive tends to be used when the
Landmark is lexically more complex, when it is static, and when the locative phrase
precedes the Trajector. Adposition peal ‘on’ tends to be used together with verbs of
existence and pronominal Landmarks, when the Landmark phrase is lexically simple,
when the Landmark is animate and mobile, and when the locative phrase follows the
Trajector.
Even though the results of the corpus analysis confirmed the prediction that there
are differences in the use of the adessive and the adposition peal ‘on’, we should be
careful about drawing any far-reaching conclusions because the model obtained as a
result of running the logistic regression analysis is only marginally powerful. This may
indicate that there are other more important factors that better predict the alternation
between the adessive and the adpositional construction which are absent from the
present analysis (e.g. discourse-functional factors, idiolect, dialect).
The corpus analysis results discussed in the present chapter do not facilitate a
very good comparison with the results obtained by Klavan et al. (2011), since the
latter included in their studies only semantic factors, thus excluding morphosyntactic ones. Nevertheless, as the logistic regression analysis showed, at least one of the
semantic factors was an important predictor – mobility of Landmark; static Landmarks predict the adessive construction and mobile Landmarks the adpositional construction. Klavan et al. (2011) showed, in turn, that with locations as Landmarks the
adessive is used, and with objects as Landmarks the adposition peal ‘on’ is used; since
locations are predominantly static and objects mobile, there is converging evidence
that the type of Landmark does influence the alternative use of the adessive and the
adposition peal ‘on’. It is hoped that the chapter succeeded in demonstrating the necessity and utility of using a combination of methodologies in studying grammatical
(and other) variation phenomena.
276 Jane Klavan
References
Arnold, J. E., Wasow, T., Losongco, A., & Ginstrom, R. (2000). Heaviness vs. newness: The
effects of complexity and information structure on constituent ordering. Language, 76(1),
28–55.
Baayen, H. (2008). Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511801686
Balanced Corpus of Estonian (2008). Retrieved from <http://www.cl.ut.ee/korpused/
grammatikaliides>
Bartens, R. (1978). Synteettiset ja analyyttiset rakenteet lapin paikanilmauksissa [Suomalais-ugrilaisen Seuran toimituksia 166]. Helsinki: Suomalais-Ugrilainen Seura.
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, H. (2007). Predicting the dative alternation. In
G. Bouma, I. Kraemer, & J. Zwarts (Eds.), Cognitive foundations of interpretation (pp. 69–
94). Amsterdam: Royal Netherlands Academy of Science.
Bresnan, J., & Ford, M. (2010). Predicting syntax: Processing dative constructions in American
and Australian varieties of English. Language, 86(1), 168–213. DOI: 10.1353/lan.0.0189
Cappelle, B. (2009). Contextual cues for particle placement: Multiplicity, motivation, modelling. In A. Bergs, & G. Diewald (Eds.), Context in construction grammar (pp. 145–191).
Amsterdam & Philadelphia: John Benjamins.
Carlson, L., & Van der Zee, E. (2005). Functional features in language and space: Insights from
perception, categorization, and development. Oxford: Oxford University Press.
Comrie, B. (1986). Markedness, grammar, people, and the world. In F. R. Eckman, E. A.
Moravcsik, & J. R. Wirth (Eds.), Markedness (pp. 85–106). New York: Plenum.
DOI: 10.1007/978-1-4757-5718-7_6
Cooper, W. E., & Ross, J. R. (1975). World order. In R. E. Grossman, J. L. San, & T. J. Vance
(Eds.), Chicago linguistic society: Papers from the parasession on functionalism (pp. 63–
111). Chicago: Chicago Linguistic Society.
Coventry, K. R., & Garrod, S. C. (2004). Saying, seeing, and acting: The psychological semantics
of spatial prepositions. New York: Psychology Press.
de Vega, M., Rodrigo, M. J., Ato, M., Dehn, D. M., & Barquero, B. (2002). How nouns and
prepositions fit together: An exploration of the semantics of locative sentences. Discourse
Processes, 34(2), 117–143. DOI: 10.1207/S15326950DP3402_1
Erelt, M., Kasik, R., Metslang, H., Rajandi, H., Ross, K., Saari, H., Tael, K., & Vare, S. (1993).
Eesti keele grammatika II: Süntaks [The grammar of Estonian II: Syntax]. Tallinn: Eesti
Teaduste Akadeemia Keele ja Kirjanduse Instituut.
Erelt, M., Kasik, R., Metslang, H., Rajandi, H., Ross, K., Saari, H., Tael, K., & Vare, S. (1995).
Eesti keele grammatika I: Morfoloogia [The grammar of Estonian I: Morphology]. Tallinn:
Eesti Teaduste Akadeemia Eesti Keele Instituut.
Erelt, M., Erelt, T., & Ross, K. (2007). Eesti keele käsiraamat [Handbook of Estonian]. Tallinn:
Eesti Keele Sihtasutus.
Feist, M., & Gentner, D. (2003). Factors involved in the use of in and on. In R. Alterman &
D. Kirsh (Eds.), Proceedings of the twenty-fifth annual meeting of the Cognitive Science Society (pp. 390–395). Boston MA: Cognitive Science Society.
Glynn, D. (2007). Mapping meaning: Toward a usage-based methodology in cognitive semantics. Unpublished PhD thesis, Katholieke Universiteit Leuven.
A multifactorial corpus analysis of grammatical synonymy 277
Glynn, D. (2010). Testing the hypothesis: Objectivity and verification in usage-based cognitive
semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics. Corpus-driven approaches (pp. 239–270). Berlin: Mouton de Gruyter. DOI: 10.1515/9783110226423
Goldberg, A. (1995). Constructions: A construction grammar approach to argument structure.
Chicago: University of Chicago Press.
Goldberg, A. (2006). Constructions at work: The nature of generalization in language. Oxford:
Oxford University Press.
Greenberg, J. (1966). Language universals, with special reference to feature hierarchies. The
Hague: Mouton de Gruyter.
Gries, St. Th. (2003). Grammatical variation in English: A question of ‘structure vs. function’?
In G. Rohdenburg, & B. Mondorf (Eds.), Determinants of grammatical variation in English
(pp. 155–173). Berlin: Mouton de Gruyter.
Hagège, C. (2010). Adpositions. Oxford: Oxford University Press.
DOI: 10.1093/acprof:oso/9780199575008.001.0001
Hawkins, J. A. (1994). A performance theory of order and constituency. Cambridge: Cambridge
University Press.
Herskovits, A. (1986). Language and spatial cognition: An interdisciplinary study of the prepositions in English. Cambridge: Cambridge University Press.
Klavan, J., Kesküla, K., & Ojava, L. (2011). Synonymy in grammar: the Estonian adessive case
and the adposition peal ‘on’. In S. Kittilä, K. Västi, & J. Ylikoski (Eds.), Studies on case, animacy and semantic roles (pp. 113–134). Amsterdam: John Benjamins.
Langacker, R. W. (1987). Foundations of Cognitive Grammar. Volume I: Theoretical prerequisites.
Stanford: Stanford University Press.
Langacker, R. W. (2008). Cognitive Grammar: A basic introduction. Oxford: Oxford University
Press. DOI: 10.1093/acprof:oso/9780195331967.001.0001
Lestrade, S. (2010). The space of case. Unpublished PhD dissertation, Radboud University
Nijmegen.
Levin, B. (1993). English verb classes and alternations: A preliminary investigation. Chicago:
University of Chicago Press.
Luraghi, S. (1991). Paradigm size, possible syncretism, and the use of adpositions with cases
in flective languages. In F. Plank (Ed.), Paradigms: The economy of inflection (pp. 57–74).
Berlin: Mouton de Gruyter.
Mondorf, B. (2003). Support for more-support. In G. Rohdenburg, & B. Mondorf (Eds.), Determinants of grammatical variation in English (pp. 251–304). Berlin: Mouton de Gruyter.
DOI: 10.1515/9783110900019
Morphologically Disambiguated Corpus (2010). Retrieved from <http://www.cl.ut.ee/
korpused/morfkorpus/>
Ojutkangas, K. (2008). Mihin suomessa tarvitaan sisä-grammeja? Virittäjä, 112(3), 382–400.
Palmeos, P. (1985). Eesti keele grammatika II: Kaassõna [The grammar of Estonian II: Adposition]. Tartu: TRÜ trükikoda.
Rannat, R. (1991). Noomeni sünteetiliste ja analüütiliste vormide kasutus [The use of the synthetic and analytic forms of the noun]. Unpublished BA dissertation, University of Tartu.
Rosenbach, A. (2003). Aspects of iconicity and economy in the choice between the s-genitive
and the of-genitive in English. In G. Rohdenburgand, & B. Mondorf (Eds.), Determinants
of grammatical variation in English (pp. 379–411). Berlin: Mouton de Gruyter.
278 Jane Klavan
Salm, S. (2010). Kaassõnade ‘sees’ ja ‘peal’ ning vastavate kohakäänete kasutust mõjutavad tegurid [The factors influencing the use of Estonian adpositions sees ‘in’ and peal ‘on’ and the
corresponding locative cases]. Unpublished BA dissertation, University of Tartu.
Szmrecsanyi, B. (2010). The English genitive alternation in a cognitive sociolinguistic perspective. In D. Geeraerts, G. Kristiansen, & Y. Peirsman (Eds.), Advances in cognitive sociolinguistics (pp. 141–166). Berlin & New York: Mouton de Gruyter.
Talmy, L. (1983). How language structures space. In H. Pick, & L. P. Acredolo (Eds.), Spatial
orientation: Theory, research and application (pp. 225–282). New York: Plenum Press.
DOI: 10.1007/978-1-4615-9325-6_11
Vainik, E. (1995). Eesti keele väliskohakäänete semantika kognitiivse grammatika vaatenurgast [The semantics of Estonian external locative cases from the perspective of Cognitive
Grammar]. Tallinn: Eesti Keele Instituut.
Vandeloise, C. (1991). Spatial prepositions: A case study from French. Chicago: University of
Chicago Press.
Wasow, T. (1997). Remarks on grammatical weight. Language Variation and Change, 9(1), 81–
105. DOI: 10.1017/S0954394500001800
Wulff, S. (2003). A multifactorial corpus analysis of adjective order in English. International
Journal of Corpus Linguistics, 8(2), 245–82. DOI: 10.1075/ijcl.8.2.04wul
A diachronic corpus-based multivariate
analysis of “I think that” vs. “I think zero”
Christopher Shank, Koen Plevoets, and Hubert Cuyckens
Bangor University / University College Ghent / University of Leuven
This corpus-driven study seeks to explain the choice between the zero complement and the that complement constructions, when occurring with the mental
state predicate think. Previous studies have identified a range of factors that are
argued to explain the alternation patterns. Such studies have also proposed that
there is a diachronic drift towards zero complementation. Based on a sample
of 9,720 think tokens, from both spoken and written corpora, from between
1560–2012, we test the hypothesis of diachronic change and the effect of eleven
proposed factors on the constructional alternation. Using logistic regression,
we demonstrate that, contrary to previous studies, there is in fact a diachronic
decrease in zero complementation. Moreover, the study also demonstrates the
importance of understanding the interaction of the various factors that explain
the near-synonymous relation, including, especially, between the spoken and
written modes.
Keywords: complementation, logistic regression, mental state verb,
near-synonymy, that/zero alternation
1. Introduction
This chapter uses a corpus-based approach in conjunction with logistic regression
analysis to understand the near-synonymous motivation for the diachronically varied
alternation of the complementizer zero and that constructions with the verb think.
The analysis examines the periods from 1560 to 2010, in both written and spoken
genres, as exemplified in the examples below.
(1) I think that Powder is a vile bragger, he doth nothing but cracke.
(CED, 1560–1673)
(2) I think you can marry non but me; seinge we are sworne to be true.
(CED, 1560–1673)
280 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
In previous studies, it has been suggested that this alternation is being lost and that the
zero form is generalizing to substitute the that form (Rissanen 1991; Thompson and
Mulac 1991; Palander-Collin 1999). The current chapter seeks to test this hypothesis
by means of a stepwise logistic regression analysis of (n = 5801) tokens of think, the
most frequently used complement-taking verb of cognition, spanning the time period
from 1560 to 2012. The literature has also put forward a number of motivating factors
promoting the zero form. Logistic regression determines the importance of different
factors by treating the alternation as a choice that it attempts to predict based on the
behaviour of the factors. Our regression model tests whether the factors proposed in
the literature do indeed predict the zero form, the potential impact that these individual factors may or do have upon each other when combined in terms of their overall
predictive power from both a synchronic and diachronic perspective, and finally their
ability to predict the zero form over time. Determining the interaction of time with
each of the structural conditioning factors, this study adds an innovative diachronic
perspective to existing research into zero/that alternation by testing the effect of time
as a factor on the selection of the zero complementizer.
Logistic regression represents an established technique in the attempts to understand near-synonymous constructions. Some recent applications of the technique
to questions of constructional near-synonymy include Heylen (2005), Bresnan et al.
(2006), Grondelaers et al. (2007; 2008), Divjak (2010), Glynn (2010), Speelman and
Geeraerts (2010), Klavan (this volume), Levshina (this volume), Speelman (this volume), inter alia. This study extends the principle to diachronic research. Although the
analysis is restricted to formal usage-features, it is believed that these formal characteristics of use are sufficient to adequately distinguish the two semantically extremely
similar forms. Moreover, it is hoped that these differences can be charted over time.
We start off with a review of the literature dealing with that/zero alternation in order to characterize the construction under investigation and to review the factors that
have previously been said to condition the use of either that or zero complementation.
In Section 3, our data and methodology are explained. After presenting our results in
Section 4, we offer a conclusion in Section 5.
2. Review of the literature
The increase in structural/clausal flexibility that emerged in English starting in the late
ME and EModE periods had a profound impact on many facets of early English syntax
especially in regards to the fixation of SVO word order, clause combining, and complementation patterns. One of more important shifts, especially in regards to grammaticalization research, has concerned the observed decrease in the frequency of the
that-complementizer and a corresponding increase in the zero complementizer form
A diachronic corpus-based multivariate analysis 281
(Rissanen 1991; Hopper and Traugott 1993; Finegan and Biber 1995, and Palander-�
Collin 1999). The most often cited study is that of Rissanen (1991) who used the Helsinki corpus to examine the development and use of the that/zero alternation in think,
know, say, and tell constructions with object clauses in Late Middle and Early Modern
English. His analysis revealed a steady increase in the deletion of that as an object
clause link in think constructions from 14% in the years 1350 to 1420 up to nearly 70%
by the period of 1640 to 1710. Other researchers such as Finegan and Biber (1995) have
expanded upon Rissanen’s claims regarding the that/zero-alternation by analyzing a
similar data set taken from the Archer Corpus to examine the period from 1650 to
1990. This type of analysis has illustrated the role variables such as genre plays in such
alterations by demonstrating how more formal genres such as sermons, medical articles, and personal letters often retained the that-complementizer form.
Initially, researchers used early corpus-based methodologies to document the
diachronic increase in the zero complementizer in a number of different verbs (e.g.
say, tell, think, know) from ME through PDE, then turned their attention to factors in
both the matrix and complement clauses that might be motivating the observed and
ongoing structural/clausal changes. With regards to matrix clause features, the subject
has drawn attention from a number of authors. Thompson and Mulac (1991) utilized
chi-square tests to demonstrate the impact that the higher relative frequency of a verb
(e.g. think and guess) and the presence of I or you (versus other subject forms) as the
subject of the matrix verb also facilitate the presence of the zero complementizer.
Their findings were further complemented and built upon by Rissanen (1991) and
Biber and Finegan (1995) who showed, via a simple proportional contrastive analysis,
that the subject type (i.e. pronominal subjects), the person of the subject governing
the object clause (especially 1st person), and again in the text type (especially with regards to informality), also contributed to a decline in the frequency of the that-clause.
Other studies that have shown that pronouns, particularly ‘I’ or ‘you’, favour the use
of zero include Elsness (1984), Tagliamonte and Smith (2005) and Torres Cacoullos
and Walker (2009).
Another matrix clause factor that has received attention in the literature is the
absence of additional material in the matrix clause. It is believed that matrix clauses containing elements other than a subject and a (simplex) verb are more likely to
be followed by that. Such elements may be adverbials (Thompson and Mulac 1991a;
Torres Cacoullos and Walker 2009); negations or periphrastic forms in the verbal
morphology of the matrix clause predicate (Thompson and Mulac 1991a; Torres
Cacoullos and Walker 2009).
The presence of intervening material between matrix and complement has been
widely discussed as a factor favouring that (Finegan and Biber 1985; Rissanen 1991;
Rohdenburg 1996; Tagliamonte and Smith 2005; Torres Cacoullos and Walker 2009).
Adjacency of matrix and complement clause is believed to minimize syntactic and
282 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
cognitive complexity (Torres Cacoullos and Walker 2009). Besides the risk of ambiguity, which Rohdenburg (1996:â•›160) regards as a special type of cognitive complexity,
the presence of intervening material has been related to a heavier cognitive processing
load. In the words of Rohdenburg (1996:â•›161), “any elements capable of delaying the
processing of the object clause and thus the overall sentence structure favour the use
of an explicit signal of subordination”.
Another factor which has received attention in the literature on zero/that alternation is the subject of the complement clause. It has been suggested that pronominal
subjects as opposed to full NPs favour the use of zero (Elsness 1984; Finegan and Biber
1985; Rissanen 1991; Thompson and Mulac 1991a; Rohdenburg 1996; Tagliamonte
and Smith 2005; Torres Cacoullos and Walker 2009).
(3) Bill, I understand you have a special guest with you. (COCA)
(4) Well, I’m not, because I understand that most of his girlfriends have either
(COCA)
been, you know, like the hooker or porn star types. The high discourse topicality of pronouns has been proposed as an explanatory principle (Thompson and Mulac 1991a:â•›248), as well as Rohdenburg’s (1996:â•›151) complexity principle, which entails that “in the case of more or less explicit grammatical
options the more explicit one(s) will tend to be favoured in cognitively more complex
environments”. While Elsness (1984) regards I and you as particularly conducive to
zero complementation, Torres Cacoullos and Walker’s (2009:â•›28) multivariate study
results in the following ordering of subjects from least to most favourable to that: it/
there < I < other pronoun < NP. Elsness (1984) adds that short NPs and NPs with
definite or unique reference are more likely to select the zero variant than longer and
indefinite NPs. In Kearns (2007a:â•›494), first and second person subjects (i.e. I, you but
also we) are compared to third person subjects, but identical rates of zero and that
are found for both data sets. Kearns (2007:â•›493; 2007b:â•›304) also examines the length
of the complement clause subject as a possible factor, operationalizing it in terms of
a three-way distinction between pronouns, short NPs (one or two words) and long
NPs (three or more words). The study reveals significant differences, including one
between short and long NPs.
Finally, other factors that have been shown to be influential with regards to increasing the frequency of the zero form have included “appropriate light heavy weight
distribution pattern in the matrix and complement clause”, an “anaphoric relationship
or givenness of the complement clause” (summarized in Kaltenböck 2004:â•›52), coreferentiality of either tense or person between the matrix and complement clauses, and
the absence of the harmony of polarity between the matrix and complement clauses
(Torres Cacoullos and Walker 2009). A considerable amount of the corpus-based research into the loss of the complementizer with the matrix verb think, while informative and clearly important, has, however, been inherently limited in terms of actual
A diachronic corpus-based multivariate analysis 283
Table 1.╇ Zero and that in Early Modern English: Subject types (Risannen 1991:╛281)
Model 1 (1500–1570)
Pronoun
Say
Tell
Know
Think
Total
Model 3 (1640–1710)
other
Pronoun
other
zero
that
zero
that
zero
that
zero
that
37
â•⁄6
18
16
77
47
13
12
â•⁄7
79
â•⁄7
â•⁄2
â•⁄5
â•⁄6
20
33
â•⁄7
â•⁄5
â•⁄6
51
â•⁄80
â•⁄47
â•⁄22
â•⁄48
197
â•⁄8
25
13
â•⁄2
48
22
â•⁄9
â•⁄7
19
57
22
25
â•⁄4
â•⁄9
60
explanatory or predictive power due to what we believe are a number of underlying methodological issues. For example, the seminal and often cited work done by
Rissanen (1991) and Finegan and Biber (1995) on that deletion utilized the Helsinki
Corpus, which covers a period ranging from c730 to 1710 and also contains only
1.5 million words, a small corpus by today’s standards. The sample sizes that these
authors worked with were often less than 30 tokens per period which had the unfortunate effect of severely limiting the generalizability of their results. An example of this
type of limitation is presented below in Table 1.
In addition to small sample sizes, a number of the individual factors which were
deemed to be predictive of the zero form were developed by contrasting and comparing the simple percentages of occurrences of a feature or variable in the resulting data
sets, as a high percentage of occurrence was assumed to predict the presence of the
zero complementizer. This type of methodological approach may reveal general structural trends and patterns but it does not give any substantive insight into the actual
significance a given factor is actually playing in positively or negatively influencing
the that/zero deletion process. It also does not allow for any valid inferences to be
made concerning the diachronic validity of a particular factor remaining a significant
predictor over time.
More recent studies have benefited from larger written and spoken corpora (e.g.
the Cobuild, Brown and Santa Barbara corpora) which have allowed for larger samples to be extracted and analyzed. These improvements have also coincided with the
incorporation by researchers of statistical techniques such as chi-square testing into
their research design and methodologies. The results from these newer studies, however, are limited by the fact that a chi-square test only reveals if a relationship exists
between two variables (i.e. the presence of a structurally predictive variable for the
zero complementizer form and the presence of the zero complementizer form in a
given sentence). It does not indicate which specific outcome or which diachronic direction is being predicted (the variable could actually have the opposite effect in that
it could actually be predicting, contra expectations, the that complementizer form).
284 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
In addition, the results presented in the synchronically-oriented studies are often used
to make inferences about diachronic patterns and the effects of specific motivating
factors; factors which were initially proposed or hypothesized by looking at small
samples sizes (n < 30) and simply comparing the occurrence of a particular factor
within a data set. It is with these types of methodological limitations and/or concerns
in mind that we have designed our study. To address these aforementioned problems,
our study utilizes an empirically motivated framework, a range of large diachronic
corpora, and a statistical technique (regression analysis).
3. Data and methods of the current study
Our analysis was based on tokens retrieved from the following written and spoken
corpora (Tables 2 and 3).
The Wordsmith concordance program was used first to identify the total number
of inflected forms of think (i.e. think, thinks, thinking and thought) in both the written
and spoken corpora from 1560–2012 per period. These results were then used to calculate the overall percentage of each inflected form relative to one another within the
different periods. The percentages were then applied to the extracted subsets in order
to ensure that the subsets would be proportionally similar in terms of inflected forms
to the larger corpora from which they were taken. This two-step process resulted in
Table 2.╇ Written corpora
Sub-period of
written English
Time span
Corpus
Number of
words
Early Modern
English
(EModE)
1560–1710
â•⁄â•⁄2,848,314
Innsbruck Corpus of Letters
CEECS I Corpus (1560–onward)
CEECS II Corpus
Corpus of English Dialogues (CED)
Corpus of Early Modern English Texts
(CMET)
Lampeter Corpus (Early Modern English
portion-up to 1710)
Late Modern English
(LModE)
1710–1920
â•⁄15,413,159
Corpus of Late Modern English texts
Extended Version (CLMETEV)
Lampeter Corpus (Early Modern English
portion (1710–onward)
Present-Day English
(PDE)
1920–2009
The Time Corpus (Time)
The Corpus of Contemporary American
English – written component (COCA)
500,000,000
A diachronic corpus-based multivariate analysis 285
Table 3.╇ Spoken corpora
Sub-period of
spoken English
Time span
Early Modern English 1560–1710
(EModE)
Corpus
Number of
words
Corpus of English Dialogues (CED)
Old Bailey Corpus (OBC)
â•⁄â•⁄â•⁄â•‹980,320
Late Modern English
(LModE)
1710–1913
Old Bailey Corpus (OBC)
113,253,011
Present-Day English
(PDE)
1920–2012
The Corpus of Contemporary American
English – spoken components (COCA)
American National Corpus – spoken
components (ANC)
London-Lind Corpus (L-Lund)
Alberta Corpus – 2010 component
(Alberta)
133,180,448
Table 4.╇ Total number of tokens for think retrieved from the written and spoken corpora
Date
Written data
Total number
of verbal forms
Date
Spoken data
Total number
of verbal forms
1560–1579
1580–1639
1640–1710
1710–1780
1780–1850
1850–1920
1920–1989
1990–2009
Total
(n = 100)
(n = 638)
(n = 1346)
(n = 1440 )
(n = 1201)
(n = 1297)
(n = 280)
(n = 317)
(n = 6619)
1560–1579
1580–1639
1640–1710
1710–1780
1780–1850
1850–1913
1980–1993
1994–2012
Total
(n = 68)
(n = 481)
(n = 451)
(n = 537 )
(n = 556)
(n = 527)
(n = 229)
(n = 252)
(n = 3101)
the following datasets for the verb think: (n = 3101 tokens from the written English
corpora and n = 6619 tokens from the spoken English corpora).
The full set (n = 9,720) of extracted sentences were analyzed and divided into
those containing either a that-clause or a zero complementizer. The distribution of
the remaining written (n = 2,217) and spoken (n = 3,584) sentences and resulting data
sets are presented in Table 5 and Table 6.
A comparison of the diachronic relative frequency patterns of the that versus zero
forms per million words with the verb think indicates that the frequency of the zero
form has remained relatively constant vis-à-vis the that-complementizer from 1560
to 2010 in both spoken and written genres. The zero form is clearly the more frequent
286 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
Table 5.╇ think in written corpora. Distribution of that-clauses and zero complementizer
clauses from EModE to PDE in written corpora
Period
1560–1579
1580–1639
1640–1710
1710–1780
1780–1850
1850–1920
1920–1989
1990–2009
Total
think-that
think-zero
think-that
think-zero
n
N
n
N
(n = 21)
(n = 18)
(n = 65)
(n = 79)
(n = 103)
(n = 101)
(n = 40)
(n = 24)
(n = 451)
214.00
â•⁄59.23
174.51
123.19
151.66
175.47
109.44
106.20
(n = 17)
(n = 133)
(n = 200)
(n = 290)
(n = 316)
(n = 359)
(n = 204)
(n = 247)
(n = 1766)
173.24
437.65
558.27
535.29
545.23
680.69
561.92
912.90
n: absolute frequency; N: normalized frequency per million
Table 6.╇ think in spoken corpora. Distribution of that-clauses and zero complementizer
clauses from EModE to PDE in spoken English
Period
1560–1579
1580–1639
1640–1710
1710–1780
1780–1850
1850–1913
1980–1993
1994–2012
Total
think-that
think-zero
think-that
think-zero
n
N
n
N
(n = 8)
(n = 29)
(n = 10)
(n = 22)
(n = 12)
(n = 16)
(n = 97)
(n = 129)
(n = 323)
â•⁄92.97
â•⁄86.37
â•⁄23.75
â•⁄45.64
â•⁄26.09
â•⁄47.50
449.18
471.64
(n = 28)
(n = 116)
(n = 212)
(n = 412)
(n = 439)
(n = 418)
(n = 857)
(n = 779)
(n = 3261)
â•⁄324.78
â•⁄345.48
â•⁄447.47
â•⁄854.10
â•⁄938.68
1305.45
3152.25
3139.33
n: absolute frequency; N: normalized frequency per million
form from 1560 to 2012 and this comports with all previous literature on think and
claims regarding diachronic that/zero variation patterns.
The (n = 2217) written and (n = 3584) spoken sentences containing either a that
or zero complementizer clause were then coded for 26 features within three categories: corpus information, matrix clause features and complement clause features. The
features included information such as the time period of the corpus (e.g. 1710–1780),
the inflected form of the token and the full context in which it appeared. The matrix
A diachronic corpus-based multivariate analysis 287
1000.00
912.90
800.00
558.27
600.00
400.00
200.00
Written data
– freq of the that
–complemtizer
680.69
545.23
535.29
561.92
437.65
214.00
173.24
174.51
123.19
151.66
175.47
109.44
106.20
Written data
– freq of zero
complementizer
59.23
1990–2009
1920–1989
1580–1920
1780–1850
1710–1780
1640–1710
1580–1639
1560–1579
0.00
Figure 1.╇ Think in written data – that versus zero distribution per million words
3152.25
3139.33
Spoken data
– freq of that
–complementizer
Spoken data
– freq of zero
complementizer
1305.45
449.18
471.64
1980–1993
47.50
1850–1913
26.09
1780–1850
1710–1780
1640–1710
1580–1639
447.47
92.97 324.78 345.38
86.37
45.64
23.75
938.68
1990–2012
854.10
1560–1579
3600.00
3200.00
2800.00
2400.00
2000.00
1600.00
1200.00
800.00
400.00
0.00
Figure 2.╇ Think in spoken data – that versus zero distribution per million words
and complement clauses of each extracted token were also coded for features such
as person, tense, polarity, the length of the subject (pronoun / np-short 1–2 words /
np-long 3+ words), and coreferentiality (or lack thereof). In addition, the presence
(or absence) of additional elements within the matrix clause (elements between the
subject and the matrix verb) was also noted along with intervening elements (between
the matrix clause and the complementizer) and the location of the intervening elements (either pre/before or post/after the complementizer and before the complement
clause subject).
Once the coding process was completed, the data were submitted to the multiple
logistic regression analysis modelled with the factors that have been claimed in the
288 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
Table 7.╇ Factors which favour the presence of the zero complementizer
(summarized in Kaltenböck 2004:â•›52)
1. Matrix clause subjects are either I or you. (Elsness, 1984; Thompson and Mulac 1991;
Tagliamonte and Smith 2005; Kearns 2007a)
2. The absence of extra elements in the matrix clause (viz. auxiliaries, indirect objects,
adverbials) which reduce the ability of the matrix to function as an epistemic phrase by
additional semantic content (cf. Thompson and Mulac 1991; Rohdenburg 1996)
3. The absence of intervening elements between the matrix and complement clause, making
explicit boundary marking (disambiguation) with that unnecessary (Rissanen 1991;
Tagliamonte and Smith 2005)
4. Pronominal subject of the complement clause, co-referential with the matrix clause subject (Elsness 1984; Torres Cacoullos and Walker 2009)
5. The length of the matrix clause subject (pronoun > np-short > np-long) (Thompson and
Mulac 1991; Rissanen, 1991)
6. The length of the complement clause subject (it > pronoun > np-short > np-long)
(Thompson and Mulac 1991; Rissanen, 1991; Rohdenburg 1996)
7. Coreferentiality of tense between the matrix and complement clauses (Torres Cacoullos
and Walker 2009)
8. Coreferentiality of polarity between the matrix and complement clauses (Torres
Cacoullos and Walker 2009)
literature (see Section 2) to favour the presence of the zero complementizer. Table 7
lists the factors that were included in our analysis.1
The statistical technique for our analysis is stepwise logistic regression analysis
(which was run with the stepAIC-function in the R library MASS).2 The stepwise
selection procedure was both-ways. The maximal model contained all main effects
1. In fact, we ran multiple analyses which also incorporated factors such as a pronominal subject (versus NP) in the matrix clause (Elsness 1984), a pronominal subject (versus NP) in the
complement clause (Thompson and Mulac 1991; Rissanen 1991), and a 1st or 2nd person as the
complement clause subject (Thompson and Mulac 1991). All factors cited in Section 3 were first
analyzed in separate models and then also in various small subsets of factors, in order to get
a better understanding of their relative potential in explaining the that/zero complementizer.
These exploratory steps eventually led to the factors in Table 7. We would like to thank Dylan
Glynn for his helpful comments in this respect.
2. The logical extension for our many predictors would be to fit a mixed-effects model. However, none of the factors in our data is a random effect, so we opted for a straightforward logistic regression, as explained in Speelman (this volume). The stepwise selection procedure then
serves to identify the important predictors. The general outline of this methodology was suggested to us by Stefan Th. Gries, for which we express our gratitude.
A diachronic corpus-based multivariate analysis 289
plus the two-way interactions with mode and period (together with the interaction
between period and mode itself).3
The resulting model after stepwise selection contained 9 main effects and 8 interactions which predict the zero form; see Table 8 for the model summary.4 The model
fits well with the predicted variation (C-statistic, or the area under the ROC) just
above the threshold of 80%. The rather modest explained variation (as expressed in
Nagelkerke’s pseudo-R²) of 27% is just below the baseline of 30% (see Speelman, this
vol., for how to interpret the model diagnostics of the C-statistic or Nagelkerke’s R²).5
The next section discusses the coefficients.
4. Discussion of the results
Because of the rather complex structure of our regression model (with 9 main effects
and 8 interactions), we interpret the results in Table 8 with the visual aid of so-called
effect plots (obtained with the R library effects; see Gries 2013:â•›303). Furthermore,
we divide the discussion into the main effects, the interactions with mode and the
interactions with period. In 4.1, we present the nine main effects of period, mode,
matrix-internal elements, intervening elements between the matrix and complement
clause, complement clause subject length, I or you as the matrix clause subject, harmony of tense, subject coreferentiality and harmony of polarity, which predict the
use of the zero form. In 4.2 we discuss the four statistically significant interactions
with mode, viz. interactions with intervening elements between the matrix and complement, the absence of intervening elements between the matrix and complement
clauses, subject coreferentiality, and the length of the complement clause subject. In
4.3, we finally offer the diachronic picture of the conditioning factors for zero use, i.e.
3. Mode is the distinction between written and spoken language. In the results of the statistical analysis, however, its label has been changed to TYPE.
4. The ANOVA type III tests revealed that the main effect of coreferentiality of tense (CC.T.
co.ref) and the two interactions of mode with matrix-internal elements (mat.int:TYPE)
and with the length of the complement clause subject (CC.length:TYPE) are “border-significant” (i.e. slightly higher than 0.05). This is reflected in Table 8 in the fact that some levels of
these factors are not significant or are also border-significant. As the (stepwise) selection of the
factors was based on Akaike’s Information Criterion and not on Likelihood-Ratio testing, the
discussion of the results in Section 4 will nevertheless cover all effects.
5. The model was also checked for multicollinearity and although TYPE and period revealed
variance inflation factors of 5.4 and 5.2, respectively, the rest of the predictors received scores
of approximately 4 or under. Given the complexity of the model, we feel the assumption of
orthogonality is met. See Speelman (this volume) for an explanation of the question of multicollinearity in logistic regression.
Estimate Error
z value
(Intercept)
7.14770 0.85237
8.386
mat.int: mat int
-1.12456 0.14464 -7.775
CC.length: Pronoun
-2.79835 0.86310 -3.242
CC.length: NP Short
-3.99093 0.86946 -4.590
CC.length: NP Long
-4.79996 0.95717 -5.015
interv: interv
-2.55597 0.55110 -4.638
I.or.U: non
-0.89387 0.09886 -9.042
Mode: written
-2.26338 0.46837 -4.832
CC.Pol.co.ref: non
-1.35818 0.39459 -3.442
CC.Pcoref: coref
1.08250 0.28079
3.855
CC.T.co.ref: non
-0.15270 0.08981 -1.700
Period
-0.54998 0.11807 -4.658
interv: interv * Mode: written
-1.38356 0.30941 -4.472
mat.int: mat int * Mode: written
-0.33652 0.18580 -1.811
Mode: written * CC.Pcoref: coref
-1.38028 0.34021 -4.057
Mode: written * Period
0.31019 0.05057
6.134
CC.Pol.co.ref: non * Period
0.29591 0.06973
4.243
CC. length Pro * Period
0.33643 0.12078
2.786
CC. length NP short * Period
0.44723 0.12154
3.680
CC. length NP long * Period
0.48836 0.13219
3.694
interv: interv: * Period
0.17660 0.08285
2.132
CC.length: Pro * Mode: written
0.78566 0.37388
2.101
CC.length: NP Short * Mode: written 0.38367 0.36568
1.049
CC.length: NP Long * Mode: written
0.16318 0.40981
0.398
------------------------------Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Coefficients:
Table 8.╇ Stepwise regression analysis; 9 main effects and 8 two-way interactions
Pr(>|z|)
< 2e-16
7.55e-15
0.001186
4.43e-06
5.31e-07
3.52e-06
< 2e-16
1.35e-06
0.000577
0.000116
0.089100
3.19e-06
7.77e-06
0.070107
4.97e-05
8.59e-10
2.20e-05
0.005343
0.000234
0.000220
0.033043
0.035610
0.294098
0.690489
***
***
**
***
***
***
***
***
***
***
.
***
***
.
***
***
***
**
***
***
*
*
290 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
A diachronic corpus-based multivariate analysis 291
Table 8. (continued)
Deviance Residuals:
Min
-3.1176
1Q
0.2271
Median
0.3528
3Q
0.4883
Max
2.8074
Null deviance:
4557.8 on 5800 degrees of freedom
Residual deviance: 3607.8 on 5777 degrees of freedom
AIC: 3655.8
Model L.R.: 949.99
d.f.: 23
C: 0.803
R2: 0.278
Legend: The following coding abbreviations have been utilized in the regression analysis: mat.int = absence of matrix internal elements,
CC.length = length of complement clause subject, interv = lack of intervening elements between matrix and complement clauses,
I. or U. = Subject pronoun I or you, CC.Pol.co.ref = polarity between the complement and matrix clauses, and CC.Pcoref= coreferentiality
of person between the complement and matrix clauses.
292 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
the interactions with period. It shows that there are significant changes across time
in the extent to which mode, the absence of intervening elements between the matrix
and complement clauses, length of the complement clause subject, and harmony of
polarity predict the use of zero.
4.1
Main effects
The first step in our process involved determining whether the factors proposed in the
literature indeed predict the zero complementizer form. The results from this initial
procedure are presented in Figures 3–6. As it gives the overall picture of the diachronic change, we begin by examining the main effect of time or ‘period’ as a predictor of
the zero form.
Main effect: Period
Figure 3a shows that there is a steady loss of the zero form from the earliest period of
1560 to the most recent period of 2012. The decrease is gradual but statistically significant. This finding is most noteworthy as it runs counter to previous claims that there
is a diachronic increase in zero complementation.
Main effect: Mode
1.0
Predicted probability of SubC = ’zero’
Predicted probability of SubC = ’zero’
The analysis of the effect of mode (i.e. spoken versus written language) upon the presence of the zero form reveals that the zero form occurs significantly more often in the
spoken rather than the written data (Figure 3b). This is consistent with claims in the
literature that zero complementation is especially common in spoken language. This
confirms findings from previous studies (cf. Finegan and Biber 1985; Rissanen 1991).
0.8
0.6
0.4
0.2
0.0
1
2
3
4
5
Period
a. Period
6
7
8
1.0
0.8
0.6
0.4
0.2
0.0
Spoken
b. Mode (or Type)
Figure 3.╇ Main effects – structural factors predicting the zero form
Type
Written
A diachronic corpus-based multivariate analysis 293
Although the difference between the main effects of spoken versus written data is
slight, it will become apparent in 4.2, where we discuss the interactions of the other
factors with it, that the effects of conditioning factors may differ quite vastly depending on the mode.
Main effect: Absence of matrix internal elements
In Figure 4a, the analysis reveals that the absence of elements within the matrix clause
is a good predictor for the zero form; when the matrix clause does not contain any
material in addition to the subject and the verb, the probability of the zero form is
significantly greater compared to matrix clauses that do contain extra material.
Main effect: Absence of intervening elements
In Figure 4b, we see that the absence of intervening material between the matrix and
complement clause is an even stronger predictor for the zero form. When intervening
material is present, the zero complementizer rate drops to just above 60%. Thus, when
there are elements separating the matrix clause from the complement clause, there is
almost a 40% chance that that will be realized. This adds support to the validity complexity principle proposed in the literature (cf. Rohdenburg 1996).
Main effect: Length of the complement clause subject
1.0
Predicted probability of SubC = ’zero’
Predicted probability of SubC = ’zero’
In Figure 5a, we turn our attention to the complement clause subject, examining the
role that the weight of the subject (i.e. it, pronoun, NP-short and NP-long) plays in
predicting the presence of the zero form. The results in this plot confirm the order
that Torres Cacoullos and Walker (2009) previously arrived at; the subject pronoun it
is more likely to occur with the zero form than other pronouns, which are still better
0.8
0.6
0.4
0.2
0.0
abs
mat.int
a. Matrix internal elements
Written
1.0
0.8
0.6
0.4
0.2
0.0
abs
interv
b. Intervening elements
Figure 4.╇ Main effects – structural factors predicting the zero form
interv
1.0
Predicted probability of SubC = ’zero’
Predicted probability of SubC = ’zero’
294 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
0.8
0.6
0.4
0.2
0.0
it
pro
np-short
np-long
1.0
0.8
0.6
0.4
0.2
0.0
I or U
a. Length of Complement
non
I.or.U
CC.length
b. Subject I or you
Figure 5.╇ Main effects – structural factors predicting the zero form
predictors of zero than short noun phrases and long noun phrases. This shows that
the shorter and referentially lighter complement clause subjects are better predictors
of the zero form.
Main effect: I or you
Turning our attention back to the main clause subject, we now examine the effect of
the subject pronouns I or you as a predictor (Figure 5b). As with complement clause
length, our results confirm previous literature (see Elsness 1984; Thompson and
Mulac 1991; Tagliamonte and Smith 2005; Kearns 2007; Torres Cacoullos and Walker
2009) that the 1st and 2nd person is indeed a better predictor than other subject forms
(i.e. 3rd person and plural forms) for the presence of the zero form.
Main effect: Cotemporality
In Figure 6a, we examine the effect of cotemporality, i.e. whether a construction in
which the verbs of the matrix and complement clauses have the same tense is more
likely to be used with the zero complementizer. In the plot above, there is not much
difference between cotemporality and non-cotemporality, and Table 8 indeed revealed
this effect to be border-significant. Nevertheless, the coefficient for non-cotemporality was negative, demonstrating that cotemporality tends to favour the zero form.
Main effect: Coreferentiality of person between matrix and complement clause
Figure 6b presents the effect of another type of ‘harmony’ between the matrix and
complement clause, viz. coreferentiality between the respective subjects. Structures
with coreferential subjects are slightly more likely to get a zero form than those with
non-coreferential subjects.
A diachronic corpus-based multivariate analysis 295
0.6
0.4
0.2
0.0
a. Cotemporality
1.0
1.0
0.8
0.8
0.6
0.4
0.2
0.0
b. Coreferentiality of Person
Predicted probability
of SubC=’zero’
0.8
Predicted probability
of SubC=’zero’
Predicted probability
of SubC=’zero’
1.0
0.6
0.4
0.2
0.0
c. Harmony of polarity
Figure 6.╇ Main effects – structural factors predicting the zero form
Main effect: Harmony of polarity between the matrix and complement clauses
The main effect of harmony of polarity between the matrix and complement clauses
(Figure 6c) is that it is the disharmonious patterns that significantly predict the zero
form. When there is harmony of polarity between the matrix and complement clauses, the zero form is slightly less likely to occur than with a mixed, i.e. disharmonious,
pattern. This refines Torres Cacoullos and Walker’s (2009) finding that harmony of
polarity is not a significant conditioning factor for the zero form.
Now that we have completed our discussion of the main effects, we look at the
interaction with the spoken or written mode.
4.2
Mode
In this section, we see that mode has an impact on the strength of other factors. Recall
that although there was a significant difference in the main effect between spoken and
written language, the difference was not that great. This section, however, reveals that
some factors may be better predictors for the zero form in one mode as opposed to
the other mode.
Mode: Absence of intervening elements
Figure 7a allows us to compare the conditioning effect of intervening elements between matrix clauses and the complement clause in the spoken and written modes.
Recall that absence of intervening elements was a very good predictor overall. The
interaction confirms this earlier finding; in both panels we observe a dramatic difference in complementizer use between the presence and absence of material. A notable
difference, however, resides in the extent to which the presence of intervening material in the written mode predicts the zero form. When there is intervening material
in the written mode, we are much less likely to get the zero form than in the spoken
mode, so much so that the explicit complementizer that in fact becomes more likely;
the zero rate drops to below 0.4. It may be that writers are more led by the complexity
296 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
abc
TYPE : spoken
0.8
0.6
0.4
0.2
TYPE : spoken
1.0
Predicted probability of SubC=’zero’
Predicted probability of SubC=’zero’
1.0
abc
interv
TYPE : written
mat int
TYPE : written
0.8
0.6
0.4
0.2
0.0
0.0
abc
abc
interv
a. Intervening elements
mat int
mat.int
interv
b. Internal elements
Figure 7.╇ Mode and structural factors predicting the zero form
principle than speakers and feel the need to insert that to make clause boundaries
clearer when intervening material risks impairing clarity.
Mode: Absence of matrix internal elements
Figure 7b presents the results for the effect of elements between the matrix clause subject and verb in the spoken and written modes respectively. We see that the absence of
matrix-internal elements is a good predictive factor for both spoken and written data.
The steepness of the plot lines shows that the difference in predictive power between
the presence and absence of matrix internal elements is comparable for both modes;
however, in the written mode, less zero is used overall. This is indicated by the lower
points for both the spoken and written levels.
Mode: Coreferentiality of person between the matrix and complement clause
The analysis of the effect of the coreferentiality of person between the matrix and
complement clause subjects reveals an interesting difference with regard to mode
(Figure 8a). Coreferentiality of person leads to higher levels of zero in the spoken
data. In the written data, confidence intervals show that the difference between coreferentiality and non-coreferentiality is not significant.
Mode: Length of the complement clause subject
The final factor that we examine in this section is the effect that the length of the
complement clause subject has as a predictor of the zero form relative to mode. As
before, the plot in Figure 8b shows that within the spoken data the following cline exists: it > pro > np-short > np-long. A comparison between the two modes shows that
short and long NPs tend more strongly towards that in the written mode than in the
A diachronic corpus-based multivariate analysis 297
non
TYPE : spoken
it
0.8
0.6
0.4
0.2
0.0
non
coref
TYPE : spoken
1.0
Predicted probability of SubC = ’zero’
Predicted probability of SubC = ’zero’
1.0
coref
TYPE : written
0.8
0.6
0.4
0.2
0.0
it
pro np-short np-long
CC.Pcoref
a. Coreferentiality of person
pro np-short np-long
TYPE : written
CC. length
b. Length of the complement subject
Figure 8.╇ Mode and structural factors predicting the zero form
spoken mode. Overall, the length of complement clause subject has a stronger effect
on written data than on spoken data. Again, the complexity principle, i.e. the need to
mark off clause boundaries, may motivate writers’ choice of the that-complementizer
as opposed to the zero form. In addition, the concern with clarity fostered by standardization and prescriptivism may also play a role.
We now turn to the final stage of our analysis and look at the effect of the structural factors across the eight time periods (i.e. 1560–2012). Thus, in the following
sections, we discuss the interactions with period which came out as significant.
4.3
Period
The interaction effects with period were significant with the following factors: mode,
absence of intervening elements, complement clause length, and cotemporality of
polarity between the matrix and complement clauses. This final step in the analysis
offers a diachronic perspective; it shows whether the import of a given factor becomes
stronger or weaker over time.
Period: Mode
Figure 9a shows the effect of mode over time. In the earliest periods, the zero form
was far more prevalent in the spoken data relative to the written data but over time,
as the zero form has gone down in the spoken mode and increased in the written
mode, in PDE the two modes are at the same predictive level. As Figure 9a shows, the
endpoints in PDE for both modes are almost identical which suggests that nowadays
mode, in and of itself, is no longer a good or a significant predictor of the zero form
with these verbs anymore.
298 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
1 2
TYPE : spoken
3 4 5 6
7 8
0.8
0.6
0.4
0.2
0.0
1 2
TYPE : written
1 2 3 4 5 6 7
8
interv : abs
1.0
Predicted probability of SubC = ’zero’
Predicted probability of SubC = ’zero’
1.0
7 8
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6 7
Period
a. Mode (or Type)
3 4 5 6
interv : interv
8
Period
b. Intervening elements
Figure 9.╇ Period and structural factors predicting the zero form
Period: Absence of intervening elements
The analysis of the diachronic effect of the absence of intervening elements between
the matrix and complement clauses (Figure 9b) gives a result which confirms what
has been argued in the literature on that/zero variation, namely that the absence of
intervening elements is a strong predictor of the zero form. The results show that this
trend is decreasing over time; however, it still remains quite robust relative to the presence of intervening elements. The values in the right panel suggest that intervening
elements predict the explicit that-complementizer throughout all periods, although
the effect gets weaker, but this finding is somewhat less robust, as shown by the larger
confidence intervals.
Period: Length of the complement clause subject
In Figure 10a, the analysis of the effect of the length of the complement clause subject
over time shows a clear division between it and other pronouns versus NPs in that the
former two have been and still remain the stronger predictors of the zero form while
the latter (i.e. NPs) are actually increasing in their own respective predictive abilities
of the zero form. Still, they have yet to reach the level of it or other pronouns. Furthermore, an examination of the start and endpoint for it and other pronouns shows
that they are higher compared to NPs at any stage of their development and that it and
other pronouns remain the stronger predictive factors in PDE.
Period: Harmony of polarity between the matrix and complement clauses
The final significant effect over time to be discussed in this section is the interaction
between harmony of polarity and period. Figure 10b shows that when there is harmony of polarity, there is a distinct tendency towards more that over time; harmony
A diachronic corpus-based multivariate analysis 299
Predicted probability of SubC = ’zero’
3 4 5 6
7 8
CC.length : pro
1 2
0.8
0.6
0.4
0.2
1.0
CC.length : np-short
CC.length : np-long
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6 7
8
1.0
1.0
0.0
Predicted probability of SubC = ’zero’
1 2
CC.length : it
CC.Pol.co.ref : coref
7 8
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6 7
Period
a. Length of the complement subject
3 4 5 6
CC.Pol.co.ref : non
8
Period
b. Harmony of polarity
Figure 10.╇ Period and structural factors predicting the zero form
of polarity used to be a stronger predictor of the zero form than it is now. This trend
is the opposite in the non-harmonious data; here, the level of zero use has actually
increased over time.
5. Discussion
This study has shown that, contrary to claims and speculation in the literature to the
effect that there has been an overall diachronic tendency towards more zero complementizer use at the expense of that-complementation, the most frequent complement-taking mental verb in Present-Day English think, in fact, exhibits a diachronic
decrease in zero complementation and a concomitant increase in that use. This trend
can be observed both as a main effect when the data for this verb is aggregated and the
interactions via mode (i.e. spoken and written data) are explored.
The rigorous methodological approach developed and utilized in this study, and
the attention given to ensuring sufficiently large and representative sample sizes from
each period, has also highlighted the fundamental problems seen in previous work
on this topic which have often relied heavily upon descriptive statistical processes. As
evidenced by our initial presentation of findings in Section 3, a reliance on descriptive
statistics (often presented in the literature in conjunction with a chi-square test) can
unintentionally obscure important interactions between factors or variables and/or
not reveal the stability or robustness of diachronic trend-lines or patterns.
From a descriptive perspective, it would appear that the zero form for think is
robust or is at least remaining consistent over time and thus one could reasonable
infer that the factors which have been proposed to facilitate the zero form are either
300 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
equally predictive or also remain significant over time. It is only when a methodology
such as the one used in this study is applied that the true role of the various factors
becomes apparent along with the diachronic robustness of the predicted or expected
trends and/or patterns vis-à-vis a dependent variable such as the presence of the zero
complementizer.
In addition to invalidating the long-standing assumption that this particular
complement-taking verb has diachronically developed towards higher levels of zero
complementation, this study also highlights the need to differentiate between individual factors when examining complementation patterns.
It became apparent, firstly, that the extent to which the factors mentioned in the
literature actually predict zero use may differ considerably in terms of predictive power, as was revealed by the analysis of interactions between various factors and mode
(i.e. spoken versus written data) or period (i.e. time).
The effect of intervening elements of material between the matrix and complement clause was a case in point. As we observed in Figure 4, a strong predictor overall, when intervening material was present the zero complementizer rate dropped to
just above 60%. Its predictive power, however, is revealed to be much stronger in the
written mode than in the spoken mode; when intervening material is present in the
written mode, that is favoured. This effect was also shown to be quite robust and stable
over time (i.e. period) with a slight decrease seen in PDE (Figure 9).
The apparent significance, however, of the length of the complement clause subject was not limited to just one mode. This factor was significant in both the written
and spoken data sets. Our results therefore suggest that the following cline may be
present in terms of predicting the zero form:
it >Pro>NP-short>NP-long
zero ---------------------------------that
This cline, however, when examined diachronically, also revealed a small but consistent decrease for both the it and pro subject forms over time towards PDE. This general
decline was seen with a great majority of the factors tested in this study.
Finally, the analysis of mode revealed that the zero form occurs overall significantly more often in the spoken rather than the written data. Yet, once again, by
diachronically examining the effect of mode over time (i.e. period) we see that in
PDE this previously significant finding now essentially disappears. In the most recent
period, the trend lines indicate that both modes are equally predictive of the presence
of the zero form and that mode has lost any real significance as a predictive factor. We
believe that these results have shown the strength of the methodology and approach.
Having established the structural factors that lie behind the choice of the complementizer for think and having demonstrated that this alternation is not merely a
diachronic phenomenon, two important steps should now be taken. Firstly, the data
set needs to be extended to include a larger set of mental state verb types. This may
A diachronic corpus-based multivariate analysis 301
reveal additional differences in the way that /zero alternation has evolved with similar
verbs of cognition, as well as shedding more light on how the effect of a conditioning
factor may differ from verb to verb. Secondly, the addition of a range of semantic and
pragmatic features to the analysis may offer more direct insights into the differences
between the two forms. Although the manual annotation involved may necessitate
smaller samples, extending the analysis in this manner will further explain the role of
the alternation.
6. Conclusion
In this chapter we have tried to demonstrate the advantages of utilizing a rigorous
empirically oriented framework and statistical analysis when exploring the diachronic development of clause combining and complement clause patterns with the verb
think. Our approach has allowed us to test a range of motivating factors for actual
diachronic significance, to calculate the cumulative effects of these factors/variables
against each other and over time, and to determine what factors, if any, retain any
predictive power for the presence of the zero complementizer form.
In addition, our methodology has permitted us to identify the importance that
the absence of intervening elements of material between the matrix and complement
clauses, the length of the complement clause subject and the effects of mode and period play in predicting the presence of the zero form
Finally, our use of a regression analysis has allowed us to assess the effect of factors, or lack thereof, across mode and over time. Our findings indicate that, contrary
to expectations and predictions, there is a steady statistically significant loss of the
zero form from the earliest period of 1560 to the most recent period of 2012. It is
based on these findings and our overall results that we believe the approach demonstrated in this chapter has real potential for helping us to understand the role or roles
that structural factors play in distinguishing near-synonymous forms in some aspects
of diachronic language change.
References
Aijmer, K. (1997). I think – an English modal particle. In T. Swan, & O. Jansen Westvik (Eds.),
Modality in Germanic languages: Historical and comparative perspectives (pp. 1–47). Berlin
& New York: Mouton de Gruyter. DOI: 10.1515/9783110889932.1
ANC = American National Corpus (2002). Linguistic Data Consortium, University of
Pennsylvania.
Bolinger, D. (1972). That’s that. (Janua linguarum. Series Minor, 155). The Hague: Mouton de
Gruyter.
302 Christopher Shank, Koen Plevoets, and Hubert Cuyckens
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, H. (2007). Predicting the dative alternation. In
G. Bouma, I. Kraemer, & J. Zwarts (Eds.), Cognitive foundations of interpretation (pp. 69–
94). Amsterdam: Royal Netherlands Academy of Science.
Brinton, L. J., & Traugott, E. C. (2005). Lexicalization and language change. (Research Surveys in
Linguistics.) Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511615962
BROWN = Francis, W. N., & Kucera, H. (1979). The Brown Corpus. Department of Linguistics,
Brown University.
Bybee, J. (2002). Cognitive processes in grammaticalization. In M. Tomasello (Ed.), The new
psychology of language Vol. II (pp. 145–167). New Jersey: Lawrence Erlbaum
CEECS I & II = Corpus of Early English Correspondence Sampler (CEECS) <http://khnt.hit.
uib.no/icame/manuals/ceecs/>.
CEMET = Corpus of Early Modern English texts (Extended version). See De Smet (2005).
CLMETEV = Corpus of Late Modern English texts (Extended version). See De Smet (2005).
COCA = Davies, M. (2008). The Corpus of Contemporary American English (COCA): 1990–
present. Retrieved from <http://www.americancorpus.org>.
De Smet, H. (2005). A Corpus of Late Modern English. ICAME Journal, 29, 69–82.
Divjak, D. (2010). Structuring the lexicon: A clustered model for near-synonymy. Berlin & New
York: Mouton de Gruyter
Elsness, J. (1984). That or zero: A look at the choice of the object clause connective in a corpus
of American English. English Studies, 65, 519–533. DOI: 10.1080/00138388408598357
Finnegan, E., & Biber, D. (1995). That and zero complementizers in Late Modern English: Exploring Archer from 1650–1990. In B. Aarts, & C. Meyer (Eds.), The verb in contemporary
English: Theory and description (pp. 241–257). Cambridge: Cambridge University Press.
Glynn, D. (2010). Testing the hypothesis: Objectivity and verification in usage-based Cognitive Semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 239–270). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110226423
Gries, S. Th. (2013). Statistics for linguistics with R: A practical introduction. 2nd revised edition.
Berlin & Boston: Mouton de Gruyter. DOI: 10.1515/9783110307474
Grondelaers S., Speelman, D., & Geeraerts, D. (2008). National variation in the use of er
“there”. Regional and diachronic constraints on cognitive explanations. In G. Kristiansen,
& R. Dirven (Eds.), Cognitive Sociolinguistics: Language variation, cultural models, social
systems (pp. 153–204). Berlin & New York: Mouton de Gruyter.
Grondelaers, St., Geeraerts, D., & Speelman, D. (2007). A case for cognitive corpus linguistics.
In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson, & M. Spivey (Eds.), Methods in Cognitive Linguistics (pp. 49–169). Amsterdam & Philadelphia: John Benjamins.
Heylen, K. (2005). A quantitative corpus study of German word order variation. In S. Kepser,
& M. Reis (Eds.), Linguistic evidence: Empirical, theoretical and computational perspectives
(pp. 241–264). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110197549.241
Kaltenböck, G. (2004). That or no that – that is the question: On subordinator suppression in
extraposed subject clauses. Vienna English Working Papers, 13, 49–68.
Kearns, K. (2007a). Epistemic verbs and zero complementizer. English Language and Linguistics, 11, 475–505. DOI: 10.1017/S1360674307002353
Kearns, K. (2007b). Regional variation in the syntactic distribution of null finite complementizer. Language Variation and Change, 19, 295–336. DOI: 10.1017/S0954394507000117
LAMPETER = Lampeter Corpus of Early Modern English Tracts (1641–1732). Retrieved from
<http://khnt.hit.uib.no/icame/manuals/LAMPETER/LAMPHOME.htm>.
A diachronic corpus-based multivariate analysis 303
Palander-Collin, M. (1999). Grammaticalization and social embedding: I THINK and METHINKS in Middle and Early Modern English. Helsinki: Tome LV.
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1972). A grammar of contemporary English.
London: Longman.
Rissanen, M. (1991). On the history of that/zero as clause object links in English. In K. Aijmer,
& B. Altenberg (Eds.), English corpus linguistics: Studies in honor of Jan Svartvik (pp. 272–
289). London & New York: Longman.
Rohdenburg, G. (1996). Cognitive complexity and increased grammatical explicitness in English. Cognitive Linguistics, 7, 149–182. DOI: 10.1515/cogl.1996.7.2.149
Speelman, D., & Geeraerts, D. (2010). Causes for causatives: The case of Dutch ‘doen’ and ‘laten’.
In T. Sanders, & E. Sweetser (Eds.), Causal categories in discourse and cognition (pp. 173–
204). Berlin & New York: Mouton de Gruyter.
Tagliamonte, S., & Smith, J. (2005). No momentary fancy! The zero ‘complementizer’ in English
dialects. English Language and Linguistics, 9, 289–309. DOI: 10.1017/S1360674305001644
Thompson, S. A., & Mulac, A. (1991). A quantitative perspective on the grammaticalization of
epistemic parentheticals in English. In E. Traugott, & B. Heine (Eds.), Approaches to grammaticalization (pp. 313–339). Amsterdam & Philadelphia: John Benjamins.
TIME = Davies, M. (2007). TIME Magazine Corpus (1920s–2000s). Retrieved from <http://
corpus.byu.edu/time>.
Torres Cacoullos, R. & Walker, J. A. (2009). On the persistence of grammar in discourse formulas: A variationist study of that. Linguistics, 47, 1–43. DOI: 10.1515/LING.2009.001
Traugott, E. C., & Dasher, R. B. (2002). Regularity in semantic change. Cambridge: Cambridge
University Press.
Traugott, E. C., & Konig, E. (1991). The semantics-pragmatics of grammaticalization revisited.
In E. C. Traugott, & B. Heine (Eds.), Approaches to grammaticalization, Volume 1. Focus
on theoretical and methodological issues (pp. 189–218). Amsterdam & Philadelphia: John
Benjamins. DOI: 10.1075/tsl.19.1
Section 2
Statistical techniques
Techniques and tools
Corpus methods and statistics for semantics
Dylan Glynn
University of Paris VIII
The use of corpora in semantic research is a rapidly developing method. However, the range of quantitative techniques employed in the field can make it
difficult for the non-specialist to keep abreast with the methodological development. This chapter serves as an introduction to the use of corpus methods
in Cognitive Semantic research and as an overview of the relevant statistical
techniques and software needed for performing them. The discussion and description are intended for researches in semantics that are interested in adopting
quantitative corpus-driven methods. The discussion argues that there are fundamentally two corpus-driven approaches to meaning, one based on observable
formal patterns (collocation analysis) and another based on patterns of annotated usage-features of use (feature analysis). The discussion then introduces and
explains each of the statistical techniques currently used in the field. Examples
of the use of each technique are listed and a summary of the software packages
available in R for performing the techniques is included.
Keywords: collocation analysis, corpus linguistics, semantics, statistics,
usage-feature analysis (behavioural profile)
1. Introduction
This chapter offers an explanation of the corpus methods represented in the book
and a brief overview of the various statistical techniques employed. It is designed as a
resource for those less familiar with the field, but also as a reference for those already
working with corpus-driven methods in Cognitive Semantics. Specifically, corpus-�
driven Cognitive Semantics is understood as the work beginning with Dirven et al.
(1982), Schmid (1993, 2000), Geeraerts et al. (1994, 1999) and Gries (1999, 2003),
and currently represented in the edited volumes of Gries and Stefanowitsch (2006),
Stefanowitsch and Gries (2006), Lewandowska-Tomaszczyk and Dziwirek (2009),
Glynn and Fischer (2010), Geeraerts et al. (2010), Divjak and Gries (2012), Gries and
308 Dylan Glynn
Divjak (2012), Pütz et al. (2012), Reif et al. (2013), Glynn and Sjölin (2014), and in
the monographs Hilpert (2008, 2012), Divjak (2010a), Gilquin (2010), Dziwirek and
Lewandowska-Tomaszczyk (2011), Hoffmann (2011), and Glynn (forthc.).
In this chapter, corpus-driven Cognitive Semantics is argued to divide into two
methodologies, or analytical approaches, based either on the formal analysis of collocations or the semantic analysis of features. This proposed distinction is described in
Section 2. Following this, Section 3 describes the quantitative techniques used in such
research. It lists and explains the techniques and offers examples of how they are used,
giving detailed references on where the application of each technique is explained in
the literature.
2. Collocations and features: Two approaches to corpora
A common misconception amongst cognitive linguists is that corpus-driven research,
and indeed, the quantitative analysis of corpus data, does not involve any close analysis
of actual examples. This is not necessarily true at all. Within Cognitive and Functional
Linguistics, broadly speaking, there is a wide range of approaches to corpus data, from
simply counting the number of occurrences of a given form in a given context to the
development of complex computational models trained on enormous text banks. For
corpus-driven research in semantics, where the ‘meaning’ of a given linguistic form
is in question, it is possible to broadly identify at least two approaches. All the studies
in the first section of this book fall into one of these two categories. The first of these
is based on formal, and therefore, observable, patterns. We can term this approach
‘collocation analysis’. Secondly, the corpus analysis can be based on patterns of annotated features, which we term ‘feature analysis’. In the former, the analysis seeks
to identify formal patterns so as to interpret them as indices of meaning structure
and in the latter, the analysis seeks to directly identify semantico-pragmatic patterns
through close manual annotation. Although the approaches can be combined (cf.
Stefanowitsch and Gries 2008), they tend to be used separately and possess distinct
strengths and weaknesses.
The first ‘type’ of corpus-driven research, collocation analysis, is more established
and is typical of mainstream Corpus Linguistics. Collocation studies identify the
co-occurrence of linguistic forms in a given sample of naturally occurring language.
Firth’s (1957:â•›179) now famous phrase, “you shall know a word by the company it
keeps”, is a succinct way of capturing the aim of this approach. When extended to
other parts of language, such as syntactic patterns or indeed text types and genres, the
large-scale study of collocation is a powerful tool for making generalisations about
language use. Cognitive and Functional Linguistics are particularly concerned with
why a given form is used and so it follows that in order to answer research questions
Techniques and tools 309
of this nature, inferences as to the semantic, functional, or conceptual motivation for
the collocation must be made in post hoc interpretation.
Despite this subjective step in the use of collocation analysis in Cognitive-Functional Linguistic research, the analytical approach has important advantages. To the
extent that one can retrieve forms automatically, one can consider extremely large
samples, making studies (relatively) representative of a given language or part of language. Secondly, forms are objectively identifiable, making this step largely independent of subjective analysis. However, this statement warrants qualification. Even if a
form is objectively identifiable, linguists are typically interested in only certain uses of
a given form and, often, these specific uses cannot be retrieved automatically. In such
situations, the decision as to which occurrences are representative of the category is
typically a question for debate (cf. Perek, this volume, 61–86).
Moreover, collocation studies rely on some measurement of association. Raw frequency of co-occurrence can be misleading because if one of the forms is extremely
frequent, then relatively high co-occurrence may just be a result of the overall high
frequency of that form. The problem of how to determine the degree of association, or
‘attraction’, is fundamental. Common ways of measuring the degree of association for
lexical co-occurrence are the mutual information (MI) score, the z-score (standard
score), the t-score and the log-likelihood. Many Corpus Linguistics programs, both
on-line and stand-alone, automatically generate some of these scores. Collostructional analysis is one alternative to such measures. Developed by Stefanowitsch and Gries
(2003, 2005) and Gries and Stefanowitsch (2004a, 2004b) and described in Hilpert
(this volume, 391–404), it is a suite of methods that use the Chi-squared or Fisher
exact test to compute degree of association. These techniques allow the researcher to
consider the co-occurrence, not just of lexemes, but also of syntactic patterns. Collostructional analysis has proven popular in Cognitive Linguistics.
One of the newest advances in the use of collocation is the application of Word
Space modelling to semantic research questions within computational linguistics. The
principle is to extend the analysis of collocation beyond one or two words or even syntactic patterns, to whole lines, paragraphs and even entire texts. Such approaches give
rich collocation-based behavioural profiles of a given linguistic form. The implications
for such analytical techniques in semantics are only now being realised. This methodology is not represented in the volume. Peirsman et al. (2010) and Sagi et al. (2001) are
examples of the application of these methods to research in semantic relations.
The number of studies employing a collocation approach, even restricted to Cognitive Linguistics, is enormous. A small sample of recent studies includes Newman
and Rice (2004, 2006), Deignan (2005), Delorge (2009), Pęzik (2009), Van Bogaert
(2010), Colleman (2010) and Zeschel (2010). Applications of collostructional analysis
include Wulff (2006), Wulff et al. (2007), Hilpert (2008, 2009) and Gilquin (2010).
In general terms, it is possible to identify a second quantitative approach in
corpus-driven Cognitive Semantics, one that focuses on the manual analysis of
310 Dylan Glynn
usage-features. Although less traditional in the mainstream of Corpus Linguistics,
the general principle has a long tradition in Cognitive Linguistics (Dirven et al. 1982;
Rudzka-Ostyn 1989, 1995; Fillmore and Atkins 1992; Geeraerts et al. 1994) and, more
recently, is gaining currency in Functional Linguistics (Fischer 2000; Scheibman 2002;
Kärkkäinen 2003; Pichler 2013). The principle of combining the results of this usage-feature analysis with multivariate statistics begins with Geeraerts et al. (1999) and
Gries (2003). It is termed the behavioural-profile approach by Gries and Divjak (2009)
and Divjak and Gries (2009) and multifactorial usage-feature analysis by Glynn (2009,
2010b).1 The principle is simple: for a large sample of a given linguistic phenomenon,
various formal, semantic, and/or social ‘linguistic features’ (or ‘ID tags’ in the terminology of Gries and Divjak 2009) are identified and ascribed to each occurrence. It is
worth noting that the method per se has also been independently developed in social
psychology and computational linguistics. In the former, it is termed the analysis of
components (cf. Scherer 2005; Fontaine et al. 2013) and in the latter, sentiment analysis
(Wiebe et al. 2005; Verdonik et al. 2007; Daille et al. 2011; Balahur and Montoyo 2012;
Read and Carroll 2012; Taboada and Carretero 2012).
The approach consists of the repeated application of what is essentially a ‘traditional’ linguistic analysis to hundreds, or even thousands, of naturally occurring
examples. This procedure results in a quantified usage-profile of the linguistic phenomenon in question. Usage-feature analysis is employed, with varying degrees of
statistical sophistication, to examine phenomena of all kinds, from syntactic variation
and semantics (Heylen 2005a; Bresnan et al. 2007; Speelman et al. 2009), to discourse
studies and conversation analysis (Scheibman 2002; Kärkkäinen 2003; Flores Salgado
2011; De Cock 2014a, 2014b), and even gesture research (Zlatev and Andrén 2009;
Morgenstern et al. 2011).
The limitations of the approach are twofold. Firstly, the detailed manual analysis
is as subjective as any traditional linguistic analysis and is open to the same vagaries,
theoretical biases and human error. Secondly, the manual analysis, or annotation, of
examples is meticulous and laborious. This, combined with the simple practical reality of limited resources, means that samples are relatively small. The resulting sample
1. Since Multifactorial Usage-Feature Analysis (Behavioural Profile Approach) is less known
in corpus circles, a selection of current examples of its use include: Geeraerts et al. (1999),
Gries (1999, 2003, 2006, 2010), Szmrecsanyi (2003, 2010), Wulff (2003, 2009), Heylen (2005a),
Divjak (2006, 2010a, 2010b), Divjak and Gries (2006, 2009), Bresnan et al. (2007), Grondelaers
et al. (2007, 2008), Glynn (2009, 2010a, 2010b, 2014a, 2014b, forthc.), Janda and Solovyev
(2009), Speelman et al. (2009), Speelman and Geeraerts (2010), Krawczak and Glynn (2011,
in press), Krawczak and Kokorniak (2012), Levshina (2012), Levshina et al. (2013a, 2013b),
Krawczak (2014a, 2014b), and Deshors (2014). Doctoral dissertations focusing on developing
the method include Gries (2000), Grondelaers (2000), Heylen (2005a), Glynn (2007), Arppe
(2008), Robinson (2010b), Deshors (2011), Levshina (2011), Barnabé (2012), Klavan (2012),
and Diehl (2014).
Techniques and tools 311
Table 1.╇ Observational differences in collocation and feature analysis of corpora
Stage 1: Analysis of data
Stage 2: Interpretation of analysis
Collocation
Feature
objective
subjective
subjective
objective
size makes it more difficult to be sure of representativity and harder to obtain statistically significant results.
The advantages of the approach are also twofold. Firstly, the method allows the
operationalisation and quantification of traditional linguistic analyses. This is no
trivial matter because it permits hypothesis testing and produces falsifiable results
for research questions not easily approached using traditional corpus methods (c.f.
Geeraerts 2010; Glynn 2010b; Stefanowitsch 2010). Secondly, an important strength
lies in the possibility of treating the results obtained through the usage-feature analysis with multivariate statistics. This is especially important for non-modular theories
of linguistics, such as Cognitive Linguistics, because multivariate statistics permits
an analysis to handle the complexity of the interaction of the different dimensions
of language structure simultaneously (such as lexis, syntax, phonology, society, etc.),
creating a multidimensional and socio-conceptually realistic profile of the use of a
linguistic form or the role of a linguistic function.
Geeraerts (2011) compared the two corpus approaches, underlining that both
are subjective, but are at different stages in their application. Table 1 summarises
Geeraerts’ point about subjectivity.
Juxtaposing the two analytical approaches like this is, of course, a simplification.
At the first stage of analysis, collocation studies are often not entirely objective because
of questions such as what constitutes a ‘form’. Firstly, forms are polysemous and only
certain uses may be relevant for a given study. In such a situation, manual selection is
often the only solution. Secondly, the forms themselves are typically composite and
so formal variation itself can cause category issues. In other words, is a given formal
variant an example of the form in question or is it a ‘different’ form? Again, in such
situations, subjective categorisation enters the analysis. Turning to feature analysis,
the subjective first step is not always particularly subjective. Often, feature analysis is
largely based on observable phenomena. For example, grammatical features can be
crucial to usage-feature analysis and are annotated automatically, or if done manually,
are done so objectively.
At the stage of interpretation the same objective-subjective blurring occurs. For
collocation analysis, as Desagulier (this volume, 145–178) shows, statistical analysis
can help add a degree of objectivity to interpreting the collocation patterns observed.
A similar caveat is needed for the usage-feature method. Although multivariate statistics may help us to objectively distinguish semantico-pragmatic patterns from
non-patterns, we still must decide if those patterns answer the research question at
hand, which is an inherently subjective step.
312 Dylan Glynn
3. Statistical techniques and tools
Often one of the most confusing issues in the application of quantitative techniques to
linguistic research is the myriad of different techniques available. This section is primarily intended for the reader who has some experience with quantitative methods,
presenting an overview of the techniques relevant to corpus linguistic research. For
the reader who has little experience in quantitative techniques, the overview will be
technical, but it is hoped, still informative.
It is important to understand that statistics is a rapidly growing science with constant new advances as well as many uncertainties and conflicts. Perhaps more importantly, we must also remember that statistical techniques are only analytical tools. No
statistical technique will identify a linguistic fact or explain any linguistic structure.
Nevertheless, statistical tools can be used by linguists to help look for language structure – assuming one knows where to look. They can also be used to confirm the probability that the results of an analysis are not a chance occurrence. Statistics can help
linguists struggle with what they have been doing for centuries, describe and explain
language, but they are only tools in that endeavour.
Just as there is sometimes a misconception that statistics can answer linguistic
questions, there exists a misconception that quantitative corpus-driven research is
devoid of ‘real’ linguistic analysis. Nothing is further from the truth. Corpus-driven
linguists deal with real language and in large quantities. The ‘numbers’ presented in
corpus-driven research are not the analysis; they are a quantitative summary of the
analysis, which must, in turn, be interpreted. Corpus-driven linguists, for the most
part, deal with language in a relatively close and fine-grained way; they just deal with
large quantities of it.
One of the aims of this book is to showcase and explain the use of a small set of
statistical techniques that can be helpful for traditionally trained linguists in their
research. The aim is not to teach statistics or the computer programs for performing
statistical analyses, but simply to introduce some of the possibilities. In this section,
we begin with a short description of the computer applications available for performing statistics, and then briefly consider a fundamental theoretical question for the
statistical sciences – type of data. This question is essential to understand before one
can decide which statistical techniques are appropriate in a given situation. This is followed by a systematic summary of the techniques currently used in the field, examples
of their use, as well as examples of texts that explain how they are used. The description ends with a detailed list of the different commands and packages for performing
these statistical techniques in the programming suite R.
Statistical software
There are many computer applications, commercial and otherwise, that enable the
researcher to perform statistical analysis. In this volume, the statistical program that
Techniques and tools 313
is used by most authors is R. This program is, in fact, a powerful programming suite
with enormous potential. The explanatory chapters all use R and the reader is taken
step-by-step through the necessary “code”, or command lines, needed to perform the
analyses. No attempt is made to demonstrate the full functionality of the program,
merely to offer a working knowledge of how to perform specific analyses.
This volume focuses upon R for three reasons. Firstly, it is a free and cross-platform program. Secondly, since it is open source, as soon as new statistical techniques
develop, new software modules are written and uploaded for the public. Thirdly, the
programme is one of the two most commonly used programs for statistics in the social sciences (there are, of course, many more, especially devoted to specific techniques). The other most frequently used program in the social sciences is SPSS. Like
R, it is also an extremely powerful tool, as widely used, but also includes a graphic user
interface (unlike R). Since R is equally powerful, arguably more up to date, entirely
free and used by the majority of authors in the book, the only negative is its command-line interface. However, in the following chapters, the command-line is given
simple step-by-step instructions and, it is hoped, will not pose too many problems for
the beginner. It is true that the command-line may seem daunting at first, but if the
steps are followed line-by-line, the only difference with ‘button-for-button’ (as in a
graphic interface application) is one of familiarity.
Other important application suites include SAS, Statistica, and Stata, which are
all powerful and versatile. SAS is command-line, like R and in some ways, R can be
seen as the open-source version of SAS. It is arguably the most complete statistical
programming suite, but is rarely used in the social sciences. Statistica and Stata are
comparable to SPSS. They too have graphic user interfaces, are relatively user friendly
and, just like SPSS, are costly. Statistica is restricted to the Windows operating systems, but has a relatively large and helpful online community. Stata is cross-platform,
but is probably less common than Statistica. It is not really possible to say which suite
is the best, since certain techniques are extremely well covered in one suite and not
the other. Due to its being open source, R is surely the suite with the most options and
also the quickest to respond to developments within the domain of statistics, but, of
course, that does not mean its implementation of those techniques is the best.
If the reader is familiar with any of these other programs, the descriptions of the
statistical techniques in the book, as well as their interpretation and application, will
still be useful. Lastly, it should be noted that a graphic user interface is under development for R. This is not drawn upon because its development is not yet complete and
the commands/R sessions described in this book are sufficiently straightforward that
readers who are not familiar with statistics or command-line will not have problems
following.
Types of data
Before choosing a statistical technique, one must first know what ‘type’ of data one
is dealing with. This is because different types of data require different statistical
314 Dylan Glynn
techniques. The most basic distinction is between what is called continuous data and
categorical data. The former typically come from measurements and therefore make a
continuum, for example 1.0, 1.1, 1.2 … 1.8, 1.9, 2.0. This kind of data is probably the
most common and comes from diverse sources such as age, time, height, dosage, temperature, response times, and, arguably, grammatical judgements. Continuous data
are typical in psychology and psycholinguistics. The second kind of data is categorical,
also called ‘discrete data’, ‘tabular data’ or ‘count data’. It is this kind of data, as corpus
linguists, with which we are most often concerned. Such data include, for example,
the frequency of occurrence of a linguistic form, the number of times it occurs in a
given tense, or in a given register. In these examples, the data are said to be nominal
because each of the occurrences is independent from the other. However, categorical
data can also be ordered. This is the case when, for example, the categories follow a
natural sequence or ranking, such as young, middle-aged, and old or when a sentence
is short, medium or long in length. Ordered categorical data share properties of both
nominal categorical and continuous data. Grammatical judgements, on a scale of 1 to
7, for example, could be argued to be continuous or ordered categorical. Technically,
it is ordered because a respondent cannot enter 3.5, for example, but is forced to make
a discrete choice upon what is, in reality, a continuous scale of acceptability. However,
if we assume that no respondent would perceive differences to the degree of 3.5, then
we can treat the scale as a true measurement, and therefore, continuous.
Table 2.╇ Types of data in statistics
Data type
Example of data
Continuous 1, 1.1, 1.2 … 1.8, 1.9, 2
1, 2, 3, 4, 5, 6, 7
Ordered
short, medium, long
cold, warm, hot
Nominal
apples, peaches, pears
y’all, you lot, youse
Description
Example of use
Sequential (ordered) but
non-discrete / continuous
Sequential (ordered) but
discrete / non-continuous
Independent
and discrete categories
Response times in
Psycholinguistics
Different periods in
diachronic linguistics
Different lexemes in
Corpus Linguistics
Although there is occasionally debate on the issue, most statistical techniques are designed for one of the kinds of data. For example, least squares estimation and linear
regression are used for continuous data and maximum likelihood and logistic regression for categorical, just as principle components analysis is used for continuous data
and correspondence analysis for categorical. Table 2 summarises the differences.
Statistical techniques for corpus linguistics
Statistics is an immense science – there are countless tests and corrections for those
tests. There are even more exploratory techniques with various algorithms that each
technique can employ and different ways for representing results of those exploratory
techniques. Confirmatory analysis has again as many different techniques, but this
Techniques and tools 315
time, seemingly endless sets of diagnostics to check the validity of the results. It must
be stressed that the techniques presented here only scratch the surface of what is possible, but also of what problems exist.
We begin with significance tests and association measures. Although not statistical techniques per se, they are tools that are important to the field. We then cover
exploratory methods, the results of which cannot be used to make claims about structure beyond the sample. In other words, what is found with these techniques may be
restricted to the corpus or the extract of the corpus being examined. These exploratory techniques do not test hypotheses or make predictions about the population (real
language). The description then turns to confirmatory techniques, which are more
complex in their application but which make predictive claims and can test hypotheses in terms of statistical significance or the probability that observed structures exist
in real language beyond the sample.
Sample, significance and independence
Establishing that the occurrence of something in a given sample is more or less common than would be expected by chance or that two sets of data are more different
than would be expected by chance are basic steps in inductive research. Pearson’s Chisquared test and Fisher’s Exact test are omnipresent in research based on samples of
categorical data. Gries (this volume) explains these tests and shows how to apply them
in R. Other tests useful for corpus data include the exact binomial test, McNemar’s
paired Chi-squared test, and the proportions test. These are used for investigating
relations in frequency tables. An excellent explanation of these tests and their commands in R can be found in Dalgaard (2008: Ch. 8) and Baayen (2008:â•›Section 4.1.1).
See also Gries (2009b:â•›125–127, 158–176; 2013:â•›165–172), Everitt and Hothorn (2010:
Ch. 3), and Adler (2010:â•›360–367).
Collocation and association measures
Within Cognitive Linguistics, collostructional analysis has proven to be one of the
most important methods for investigating collocations. Developed by Stefanowitsch
and Gries (2003, 2005) and Gries and Stefanowitsch (2004a, 2004b), the principle
can be combined with a range of association measures for determining the degree of
collocational ‘strength’ (the measure is typically calculated with a p-value obtained
from a Fisher exact test, log-transformed). These calculations are not yet implemented in most corpus annotation or concordance software. However, Stefan Gries has
developed R scripts (semi-automated sets of commands) for performing the tests.2
Hilpert (this volume, 391–404) explains three varieties of collostructional analysis:
collexeme analysis, distinctive collexeme analysis, and covarying collexeme analysis.
2. For more information, contact Stefan Th. Gries. His contact details can be found on his
website: http://www.linguistics.ucsb.edu/faculty/stgries/.
316 Dylan Glynn
Examples of use include Wulff (2003), Hilpert (2008), Stefanowitsch and Gries (2008),
Colleman (2009), and Gilquin (2010).
The aim of quantifying degree of association between two forms in terms of frequency is not unique to the collostructional suite. Corpus Linguistics has developed
an array of calculations to determine relative degree of association, especially between
individual words. The most common are the mutual information (MI), the z-score,
the t-score, and the log-likelihood. There is important variation in the results obtained from using any one test over another. Evert (2009) offers a detailed discussion
on the matter; see also Wiechmann (2008), Wulff (2010) and Desagulier (this volume,
145–178). The z-score and the t-score are both explained with the R-code in Johnson
(2008: Ch. 3) and Dalgaard (2008: Ch. 5). The freely available Ngram Statistics Package extracts sequences from a corpus and calculates a range of association measures.
All these scores are used extensively in collocation-based corpus linguistics.
Cluster analysis
Cluster analysis is a diverse family of techniques, which, as the name suggests, cluster
data. K-means clustering is used when one knows how many clusters there should
be in advance; the technique ‘sorts’ the data accordingly. More common in semantic
research is hierarchical clustering, which is used as an exploratory technique for the
identification of clusters in the data. Importantly, by identifying clusters, it also sorts
the data into the clusters it has ‘discovered’. The technique begins with a set of features
and then uses them to group the features of a given variable (for instance, a list of
senses, concepts, words, or constructions). It represents the results in a dendrogram,
a kind of plot that depicts groups in an intuitively transparent way as dependencies
clustered in branches. Cluster analysis is an excellent technique for determining
which forms are similar to each other and which are different.
It is explained by Divjak and Fieller (this volume, 405–442). Other explanations
using R code include Crawley (2007:â•›742–744), Baayen (2008:â•›138–148), Johnson
(2008: Ch. 6), and Ledolter (2013: Ch. 15). Härdle and Simar (2007: Ch. 11), Izenman
(2008: Ch. 12), Drenan (2009: Ch. 25), Everitt and Hothorn (2010:â•›18), Afifi et al. (2011:
Ch. 16) and Marden (2011: Ch. 12) represent detailed, yet approachable, explanations
without R code. Everitt et al. (2011) is surely the most comprehensive work devoted to
the technique, and although quite technical, is a systematic and excellent reference for
using cluster analysis. The book provides no explanations for performing the analysis,
but does give information on which software packages are available for many of the
analyses it describes.
Examples of use in Cognitive Linguistics include Schulze (1991), Chaffin (1992),
Myers (1994), Sandra and Rice (1995), Ravid and Hanauer (1998), Rice et al. (1999),
Gries (2006), Divjak (2006, 2010а), Divjak and Gries (2006), Gries and Hilpert (2008),
Valenzuela Manzanares and Rojo López (2008), Janda and Solovyev (2009), Louwerse
and Van Peer (2009), Robinson (this volume, 87–116), Glynn (2010a, 2014a, 2014b,
Techniques and tools 317
this volume, 117–144), Levshina (2012), Szmrecsanyi (2013), and Krawczak and
Glynn (in press).
Correspondence analysis
Correspondence analysis is an exploratory technique that helps identify associations
in the data, such as patterns in the combinations of linguistic features. The technique
is designed for dealing with complex interactions where it is not known a priori which
dimension, be that syntax, semantics, pragmatic, or social context, that structures the
behaviour of the data. For instance, it can help find which semantic features typically
occur with a set of grammatical forms or constructions, but also how these two dimensions interact relative to social variation. It visualises these associations in biplots,
which, although arguably difficult to interpret, represent rich depictions of complex
structures.
Glynn (this volume, 443–486) explains the application and interpretation of two
varieties of correspondence analysis: binary correspondence and multiple correspondence analysis. There exist several comprehensive books devoted to the technique:
Benzécri (1980, 1992), Murtagh (2005), Greenacre (2007 [1993], 2010), and Le Roux
and Rouanet (2010). Amongst these, Greenacre (2007) is probably the standard book
of reference. Useful introductions include Le Roux and Rouanet (2004: Chs. 2 and 5),
Everitt (2005: Ch. 5), Härdle and Simar (2007: Ch. 13), Baayen (2008: Ch. 5), Izenman
(2008: Ch. 17), and Husson et al. (2011: Chs. 2 and 3). The last of these, Husson et al.
(2011), is particularly clear and includes some of the most recent developments.
Examples of use include Arppe (2006), Glynn (2007, 2009, 2010a, 2010b, 2014a,
2014b, this volume, 117–144), Szelid and Geeraerts (2008), Plevoets et al. (2008),
Glynn and Sjölin (2011), Krawczak and Glynn (2011), Barnabé (2012), Krawczak and
Kokorniak (2012), Nordmark and Glynn (2013), Levshina et al. (2013b), Desagulier
(this volume, 145–178; in press), Delorge et al. (this volume, 39–60), Fabiszak et al.
(this volume, 223–252), and Krawczak (2014a, 2014b; in press).
Multidimensional scaling
This technique is similar to correspondence analysis in its functionality and output.
It identifies correlations between levels (features) in frequency tables. Explanation in
R can be found in Rencher (2002: Ch. 15, Section 1), Everitt (2005: Ch. 5), Baayen
(2008:â•›136–138), Drenan (2010: Ch. 23), Maindonald and Braun (2010 [2003]:â•›383–
384), and Everitt and Hothorn (2009: Ch. 17; 2011:â•›121–127). A new volume, which is
one of the most comprehesive applied works on the technique to date and one that includes explanation in R, is Borg et al. (2013). Adler (2010:â•›525, 541ff., 564) lists the wide
range of functions in R for applying multidimensional scaling, but without examples
of use. Härdle and Simar (2007: Ch. 15) and Izenman (2008:â•›13) offer more detailed
explanations of how the technique functions. See Le Roux and Rouanet (2004:â•›12–
14) and Cadoret et al. (2011) for comparison between multidimensional scaling and
318 Dylan Glynn
correspondence analysis. Borg and Groenen (2005) is a complete description, containing both mathematical theory and details of application and interpretation. Cox
and Cox (2001) is equally detailed, though more concerned with mathematical theory. Nevertheless, the work includes helpful chapters on biplots and correspondence
analysis. Examples of its use within the field include Bybee and Eddington (2006),
Clancy (2006), Croft and Poole (2008), Szmrecsanyi (2010), Hilpert (2012), Heylen
and Ruette (2013), and Ruette et al. (in press, forthc.). Although not a corpus study,
Berthele (2010) is another recent example.
Configural frequency analysis
This is a simple and powerful technique, yet surprisingly uncommon outside the German linguistic tradition. It can be seen as a simplified log-linear analysis (see below) or
as multiple Chi-squared tests; indeed, it functions by creating log-linear combinations
of factors to predict cell frequencies typically based on Chi-squared tests. The technique offers possibilities for significance testing in multivariate models where no clear
response variable exists, by identifying which correlations in a multiway frequency
table are significant. The main limitation for the application of this technique is sample
size. For a given analysis, all cells must have at least one occurrence and a minimum of
20% should have more than 5 occurrences. An excellent explanation, though with no
R code, can be found in Tabachnick and Fidell (2007: Ch. 16). Gries (2009b:â•›240–252)
offers a clear explanation of how to implement it, but note that this is omitted from
the newest version of his book (Gries 2003). Von Eye (2002) is a textbook devoted to
the subject and von Eye et al. (2010) represents the state-of-the-art. Hierarchical configuration frequency analysis has been used by Stefanowitsch and Gries (2005, 2008),
Wulff et al. (2007), Hilpert (2009, 2012), Jing-Schmidt and Gries (2009), Schmidtke-�
Bode (2009), Berez and Gries (2010), Hoffmann (2011), and Kööts et al. (2012).
Linear discriminant analysis
Discriminant analysis is a classification technique that functions in a similar way to
logistic regression and classification tree analysis (see below). However, linear discriminant analysis requires normally distributed data and continuous predictor variables, two conditions that are rarely met in Corpus Linguistics.3
Venables and Ripley (2002:â•›331–338), Crawley (2007:â•›744–747), Baayen (2008:
154–160), Adler (2010:â•›440–444) and Maindonald and Braun (2010:â•›385–391) offer
explanations appropriate for the intermediate user. Everitt (2005:â•›Ch 7), Härdle and
Simar (2007: Ch. 12), Tabachnick and Fidell (2007: Ch. 9), Izenman (2008: Ch. 8) and
Afifi et al. (2011: Ch. 11) offer more substantial descriptions of discriminant analysis,
3. Cf. Stevens (2001), Arppe (2008:â•›164), Baayen (2008:â•›154), Heylen et al. (2008), and
Hoffman (2011:â•›95) for discussion on the problems associated with the implementation of discriminant analysis. See also Divjak (2010a:â•›138) who defends its use.
Techniques and tools 319
but offer no explanation for performing the analysis in R. Given the criteria are met,
the method is a powerful classification technique and has been used by Gries (2003),
Wulff (2003), and Divjak (2010a) in the field.
Classification tree analysis
An alternative to linear discriminant analysis is a data mining technique designed
for categorical data called classification tree analysis. It is closely related to another
technique termed regression tree analysis, which is used for continuous data. Together they are referred to as CART (or classification and regression tree analysis). The
classification tree analysis technique employs an algorithm called recursive partitioning. For a given binary response variable (a vs. b), the algorithm begins with this
alternation and asks which of the predictors (the other variables in the model) is best
at predicting the choice between the two alternatives in the response variable. The
algorithm continues this process for each of the two branches until all the predictor
variables are ‘used up’. This re-occurring branching gives us a ‘tree’ that shows how the
different variables predict the outcome, a vs. b.
Classification tree analysis is explained and presented with R code in Crawley
(2007: Ch. 21), Baayen (2008:â•›148–154), and Adler (2010:â•›406–117, 446–452). Other substantial descriptions include Venables and Ripley (2002: Ch. 9), Everitt and
Hothorn (2010: Ch. 9), Maindonald and Braun (2010: Ch. 11), and Marden (2011:
Ch. 11). The method has enjoyed some popularity in Cognitive Linguistic research,
being both straightforward to apply and to interpret. Within the field, examples of its
use include Klavan et al. (2011), Robinson (2012; this volume, 87–116), and Levshina
et al. (this volume, 205–222).
Bootstrapping regression trees and, what is termed, the random forests technique,
represent an important avenue for the development of these techniques. Bootstrapping is a widely used technique that randomises the data in order to test explanatory
strength and, thus, to ascertain confidence scores for the observed data through comparison with the randomised version of the data. The application of such techniques
to classification tree analysis is opening up a new set of statistical alternatives to logistic regression analysis (see below). See Everitt and Hothorn (2010:â•›170–173), Strobl
et al. (2009a), Adler (2010:â•›414–417), and Maindonald and Braun (2010:â•›369–372) for
a description. Such techniques have yet to be applied in the field.
Regression analysis
In its various forms, regression analysis is one of the most widely used and powerful
techniques in statistics. The importance of regression techniques lies in their ability
to ‘predict outcomes’. The outcome is the term used to refer to a linguistic choice or
a linguistic variant. This can be any kind of linguistic phenomenon, from lexemes,
gestures, grammatical constructions and phonological patterns to the meanings of
words, pragmatic functions, even gender, period, sociolect or dialect. The principle
320 Dylan Glynn
of how a regression analysis works is simple. The regression analysis takes our linguistic analysis of the data and builds a model that attempts to predict the behaviour
of whatever phenomenon we are interested in explaining. If the model can predict
which linguistic phenomenon (choice or variant, for example) is used, based on the
linguistic analysis, then we can say that the analysis is accurate and, at least adequate,
in distinguishing the phenomena under consideration.
The linguistic choice or variety is understood as the response variable, which is
‘predicted’ by the independent variables, or the factors and features of the linguistic
analysis. The model provides a great deal of information about how the linguistic
analysis predicts the behaviour of the response variable but three pieces of information are crucial. Firstly, it tells us which of the linguistic factors and features are statistically significant in predicting the outcome. Secondly, it tells us the effect size of
those features and factors; in other words, the relative importance of that factor or
feature in predicting the outcome. Lastly, it tells us how accurately a combination of
all the significant factors and features distinguish between the linguistic phenomena
(the forms, uses or varieties being investigated). The following sections summarise
several types of regression that are designed for categorical outcomes. This family of
regression techniques are typically referred to as logistic regression.
The standard references for logistic regression modelling include Agresti (2013
[1990, 2002], also 2007) and Hosmer and Lemeshow (2013 [1989, 2000]). Harrel
(2001, also 2012) and Faraway (2006, also 2002) are also widely used reference books
for the technique. Two other useful references include Hilbe (2009) and Menard
(2010, also 2002). Once the basics have been mastered, and perhaps even before then,
these books should be consulted. Especially useful is Thompson (2009), an unpublished and freely downloadable book that accompanies, step-by-step, Agresti’s work,
with the R code needed to perform most of what his books cover.
A note of caution is needed for the reader with little experience in statistics. None
of the aforementioned books are designed for novice users, but they need to be consulted before regression analysis is used in research. Actually performing regression
analysis is not particularly difficult. The complexity of confirmatory modelling lies
not in applying the techniques (fitting the models), but in knowing which of the many
algorithms and options one should use for the data and also applying and understanding the diagnostics of the model. Since confirmatory modelling tests hypotheses, one
runs the risk of what is termed a Type I Error. This is statistics parlance, more or less,
for demonstrating something to be true, when it is not. Before one reports findings
obtained with regression modelling, one should always have the results thoroughly
checked by a statistician.
Techniques and tools 321
Binary logistic regression
Currently, the most common regression analysis for categorical data is binary logistic
regression. This technique takes one or more ‘predictor’ or ‘explanatory’ variables and
attempts to predict the outcome of a binary response variable, such as the use of one
sense or near-synonym over another (start vs. begin, for instance). The regression
analysis ‘models’ the data, permitting it to indicate which features, or ‘levels’, are most
important in distinguishing the binary outcome. It also indicates the statistical significance of each of these predictions. Finally, scores for the overall success of the model
in predicting the outcome can be obtained.
As one of the most widely employed techniques in categorical statistics, there
exists a diverse range of tutorials and textbooks devoted to it. Specifically designed for
linguists, Speelman (this volume, 487–533) offers a concise introduction to applying
the technique, so too does Baayen (2008: Ch. 6), Dalgaard (2008: Ch. 2008), Johnson
(2008: Ch. 5), and Gries (2009b:â•›291–306; 2013: Ch. 5). Speelman and the latter two
explanations include R code. Crawley (2005: Ch. 16) also includes lucid explanations
of much of the R code needed.
More general explanations, which remain accessible to the relative beginner,
include Chatterjee and Hadi (2006: Ch. 12), Faraway (2006: Chs. 2–4), Gelman
and Hill (2007: Ch. 5), Sheather (2009: Ch. 8), Everitt and Hothorn (2010: Ch. 7),
Maindonald and Braun (2010: Ch. 8), Azen and Walker (2011: Chs. 8, 9), and Field
et al. (2012: Ch. 8). As mentioned above, the ‘standard’ references for the technique
include Harrell (2001), Faraway (2006), Hilbe (2009), Menard (2010: Chs. 8, 9),
Agresti (2013: Chs. 4–7; 2007: Chs. 4, 5), and Hoshmer and Lemshow (2013).
The technique is widely used in sociolinguistics and has a well-established tradition in Cognitive Linguistics. A few examples of use include Szmrecsanyi (2003,
2006), Heylen (2005b), Grondelaers et al. (2007, 2008), Speelman et al. (2009), Divjak
(2010a), Glynn (2010b, this volume, 117–144), Robinson (2010a, 2010b, this volume,
87–116), Speelman and Geeraerts (2010), Deshors (2011, 2014), Levshina (2011), and
Deshors and Gries (this volume, 179–204).
Loglinear analysis
Multiway frequency analysis or loglinear analysis is a technique not yet widely used
in the field. Unlike binary logistic regression, loglinear analysis is not limited to determining the difference between a maximum of two possibilities. Therefore, it can be
used to predict the behaviour of several senses, lexemes, or constructions. The technique is similar to configural frequency analysis, described above. Where configural
frequency analysis examines configurations of sets of cells in a multiway frequency
table, log-linear analysis looks at the interaction of variables that make up the multiway frequency table. Another way to think of loglinear analysis is to think of it as a
logistic regression analysis without a response variable (start vs. begin, for instance).
322 Dylan Glynn
Instead of this response variable, one attempts to predict the actual frequencies for
each variable with the minimal number of factors.
Gries (this volume) offers a brief introduction to the technique, where it is termed
“Poisson regression”. Adler (2010:â•›394–395, 444) offers a very short explanation, but
suggests a range of functions in R that can be used for fitting loglinear models (Adler
2010:â•›227, 425, 437–438, 543, 557–558, 569). Thompson’s (2009: Chs. 8, 9) R manual
for Agresti (2002) has two detailed chapters devoted to the technique. Short explanations include Oakes (1998: Ch. 5), Agresti (2007: Ch. 7; 2013: Chs. 9, 10), Faraway
(2006:â•›61–67, 93–95), Dalgaard (2008: Ch. 15), Gries (2009b:â•›240–248; 2013:â•›324–
327), Tarling (2009: Ch. 7), Braun (2010:â•›258–266), Afifi et al. (2011: Ch. 17), Azen
and Walker (2011: Ch. 7), Smith (2011: Ch. 4), Field et al. (2012: Ch. 18), and Ledolter
(2013: Ch. 7). Von Eye and Mun (2013) is a new volume devoted to the technique and
includes practical explanations in R. However, the book is relatively theoretical and
may prove challenging for learners. For users of SPSS, Tabachnick and Fidell (2007:
Ch. 16) present a thorough explanation. Kroonenberg (2008) is an approachable,
non-technical, volume devoted to the topic, and Christensen (1997) is older and more
technical, but comprehensive. Finally, Hilbe (2011) offers a less orthodox discussion,
contextualising loglinear modelling as a means for identifying multivariate dependencies. With an example-based discussion, the author reveals how the approach ties
in with other techniques. Within the field of Cognitive Linguistics, Krawczak and
Glynn (in press) and Glynn (forthc.) are examples of its use.
Multinomial logistic regression
This extension of binary logistic regression (explained above) is also called polychotomous logistic regression, or polytomous logistic regression. The principle is the same
as for binary logistic regression, save that there are multiple nominal outcomes. The
technique, however, still requires a base line for the model, that is, an outcome that
serves as the point of reference for the ‘other’ outcomes (start vs. begin, set off and
commence, for example).
Arguably the most approachable descriptions to date are Hilbe (2009: Ch. 10),
Orme and Combs-Orme (2009: Ch. 3), and Ledolter (2013: Ch. 11), but see also
Agresti (2007: Ch. 6). Arppe (2008) represents a detailed study on possible alternatives to this technique. For SPSS users, Tarling (2009: Ch. 6) and Azen and Walker
(2011: Ch. 10) include a step-by-step example-based explanation. For Stata uses, Long
and Freese (2006) is clear; its explanations are also useful independent of the statistical package used. The application of multinomial logistic regression is not straightforward and the technique has not yet enjoyed wide use in the field. However, as
quantitative approaches to semantics continue, its application is likely to be an important contribution. Arppe (2008), Nordmark and Glynn (2013), Krawczak (2014a,
2014b, in press), and Glynn (forthc.) represent examples of its application in Cognitive Linguistics.
Techniques and tools 323
Ordinal logistic regression
Also referred to as ordered multinomial logit regression or proportional odds regression, the technique is a special case of logistic regression where the response is multiple and ordered, such as ‘short’, ‘medium’, ‘long’ or ‘young’, ‘older’, and ‘oldest’. At
least three ways of modelling ordinal regression exist; the most common is called the
proportional method. The principle is straightforward. Rather than a binary response,
one has a series of response variables. For example, for an ordered list of choices A,
B, C or D, one attempts to predict the outcome of A versus B, C, or D, then in turn A
or B versus C and D, and finally A or B or C versus D. If these response variables A,
B, C, and D are ordered, this can be interpreted as determining what factors predict
that ordering.
The most accessible explanations of such modelling can be found in Baayen
(2008: Ch. 6), Hilbe (2009: Ch. 9), Orme and Combs-Orme (2009), and Tarling (2009:
Ch. 8). O’Connell (2006) is a user-friendly textbook devoted to the technique, but
intended for users of SPSS. Long and Freese (2006) is comparable for users of Stata. Agresti (2013:â•›86–98) offers a description of some of the basic issues and tests
involved with ordered categories, and Agresti (2007: Ch. 6) offers a more detailed
description, though somewhat theoretical. In terms of theory, Agresti (2010) represents a comprehensive work of reference. Johnson and Albert (1999) is a detailed and
somewhat technical book devoted to the subject. This is a good reference, but has little
explanation on application and only includes a software guide for program MATLAB.
Mixed-effects logistic regression
Sometimes also called multilevel modelling or hierarchical modelling, this technique
is similar to ‘normal’ logistic regression, except that the model accounts for both
‘fixed’ effects (that is, the predictors in the model) and ‘random’ effects (or factors
we know a priori are ‘noise’ in the model). For example, if one is looking at examples
from a small set of sources, such as a set of authors in a diachronic corpus or speakers in discourse analysis, one does not want the individual traits of those authors or
speakers influencing the outcome of the analysis. These unwanted effects are treated
as ‘random’ in the model. Put simply, mixed-effects regression analysis accounts for
those ‘unwanted’ factors, and ‘neutralises’ their effects, preventing them from skewing
results. The principle can be applied to any form of regression, including the ordinal
and multinomial regression explained above. Speelman (this volume, 487–533) offers
a succinct explanation.
An older, but thorough, description can be found in Edwards (2000: Ch. 4).
Gellman and Hill (2006) offer an extremely detailed, yet approachable, book on the
matter. Crawley (2007: Ch. 19), Baayen (2008: Ch. 7), Maindonald (2008: Ch. 10),
Sheather (2009: Ch. 10), and Tarling (2009: Ch. 9) give clear introductions to the
method, as does Johnson (2008:â•›255–260). See also Frawley (2007: Ch. 19), who gives
one of the clearest explanations on how to distinguish random variables from fixed
324 Dylan Glynn
variables, and Maindonald and Braun (2010: Ch. 10), who offer a thorough description of the interpretation of the output in R. Finally, Hox (2010) is a work devoted to
the technique. It is broad in its coverage, with a theoretical orientation, but it remains
approachable for the faux-débutant, serving as an excellent book of reference. Mixed
models are beginning to become more common in the Cognitive Linguistic literature;
examples include Bresnan et al. (2007), Divjak (2010b), Klavan (2012), Levshina et al.
(2013a; this volume, 205–222); Krawczak and Glynn (in press), and Glynn (2014a).
Table 3 summarises the different techniques described here. Although the table
systematically covers the techniques for categorical data, it does not include any techniques for continuous data. Moreover, it does not include many of the recent advances
and variants, such as random forest classification or hierarchical configural frequency
analysis. Tabachnick and Fidell (2007:â•›29–31) offer an excellent breakdown of many
of the multivariate techniques available; so too does Baayen (2008:â•›Appendix B).
Tummers et al. (2005), Heylen et al. (2008) and Gilquin and Gries (2009) offer extensive discussions on the quantitative state-of-the-art in Cognitive Linguistics.
Just as the number of different statistical techniques can be overwhelming for
someone first learning, so too can the number of packages and commands available
for performing them in R. Packages are modules that expand R’s functionality and
the commands are the computer prompts to make them operate. One of R’s most important strengths is the fact it is a vibrant community, with countless active internet
fora and just as many people writing packages to refine and advance the application
of every imaginable statistical technique. The downside to this, of course, is that a
simple search request on the Internet can result in in an overload of information and
options. In response to this problem, Table 4 represents a concise reference list for the
functions and packages in R for performing the multivariate techniques described
above. It is far from complete, being designed as a quick reference for the intermediate user who wishes to get started on a method with which he or she is not yet
familiar. Also included are references for tutorials and textbooks on the functions
and packages. A complete list would be impossible since many of the techniques have
a number of packages devoted, or partially devoted, to them and other techniques
have many variants. Moreover, it must be remembered that for the confirmatory techniques, there also exist large numbers of diagnostic and visualisation options, most
of which are performed with the use of other more general or more specific packages
and functions.
Certain books can be recommended for the reader who wishes to go back and investigate the basics that this volume skips, and also for the reader who wishes to delve
deeper into the kinds of methods presented here. Baayen’s (2008) Analyzing Linguistic
Data is an excellent place to start. Another highly recommended guide for starting
statistical analysis using R in Linguistics is Dalgaard’s (2008)’s Introducing Statistics
with R. If used in combination with Baayen (2008), one should be able to move on
Techniques and tools 325
Table 3. Quantitative techniques and their usage in corpus-based Cognitive Linguistics
Technique
Type
Object collocation
Example of application
Explanation
T-score, Z-score,
MI score
measure
strength
identifying multi-word patterns – Biber (2009)
identifying constructional variants – Wong (2009)
Evert (2009), Biber & Jones (2010)
Chi-squared test, univariate
Fisher’s exact Test
probability / independence
synonymy, constructional – Wulff (2006)
polysemy, lexical – Robinson (2010)
Dalgaard (2008), Everitt & Hothorn (2009),
Gries (this volume)
Collostructional
analysis
univariate
collocation strength
synonymy, constructional – Hilpert (2008)
synonymy, constructional – Gilquin (2010)
Stefanowitsch & Gries (2003), Gries &
Stefanowitsch (2004a), Hilpert (this volume)
Hierarchical
cluster analysis
multivariate associations btw. objects of
single variable
polysemy, lexical – Gries (2006)
synonymy, lexical – Divjak (2010a)
Baayen (2008), Everitt et al. (2011),
Divjak & Fieller (this volume)
Multidimensional multivariate associations btw. objects of
scaling
multiple variables
synonymy, morphological – Croft & Poole (2008)
relations between variants Berthele (2010)
Baayen (2008), Izenman (2008),
Everitt & Hothorn (2010)
Correspondence
analysis
synonymy, concepts – Szelid & Geeraerts (2008)
polysemy, constructional – Glynn (2009)
Le Roux & Rouanet (2010),
Husson et al. (2011), Glynn (this volume)
Configural
multivariate associations btw. objects of
frequency analysis
multiple variables
polysemy, constructional – Hilpert (2009)
synonymy, constructional – Hoffmann (2011)
von Eye (2002), Tabachnick & Fidell (2007),
Gries (2009b)
Discriminant
analysis
synonymy, constructional – Gries (2003)
synonymy, lexical – Divjak (2010a)
Tabachnick & Fidell (2007), Baayen (2008),
Maindonald & Braun (2010)
Classification tree multivariate identify factors that lead to
analysis
an outcome / prediction
polysemy, lexical – Robinson (2012a)
synonymy, lexical – Levshina et al. (this vol.)
Venables & Ripley (2002), Everitt & Hothorn
(2010), Maindonald & Braun (2010)
Loglinear analysis multivariate predict correlation
multiple response variables
synonymy, constructional – Krawczak & Glynn (in press) Kroonenberg (2008), Hilbe (2011),
polysemy, lexical – Glynn (forthc.)
Smith (2011)
Binary logistic
regression
multivariate predict outcome
binary response variable
synonymy, constructional – Szmrecsanyi (2003)
synonymy, lexical – Speelman & Geeraerts (2010)
Ordinal logistic
regression
multivariate predict outcome
ranked response variable
synonymy. lexico–constructional – Klavan (2012)
Baayen (2008), Tarling (2009),
synonymy, constructional – Glynn & Krawczak (forthc.) Orme & Combs-Orme (2009)
multivariate associations btw. objects of
multiple variables
multivariate identify factors that lead to
an outcome / prediction
Orme & Combs-Orme (2009), Everitt &
Hothorn (2010), Speelman (this volume)
Multinomial
multivariate predict outcome
logistic regression
multiple response variables
synonymy, lexical Arppe (2008)
synonymy, lexical – Krawczak (2014a)
Long & Freese (2006), Tarling (2009),
Orme & Combs-Orme (2009)
Mixed-effects
multivariate predict outcome
logistic regression
include random variables
synonymy, constructional – Divjak (2010b)
synonymy, constructional – Levshina et al. (this vol.)
Baayen (2008), Maindonald & Braun (2010),
Smith (2011)
326 Dylan Glynn
Table 4.╇ Functions and packages for categorical multivariate statistics in R
Technique
Function
Package
R code tutorial
Hierarchical cluster
analysis
hclust
stats*
Crawley (2007:â•›738ff.); Zhao (2013)
agnes
cluster
Kaufman & Rousseeuw (2005); Maechler (2013)
pvclust
pvclust
Suzuki & Hidetoshi (2006); Suzuki (2013)
kmeans
stats*
Crawley (2007:â•›742ff.); Zhao (2013)
clara
cluster
Kaufman & Rousseeuw (2005); Maechler (2013)
pamk
fpc
Hennig (2013)
corresp
MASS*
Venables & Ripley (2002:â•›326ff.); Ripley (2013)
ca
ca
Greenacre (2007); Neandić & Greenacre (2007)
anacor
anacor
de Leeuw & Mair (2009a, 2013a)
MASS*
Venables & Ripley (2002:â•›329f.); Ripley (2013)
K-means cluster
analysis
Binary correspondence analysis
Multiple correspond- mca
mjca
ence analysis
Multidimensional
scaling
ca
Greenacre (2007); Neandić & Greenacre (2007)
MCA
FactoMineR
Lê et al. (2008); Husson et al. (2013)
cmdscale
stats*
Baayen (2008:â•›136ff.); Johnson (2008:â•›208ff.)
sammon
MASS*
Maindonald & Baun (2010:â•›284f.)
smacofSym
smacof
de Leeuw & Mair (2009b, 2013b)
cfa
Funke et al. (2007); von Eye & Mair (2008)
Configural frequency cfa
hcfa
analysis
Linear discriminant
Analysis
Classification tree
analysis / Random
forest classification
cfa
Gries (2010:â•›248ff.); von Eye et al. (2010:â•›265ff.)
cfa2
cfa24
No tutorials available, cf. Schönbrodt (2013)
lda
MASS*
Baayen (2008:â•›167ff.);
Maindonald & Braun (2010:â•›385ff.)
discrim
ade4
Chessel et al. (2004); Chessel & Dufour (2013)
rda
klaR
Roever et al. (2013)
rpart
rpart
Zhao (2012:â•›32ff.); Therneau et al. (2013)
tree
tree
Venables & Ripley (2002:â•›266)
ctree
party
Zhao (2013:â•›29ff.)
cforest
party
Strobl et al. (2009a, 2009b)
randomForest randomForest Maindonald & Braun (2010:â•›351ff.);
Liaw & Wiener (2002)
Loglinear analysis
glm
MASS*
Maindonald & Braun (2010:â•›258ff.);
Baguley (2012)
loglm
MASS*
Thompson (2009:â•›142ff.); Baguley (2012)
quasipois
aod
Lesnoff & Lancelot (2013)
4. The package cfa2 is not currently in the CRAN repository for R but can be found in the
RForge repository. This repository is typically used for packages still under development. A
simple command listed on the Rforge site for the package will install a package as effortlessly as
installation using the ‘normal’ method in R.
Techniques and tools 327
Table 4.╇ (continued)
Technique
Function
Package
R code tutorial
Binary logistic
regression
glm
MASS*
Baayen (2008:â•›195ff.);
Everitt & Hothorn (2010:â•›122ff.)
lrm
rms†
Harrell (2001:â•›257ff.; 2012:â•›221ff.);
Baayen (2008:â•›195ff.)
MCMClogit
MCMCpack
Martin et al. (2010)
polr
MASS*
Faraway (2006:â•›117ff.);
Maindonald & Braun (2010:â•›270ff.)
lrm
rms†
Baayen (2008); Johnson (2008)
clm
ordinal
Christensen (2012)
multinom
nnet
Faraway (2006:â•›106); Thompson (2009:â•›118)
polytomous
polytomous
Arppe (2014)
mlogit
mlogit
Field et al. (2012:â•›325); Croissant (2013)
lmer
lme4
Baayen (2008:â•›278ff.); Bates (forthc.:â•›1ff.)
glmmPQL5
MASS*
Johnson (2008:â•›255ff.); Thompson (2009:â•›179ff.)
MCMCglmm
Hadfield (2010)
Ordinal logistic
regression
Multinomial logistic
regression
Multilevel logistic
regression
(mixed effects)
*
MCMCglmm
Recommended package for the base installation. This means it comes ‘pre-installed’.
†
In older textbooks and tutorials, this package is called Design. The package rms is simply a new
version. The command line to use the package is unchanged and so older descriptions remain helpful.
from what is covered in this volume on all three fronts – developing knowledge of R,
the basic statistical principles and tests, as well as advanced statistical analysis.
Gries’ (2009a) Quantitative Corpus Linguistics with R is another book to consider. Although an excellent book, it is designed more for corpus linguistics per se than
multivariate analysis. More in line with the focus of this volume is Gries’ (2009b)
Statistics for Linguistics Using R. It covers the basics thoroughly and introduces some
multivariate statistical techniques. A new edition, Gries (2013), expands the chapter
on precisely the techniques covered in this volume
Johnson’s (2008) Quantitative Methods in Linguistics is good for a debutant level
statistics textbook using R – it explains both the command line and statistics lucidly
and concisely. However, it ‘orders’ the different statistical techniques relative to different subfields of linguistics. This could be misleading for the novice and not particularly
logical for the reader with some knowledge in the field, since most of the techniques
are not at all restricted to the subfield Johnson ascribes to them. However, the expla-
5. The function glmmPQL uses a so-called penalised quasi-likelihood, which has lost favour
in the research community (Crawley 2007:â•›655). Although the functions lmer and MCMCglmm
are more up to date in this regard, glmmPQL in the MASS package still works perfectly well,
especially when learning since some of the command line is closer to other regression functions
a learner may have already mastered.
328 Dylan Glynn
nations of the techniques are clear, especially concerning the issues that lie between
the very basic and more advanced study, such as understanding data distribution and
samples. Slightly more advanced books, though approachable, are Everitt and Hothorn
(2010; 2011). These volumes are excellent textbooks for researchers with an introductory knowledge in statistics and/or with R, but who wish to adopt multivariate techniques – veritable handbooks. Although the examples are not linguistic, they are clear
and well chosen. The statistical techniques covered are all explained through the use of
examples. The demonstration of the R code is systematic and complete. Finally, Keen
(2010) offers a thorough coverage of the graphic possibilities in R. Appropriate for
novice and expert alike, the book is practically orientated with detailed examples of
the R code.
References
Adler, J. (2010). R in a nutshell: A desktop quick reference. Sebastopol: O’Reilly Media.
Afifi, A., May S., & Clark, V. A. (2011). Practical multivariate analysis (5th ed.). London: Chapman & Hall.
Agresti, A. (2007). An introduction to categorical data analysis (2nd ed.). Hoboken: John Wiley.
DOI: 10.1002/0470114754
Agresti, A. (2010). Analysis of ordinal categorical data (2nd ed.). Hoboken: John Wiley.
DOI: 10.1002/9780470594001
Agresti, A. (2013) [1990, 2002]. Categorical data analysis (3rd ed.). New York: John Wiley.
Arppe, A. (2006). Frequency considerations in morphology: Finnish verbs differ, too. SKY Journal of Linguistics, 19, 175–189.
Arppe, A. (2008). Univariate, bivariate and multivariate methods in corpus-based lexicography – A study of synonymy. Unpublished PhD dissertation, University of Helsinki.
Azen, R., & Walker, C. (2011). Categorical data analysis for the behavioral and social sciences.
New York & Hove: Routledge.
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511801686
Baguley, T. (2012). Loglinear models. Online Supplement 5 to Serious stats: A guide to advanced
statistics for the behavioral sciences. Basingstoke: Palgrave. Available at: http://www.
palgrave.com/psychology/baguley/students/supplements.html.
Balahur, A., & Montoyo, A. (2012). Semantic approaches to fine and coarse-grained feature-based opinion mining. In H. Horacek, E. Métais, R. Muñoz, & M. Wolska (Eds.),
Natural language processing and information systems (pp. 142–153). Berlin: Springer.
Barnabé, A. (2012). Le schème du chemin en grammaire et sémantique anglaises. Unpublished
PhD dissertation, Université Bordeaux 3.
Bates, D. (Forthcoming). lme4: Mixed-effects modeling with R. Heidelberg & New York: Springer. Preprints available at: http://lme4.r-forge.r-project.org/lMMwR/lrgprt.pdf.
Benzécri, J.-P. (1980). Pratique de l’analyse des donnees. Paris: Dunod.
Benzécri, J.-P. (1992). Correspondence analysis handbook. New York: Dekker.
Techniques and tools 329
Berthele, R. (2010). Investigations into the folk’s mental models of linguistic varieties. In
D. Geeraerts, G. Kristiansen, & Y. Peirsman (Eds.), Advances in cognitive sociolinguistics
(pp. 265–290). Berlin & New York: Mouton de Gruyter.
Biber, D. (2009). A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14, 275–311.
DOI: 10.1075/ijcl.14.3.08bib
Biber, D., & Jones, J. (2009). Quantitative methods in Corpus Linguistics. In A. Lüdeling, &
M. Kytö (Eds.), Corpus Linguistics: An international handbook. Vol. 2. (pp. 1287–1304).
Berlin & New York: Mouton de Gruyter.
Borg, I., Groenen, & Mair, P. (2013). Applied multidimensional scaling. Heidleberg & New York:
Springer. DOI: 10.1007/978-3-642-31848-1
Borg, I., & Groenen, P. (2005). Modern multidimensional scaling (2nd ed.). Heidelberg & New
York: Springer.
Bresnan, J., Cueni, A., Nikitina, T., & Baayen, H. (2007). Predicting the dative. In G. Bouma,
I. Krämer, & J. Zwarts (Eds.), Cognitive foundations of interpretation alternation (pp. 69–
94). Amsterdam: Royal Netherlands Academy of Arts and Sciences.
Bybee, J., & Eddington, D. (2006). A usage-based approach to Spanish verbs of ‘becoming’.
Language, 82, 323–355. DOI: 10.1353/lan.2006.0081
Cadoret, M., Lê, S., & Pagès, J. (2011). Multidimensional scaling versus multiple correspondence
analysis when analyzing categorization data. In B. Fichet, D. Piccolo, R. Verde, & M. Vichi
(Eds.), Classification and multivariate analysis for complex data structures (pp. 301–308).
Heidleberg & New York: Springer. DOI: 10.1007/978-3-642-13312-1_31
Chaffin, R. (1992). The concept of a semantic relation. In A. Lehrer, & E. Kittay (Eds.), Frames,
fields, and contrasts: New essays in semantic and lexical organisation (pp. 253–288).
London: Lawrence Erlbaum.
Chatterjee, S., & Hadi, A. (2006). Regression analysis by example. London: John Wiley.
DOI: 10.1002/0470055464
Chessel, D., & Dufour, A.-B. (2013). Analysis of ecological data: Exploratory and Euclidean
methods in environmental sciences. Available at: http://cran.r-project.org/web/packages/
ade4/ade4.pdf.
Chessel, D., Dufour A.-B, & Thioulouse, Y. (2004) The ade4 package – I: One-table methods.
R News, 4, 5–10.
Christensen, R. (1997). Log-linear models and logistic regression (2nd ed.). Heidleberg & New
York: Springer.
Christensen, R. (2012). A tutorial on fitting cumulative link models with the ordinal package.
Available at: http://cran.r-project.org/web/packages/ordinal/vignettes/clm_intro.pdf.
Clancy, S. (2006). The topology of Slavic case: Semantic maps and multidimensional scaling.
Glossos, 7, 1–28.
Colleman, T. (2009). The semantic range of the Dutch double object construction. A collostructional perspective. Constructions and Frames, 1, 190–221. DOI: 10.1075/cf.1.2.02col
Colleman, T. (2010). Beyond the dative alternation: The semantics of the Dutch aan-Dative. In
D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches
(pp. 271–304). Berlin & New York: Mouton de Gruyter.
Cox, T., & Cox, M. (2001). Multidimensional scaling (2nd ed.). Boca Raton: Chapman & Hall.
Crawley, M. (2005). Statistics: An introduction using R. Southern Gate & Hoboken: John Wiley.
DOI: 10.1002/9781119941750
330 Dylan Glynn
Crawley, M. (2007). The R book. Chichester: John Wiley. DOI: 10.1002/9780470515075
Croft, W., & Poole, K. (2008). Inferring universals from grammatical variation: Multidimensional scaling for typological analysis. Theoretical Linguistics, 34, 1–37.
DOI: 10.1515/THLI.2008.001
Croissant, Y. (2013). Estimation of multinomial logit models in R: The mlogit packages. Available at: cran.r-project.org/web/packages/mlogit/mlogit.pdf.
Daille, B., Dubreil, E. Monceaux, L., & Vernier, M. (2011). Annotating opinion–evaluation of
blogs: The Blogoscopy corpus. Language Resources and Evaluation, 45, 409–437.
DOI: 10.1007/s10579-011-9154-z
Dalgaard, P. (2008). Introductory statistics with R (2nd ed.). Dordrecht: Springer.
DOI: 10.1007/978-0-387-79054-1
De Cock, B. (2014a). A discourse-functional analysis of speech participant profiling in spoken
Spanish. Amsterdam & Philadelphia: John Benjamins.
De Cock, B. (2014b). The discursive effects of Spanish impersonals uno and se. In D. Glynn, &
M. Sjölin (Eds.), Subjectivity and epistemicity: Corpus, discourse, and literary approaches to
stance (pp. 103–120). Lund: Lund University Press.
De Leeuw, J., & Mair, P. (2009a). Simple and canonical correspondence analysis using the R
package anacor. Journal of Statistical Software, 31, 1–18.
De Leeuw, J., & Mair, P. (2009b). Multidimensional scaling using majorization: The R package
smacof. Journal of Statistical Software, 31, 1–30.
De Leeuw, J., & Mair, P. (2013a). anacor: Simple and canonical correspondence analysis. Available at: cran.r-project.org/web/packages/anacor/anacor.pdf.
De Leeuw, J., & Mair, M. (2013b). SMACOF for multidimensional scaling. Available at: http://
cran.r-project.org/web/packages/smacof/smacof.pdf.
Deignan, A. (2005). Metaphor and Corpus Linguistics. Amsterdam & Philadelphia: John
Benjamins. DOI: 10.1075/celcr.6
Delorge, M. (2009). A diachronic corpus study of the constructional behaviours of reception
verbs in Dutch. In B. Lewandowska-Tomaszczyk, & K. Dziwirek (Eds.), Studies in Cognitive Corpus Linguistics (pp. 249–272). Frankfurt/Main: Peter Lang.
Desagulier, G. (In press). Le statut de la fréquence dans les Grammaires de Constructions:
‘simple comme bonjour’? Langages.
Desagulier, G. (Submitted). Quite new methods for a rather old issue: Exploring and visualizing
collocation data from the BNC with correspondence analysis.
Deshors, S. (2011). A multifactorial study of the uses of may and can in French-English interlanguage. Unpublished PhD dissertation, University of Sussex.
Deshors, S. (2014). Identifying different types of non-native co-occurrence patterns: A corpus-based approach. In D. Glynn, & M. Sjölin (Eds.), Subjectivity and epistemicity: Corpus,
discourse, and literary approaches to stance (pp. 387–412). Lund: Lund University Press.
Diehl, H. (2014). On modal meaning in the uses of quite, rather, pretty and fairly as degree
modifiers in British English. Unpublished PhD dissertation, Lund University.
Dirven, R., Goossens, L., Putseys, Y., & Vorlat, E. (1982). The scene of linguistic action and its
perspectivization by speak, talk, say, and tell. Amsterdam & Philadelphia: John Benjamins.
DOI: 10.1075/pb.iii.6
Divjak, D. (2006). Ways of intending: A corpus-based Cognitive Linguistic approach to
near-synonyms in Russian. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 19–56). Berlin & New
York: Mouton de Gruyter.
Techniques and tools 331
Divjak, D. (2010a). Structuring the lexicon: A clustered model for near-synonymy. Berlin & New
York: Mouton de Gruyter.
Divjak, D. (2010b). Corpus-based evidence for an idiosyncratic aspect-modality relation in
Russian. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven
approaches (pp. 305–331). Berlin & New York: Mouton de Gruyter.
Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles.
Corpus Linguistics and Linguistic Theory, 2, 23–60. DOI: 10.1515/CLLT.2006.002
Divjak, D., & Gries, St. Th. (2009). Corpus-based Cognitive Semantics: A contrastive study of
phrasal verbs in English and Russian. In B. Lewandowska-Tomaszczyk, & K. Dziwirek
(Eds.), Studies in Cognitive Corpus Linguistics (pp. 273–296). Frankfurt/Main: Peter Lang.
Divjak, D., & Gries, St. Th. (Eds.). (2012). Frequency effects in language learning and processing.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110274059
Drenan, R. (2009). Statistics for archaeologists: A common sense approach (2nd ed.). Heidelberg
& New York: Springer.
Dziwirek, K., & Lewandowska-Tomaszczyk, B. (2011). Complex emotions and grammatical mismatches: A contrastive corpus-based study. Berlin & New York: Mouton de Gruyter.
Edwards, D. (2000). Introduction to graphical modelling (2nd ed.). Heidelberg: Springer.
DOI: 10.1007/978-1-4612-0493-0
Everitt, B. S. (2005). An R and S-PLUS companion to multivariate analysis. London: Springer.
Everitt, B. S., & Hothorn, I. (2010). A handbook of statistical analyses using R (2nd ed.). Boca
Raton: Taylor & Francis. DOI: 10.1201/9781420079340
Everitt, B. S., & Hothorn, I. (2011). An introduction to applied multivariate analysis with R.
Munich: Springer. DOI: 10.1007/978-1-4419-9650-3
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Chichester:
John Wiley. DOI: 10.1002/9780470977811
Evert, S. (2009). Corpora and collocations. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics:
An international handbook (pp. 1212–1249). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110213881.2.1212
Faraway, J. (2002). Practical regression and anova using R. Available at: cran.r-project.org/doc/
contrib/Faraway-PRA.pdf.
Faraway, J. (2006). Extending the linear model with R: Generalized linear, mixed effects and nonparametric regression models. London: Taylor & Francis.
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. London & Thousand Oaks:
Sage.
Fillmore, C., & Atkins, B. (1992). Toward a frame-based lexicon: The semantics of risk and its
neighbours. In A. Lehrer, & E. Kittay (Eds.), Frames, fields, and contrasts: New essays in
semantic and lexical organisation (pp. 75–102). London: Lawrence Erlbaum.
Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. In J. R. Firth (Ed.), Studies in
linguistic analysis (pp. 1–32). Oxford: Basil Blackwell.
Fischer, K. (2000). From Cognitive Semantics to Lexical Pragmatics: The functional polysemy of
discourse particles. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110828641
Flores Salgado, E. (2011). The pragmatics of requests and apologies: Developmental patterns in
Mexican students. Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/pbns.212
Fontaine, J., Scherer, K., & Soriano, C. (Eds.). (2013). Components of emotional meaning: A sourcebook. Oxford: Oxford University Press. DOI: 10.1093/acprof:oso/9780199592746.001.0001
Funke, S., Mair, P., & von Eye, A. (2007). cfa: R package for the analysis of configuration frequencies. Available at: http://cran. R-project.org.
332 Dylan Glynn
Geeraerts, D. (2010). The doctor and the semantician. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 63–78). Berlin & New York:
Mouton de Gruyter.
Geeraerts, D. (2011). Entrenchment, conventionalization, and empirical method. Presented at
the 44th Meeting of the Societas Linguistica Europaea, Logroño.
Geeraerts, D., Grondelaers, S., & Bakema, P. (1994). The structure of lexical variation: Meaning,
naming, and context. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110873061
Geeraerts, D., Grondelaers, S., & Speelman, D. (1999). Convergentie en Divergentie in de Nederlandse Woordenschat. Amsterdam: Meertens Instituut.
Geeraerts, D., Kristiansen, G., & Peirsman, Y. (Eds.). (2010). Advances in cognitive sociolinguistics. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226461
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models.
Cambridge: Cambridge University Press.
Gilquin, G. (2010). Corpus, cognition and causative constructions. Amsterdam & Philadelphia:
John Benjamins. DOI: 10.1075/scl.39
Glynn, D. (2007). Mapping meaning: Toward a usage-based methodology in Cognitive Semantics. Unpublished PhD dissertation, University of Leuven.
Glynn, D. (2009). Polysemy, syntax, and variation: A usage-based method for Cognitive Semantics. In V. Evans, & S. Pourcel (Eds.), New directions in Cognitive Linguistics (pp. 77–
106). Amsterdam & Philadelphia: John Benjamins.
Glynn, D. (2010a). Synonymy, lexical fields, and grammatical constructions: A study in usage-based Cognitive Semantics. In H.-J. Schmid, & S. Handl (Eds.), Cognitive foundations
of linguistic usage-patterns: Empirical studies (pp. 89–118). Berlin & New York: Mouton de
Gruyter.
Glynn, D. (2010b). Testing the hypothesis: Objectivity and verification in usage-based Cognitive Semantics. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 239–270). Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110226423
Glynn, D. (2014a). The conceptual profile of the lexeme home: A multifactorial diachronic analysis. In J. E. Díaz-Vera (Ed.), Metaphor and metonymy across time and cultures (pp. 265–
293). Berlin & New York: Mouton de Gruyter.
Glynn, D. (2014b). The social nature of anger: Multivariate corpus evidence for context effects
upon conceptual structure. In I. Novakova, P. Blumenthal, & D. Siepmann (Eds.), Emotions in discourse (pp. 69–82). Frankfurt/Main: Peter Lang.
Glynn, D. (Forthcoming). Mapping meaning: Corpus methods for Cognitive Semantics.
Cambridge: Cambridge University Press.
Glynn, D., & Sjölin, M. (2011). Cognitive Linguistic methods for literature: A usage-based approach to metanarrative and metalepsis. In A. Kwiatkowska (Ed.), Texts and minds: Papers
in cognitive poetics and rhetoric (pp. 85–102). Frankfurt/Main: Peter Lang.
Glynn, D., & Krawczak, K. (Forthcoming). Social cognition, Cognitive Grammar and corpora:
A multifactorial approach to epistemic modality. Cognitive Linguistics.
Glynn, D., & Fischer, D. (Eds.). (2010). Quantitative Cognitive Semantics: Corpus-driven approaches. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110226423
Glynn, D., & Sjölin, M. (Eds.). (2014). Subjectivity and epistemicity: Corpus, discourse, and literary approaches to stance. Lund: Lund University Press.
Greenacre, M. (2007) [1993]. Correspondence analysis in practice (2nd ed.). London: Chapman
& Hall.
Techniques and tools 333
Greenacre, M. (2010). Biplots in practice. Bilbao: Fundación BBVA.
Gries, St. Th. (1999). Particle movement: A cognitive and functional approach. Cognitive Linguistics, 10, 105–145. DOI: 10.1515/cogl.1999.005
Gries, St. Th. (2000). Towards multifactorial analyses of syntactic variation: The case of particle
placement. Doctoral dissertation, University of Hamburg.
Gries, St. Th. (2003). Multifactorial analysis in Corpus Linguistics: A study of particle placement.
London: Continuum Press.
Gries, St. Th. (2006). Corpus-based methods and Cognitive Semantics: The many senses of
to run. In St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 57–99). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110197709
Gries, St. Th. (2009a). Quantitative Corpus Linguistics with R: A practical introduction. London:
Routledge. DOI: 10.1515/9783110216042
Gries, St. Th. (2009b). Statistics for Linguistics with R: A practical introduction (1st ed.). Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110216042
Gries, St. Th. (2010). Behavioral profiles: A fine-grained and quantitative approach in corpus
based lexical semantics. The Mental Lexicon, 5, 323–346. DOI: 10.1075/ml.5.3.04gri
Gries, St. Th. (2013). Statistics for linguistics with R: A practical introduction (2nd ed.). Berlin &
New York: Mouton de Gruyter. DOI: 10.1515/9783110307474
Gries, St. Th., & Divjak, D. (2009). Behavioral profiles: A corpus-based approach to cognitive
semantic analysis. In V. Evans, & S. Pourcel (Eds.), New directions in Cognitive Linguistics
(pp. 57–75). Amsterdam & Philadelphia: John Benjamins.
Gries, St. Th., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbor clustering. Corpora, 3, 59–81. DOI: 10.3366/E1749503208000075
Gries, St. Th., & Stefanowitsch, A. (2004a). Extending collostructional analysis: A corpus-based
perspective on ‘alternations’. International Journal of Corpus Linguistics, 9, 97–129.
DOI: 10.1075/ijcl.9.1.06gri
Gries, St. Th., & Stefanowitsch, A. (2004b). Co-varying collexemes in the into-causative. In
M. Achard, & S. Kemmer (Eds.), Language, culture, and mind (pp. 225–36). Stanford: CSLI.
Gries, St. Th., & Divjak, D. (Eds.). (2012). Frequency effects in language representation. Berlin &
New York: Mouton de Gruyter.
Gries, St. Th., & Stefanowitsch, A. (Eds.). (2006). Corpora in Cognitive Linguistics: Corpus-based
approaches to syntax and lexis. Berlin & New York: Mouton de Gruyter.
DOI: 10.1515/9783110197709
Grondelaers, S. (2000). De distributie van niet-anaforisch er buiten de eerste zinsplaats: Socio�
lexicologische, functionele en psycholinguïstische aspecten van er’s status als presentatief
signaal. Doctoral dissertation, University of Leuven.
Grondelaers S., Geeraerts, D., & Speelman, D. (2007). A case for a cognitive Corpus Linguistics.
In M. Gonzalez-Marquez, I. Mittleberg, S. Coulson, & M. Spivey (Eds.), Methods in Cognitive Linguistics (pp. 149–169). Amsterdam & Philadelphia: John Benjamins.
Grondelaers S., Speelman, D., & Geeraerts, D. (2008). National variation in the use of er
“there”: Regional and diachronic constraints on cognitive explanations. In G. Kristiansen,
& R. Dirven (Eds.), Cognitive Sociolinguistics: Language variation, cultural models, social
systems (pp. 153–204). Berlin & New York: Mouton de Gruyter.
Hadfield, J. (2010). MCMC methods for multi-response generalized linear mixed models: The
MCMCglmm R package. Journal of Statistical Software, 33, 1–22.
334 Dylan Glynn
Härdle, W., & Simar, L. (2007). Applied multivariate statistical analysis. Heidelberg & New York:
Springer.
Harrell, F. (2001). Regression modeling strategies: With Applications to linear models, logistic regression, and survival analysis. Heidelberg & New York: Springer.
Harrell, F. (2012). Regression modeling strategies. Unpublished manuscript, available at: www.
biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/rms.pdf.
Hennig, C. (2013). Flexible procedures for clustering. Available at: http://cran.r-project.org/
web/packages/fpc/fpc.pdf.
Heylen, K. (2005a). A quantitative corpus study of German word order variation. In St. Kepser,
& M. Reis (Eds.), Linguistic evidence: Empirical, theoretical and computational perspectives
(pp. 241–264). Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110197549.241
Heylen, K. (2005b). Zur Abfolge (pro)nominaler Satzglieder im Deutschen: Eine korpusbasierte Analyse der relativen Abfolge von nominalem Subjekt und pronominalem Objekt im
Mittelfeld, 264. Doctoral dissertation, University of Leuven.
Heylen, K., & Ruette, T. (2013). Degrees of semantic control in measuring aggregated lexical
distances. In L. Borin, A. Saxena, A., & T. Rama (Eds.), Approaches to measuring linguistic
differences (pp. 353–374). Berlin & New York: Mouton de Gruyter.
Heylen, K., Tummers, J., & Geeraerts, D. (2008). Methodological issues in corpus-based Cognitive Linguistics. In G. Kristiansen, & R. Dirven (Eds.), Cognitive Sociolinguistics: Language
variation, cultural models, social systems (pp. 91–128). Berlin & New York: Mouton de
Gruyter. DOI: 10.1515/9783110199154.2.91
Hilbe, J. (2009). Logistic regression models. London: Chapman & Hall.
Hilbe, J. (2011) [2007]. Negative binomial regression (2nd ed.). Cambridge: Cambridge University Press.
Hilpert, M. (2008). Germanic future constructions: A usage-based approach to language change.
Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/cal.7
Hilpert, M. (2009). The German mit-predicative construction. Constructions and Frames, 1,
29–55. DOI: 10.1075/cf.1.1.03hil
Hilpert, M. (2012). Constructional change in English: Developments in allomorphy, word formation, and syntax. Cambridge: Cambridge University.
Hoffmann, Th. (2011). Preposition placement in English: A usage-based approach. Cambridge:
Cambridge University Press. DOI: 10.1017/CBO9780511933868
Hosmer, D., & Lemeshow, S. (2013) [1989, 2000]. Applied logistic regression. Hoboken: John
Wiley. DOI: 10.1002/9781118548387
Hox, J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). Hove & New York:
Routledge.
Husson, F. Josse, J., Lê, S., & Mazet, J. (2013). Multivariate exploratory data analysis and
data mining with R. Available at: http://cran.r-project.org/web/packages/FactoMineR/
FactoMineR.pdf.
Husson, F., Lê, S., & Pagès, J. (2011). Exploratory multivariate analysis by example using R.
London: Chapman & Hall.
Izenman, A. (2008). Modern multivariate statistical techniques: Regression, classification and
manifold learning. Heidelberg & New York: Springer. DOI: 10.1007/978-0-387-78189-1
Janda, L., & Solovyev, V. (2009). What constructional profiles reveal about synonymy: A case
study of the Russian words for sadness and happiness. Cognitive Linguistics, 20, 367–393.
DOI: 10.1515/COGL.2009.018
Johnson, K. (2008). Quantitative methods in linguistics. Oxford: Blackwell.
Techniques and tools 335
Johnson, V., & Albert, J. (1999). Ordinal data modeling. Heidelberg & New York: Springer.
Kärkkäinen, E. (2003). Epistemic stance in English conversation: A description of its interactional
functions, with a focus on I think. Amsterdam & Philadelphia: John Benjamins.
DOI: 10.1075/pbns.115
Kaufman, L., & Rousseeuw, P. (2005) [1990]. Finding groups in data: An introduction to cluster
analysis. Hoboken: John Wiley.
Keen, K. (2010). Graphics for statistics and data analysis with R. Boca Raton: CRC Press.
Klavan, J. (2012). Evidence in linguistics: Corpus-linguistic and experimental methods for
studying grammatical synonymy. Doctoral Dissertation, University of Tartu.
Klavan, J, Kesküla K., & Ojava, L. (2011). Synonymy in grammar: The Estonian adessive case
and the adposition peal ‘on’. In S. Kittilä, K. Västi, & J. Ylikoski (Eds.), Studies on case, animacy and semantic roles (pp. 1–19). Amsterdam & Philadelphia: John Benjamins.
Krawczak, K. (2014a). Shame and its near-synonyms in English: A multivariate corpus-driven
approach to social emotions. In I. Novakova, P. Blumenthal, & D. Siepmann (Eds.), Emotions in discourse (pp. 84–94). Frankfurt/Main: Peter Lang.
Krawczak, K. (2014b). Epistemic stance predicates in English: A quantitative corpus-driven
study of subjectivity. In D. Glynn, & M. Sjölin (Eds.), Subjectivity and epistemicity: Corpus,
discourse, and literary approaches to stance (pp. 355–386). Lund: Lund University Press.
Krawczak, K. (In press). Corpus evidence for the cross-cultural structure of social emotions:
Shame, embarrassment, and guilt in English and Polish. Poznań Studies in Contemporary
Linguistics.
Krawczak, K., & Glynn, D. (2011). Context and cognition: A corpus-driven approach to parenthetical uses of mental predicates. In K. Kosecki, & J. Badio (Eds.), Cognitive processes in
language (pp. 87–99). Frankfurt/Main: Peter Lang.
Krawczak, K., & Kokorniak, I. (2012). Corpus-driven quantitative approach to the construal of
Polish ‘think’. Poznań Studies in Contemporary Linguistics, 48, 439–472.
DOI: 10.1515/psicl-2012-0021
Krawczak, K., & Glynn, D. (In press). Operationalising construal: Of/about prepositional profiling for cognitive and communicative predicates. In C. M. Bretones Callejas (Ed.), Construals in language and thought: What shapes what? Amsterdam: John Benjamins.
Kroonenberg, P. (2008). Applied multiway data analysis. New York: John Wiley.
DOI: 10.1002/9780470238004
Lê, S., Josse, J., & Husson, F. (2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25, 1–18.
Le Roux, B., & Rouanet, H. (2004). Geometric data analysis: From correspondence analysis to
structured data analysis. Dordrecht: Kluwer.
Le Roux, B., & Rouanet, H. (2010). Multiple correspondence analysis. London & Thousand
Oaks: Sage.
Ledolter, J. (2013). Data mining and business analytics with R. Hoboken: John Wiley.
DOI: 10.1002/9781118596289
Lesnoff, M., & Lancelot, R. (2013). Analysis of overdispersed data. Available at: http://
cran.r-project.org/web/packages/aod/aod.pdf.
Levshina, N. (2011). A usage-based study of Dutch causative constructions. Doctoral dissertation, University of Leuven.
Levshina, N. (2012). Comparing constructicons: A usage-based analysis of the causative construction with doen in Netherlandic and Belgian Dutch. Constructions and Frames, 4,
76–101. DOI: 10.1075/cf.4.1.04lev
336 Dylan Glynn
Levshina, N., Geeraerts, D., & Speelman, D. (2013a). Towards a 3D-grammar: Interaction of
linguistic and extralinguistic factors in the use of Dutch causative constructions. Journal of
Pragmatics, 52, 34–48. DOI: 10.1016/j.pragma.2012.12.013
Levshina, N., Geeraerts, D., & Speelman, D. (2013b). Mapping constructional spaces: A contrastive analysis of English and Dutch analytic causatives. Linguistics, 51, 825–854.
DOI: 10.1515/ling-2013-0028
Lewandowska-Tomaszczyk, B., & Dziwirek, K. (Eds.). (2009). Studies in Cognitive Corpus Linguistics. Frankfurt/Main: Peter Lang.
Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R News, 2, 18–22.
Long, J. S., & Freese, J. (2006) [2001]. Regression models for categorical dependent variables using
Stata. College Station: Stata Press.
Louwerse, M., & Van Peer, W. (2009). How cognitive is cognitive poetics? The interaction between symbolic and embodied cognition. In G. Brône, & J. Vandaele (Eds.), Cognitive poetics goals, gains and gaps (pp. 423–444). Berlin & New York: Mouton de Gruyter.
Maechler, M. (2013). Cluster analysis extended. Available at: http://cran.r-project.org/web/
packages/cluster/cluster.pdf.
Maindonald, J. (2008). Using R for data analysis and graphics: Introduction, code and commentary. Available at: http://www.maths.anu.edu.au/˜johnm/r/usingR.pdf.
Maindonald, J., & Braun, J. (2010) [2003]. Data analysis and graphics using R (3rd ed.).
Cambridge: Cambridge University Press.
Marden, J. (2011). Multivariate statistical analysis: Old school. Department of Statistics, University of Illinois at Urbana-Champaign. Available at: istics.net/pdfs/multivariate.pdf.
Martin, A. D., Quinn, K. M., & Park, J. H. (2010). Markov chain Monte Carlo (MCMC) package. Available at: http://mcmcpack.wustl.edu/.
Menard, S. (2002). Applied logistic regression analysis (2nd ed.). London & Thousand Oaks:
Sage.
Menard, S. (2010). Logistic regression: From introductory to advanced concepts and applications.
London & Los Angeles: Sage.
Morgenstern, A., Blondel, M., Caët, S., & Boutet, D. (2011). Hearing children’s use of pointing
gestures: From pre-linguistic buds to the blossoming of communication skills. Presentation at SALC III, Copenhagen.
Murtagh, F. (2005). Correspondence analysis and data coding with R and Java. London: Chapman & Hall. DOI: 10.1201/9781420034943
Myers, D. (1994). Testing for prototypicality: The Chinese morpheme gong. Cognitive Linguistics, 5, 261–280. DOI: 10.1515/cogl.1994.5.3.261
Neandić, O., & Greenacre, M. (2007). Correspondence analysis in R, with two- and three-dimensional graphics: The ca Package. Journal of Statistical Software, 20, 1–13.
Newman, J., & Rice, S. (2004). Patterns of usage for English sit, stand, and lie: A cognitively-inspired exploration in corpus linguistics. Cognitive Linguistics, 15, 351–396.
DOI: 10.1515/cogl.2004.013
Newman, J., & Rice, S. (2006). Transitivity schemas of English eat and drink in the BNC. In
St. Th. Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based
approaches to syntax and lexis. (pp. 225–260). Berlin & New York: Mouton de Gruyter.
Nordmark, H., & Glynn, D. (2013). anxiety between mind and society: A corpus-driven
cross-cultural study of conceptual metaphors. Explorations in English Language and Linguistics, 1, 107–130.
Techniques and tools 337
O’Connell, A. (2006). Logistic regression models for ordinal response variables. London & Thousand Oaks: Sage.
Oakes, M. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Orme, J., & Combs-Orme, T. (2009). Multiple regression with discrete dependent variables.
Oxford: Oxford University Press. DOI: 10.1093/acprof:oso/9780195329452.001.0001
Peirsman, Y. Heylen, K., & Geeraerts, D. (2010). Applying word space models to sociolinguistics: Religion names before and after 9/11. In D. Geeraerts, G. Kristiansen, & Y. Peirsman
(Eds.), Advances in Cognitive Sociolinguistics (pp. 111–139). Berlin & New York: Mouton
de Gruyter. DOI: 10.1515/9783110226461
Pęzik, P. (2009). Extraction of multiword expressions for corpus-based discourse analysis. In
B. Lewandowska-Tomaszczyk, & K. Dziwirek (Eds.), Studies in Cognitive Corpus Linguistics (pp. 249–272). Frankfurt/Main: Peter Lang.
Pichler, H. (2013). The structure of discourse-pragmatic variation. Amsterdam & Philadelphia:
John Benjamins. DOI: 10.1075/silv.13
Plevoets, K., Speelman, D., & Geeraerts, D. (2008). The distribution of T/V pronouns in Netherlandic and Belgian Dutch. In K. Schneider, & A. Baron (Eds.), Variational pragmatics:
Regional varieties in pluricentric languages (pp. 181–209). Amsterdam & Philadelphia:
John Benjamins.
Pütz, M, Robinson, J. A., & Reif, M. (Eds.) (2012). Cognitive Sociolinguistics: Social and cultural
variation in cognition and language use. (Special edition of Annual Review of Cognitive
Linguistics, 10.)
Ravid, D., & Hanauer, D. (1998). A prototype theory of rhyme: Evidence from Hebrew. Cognitive Linguistics, 9, 79–106. DOI: 10.1515/cogl.1998.9.1.79
Read, J., & Carroll, J. (2012). Annotating expressions of Appraisal in English. Language Resources and Evaluation, 46, 421–447. DOI: 10.1007/s10579-010-9135-7
Reif, M., Robinson, J. A., & Pütz, M. (Eds.). (2013). Variation in language and language use:
Linguistic, socio-cultural and cognitive perspectives. Frankfurt/Main: Peter Lang.
Rencher, A. (2002). Methods of multivariate analysis (2nd ed.). New York: John Wiley.
DOI: 10.1002/0471271357
Rice, S., Sandra, D., & Vanrespaille, M. (1999). Prepositional semantics and the fragile link between space and yime. In M. Hiraga, C. Sinha, & S. Wilcox (Eds.), Cultural, typology and
psycholinguistic issues in Cognitive Linguistics (pp. 107–127). Amsterdam & Philadelphia:
John Benjamins.
Ripley, B. (2013). Support functions and datasets for Venables and Ripley’s MASS. Available at:
http://cran.r-project.org/web/packages/MASS/MASS.pdf.
Robinson, J. A. (2010a). Awesome insights into semantic variation. In D. Geeraerts,
G. Kristiansen, & Y. Piersman (Eds.), Advances in Cognitive Sociolinguistics (pp. 85–109).
Berlin & New York: Mouton de Gruyter.
Robinson, J. A. (2010b). Semantic variation and change in present-day English. Doctoral dissertation, University of Sheffield.
Robinson, J. A. (2012). A sociolinguistic perspective on semantic change. In K. Allan, & J. A.
Robinson (Eds.), Current methods in Historical Semantics (pp. 191–231). Berlin & New
York: Mouton de Gruyter.
Roever, C., Raabe, N., Luebke, K., Ligges, U., Szepannek, G., & Zentgraf, M. (2013). Classification and visualization. Unpublished manuscript available at: http://cran.r-project.org/
web/packages/klaR/klaR.pdf.
338 Dylan Glynn
Rudzka-Ostyn, B. (1989). Prototypes, schemas, and cross-category correspondences: The case
of ask. In D. Geeraerts (Ed.), Prospects and problems of prototype theory (pp. 613–661).
Berlin & New York: Mouton de Gruyter.
Rudzka-Ostyn, B. (1995). Metaphor, schema, invariance: The case of verbs of answering. In
L. Goossens, P. Pauwels, B. Rudzka-Ostyn, A.-M. Simon-Vandenbergen, & J. Vanparys
(Eds.), By word of mouth: Metaphor, metonymy, and linguistic action from a cognitive perspective (pp. 205–244). Amsterdam & Philadelphia: John Benjamins.
DOI: 10.1075/pbns.33
Ruette, T., Ehret, K., & Szmrecsanyi, B. (In press). Frequency effects in lexical sociolectometry are
insubstantial. In H. Behrens, & S. Pfänder (Eds.), Again on frequency effects in language.
Berlin & New York: Mouton de Gruyter.
Ruette, T., Geeraerts, D., Peirsman, Y., & Speelman, D. (Forthcoming). Semantic weighting
mechanisms in scalable lexical sociolectometry. In B. Szmrecsanyi, & B. Waelchli (Eds.),
Aggregating dialectology and typology: Linguistic variation in text and speech, within and
across languages. Berlin & New York: Mouton de Gruyter.
Sagi, E., Kaufmann, S., & Clark, B. (2011). Tracing semantic change with latent semantic analysis. In K. Allan, & J. Robinson (Eds.), Current methods in Historical Semantics (pp. 161–
183). Berlin & New York: Mouton de Gruyter.
Sandra, D., & Rice, S. (1995). Network analyses of prepositional meaning: Mirroring whose
mind – the linguist’s or the language user’s? Cognitive Linguistics, 6, 89–130.
DOI: 10.1515/cogl.1995.6.1.89
Scheibman, J. (2002). Point of view and grammar: Structural patterns of subjectivity in American
English conversation. Amsterdam & Philadelphia: John Benjamins. DOI: 10.1075/sidag.11
Scherer, K. (2005). What are emotions? And how can they be measured? Social Science Information, 44, 693–727. DOI: 10.1177/0539018405058216
Schmid, H.-J. (1993). Cottage and co., idea, start vs. begin: Die kategorisierung als grundprinzip
einer differenzierten bedeutungsbeschreibung. Tübingen: Max Niemeyer.
DOI: 10.1515/9783111355771
Schmid, H.-J. (2000). English abstract nouns as conceptual shells: From corpus to cognition.
Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110808704
Schmidtke-Bode, K. (2009). Going-to-V and gonna-V in child language: A quantitative approach to constructional development. Cognitive Linguistics, 20, 509–553.
DOI: 10.1515/COGL.2009.023
Schönbrodt, F., Collins, L., & Stemmler, M. (2013). cfa2: Configuration frequency analysis with
a design matrix. Available at: http://www.rforge.net/cfa2/.
Schulze, R. (1991). Getting round to (a)round: Towards the description and analysis of a ‘spatial’ predicate. In G. Rauh (Ed.), Approaches to prepositions (pp. 253–74). Tubingen: Günter
Narr.
Sheather, S. (2009). A modern approach to regression with R. New York: Springer.
DOI: 10.1007/978-0-387-09608-7
Smith, R. (2011). Multilevel modeling of social problems: A causal perspective. Heidelberg:
Springer. DOI: 10.1007/978-90-481-9855-9
Speelman, D., & Geeraerts, D. (2010). Causes for causatives: The case of Dutch ‘doen’ and ‘laten’.
In T. Sanders, & E. Sweetser (Eds.), Causal categories in discourse and cognition (pp. 173–
204). Berlin & New York: Mouton de Gruyter.
Techniques and tools 339
Speelman, D., Tummers, J., & Geeraerts, D. (2009). Lexical patterning in a Construction Grammar: The effect of lexical co-occurrence patterns on the inflectional variation in Dutch
attributive adjectives. Constructions and Frames, 1, 87–118. DOI: 10.1075/cf.1.1.05spe
Stefanowitsch, A. (2010). Empirical Cognitive Semantics: Some thoughts. In D. Glynn, &
K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 355–
380). Berlin & New York: Mouton de Gruyter.
Stefanowitsch, A., & St. Th. Gries. (2003). Collostructions: Investigating the interaction of
words and constructions. International Journal of Corpus Linguistics, 8, 209–243.
DOI: 10.1075/ijcl.8.2.03ste
Stefanowitsch, A., & St. Th. Gries. (2005). Covarying collexemes. Corpus Linguistics and Linguistic Theory, 1, 1–43. DOI: 10.1515/cllt.2005.1.1.1
Stefanowitsch, A., & St. Th. Gries. (2008). Register and constructional meaning: A collostructional case study. In G. Kristiansen, & R. Dirven (Eds.), Cognitive Sociolinguistics: Language variation, cultural models, social systems (pp. 129–152). Berlin & New York: Mouton
de Gruyter. DOI: 10.1515/9783110199154.2.129
Stefanowitsch, A., & Gries, St. Th. (Eds.). (2006). Corpus-based approaches to metaphor and
metonymy. Berlin & New York: Mouton de Gruyter. DOI: 10.1515/9783110199895
Stevens, J. 2001. Applied multivariate statistics for the social sciences (4th ed.). Mahwah:
Lawrence Erlbaum.
Strobl, C., Hothorn, T., & Zeileis, A. (2009a). Party on! A new, conditional variable importance
measure for random forests available in the party package. The R Journal, 1, 14–17.
Strobl, C., Malley, J., & Gerhard T. (2009b). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14, 323–348. DOI: 10.1037/a0016973
Suzuki, R. (2013). Hierarchical clustering with p-values via multiscale bootstrap resampling.
Available at: http://cran.r-project.org/web/packages/pvclust/pvclust.pdf.
Suzuki, R., & Hidetoshi, S. (2006). Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics, 22, 1540–1542. DOI: 10.1093/bioinformatics/btl117
Szelid, V, & Geeraerts, D. (2008). Usage-based dialectology: Emotion concepts in the Southern
Csango dialect. Annual Review of Cognitive Linguistics, 6, 23–49. DOI: 10.1075/arcl.6.03sze
Szmrecsanyi, B. (2003). Be going to versus will/shall: Does syntax matter? Journal of English
Linguistics, 31, 295–323. DOI: 10.1177/0075424203257830
Szmrecsanyi, B. (2006). Morphosyntactic persistence in spoken English: A corpus study at the
intersection of Variationist Sociolinguistics, Psycholinguistics, and Discourse Analysis. Berlin
& New York: Mouton de Gruyter. DOI: 10.1515/9783110197808
Szmrecsanyi, B. (2010). The English genitive alternation in a cognitive sociolinguistic perspective. In D. Geeraerts, G. Kristiansen, & Y. Peirsman (Eds.), Advances in Cognitive Sociolinguistics (pp. 141–166). Berlin & New York: Mouton de Gruyter.
Szmrecsanyi, B. (2013). Grammatical variation in British English dialects. Cambridge:
Cambridge University Press.
Tabachnick, B., & Fidell, L. (2007). Using multivariate statistics (5th ed.). London: Pearson.
Taboada, M., & Carretero, M. (2012). Contrastive analyses of evaluation in text: Key issues in
the design of an annotation system for attitude applicable to consumer reviews in English
and Spanish. Linguistics and the Human Sciences, 6, 275–295.
Tarling, R. (2009). Statistical modelling for social researchers: Principles and practice. London &
New York: Routledge.
340 Dylan Glynn
Therneau, T., Atkinson, E., & Foundation, M. (2013). An introduction to recursive partitioning using the RPART routines. Available at: http://cran.r-project.org/web/packages/rpart/
vignettes/longintro.pdf.
Thompson, L. (2009). S-PLUS (and R) manual to accompany Agresti’s categorical data analysis
(2002). Available at: home.comcast.net/~lthompson221/Splusdiscrete2.pdf.
Tummers, J., Heylen, K., & Geeraerts, D. (2005). Usage-based approaches in Cognitive Linguistics: A technical state of the art. Corpus Linguistics and Linguistic Theory, 1, 225–261.
DOI: 10.1515/cllt.2005.1.2.225
Valenzuela Manzanares, J., & Rojo López, A. M. (2008). What can language learners tell us
about constructions? In S. De Knop, & T. De Rycker (Eds.), Cognitive approaches to pedagogical grammar? A volume in honour of René Dirven (pp. 197–230). Berlin & New York:
Mouton de Gruyter.
Van Bogaert, J. (2010). A constructional taxonomy of I think and related expressions: Accounting for the variability of complement-taking mental predicates. English Language and Linguistics, 14, 399–428. DOI: 10.1017/S1360674310000134
Venables, W., & Ripley, B. (2002). Modern applied statistics with S (4th ed.). Heidelberg: Springer. DOI: 10.1007/978-0-387-21706-2
Verdonik, D., Rojc, M., & Stabej, M. (2007). Annotating discourse markers in spontaneous
speech corpora on an example for the Slovenian language. Language Resources and Evaluation, 41, 147–180. DOI: 10.1007/s10579-007-9035-7
von Eye, A. (2002). Configural frequency analysis: Methods, models, and applications. Mahwah:
Erlbaum.
von Eye, A., & Mair, P. (2008) A functional approach to configural frequency analysis. Austrian
Journal of Statistics, 37, 161–173.
von Eye, A, Mair, P., & Mun, E.-Y. (2010). Advances in configural frequency analysis. London:
Guilford Press.
von Eye, A, & Mun, E.-Y. (2013). Log-linear modeling: Concepts, interpretation, and application.
Hoboken: John Wiley.
Wiebe, J., Wilson, T., & Cardie, C. (2005). Annotating expressions of opinions and emotions in
language. Language Resources and Evaluation, 39, 165–210.
DOI: 10.1007/s10579-005-7880-9
Wiechmann, D. (2008). On the computation of collostruction strength: Testing measures of
association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4, 253–
290. DOI: 10.1515/CLLT.2008.011
Wong, M. (2009). Gei constructions in Mandarin Chinese and bei constructions in Cantonese:
A corpus-driven contrastive study. International Journal of Corpus Linguistics, 14, 60–80.
DOI: 10.1075/ijcl.14.1.04won
Wulff, S. (2003). A multifactorial corpus analysis of adjective order in English. International
Journal of Corpus Linguistics, 8, 245–82. DOI: 10.1075/ijcl.8.2.04wul
Wulff, S. (2006). Go-V vs. go-and-V in English: A case of constructional synonymy? In St. Th.
Gries, & A. Stefanowitsch (Eds.), Corpora in Cognitive Linguistics: Corpus-based approaches to syntax and lexis (pp. 101–126). Berlin & New York: Mouton de Gruyter.
Wulff, S. (2009). Rethinking idiomaticity: A usage-based approach. London: Continuum.
Wulff, S. (2010). Marrying cognitive-linguistic theory and corpus-based methods: On the compositionality of English V NP-idioms. In D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches (pp. 223–238). Berlin & New York: Mouton de
Gruyter.
Techniques and tools 341
Wulff, S., Stefanowitsch, A., & Gries, St. Th. (2007). Brutal Brits and persuasive Americans:
Variety-specific meaning construction in the into-causative. In G. Radden, Köpcke, K.-M.,
Berg, Th., & Siemund, P. (Eds.), Aspects of meaning construction (pp. 265–281). Amsterdam & Philadelphia: John Benjamins.
Zeschel, A. (2010). Exemplars and analogy: Semantic extension in constructional networks. In
D. Glynn, & K. Fischer (Eds.), Quantitative Cognitive Semantics: Corpus-driven approaches
(pp. 201–221). Berlin & New York: Mouton de Gruyter.
Zhao, Y. (2013). R and data mining: Examples and case studies. Unpublished manuscript.
Available at: http://www.rdatamining.com.
Zlatev, J., & Andrén, M. (2009). Stages and transitions in children’s semiotic development. In
J. Zlatev, M. Andrén, C. Lundmark, & M. Johansson Falck (Eds.), Studies in language and
cognition (pp. 380–401). Newcastle: Cambridge Scholars.
Statistics in R
First steps
Joost van de Weijer and Dylan Glynn
Lund University and University of Paris VIII
The R Project for Statistical Computing is one of the most comprehensive and
widely used software options for statistical analysis. Moreover, it is open source,
freely available and entirely cross-platform. It is for these reasons that the following chapters all employ R to demonstrate the application and interpretation
of statistics. Like the commercially available software SAS, but unlike three
other widely used suites (SPSS, Stata, and Statistica), R is principally used in
command line. The need to work with commands rather than a graphical user
interface can be a challenge for novice users, especially when combined with the
task of learning statistics. However, commands given in a step-by-step fashion is
arguably simpler than a graphic interface, which can overwhelm the novice user
with options. This chapter is an introduction to R focusing on how to import
data and make sure those data are in the correct format for analysis. Knowledge
of each of these steps is assumed in the following chapters.
Keywords: dataframe, cross-tabulation, formatting data, importing data
1. Installing R
R is freely available from www.r-project.org. On the site, one will find the CRAN
server link where the various versions (Linux, MacOSX and Windows) can be downloaded. Follow the instructions to install R onto a personal computer.
2. Commands
Once R has been installed on the computer, you can launch it by double-clicking the
program’s icon. A window entitled “R Console” should open. In this window, you will
see some introductory text, followed by an empty line that starts with a “>” sign. This
344 Joost van de Weijer and Dylan Glynn
sign is called the prompt. Unlike many other programs, R does not have a graphical
interface with drop-down menus or dialogue windows to do an analysis. Instead, you
need to tell the program what to do by typing commands directly in the console window. Numerous examples of commands are given throughout this book. Commands
can be simple or complex. An example of a command is given below. This command
tells R to read a text file and thus load the data into its working memory. The actual
command below will not work since the location of the file is not specified, but it
serves to explain the principle behind using command line.
mydata = read.table("dataframe.txt")
For many users new to R, working with commands is difficult at first. Commands
have a very strict structure (called syntax). If they are not entered entirely correctly,
they do not work. The source of the error can be small (a missing comma, for instance) and the resulting error message is usually not very helpful to a beginner trying to determine what the problem actually is. Nevertheless, writing and memorising
commands gets easier through practice. A helpful strategy is to keep and maintain a
personal collection of command examples that you have used before and which are
useful for the kinds of analysis you typically carry out.
Let us examine the above command line by identifying its constituent elements.
The central element is the word read.table. This is one of many functions that
are built in into R and it is used for importing external data. More examples of the
read.table function are discussed/explained below in Section 4. A function, such
as read.table, is a way of telling the program to do something, in this case load
the data, which is in the format of a table, into R. Note that read.table is followed
by an opening parenthesis and, a bit further on at the end, a closing parenthesis. The
text within the parentheses is the name of a data file, enclosed in quotation marks.
The action that is specified by the function is often performed ‘on something’; here,
the data file called dataframe.txt which is stored on the computer. This part of the
command is called the argument .
Then notice the part to the left of the read.table command in the example
above. The result of the action performed by the command is saved internally in R
under the name mydata. Practically, this means that the content of the external data
file is copied, and saved in R under a new name. This name is more or less arbitrarily
chosen. There are some restrictions on the choice of names, but almost anything will
do, as long as it starts with a letter and does not contain any special characters. The
name could as well have been olddata, sunday2112, pilotstudy, or xxx. It is up
to you to choose the name, choose something short and easy to remember.
Data or results that are saved in R are usually referred to as objects. Once an object
has been created, it will be available until the program is quit or until it is manually deleted from R. An object can be a single number, a series of numbers, a complete data
file, a graph, the output of an analysis, and so on. These are all stored in the so-called
Statistics in R 345
workspace. In order to see a listing of what is in the workspace, type the command
ls(). In order to delete or remove and object type remove() with the name of the
object between the parentheses.
3. The data file
Creating a data file that will be loaded into R is an important step in the analysis and
one that often leads to confusion when first learning how to use R. The idea is to take
the data, from whatever their source, and put them in a plain text file. This step needs
extreme care since text files can include hidden formatting and other information that
will prevent the data from being loaded. Typically, data are held in a spreadsheet file
(such as those produced by MS Office’s Excel, Open Office’s Calc or iWork’s Numbers)
or a database (such as those produced by MS Office Word’s Access, Open Office’s Base,
and Apple’s FileMaker). When just beginning, the tabular layout of a spreadsheet is
arguably the easiest. This is because when data are displayed in a spreadsheet, one can
easily see whether all cells are complete, whether the columns are aligned properly
and how many cases there are.
Once one is confident that the data are all clear and there are no empty cells and
so forth, one can copy and paste the data directly into a plain text file. It is important
that the file itself does not contain any ‘formatting’. If the file contains formatting or
invisible mark-up, R may not be able to read the file. There are many text-editing programs that can contain hidden formatting (such as WordPad for Windows and TextEdit for MacOSX). However, other text editors automatically ‘strip’ any formatting,
hidden or otherwise (such as NotePad in Windows and TextWrangler in MacOSX).
If the data are contained in a database, one needs to export the data to a plain text
file. This option is also available in the spreadsheet programs. Exporting data helps to
eliminate the problem of hidden formatting and mark up. If the data contain diacritics
or non-roman characters, this option is the safest one because it is usually possible to
specify the encoding used in the text file (unicode is the preferred option here).
The data in the text file, whether taken from a database or a spreadsheet, normally
exists in one of two formats: a flat or ‘raw’ dataframe form, where each row represents
a case and each column represents a variable, and a cross-tabulation or a contingency
table where this information is summarised numerically. An example of the flat or
‘raw’ dataframe format is illustrated in Table 1.
In most spreadsheets and databases, the data can be saved as a tab or comma delimited textfile, which means that there is a tab or a comma between each column. On
screen, the columns in the textfile may not look perfectly straight, but this does not
necessarily mean there is a problem – remember, there is no formatting.
R reads any type of column delimitation. In the commands, it is assumed the
data are tab-delimited. The flat dataframe layout of the data normally results from the
346 Joost van de Weijer and Dylan Glynn
Table 1.╇ Example of a flat dataframe
verb tense personfigurativity
run
past
1
literal
run
past
1
metaphor
jog
future 2
metaphor
run
past
2
literal
literal
skippast2
jog
future 1
literal
Table 2.╇ Example of a cross-tabulation
jog
run
skip
future
2
0
0
past
0
3
1
pers.1
1
2
0
pers.2
1
1
1
literal
1
2
1
metaphor
1
1
0
manual analysis of examples in a database or spreadsheet. In this format, the numbers
of occurrences are not indicated, but instead each occurrence is listed in a large ‘flat’
file. This format of the data is the preferred format for multivariate statistics. One
of the most common problems faced when starting statistical analysis in R is that
the annotation or concordance tool used to obtain the data exports the results in a
numerical table (described below). Typically, such tables have omitted much of the
important information and cannot be used in most multivariate analyses. It is important to make sure that whatever corpus tool being used the data can be exported to
the flat dataframe format.
The second typical format for data is a numerical tabulation. We refer to this as a
cross-tabulation or a contingency table. In contrast to the raw dataframe, this format
is a numerical summary of the data and is typically a result of manually counting occurrences or the results of questionnaires. However, the format can also be generated
by a number of spreadsheet, database, annotation and concordance programs. An
example of a cross-tabulation is shown in Table 2.
Using the first column in Table 1 as the row names, the data in Table 2 are the
equivalent to those above. It is important to note that this table would look considerably different if we were to take a different column in Table 1 and use it to create
the row names. This data format is sometimes also referred to as a frequency table,
contingency table, xtab, or pivot table.
Statistics in R 347
4. Importing the data into R
4.1
Importing data from a flat dataframe
The two formats of data are loaded into R differently. We will begin with the raw
datafile. The most common command for importing data into R is read.table(),
which was introduced in Section 1. Here we present this command again, but now
with two additions.
mydata = read.table("dataframe.txt", header=TRUE,
sep="\t")
In this example, there are three arguments to the command rather than just one as
in the earlier example. The arguments have been separated by commas. The first argument, "dataframe.txt" is the filename. The second argument, header=TRUE,
indicates that the first line of the data file contains the variable names, as is the case
in Table 1. If this argument had been omitted, then the fields in the first line of the
data file would have been interpreted as values rather than as names, and the columns
would have been labelled automatically instead (V1, V2, etc.). The third argument,
sep="\t", indicates that the columns are separated by tabs. Had the columns been
separated by commas instead, then the third argument would have been sep=",".
The object that has been created with the read.table() command above is called a
dataframe, and it has been named mydata.
However, these commands will still not work – we have to add one further piece of
information. We need to tell R where to find the data file. There are four possibilities:
mydata = read.table(file.choose(), header=TRUE, sep="\t")
mydata = read.table("clipboard", header=TRUE, sep="\t")
mydata = read.table("users/linguist/data/dataframe.txt",
header=TRUE, sep="\t")
setwd("/Users/linguist/data")
mydata = read.table("dataframe.txt", header=TRUE,
sep="\t")
In the first alternative, the name of the file has been replaced by file.choose(). This
argument causes a dialogue window to open from which the data file can be chosen.
With the second alternative, it is assumed that the data have been copied to the clipboard and R takes the data from there. This is similar to the copy and paste function
common in applications such as MS Word and Excel. Depending of the operating
system and the version of R being used, the second option often results in problems.
The third option tells R to go to a specific location and open the file. That location
can be on the hard drive of a personal computer, on a server, or even on the Internet.
The location is indicated as a path, where slashes “/” represent folders. This location
348 Joost van de Weijer and Dylan Glynn
or path needs to be between inverted commas. In MacOSX and Linux, the command
given above indicates that the data file is in a folder called ‘data’, which is in a folder
called ‘linguist’, which in turn is in a folder called ‘Users’, which is on the boot drive of
the system. In Windows, one needs to add the name of the disk or volume typically
indicated by a lowercase letter followed by a colon, “c:” being the default label for a
boot drive. Therefore, for Windows, the equivalent command line looks like this:
mydata = read.table
("c:users/linguist/data/dataframe.txt",
header=TRUE, sep="\t")
The fourth alternative, finally, consists of two commands. The first command sets the
so-called working directory, which is the path to the folder in which the datafile is
stored. The second command imports the data into a dataframe object. If you do not
know what the current working directory is, you can find out by typing the command
getwd().
The object that is being created with read.table() is called mydata. It is important to check that the dataframe created from a data file has been properly imported into R. It is not uncommon that something goes astray during the process of
importing, and not always does this result in an error message in R. Therefore, next
we provide examples of commands that can help make sure that the newly imported
data is in order. The simplest way of seeing what the data looks like in R is by typing
the name of the object. We called our data object mydata.
mydata
Typing the name of the object, here mydata, will display the contents of the entire
object. However, if the object is a flat dataframe, this will result in the entire dataset
being displayed, which can be extremely cumbersome. In the case of a flat dataframe,
a better option is to look at the first few rows, using the command head():
head(mydata)
This command will display the column names of the dataframe (these are the names
that were in the first row of the data file; the headers in a spreadsheet and the cell labels in a database), followed by the first six rows. There is a corresponding command,
tail(), that displays the last six rows of the dataframe.
A second command that is useful in this regard is summary(). This command
generates a numerical summary of the dataframe. This is also useful for spotting
spelling mistakes and so forth in the analysis. Remember that R treats lowercase and
uppercase letters as distinct items and does not ignore invisible characters such as a
space or a tab. It is rare that a flat dataframe is without any mistakes. After importing
data, one must typically return to the database or spreadsheet and correct many small
typographic errors.
Statistics in R 349
summary(mydata)
A third command that provides information about a dataframe is the command
str(). This command offers information about the structure of a dataframe. When
applied to the data from Table 1, we receive the following output:
str(mydata)
'data.frame' :
$ verb
:
2 2 1 2 3 1
$ tense
:
2 2 1 2 2 1
$ person
:
$ figurativity:
1 2 2 1 1 1
6 obs. of 4 variables:
Factor w/ 3 levels "jog","run","skip":
Factor w/ 2 levels "future","past":
int 1 1 2 2 2 1
Factor w/ 2 levels "literal","metaphor":
The first line of the output shows that mydata is a dataframe with six observations
(rows) and four variables (columns). The next four lines show the four variables and
three types of information about them. First of all, they show their names. Second
they show what type of variables they are. Here, three of the four variables are labelled
as Factor, which is the usual type for categorical variables. Additionally, the str()
command shows how many levels these variables have, that is, the number of features
that these variables possess. The variables tense and figurativity have two levels, while
verb has three. The fourth variable, person, is labelled as int, which means that the
values of that variable are whole numbers. The numbers 1 and 2 are labels for first and
second person, respectively. This has revealed an important error. R has assumed that
this variable is numerical because the labels used in the analysis were numbers. One
solution is to add a letter to the label in the spreadsheet or database, the other is to
change it in R. This is further explained in Section 5.3.
4.2
Importing cross-tabulations
If the data are in a cross-tabulated format, as exemplified in Table 2, loading the data is
a little different. There are two ways to load the data in this case. Firstly, one can repeat
the command used above in Section 4.1 to load the data from a flat dataframe, but add
the argument row.names = 1.
mydata = read.table(file.choose(), header=TRUE, sep="\t",
row.names=1)
The read.table command is not designed for data in this format, so R does not
always treat the data as it should (for example, the str() and summary() commands
do not work). Nevertheless, for most purposes, loading the data in this way does not
350 Joost van de Weijer and Dylan Glynn
Table 3.╇ Layout for read.ftable() command
verb
future
2
0
0
jog
run
skip
past
0
3
1
1stPers
1
2
0
2ndPers
1
1
1
literal
1
2
1
metaphor
1
1
0
pose any problems. If one wishes to see that the cross-tabulation has been correctly
loaded, enter the object name:
mydata
This brings up the cross tabulation for inspection. A second, more orthodox, way to
load a cross-tabulation is to use the command read.ftable(). Notice an “f ” has
been added to the function read.table.
mydata = read.ftable(file.choose(), sep="\t")
This command tells R that the data are in the cross-tabulated format. Note that the
header=TRUE argument has been removed. The read.ftable command is sensitive to the actual layout of the data, especially to how the row and column names are
placed in the text file. For this command to work, the name of the first row must be
located on a previous line, independent from the column names. One must also remember to add a ‘blank’ first column beneath the first row-name. Table 3 exemplifies
this layout.
Both commands, read.table and read.ftable(), expect a blank line (a return carriage) at the end of the text file, beneath the table.
4.2.1 Transpose a contingency table
Sometimes it is useful to transpose a cross-tabulation. Transposition means that the
entire data object is rotated by 90 degrees. That is, the rows become columns and the
columns become rows. For a raw dataframe, this is rarely useful, but for cross-tabulations, since the data are summarised as a given variable and its levels are relative to
another variable (or variables and levels), inverting the table means that a statistical
technique may actually examine the data from a different perspective. Transposition
can be done in R using the command t(), as in the following example:
mydata2 = t(mydata)
The object mydata2 is now an inverted or transposed version of the of the original
cross-tabulation.
Statistics in R 351
5. Making changes to a dataframe in R
It is not uncommon that information in a dataframe needs to be changed or that
new information needs to be added. A possible way of doing this is to make changes
directly in the data file and to re-load the data, but for various reasons it might be desirable to keep the data file unchanged, and to do the modifications to the dataframe
in R instead.
5.1
Creating objects
To begin, the principle of creating objects in R needs to be explained. Two synonymous signs are used for creating object, “=” and “<-”. The signs can be used interchangeably. The item that is to the left of the sign is the new object, and it equates
whatever is to the right. In Section 4.1 above, we created an object mydata using this
sign. This object is stored in the memory of the computer and remains there until one
quits R or one removes the object with the command rm() or remove(). For example, when changing the data in R, one option is to make a duplicate of those data. Of
course, if something goes wrong, one can always re-load from the data file. Moreover,
having too many data objects in R uses up precious memory. However, to help us
understand how objects can be created, let us duplicate our data:
mydata.copy = mydata
Now there are two identical copies of the data in R, one labelled mydata, the other
labelled mydata.copy. One does not normally use this functionality to duplicate
data, but to create a new object; for example, the results of a statistical analysis, which
one wishes to store in order to plot or to run further analyses upon at a later stage.
Working with large datasets or running many analyses can result in sluggish performance from the computer since all this information is stored in the active memory.
Therefore, objects not in use should be removed. To remove an object, type rm(). The
following line removes the duplicated data.
rm(mydata.copy)
5.2
Changing a variable name
The name of one or more variables in a dataframe can be changed using the command
colnames(). In order to do this, one must also know the number of the column with
the name that needs to be changed. In the example dataframe, there are four columns.
To see the names of all four columns, type:
colnames(mydata)
352 Joost van de Weijer and Dylan Glynn
To change the name of a column, in this case the first one, type:
colnames(mydata)[1] = "newname"
5.3
Changing a variable type
A common manipulation is to change the variable type. An example is when the levels
of a categorical variable have been coded as numbers, as was the case with the variable
person in Table 1. The levels of this variable have been coded as 1 and 2 for first and
second person, respectively. Automatically, this variable was imported as an integer
variable, in other words, a numeric variable. As a consequence, certain commands
that are appropriate for a categorical variable do not work (e.g., the levels() command) if no action is undertaken. More importantly, this variable is not numerical,
it is categorical; the example sentence was either in the 1st person or the 2nd person.
Letting R assume it is a numerical variable will lead to errors in any subsequent statistical analysis. The solution is to change the variable type for person from integer
to factor using the factor() command:
mydata$person.fact = factor(mydata$person)
Note that we gave the transformed variable a new name (person.fact), which automatically created a new variable in the dataframe. While it is possible to keep the
original name after transformation, there is a risk involved of inadvertently running
the same transformation once again, and thus doubling the effect of the transformation. This risk is avoided by using a new name for the transformed variable. If one
wishes to replace the numerical variable with the same variable but understood by R
as a factor, use the command:
mydata$person = factor(mydata$person)
5.4
Changing values in a dataframe
There are many situations where values for a given variable need to be changed. These
may be values that were not entered correctly in the first place, values that need to be
rescaled, and so on. Here we illustrate how to change values in a dataframe by first
adding a new variable to mydata that divides the verbs into two groups, and then
changing some of the values of that new variable. Suppose, for instance, that we would
like to add a variable to mydata that indicates verb type, and that the verbs jog and
run belong to type "A" while skip belongs to type "B". Here we do that in two steps.
In the first step we create the new variable, call it verbtype, and assign it the value
"A" for all cases in the dataset:
Statistics in R 353
mydata$verbtype="A"
In the second step, we change the values of verbtype into "B", but do that only for
the cases where the variable verb equals skip. This can be done using square brackets
([]) notation. In R, square brackets are often used to identify the position of a value,
as we saw above with the colnames() command. To find a value in a dataframe, we
need to specify the row(s) that contain a specific value, and the column(s). In other
words, we need to include two things within the square brackets: the row(s) and the
column(s). Within square brackets, the row is always specified first, followed by a
comma, followed by the specification of the column. Schematically:
[which row(s)? , which column(s)?]
In our example, the changes need to apply only to the rows where the variable verb
equals skip, and only in the column that contains the variable verbtype:
mydata[mydata$verb=="skip","verbtype"]
If we give this as a command to R, we get the value of the variable verbtype for the
cases where the variable verb equals skip. Now this value is still "A", but it needs to
be changed to "B". Setting the new value is easy:
mydata[mydata$verb=="skip","verbtype"]="B"
We can check the result by typing the name of the dataframe:
mydata
which shows:
1
2
3
4
5
6
verb
run
run
jog
run
skip
jog
tense person figurativity verbtype
past
1
literal
A
past
1
metaphor
A
future
2
metaphor
A
past
2
literal
A
past
2
literal
B
future
1
literal
A
Learning to use the square brackets notation is very helpful when working with R. It is
a flexible way of manipulating dataframes, or getting information out of a dataframe.
5.5
Creating a subset of the dataframe
It happens frequently that certain cases need to be excluded from an analysis. These
may be cases containing missing values, outliers, or cases with incorrect values. In
354 Joost van de Weijer and Dylan Glynn
R, there are several ways of creating a subset of a dataframe. Here we present one of
them, namely using the subset() command. Suppose, by way of illustration, that we
would like to exclude all cases for the verb to skip from mydata. We could establish
this by telling R
subset(mydata,mydata$verb!="skip")
where the operator != means is not equal to. Note, that the quotes around the word
"skip" are obligatory and that they must be straight quotes (not ‘curly’). It sometimes happens, when code is copied and pasted into R from a text source that the
quotes are curled and not straight. In this case, R will not recognise the command.
If, on the other hand, we would like to restrict the dataset to the verb skip only,
then we would write:
subset(mydata,mydata$verb=="skip")
In this second example of the subset() command, pay attention to the double ==
sign. This is also obligatory, or the command will not work.
5.6
Merging dataframes
Two dataframes can be merged in two ways, either the second dataframe is appended
below the first one, or it is added next to it. In the first merge, we add new cases to the
first dataframe; in the second we add new variables. Merging two dataframes is easy,
as long as the two dataframes match. If we append one dataframe to the other one,
both dataframes need to have the same number of columns, and the column names
need to be identical. The command for appending one dataframe to another one is
rbind() (‘row bind’):
rbind(dataframe1,dataframe2)
If we want to add the second dataframe to the right of the first one, and the two dataframes have the same number of rows, we can use the cbind() command (‘column
bind’):
cbind(dataframe1,dataframe2)
If two dataframes do not match completely, the commands rbind() and cbind()
produce errors. An alternative, for partly matched dataframes, is the command
merge(). This command allows one to specify which rows or columns in the first
dataframe are to be matched with those in the second.
Statistics in R 355
5.7
Reordering the levels of a factor
In a categorical variable, there are different features or types that makes up the variable. For example, in the variable verb, we have three types run, skip, and jog. These
are referred to as levels and may be shown using the command levels(). To display
the levels of the variable verb, for instance, type:
levels(mydata$verb)
The output shows the three verbs, which, by default, have been ordered alphabetically,
i.e., jog, run and skip. This same order would also be applied when the verbs were to
be displayed in a graph or a table, or when choosing one of the verbs as a reference to
which the others are being compared. In these cases, the default order may need to
be changed into something else. One way of establishing this is by adding an option
to the factor() command. Suppose, for instance, that the three verbs were to be
reordered as skip, jog and run instead, we could write:
mydata$verb=factor(mydata$verb, levels =c("skip","jog",
"run"))
The effect of this command is that the verbs will now be displayed in the specified
order.
Re-ordering the levels is especially important in regression analysis where the
first level serves as the reference level to the others (see Speelman, this volume 487–
533). If only the reference level needs to be set without affecting the order of the levels,
there is a second and simpler alternative, namely the relevel() command. If, for
instance, we would like to keep the original alphabetical order, but make the verb run
the reference level of the three, then we could write:
mydata$verb = relevel(mydata$verb,ref="run")
5.8
Counting frequencies
The most basic command for obtaining frequency information from a dataframe was
mentioned above. The command summary() is the first port of call for examining
frequencies in a dataframe:
summary(mydata)
This command tells R to display a summary of the frequencies for each level of each
variable in the dataframe. However, it can only list the frequencies of up to 7 levels.
This means that if a variable has many different tags, features or levels, we need to
356 Joost van de Weijer and Dylan Glynn
look specifically at the variable itself. For this, we use the table () command. When
applied to the variable verb in mydata, this will be:
table(mydata$verb)
This command tells R to list all the levels in the variable verb and their frequencies.
6. Converting data formats
We have seen that there are two data formats, the dataframe and the cross-tabulation.
The data file shown in Table 1 is an example of a flat or raw dataframe. Every row
represents a single case. Often this is the format of a data file that is imported into R.
It shows the data as they were collected by the researcher. However, we often want
to display the frequency of occurrences in a flat dataframe, in which case we need to
convert it to a contingency table or cross-tabulation. Moreover, this numerical format
is needed for certain common statistical techniques, such as chi-square tests, cluster
analysis, or correspondence analysis.
6.1
Converting a raw dataframe into a two-way contingency table
The table() command can be applied to two columns in a flat dataframe, resulting
in a two-dimensional contingency table. The contingency table expresses the relation
between the two variables, that is, how often the levels of the first variable co-occur
with the levels of the second variable. If we were merely looking at the variables verb
and tense (the first two columns in Table 1), then the following would create a contingency table:
xtab = table(mydata$verb,mydata$tense)
which displays the following output:
xtab
future past
jog
2
0
run
0
3
skip
0
1
However, this kind of command is of limited use with complex dataframes since it is
for two-way contingency tables.
In order to summarise a more complex dataframe, we need to concatenate or
‘stack’ the columns. In other words, we make a string of two-way contingency tables,
all relative to the same variable and put them together in a row.
Statistics in R 357
Table 4. ╇ Data from Table 1 converted to a stacked contingency table
verb
future
past
1stPrs
2ndPrs
literal
metaphor
jog
run
skip
2
0
0
0
3
1
1
2
0
1
1
1
1
2
1
1
1
0
6.2 Creating a stacked contingency table
The data in mydata are organised in more than two columns. That means that, in
principle, multiple two-way contingency tables can be constructed, displaying the relations between the variable in the first column and each of the subsequent variables.
These contingency tables can be collected in one larger table with the first variable
in the first column and the other variables in the columns to the right. In this case,
we say that the contingency tables have been stacked next to each other. This is how
Table 2 was created. For explanatory purposes, it is represented here as a stacked contingency Table 4.
In R, one can create a stacked contingency table from a flat dataframe with the
cbind() command that we also used earlier for combining dataframes. For instance,
cbind(table(mydata$verb,mydata$tense),
table(mydata$verb,mydata$person))
combines the two contingency tables of verb by tense and verb by person. The result
looks like this:
future past 1 2
jog
2
0 1 1
run
0
3 2 1
skip
0
1 0 1
In this way, all combinations of columns can be stacked next to each other, yielding
the results displayed in Table 2. In the appendix, we provide a script that does this
automatically for an entire dataframe, and which also labels the columns with appropriate names. The script is called tablebind. If one wants to use this script, copy it
into a text file on the computer and save it in the working directory as tablebind.R.
In R, give the command:
source("tablebind.R")
This command makes tablebind() available as a new R-command, which takes the
flat dataframe as its argument. In order to run it, type:
tablebind(mydata)
358 Joost van de Weijer and Dylan Glynn
This command yields the output displayed in Table 2. Note that the first column in
the dataframe is the one upon which the cross-tabulation will be calculated. If one
wants to generate a contingency table that calculates figurativity, for example, then the
order of the columns in the original dataframe will have to be changed accordingly.
Section 7 explains in more detail how to run scripts like this and offer other options
for storing them.
7. Making charts
In R, it is possible to make almost any type of chart, and adjust the layout to the smallest detail. The variety of possibilities is too large to describe in this introduction. Here
we provide just a few examples.
The generic command for making a chart in R is plot(). When this command
is applied to two continuous variables, we get a scatter plot. When applied to a categorical variable, we get a bar chart. If we apply this command to the variable verb
in mydata, for instance, we get a chart that shows the frequencies of the verbs in the
dataframe.
plot(mydata$verb)
The chart opens in a new window, called the Quartz-window in MacOSX (Figure 1).
Sometimes, plots can be hidden behind the console. The window menu will bring the
plot to the front.
There are several ways in which the layout of a chart can be changed. The first is
to set the overall graphical parameters of the chart with the command par(). The
command
par()
gives an overview of all the parameters and their settings. The background colour of a
graph, for instance, is abbreviated as bg. To see the current setting for this parameter,
type:
par("bg")
which gives the default colour of the graph background. Each of these parameters
can be modified. Changing the background colour to, say, light grey, is established by:
par(bg="lightgrey")
Modifications in the graphical parameters stay in effect for as long as the Quartz window is open. Once this window has been closed, the old settings are restored in a new
window.
Statistics in R 359
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Jog
Run
Skip
Figure 1.╇ Barplot using plot()
The second way of adjusting the chart layout is to specify additional options in
the plot command. These options make it possible, among other things, to change the
scales of the axes, change the plotting symbols, add labels, and add a title. The following example expands the limits of the Y-axis scale in the bar chart from Figure 1 from
0 to 5, adds the label ‘Verb’ to the X-axis, adds the label ‘Frequency’ to the Y-axis, and
adds the title ‘Barchart Example’:
plot(mydata$verb,
ylim=c(0,5),
xlab="Verb",
ylab="Frequency",
main="Barchart Example")
Note that the command has been written out on several lines. Finally, extra elements
(legend, arrows, stars, text, lines) can be added separately.
Once the graph is finished, it can be saved as a file. In MacOSX, one simply saves
the file by choosing the menu “File” and then the option “Save”. The file is saved as a
.pdf by default, giving optimal quality. Many applications, such as Preview or Adobe
Acrobat have export options, which allow the file to then be saved under any format
and at resolutions defined by the user. Under Windows, there is a list of options such
as .pdf, .png, .jpg etc., under which the image file can be saved. Alternatively, the file
can be written to a file directly from the Console window. The following three commands show an example of saving the bar chart as a .pdf-file:
pdf("barchart.pdf")
plot(mydata$verb)
dev.off()
360 Joost van de Weijer and Dylan Glynn
The same procedure can be followed for .jpg, .png and other file types. Vector formats
such as .pdf offer the best quality, but certain versions of MS Word do not accept .pdf
images or automatically convert them to poor quality bitmap images. In which case,
saving the image as a high quality .png file is the best option. For .png, a resolution of
at least 600 dpi is recommended, but more is better.
R possesses a rich inventory of possibilities for visualising results. For any command, if one types a ? before the command and enters this, a screen will appear with
a help file. This will give the arguments that the command accepts, as well as an example of its use and hyperlinks to other related commands. Experiment by typing the
command:
?plot
8. Working with scripts
More often than not, an analysis consists of a series of commands. The first command
could read in the data, for example, the second command calculates some descriptive
statistics, and the third command makes a plot of the results. In that case, it is a good
idea to save them in a so-called script file. In order to make a script, select File > New
Document from the R menu. A new window will open in which one can add commands, exactly as they are typed in the console window. An example is shown below:
mydata=read.table("datafile.txt")
str(mydata)
summary(mydata)
Commands can be run directly from the script window. Click once on the command
that one wishes to run, then type command-Enter (Macintosh) or Control-r (Windows). The command and its output are then automatically displayed in the Console
window.
There are several advantages of working with scripts. One advantage is that one
builds up a collection of scripts with commands that are often used. A second advantage is that scripts can contain personal comments. Comments can be anything from
the date that the script was created, to the purpose of the script or the explanation of a
complex command. The following is an example of a very short script, which includes
the example command that we saw above, with a comment added giving information
on the date the script was created:
# Example R-script, created December 2011
mydata=read.table("datafile.txt")
Statistics in R 361
The first line starting with the hash-sign (“#”) is the comment. The second line is the
command. Comments have no effect if run, they are there only for additional information or explanation and are ignored by R.
A third advantage of working with scripts is that script files allow one to write
very long commands on multiple lines. Commands to make plots, typically, become
very long. Writing them on separate lines (as was shown above in Section 6) usually
makes the structure of the command more transparent, and makes it easier, if there
were something wrong with the command, to spot the error.
Finally, R offers the possibility of syntax colouring, which also can help with seeing the structure in R commands. Syntax colouring means that different parts of the
script are shown in different colours. A specific colour is reserved for a comment,
another colour for a command keyword, a third colour for numbers, and so on. For
these reasons, script files can be a great help in learning to work with R. Script files are
normally saved on the computer with the extension .R.
9. Extending functionality with packages
When R is installed for the first time it contains many basic functions for doing
analyses and making plots. These base functions can be complemented with other
functions that are geared towards more special-purpose analyses, such as the ones
described in this book. Many so-called packages exist that contain functions used
within a specific discipline, or for producing specific types of plots, or doing specific
types of analysis. An incredibly rich collection of packages for performing all kinds of
statistical analyses exists, and this collection is constantly being improved upon and
added to. The packages are small and download quickly.
To use a package, it must first be installed, that is, it must be downloaded from
the R-website and saved to the computer. One way of installing a package is with
the install.packages() command. The following example shows how to install
the package called ca, for doing correspondence analysis described in Glynn (this
volume):
install.packages("ca")
This will firstly produce a prompt asking the user to choose a server (CRAN mirror)
and a list of different options will appear. Once a server has been selected, the package
can be installed.
Another way to install packages is to the use the menu options. Under MacOSX,
it will be found under the menu “Packages & Data > Package Installer” and under
Windows “Packages > Install packages(s)”. Under MacOSX, choose the server and
then simply type the first letters of the name of the package and it should appear. Then
362 Joost van de Weijer and Dylan Glynn
select “install”. It is recommended that the option in the radio button “install dependencies” be checked. Under Windows, the procedure is the same except that one must
scroll through a long list to locate the desired package but install dependencies is set
to default and does not be worried about.
An extremely common point of confusion for new users is the difference between
installing and loading packages / libraries. The packages contain the libraries that R
needs to perform its tasks. One only needs to install a package once on a computer,
but the library associated with it, must be loaded each time R is booted. Loading a
package can be done with the library() command:
library(ca)
Once loaded, a package is available until R is quit. Remember that it must be re-loaded with the library() command each time. Error messages that result from unloaded packages can be one of the most frustrating experiences for novice users. A
good way of preventing this from happening is to add the library() command to
the R-script.
10. Going further
This short introduction is designed to help new users get started with R. There are
many things to discover in R, and it will probably take some time to get a good grasp
of the kinds of commands, packages and graphs that are useful for the kind of analysis
for which one wants to use R. To conclude, we offer three pieces of advice that we
found helpful on the way of becoming experienced R users.
First, as mentioned above in Section 7, there is a built-in help function that shows
information about the syntax of a command, some examples, and often links to other
related commands. A call for help on a command is obtained with help() or with
?(), placing the command in parentheses.
Second, the Internet is a good place to search for help. There are numerous sites
with blogs, tutorials, and user fora. Here one can find R code, pose questions, and see
graphs. For the less experienced users, we can recommend the site Quick-R, which
contains many clear examples. Furthermore, the R-website also offers a manual.
Finally, the number of books on statistics using R within many different disciplines grows steadily. Books for linguistic analysis are Baayen (2008), Dalgaard
(2008), Johnson (2008) and Gries (2009, 2012). Focusing on graphics, Keen (2010)
and Mittal (2011) are accessible to beginners and are relatively complete. Other introductory books include Crawley (2007), Everitt & Hothorn (2009), Maindonald &
Braun (2010) and Adler (2010).
Statistics in R 363
References
Adler, J. (2010). R in a nutshell: A desktop quick reference. Sebastopol: O’Reilly Media.
Baayen, H. (2008). Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511801686
Crawley, M. (2007). The R book. Chichester: John Wiley. DOI: 10.1002/9780470515075
Dalgaard, P. (2008). Introductory statistics with R (2nd ed.). Dordrecht: Srpinger.
DOI: 10.1007/978-0-387-79054-1
Everitt, B. S., & Hothorn, I. (2009). A handbook of statistical analyses using R (2nd ed.). Boca
Raton: Taylor & Francis.
Gries, St. Th. (2009). Quantitative corpus linguistics with R: A practical introduction. London:
Routeledge. DOI: 10.1515/9783110216042
Gries, St. Th. (2012). Statistics for linguistics with R: A practical introduction. Berlin & New York:
Mouton de Gruyter.
Keen, K. (2010). Graphics for statistics and data analysis with R. Boca Raton: CRC Press.
Johnson, K. (2008). Quantitative methods in linguistics. Oxford: Blackwell.
Maindonald, J., & Braun, J. (2010). Data analysis and graphics using R (3rd ed.). Cambridge:
Cambridge University Press.
Mittal, H. V. (2011). R graphs cookbook. Birmingham: Packt.
Appendix: The tablebind-script
This script can be copied or carefully entered into R and saved as a function, explained in Section 7. The script and the data file can be downloaded from http://dx.doi.org/10.1075/hcp.43.
13wei.additional.
tablebind=function(df.flat)
{
if(ncol(df.flat)<3)
return("Dataframe needs at least three columns.")
for(i in 1:ncol(df.flat))
df.flat[,i]=factor(df.flat[,i])
df.stacked.ncol=0
for(i in 2:ncol(df.flat))
df.stacked.ncol=df.stacked.ncol+length(levels(df.flat[,i]))
k=1
df.stacked.colnames=rep("X",df.stacked.ncol)
for(i in 2:ncol(df.flat))
for(j in 1:length(levels(df.flat[,i])))
{df.stacked.colnames[k]=
paste(colnames(df.flat[i]),
levels(df.flat[,i])[j],sep=".")
k=k+1
}
364 Joost van de Weijer and Dylan Glynn
df.stacked=table(df.flat[,1],df.flat[,2],
exclude=c(NA),useNA="no")
for(i in 3:ncol(df.flat))
df.stacked=cbind(df.stacked,table(df.flat[,1],df.flat[,i],
exclude=c(NA),useNA="no"))
df.stacked=data.frame(df.stacked)
colnames(df.stacked)=df.stacked.colnames
return(df.stacked)
}
Frequency tables
Tests, effect sizes, and explorations
Stefan Th. Gries
University of California, Santa Barbara
This chapter provides an overview of statistical tests to analyze frequency data.
Specifically, it discusses the use, logic, and interpretation of chi-squared tests
of two-dimensional frequency tables as well as the computation of effect sizes
for such tables, followed by several extensions and follow-up procedures that
are not usually discussed (such as the analysis of sub-tables of tables and the
Marascuilo procedure). In addition, there is a brief discussion of how Poisson/
count regression can be used to analyze frequency data with more than two
dimensions.
Keywords: chi-squared test, frequency data, Marascuilo procedure, Poisson
regression
1. Introduction
1.1
Discrete vs. continuous data
Usage-based linguistics is essentially a distributional science in the sense that linguists
explore the distribution of linguistic elements at every level of linguistic analysis: phonology, morphology, syntax, semantics, pragmatics and text linguistics etc. Corpus
linguistics is no exception to this. More specifically, corpus linguists explore:
– the frequencies of occurrence of linguistic elements in corpora, for example, frequency lists;
– the dispersion of linguistic elements in corpora as in, for example, measures of
dispersion;
– the frequencies of co-occurrence of linguistic elements in corpora as in, for example, collocation, collocational frameworks, n-grams, colligations/collostructions
etc.
366 Stefan Th. Gries
Very often, the data we study as linguists are discrete in nature. That is, the linguistic
elements we study come in different categories and, trivially, if two elements are labeled the same, they belong to the same category, and if they are labeled differently, they
belong to different categories. In statistical approaches, this kind of scenario is usually
described with the terminology of variables (or factors) and their levels. For example,
when direct objects are studied, it may be interesting to describe them in terms of
which part of speech the direct object’s head is. In other terminology, each direct object studied is then described with regard to the variable Part of Speech by assigning
a particular variable level to it; depending on what the direct objects look like, the
following levels are conceivable: Part of Speech: lexical noun, Part of Speech:
pronoun, Part of Speech: semipronoun, (such as matters or things), etc. Trivially, if
direct objects are categorized this way, then a direct object whose head is categorized
as Part of Speech: pronoun is, for the purposes of this analysis, identical to another
one whose head is categorized as Part of Speech: pronoun and different from one
whose head is categorized as Part of Speech: lexical noun.
On other occasions, the observed variables are actually not discrete, but continuous, but for the purposes of an analysis they may be grouped into two or more categories such as:
– when the lengths of subjects falling between 1 and 30 syllables, for example, are
classified as falling into the categories Length: short (such as shorter than the
median length) and Length: long (i.e. longer than the median length);
– when the frequencies of closed-class words falling between 1 and 100,000, for
example, are classified into the categories Frequency: low (such as between
1 and 50 occurrences), Frequency: intermediate (such as between 51 and
1,000 occurrences), and Frequency: high (such as between 1,001 and 100,000
occurrences).
For the kinds of statistical methods to be discussed below, it does not really matter
whether the variables involved in a particular study are genuinely discrete or categorical in nature or have just been converted to discrete variables: the methods as well as
their results and potential implications are the same.
The analysis of multidimensional frequency tables – i.e., tables reporting observed co-occurrence frequencies of three or more features with regard to which elements have been classified – has a lot to offer to linguists in general and cognitive
linguists in particular. For example, frequency effects play an important role in most
flavors of Cognitive Linguistics and/or Construction Grammar:
– absolute token frequencies and conditional token probabilities are correlated
with (degree of) cognitive entrenchment and unit status (cf. Schmid 2000), age
and speed of the acquisition of constructions (cf. Brooks and Tomasello 1999 or
Frequency tables 367
Goldberg, Casenhiser, and Sethuraman 2004), and phonological reduction (cf.
Bybee and Scheibman 1999);
– type frequencies are correlated with degrees of productivity and grammaticalization (cf. Bybee 1985);
– conditional probabilities as they can be derived from corpus frequencies are correlated with processing/parsing strategies (cf. Saffran, Aislin, and Newport 1996
or Saffran and Wilson 2003); etc.
But multidimensional frequency tables of course also arise in studies whose target
is not frequency effects per se but where just the interrelations of several variables is
studied on the basis of corpus or experimental data.
The general idea in the analysis of two- or more-dimensional frequency tables is
to determine whether the frequencies observed in cells of the table are distributed in
a way that is significantly different from a random distribution and, if that is the case,
what is (most) responsible for the significant difference and what is not. The entities
that are included in an analysis because they are potentially responsible for significant
differences will be called predictors, and I use predictors here to refer to three different
things:
– levels of variables;
– individual variables;
– interactions of n variables.
The first two of these three different kinds of predictors are probably obvious from
what has been said so far, but the third may not be. An interaction of n variables is
defined as a non-additive, or unpredictable, joint effect of the n variables (on a dependent variable). Consider a case where the referents subject and object NPs have
been coded with regard to a variable Clause (whether they are subjects or objects in
a main or a subordinate clause) and their Givenness in discourse (on a scale from 0
to 10). Let us assume:
– referents of subjects are more given than referents of objects;
– referents of subjects and objects in main clauses are more given than referents of
subjects and objects in subordinate clauses.
From this, one would expect the referents of subject NPs in main clauses to be most
given because they combine the two features – ‘given’ and ‘being in a main clause’ that
co-occur with high values of Givenness. If they turn out to be least given, however,
then this would be a two-way interaction between the variables GramRelation (with
the levels subject and object) and Clause (with the levels main and subordinate).
Before we turn to the actual analysis of frequencies of discrete data, I first need to
make a few general remarks that apply to virtually all evaluations of frequency tables,
in fact to most statistical methods in general.
368 Stefan Th. Gries
1.2
Methodological preliminaries
The most central methodological issue that needs to be discussed briefly is how one of
the most fundamental principles of scientific reasoning bears upon statistical analyses
of frequency data. This most fundamental principle is entia non multiplicanda praeter
necessitatem, which is known as Occam’s razor, or sometimes also as the principle of
parsimony. It prohibits the inclusion of unnecessary explanatory notions into an analysis or, from the reverse perspective, it requires the analyst to show for each explanatory notion he wants to include that it is in fact necessary to include it. For statistical
analyses of frequency data, this means that a researcher (i) tries to build a model of the
observed data, i.e. a quantitative representation of the potentially relevant relations in
the data that contains all predictors under consideration, and then (ii) must successively determine whether the predictors currently included in the model may in fact
be included in the model or whether they have to be eliminated from consideration
because their influence is too small to be statistically reliable/significant or conceptually noteworthy/substantial. This means that, especially in the area of multifactorial
studies, the first statistical analysis is hardly ever the last because once a first statistical
model has been built, Occam’s razor dictates it be tested for parsimony; in the domain
of regression modeling, this process of slimming down predictors is often referred to
as model selection.
This principle is usually recognized in multifactorial studies (to varying degrees,
though), where many researchers now routinely go through a model selection process in which in a stepwise fashion predictors are excluded from consideration until
a model consists only of predictors that are significant themselves or that figure in
higher-order interactions that are significant. However, for both mono- and multifactorial applications, this principle is not as often recognized for predictors that are
neither interactions of variables or variables but variable levels. The above definition
of predictors requires that the inclusion of different variable levels should ideally be
scrutinized for whether variable levels must be kept apart just as the inclusions of separate variables and interactions should be. Note, though, that a conflation of variable
levels must make sense conceptually: it is not useful to create a new combination of
variable levels that looks nicer statistically but is conceptually senseless (cf. below for
an example) – modeling is usually only a means to an end, not an end in itself. The
principle of parsimony is therefore a very important methodological guideline and
will surface in different forms below.
2. How to analyze frequency tables
This section constitutes the main part of this chapter. In Section 2.1, I discuss the
simpler case of two-dimensional tables, whereas in Section 2.2, I explain the more
Frequency tables 369
complex case of multidimensional tables. The discussion will be based on the open
source software R, which can be downloaded from <http://cran.at.r-project.org/>.
2.1
Two-dimensional tables
2.1.1 2-by-2 tables
The simplest case of two-dimensional tables are 2-by-2 tables, in which one nominal
or categorical variable is cross-tabulated with another nominal or categorical variable.
As an example, let us consider the question of whether the disfluencies uh and uhm
are differently frequent directly before nouns and verbs. That is, one variable is Disfluency, with the levels uh and uhm, and the other variable is Part of speech of the
following word, with the levels noun vs. verb.
The analysis of such tables is very straightforward. First, the data must be entered
into a matrix in R. To that end, the function matrix can be used, which requires
(i) the observed frequencies in a column-wise fashion (c(30, 50, 70, 20)) and
(ii) the number of columns the table has (ncol=2):
x<-matrix(c(30, 50, 70, 20), ncol=2)
While this creates the matrix of the frequencies, it is useful to add row and column
labels. The function list takes two vectors, first the row names, and second, the
column names:
attr(x, “dimnames”)<-list(Disfluency=c(“uh”, “uhm”),
POS=c(“Noun”, “Verb”))
To see whether the data entry has been successful, the data plus the row and column
totals can then be inspected using the function addmargins:
addmargins(x)
POS
Disfluency Noun Verb Sum
uh
30
70 100
uhm
50
20 70
Sum
80
90 170
Table 1.╇ Fictitious data on the correlation of Disfluency and Part of speech 1
uh
uhm
Totals
Noun
Verb
Totals
30
50
80
70
20
90
100
â•⁄70
170
370 Stefan Th. Gries
Such matrices are typically evaluated using a so-called chi-squared test (exceptions to
this will be discussed below). This test requires that all observations are independent
of each other and that 80+% of the expected frequencies are larger than 5. If this is
the case, one can use the function chisq.test, which in the standard form to be
discussed here requires the matrix to be tested (x) and an argument to be explained
below (correct=FALSE); the result of the test should be saved into a new data structure, e.g., x.test:
x.test<-chisq.test(x, correct=FALSE)
Nothing is returned, but the data structure x.test now contains all the results. Three
things must now be done. First, one should inspect the frequencies that would have
been expected by chance – i.e. when there is no correlation by the kind of disfluency
and the part of speech of the following word – by calling the part of the test results that
contain the expected frequencies:
x.test$exp
POS
Disfluency
Noun
Verb
uh 47.05882 52.94118
uhm 32.94118 37.05882
(One can also compute each expected frequency of a cell manually by dividing the product of the cell’s row and column total by the total of the table, e.g.,
100·80÷170 = 47.05882, etc.). Obviously, the expected frequencies are all greater than
5. Therefore, the next step is to determine whether the observed result from Table 1 is
significant – i.e. different enough from the expected result shown above – by calling
the overall result:
x.test
Pearson’s Chi-squared test
data: x
X-squared = 28.3671, df = 1, p-value = 1.004e-07
In this example, there is a highly significant correlation between the kind of disfluency
and the part of speech that follows: p is much smaller than the critical value of p = 0.05.
However, the fact that there is an overall significant result does not reveal which of
the four cells are most responsible for this effect and how. To identify these cells, one
should inspect the so-called Pearson residuals, which are computed as in (1).
(1) Pearson residuals =
observed – expected
√expected
Frequency tables 371
x.test$res
POS
Disfluency
Noun
Verb
uh -2.486729 2.344511
uhm 2.972210 -2.802227
First, if the Pearson residual in a cell is positive/negative, then the observed frequency
in that cell is greater/less than the expected frequency in that cell. Second, the more
the Pearson residual deviates from 0, the stronger that effect. In this case, therefore,
the strongest effect is the preference of uhm before nouns, followed by the dispreference of uhm before verbs.
The final step is to compute an effect size. An effect size quantifies the strength of
the observed correlation independently of the sample size. In the case of 2-by-2 tables,
one standard effect size is Φ (phi), which theoretically ranges from 0 (‘no effect’) to 1
(‘perfect correlation’) and is computed as shown in (2). In this case, the correlation is
intermediately strong.
(2) Φ =
χ2
n
sqrt(x.test$stat/sum(x))
X-squared
0.4084912
The final question to be addressed is what to do when too many expected frequencies
are too small. While sparse data are always problematic in the sense that one does not
want to base potentially far-reaching generalizations on small data sets, there is a test
that can be used to test such tables for significance, too, which is called the Fisher-�
Yates exact test. The R function that computes this test is fisher.test and its most
important argument is just the matrix containing the data:
fisher.test(x)
Fisher’s Exact Test for Count Data
data: x
p-value = 9.66e-08
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.08256479 0.35287555
sample estimates:
odds ratio
0.1734529
In this case, where the sample size and the expected frequencies are unproblematic anyway, the p-value provides the same kind of result: the distribution of the two
372 Stefan Th. Gries
disfluencies before the two parts of speech is most likely not random, but there is
more. The output also provides another kind of effect size for 2-by-2 tables, the socalled odds ratio. The odds ratio is one of several measures that expresses how much
the distribution of a binary variable changes in response to another binary variable.
In this case, the odds ratio quantifies the ratio of the frequency of uh before nouns
(30/50) to the frequency of uh before verbs (70/20):
(30/70)/(50/20)
[1] 0.1714286
The result is similar, but not identical, to the one provided by R, which uses a more
refined estimation algorithm (and also provides a confidence interval that is not addressed here (cf. Gries 2013:â•›Section 3.1.5 for explanation and exemplification)). The
logic, however, is the same: the more the odds ratio differs from 1, the stronger the
effect. Sometimes, a scholar might not report odds ratios of 0.5 and 1.5 but logged
odds ratios as shown below:
log(0.5)
[1] -0.6931472
log(1.5)
[1] 0.4054651
One reason for this is that odds ratios are often difficult to compare to each other: A
beginner might look at two odds ratios of 0.5 and 1.5 and – erroneously – think they
reflect equally strong effects because they are equally far away from 1. This is false as
the logs of the odds ratios show: the more a logged odds ratio deviates from 0, the
stronger the effect, which is why an odds ratio of 0.5 reflects a stronger effect than an
odds ratio of 1.5. (In addition, logged odds ratios are also important in the context of
logistic regression, a topic I cannot discuss here; cf. Gries 2013:â•›203–301 for detailed
explanation.)
2.1.2 Larger two-dimensional r-by-c tables
Thankfully, the logic of 2-by-2 tables also applies to two-dimensional tables with more
than two rows (i.e. r > 2) and/or two more columns (i.e. c > 2). Consider Table 2 for
an extended version of the above disfluencies example.
Table 2.╇ Fictitious data on the correlation of Disfluency and Part of speech 2
uh
uhm
silence
Totals
Conjunction
Noun
Verb
Totals
â•⁄30
â•⁄50
â•⁄20
100
70
20
â•⁄5
95
â•⁄90
â•⁄40
â•⁄10
140
190
110
â•⁄35
335
Frequency tables 373
The data are entered in the same way as before:
x<-matrix(c(30, 50, 20, 70, 20, 5, 90, 40, 10), ncol=3)
attr(x, “dimnames”)<-list(Disfluency=c(“uh”, “uhm”,
“silence”), POS=c(“Noun”, “Verb”, “Conjunction”))
And the requirements of the test (in terms of the proportion of expected frequencies
being not smaller than 5) analysis is also no different:
x.test<-chisq.test(x, correct=FALSE)
x.test$exp
POS
Disfluency
Noun
Verb Conjunction
uh
56.71642 53.880597
79.40299
uhm
32.83582 31.194030
45.97015
silence 10.44776 9.925373
14.62687
x.test
Pearson’s Chi-squared test
data: x
X-squared = 45.2273, df = 4, p-value = 3.566e-09
x.test$res
POS
Disfluency
Noun
Verb Conjunction
uh
-3.547512 2.196002
1.1892280
uhm
2.995361 -2.004245 -0.8805362
silence 2.955245 -1.563384 -1.2097935
The expected frequencies are unproblematic: all of them are even larger than 9. Hence,
the p-value of the chi-squared test can be taken seriously, which points to an association between the kind of disfluency and the part of speech of the following word. The
nature of this association then becomes clear from the residuals: uh is dispreferred before nouns (negative residual of ≈–3.55) whereas uh is preferred before verbs (positive
residual of ≈2.2) and before conjunctions (positive residual of ≈1.2).
Two things remain to be done. First, one again needs to compute an effect size,
which for r-by-c tables with r > 2 and/or c > 2 is called Cramer’s V. Its formula is
shown in (3), where min(r, c) means ‘take the numbers of rows and columns and pick
the smaller of the two’.
(3) V =
χ2
n × (min(r, c) – 1)
The effect size is now smaller than before:
374 Stefan Th. Gries
sqrt(x.test$stat/(sum(x) * (min(dim(x))-1)))
X-squared
0.2598142
Second, one should explore whether the data can, or in fact must, be simplified as a
consequence of Occam’s razor, the principle that requires analysts to adopt the simplest possible model. In this case, for example, the observed data distinguish three
parts of speech – nouns, conjunctions, and verbs – but the (signs of the) residuals reveal that verbs and conjunctions behave alike so maybe a two-way distinction – nouns
vs. non-nouns – is sufficient.
To test heuristically which distinction to adopt, the data are entered again, but
this time the two levels of the part of speech that are suspected to behave the same
are conflated:
x.2<-matrix(c(30, 50, 20, 70+90, 20+40, 5+10), ncol=2)
attr(x.2, “dimnames”)<-list(Disfluency=c(“uh”, “uhm”,
“silence”), POS=c(“Noun”, “Not noun”))
addmargins(x.2)
POS
Disfluency Noun Not noun Sum
uh
30
160 190
uhm
50
60 110
silence
20
15 35
Sum
100
235 335
Then, the analysis is repeated on the new merged data set:
x.2.test<-chisq.test(x.2, correct=FALSE)
x.2.test
Pearson’s Chi-squared test
data: x.2
X-squared = 43.1801, df = 2, p-value = 4.203e-10
x.2.test$res
POS
Disfluency
Noun Not noun
uh
-3.547512 2.314141
uhm
2.995361 -1.953958
silence 2.955245 -1.927790
sqrt(x.2.test$stat/(sum(x.2) * (min(dim(x.2))-1)))
X-squared
0.3590205
Frequency tables 375
The chi-square value has hardly changed and the effect size has even gone up considerably. Both of these facts suggest that the real distinction for the disfluencies in this
corpus may not be between nouns, verbs, and conjunctions, but just between nouns
and non-nouns, but this would have to be tested more rigorously (using, e.g., model
comparisons).
2.1.3 Additional applications
This section deals with two unfortunately less well-known, but nevertheless very useful methods in the analysis of different kinds of two-dimensional r-by-c columns.
Testing a subtable of a table
This section is concerned with the question of what to do when one has an r-by-c
subtable, but wishes to evaluate an s-by-d table with s ≤ r and d ≤ c. As an example, I
will use a study of the frequencies with which four emotion metaphors are distributed
over four registers (cf. Lohmann 2009 for an example). Lohmann studied the degree
to which certain supposedly very pervasive conceptual metaphors are attested in different genres. Consider Table 3 for a (fictitious) example of the kind of data such a
study may yield (I will explain the bold-faced figures below.)
If one entered the data …
x<-matrix(c(8, 31, 44, 36, 5, 14, 25, 38, 4, 22, 17, 12,
8, 11, 16, 24), ncol=4)
attr(x, “dimnames”)<-list(Register=c(“acad”, “spoken”,
“fiction”, “news”), Metaphor=c(“Heated fluid”, “Light”,
“NatForce”, “Other”))
and did a chi-squared test on this table, then one would find a significant result (with
χ2 = 19.5151; df = 9; p = 0.02115). However, let us assume that one found these data
in a study, but that one is also only interested in whether spoken conversation differed from fiction in the use of the metaphors Emotion is light and Emotion is a
natural force, i.e., the bold-faced figures in Table 3. Contrary to what quite a few
Table 3.╇ Fictitious distribution of emotion metaphors in different genres
Metaphor
Register
Emotion is a
heated fluid
in a container
Emotion
is light
Emotion is a
natural force
Other Totals
Academic writing
Spoken conv.
Fiction
News
Totals
â•⁄â•⁄8
â•⁄31
â•⁄44
â•⁄36
119
â•⁄5
14
25
38
82
â•⁄4
22
17
12
55
â•⁄8
11
16
24
59
â•⁄25
â•⁄78
102
110
315
376 Stefan Th. Gries
people seem to think, one cannot simply extract this table from the overall table – i.e.,
pretend one had done a study oneself with just the variable levels and frequencies one
is interested in – and run a chi-squared test on it. Thus, the following (slightly shortened) code and result is wrong:
subtable<-matrix(c(14, 25, 22, 17), ncol=2)
chisq.test(subtable, correct=FALSE) # WRONG!
Pearson’s Chi-squared test
data: matrix(c(14, 22, 25, 17), ncol = 2)
X-squared = 3.3016, df = 1, p-value = 0.06921
This test is wrong because its chi-square value is based on the marginal totals of the
subtable (e.g. 39 vs. 39 for Emotion is light, etc.), but does not take the overall observed frequencies of Emotion is light into consideration (e.g. 82 vs. 55, etc.). The
correct test is, unfortunately, slightly more lengthy and involves the following steps
(following Bortz, Lienert, and Boehnke 1990:â•›Section 5.4.4).
First, one computes the chi-squared test that compares the observed row sums of
the subtable (36 vs. 42) to the ones expected from the proportions of row sums of the
whole table (78 vs. 102, i.e., 78/180 vs. 102/180):
chisq.test(c(36, 42), p=c(78, 102)/180)[c(1,7)]
$statistic
X-squared
0.2526975
$expected
[1] 33.8 44.2
Second, one computes the chi-squared test that compares the observed column sums
of the subtable (39 vs. 39) to the ones expected from the proportions of column sums
of the whole table (82 vs. 55, i.e. 82/137 vs. 55/137):
chisq.test(c(39, 39), p=c(82, 55)/137)[c(1,7)]
$statistic
X-squared
3.151996
$expected
[1] 46.68613 31.31387
Third, one computes the frequencies that would have been expected in the subtable
if the cells were distributed proportional to the expected marginal totals according
to the usual two-dimensional chi-square formula mentioned above, by dividing the
product of the cell’s row and column total by the total of the table, as shown in Table 4.
Frequency tables 377
Table 4.╇ Expected frequencies (when the cells are proportional to the expected marginal
totals)
Spoken convers.
Fiction
Totals
Emotion is light
Emotion is a natural force
Totals
(33.8 × 46.69) / 78 ≈ 20.23
(44.2 × 46.69) / 78 ≈ 26.46
46.69
(33.8 × 31.31) / 78 ≈ 13.57
(44.2 × 31.31) / 78 ≈ 17.74
31.31
33.8
44.2
78
Emotion is a natural force
Totals
Table 5.╇ Contributions to chi-square
Emotion is light
Spoken convers.
Fiction
Totals
20.23)2
13.57)2
(14 –
/ 20.23 ≈ 1.92 (22 –
/ 13.57 ≈ 5.24
(25 – 26.46)2 / 26.46 ≈ 0.08 (17 – 17.74)2 / 17.74 ≈ 0.03
46.69
31.31
33.8
44.2
78
As the penultimate step, one computes each table cell’s contribution to the chisquare value by dividing the squared difference between the observed and the expected cell frequency by the expected frequency, as shown in Table 5. (By the way, these
correspond to the squared Pearson residuals mentioned in (1).)
In R, this can be done much more simply:
exp.temp<-matrix(c(20.23, 26.46, 13.57, 17.74), ncol=2)
sum(((subtable-exp.temp)^2)/exp.temp)
[1] 7.266921
The final step is then, at last, to compute the difference of this last chi-square value
and the sum of the other two. This difference is the required chi-square value and then
provides the desired p-value:
7.266921-(0.2526975+3.151996) # chi-square
[1] 3.862227
pchisq(3.862227, prod(dim(subtable)-1), lower.tail=F) #
p-value
[1] 0.04938474
This chi-square value corresponds – disregarding rounding errors – to what an R
function for this method written by the author would provide, as the last row indicates:
sub.table(x, 2:3, 2:3) # the data and the rows/columns
for the sub-table
[…]
378 Stefan Th. Gries
$‘Chi-squared tests’
Chi-square Df
p-value
7.2682190 3 0.06382273
0.2526975 1 0.61518204
3.1519956 1 0.07583417
3.8635259 1 0.04934652
Cells of subtable to whole table
Rows (within sub-table)
Columns (within sub-table)
Contingency (within sub-table)
As is now obvious, the data in the subtable actually produce a significant result: the
two kinds of metaphors are differently frequent in the two registers. Note again that
the wrong approach from above – just applying a separate chi-squared test to the subtable – did not return a significant result, which should demonstrate how important
it is to apply the correct methods.
The Marascuilo procedure
In order to determine how many different variable levels to retain in either a unidimensional vector of frequencies or percentages, one can use the so-called Marascuilo
procedure. Since this procedure can be applied to a simple vector of frequencies or
percentages, it can also be used for a 2-by-c table, where it tests which of the c variable
levels of the column variable are better conflated. To explore this procedure, we consider the alternation of particle placement exemplified in (4).
(4) a. He picked up the book.
b. He picked the book up.
Just like many other constituent order alternations in English, the choice of one order
by a speaker is determined by many different factors and usually made unconsciously.
One of the factors governing particle placement is the information status of the referent of the direct object (cf. Kruisinga and Erades 1953; Chen 1982; Gries 2003): on
the whole, it seems as if new referents prefer to occur after the particle (i.e. as in (4a))
whereas given referents prefer to occur before the particle (i.e. as in (4b)). However,
since given vs. new is only the most simplistic classification of information status,
one may want to include at least one additional level such as Status: inferable,
which characterizes referents which have not been mentioned before in the preceding
discourse, but which a hearer can infer on the fly from linguistic or contextual knowledge. Consider Table 6 for an example data set.
The first step in applying the Marascuilo procedure is as discussed above, i.e. enter the data and perform a chi-squared test for two-dimensional tables to determine
whether there is a correlation between information status and the constituent order:
x<-matrix(c(37, 13, 63, 37, 20, 40), ncol=3)
attr(x, “dimnames”)<-list(“Constituent order”=
c(“Verb-Object-Particle”, “Verb-Particle-Object”),
“Information status”=c(“given”, “inferable”, “new”))
x.test<-chisq.test(x, correct=FALSE)
Frequency tables 379
Table 6.╇ Particle placement: Constituent order and Information status
Verb-Object-Particle
Verb-Particle-Object
Totals
given
inferable
new
Totals
37
13
50
â•⁄63
â•⁄37
100
20
40
60
120
â•⁄90
210
x.test$exp
Information status
Constituent order
given inferable
new
Verb-Object-Particle 28.57143 57.14286 34.28571
Verb-Particle-Object 21.42857 42.85714 25.71429
x.test
Pearson’s Chi-squared test
data: x
X-squared = 21.0914, df = 2, p-value = 2.631e-05
x.test$res
Information status
Constituent order
given inferable
new
Verb-Object-Particle 1.576841 0.7748272 -2.439750
Verb-Particle-Object -1.820780 -0.8946933 2.817181
sqrt(x.test$stat/(sum(x) * (min(dim(x))-1)))
X-squared
0.3169151
The result is fairly obvious: there is a not particularly strong, but still highly significant, correlation or interaction between the two variables in the expected direction:
given and new referents prefer to occur before and after the particle, respectively. In
addition, inferable referents pattern more like given referents – they prefer to occur
before the particle – but less strongly so. According to Occam’s razor, one should now
test whether all three levels of Information status are required especially since in
this case a conflation of Information status: given and Information status: inferable as a counterpart to Information status: new would make sense – whereas
a conflation of Information status: given and Information status: new as a
counterpart to Information status: inferable would not.
The Marascuilo procedure requires three steps. First, one computes the percentages of the variable with two levels, in this case the column variable Constituent
order:
prop<-prop.table(x, 2) # the 2 means ‘column-wise’,
1 would mean ‘row-wise’
380 Stefan Th. Gries
prop
Information status
Constituent order
given inferable
new
Verb-Object-Particle 0.74
0.63 0.3333333
Verb-Particle-Object 0.26
0.37 0.6666667
Second, one computes all pairwise differences between the percentages of one constituent order in the three information states:
– 0.74 – 0.63 = 0.11 (given – inferable);
– 0.74 – 0.333 = 0.407 (given – new); and
– 0.63 – 0.333 = 0.297 (inferable – new).
Third, one compares each of the three differences to a threshold value that must be
computed with the rather complicated formula shown in (5) (for a significance value
of p = 0.05):
(5)
χ2
p = 0.05; df = levels – 1
perc1 × (1 – perc1)
×
Σâ•›column1
+
perc2 × (1 – perc2)
Σâ•›column2
For the comparison (given – inferable), this translates into (6):
(6)
5.9915 ×
0.74 × 0.26
50
+
0.63 × 0.37
100
= 0.1924091
Since the observed percentage difference of 0.11 is not larger than the critical percentage difference for p = 0.05 at df = 3 – 1 = 2 of approximately 0.19, the difference
between the percentages of Information status: given and Information status:
inferable is not significant. Again, this procedure is somewhat labor-intensive,
but can be computed easily using R. The output of applying such a function (mar)
to the matrix x returns, among other things, the following results for the pairwise
comparisons:
mar(x)
[…]
$‘pairwise comparisons’
comparisons
diffs crit.ranges decisions
1 given-inferable 0.1100000
0.1924091
ns
2
given-new 0.4066667
0.2127105
*
3
inferable-new 0.2966667
0.1901492
*
The results of the Marascuilo procedure at least suggest that one should conflate Information status: given and Information status: inferable into a new category Information status: non-new, evaluate that matrix, and report and interpret
those results:
Frequency tables 381
x2<-matrix(c(100, 50, 20, 40), ncol=2)
attr(x2, “dimnames”)<-list(“Constituent order”=
c(“Verb-Object-Particle”, “Verb-Particle-Object”),
“Information status”=c(“non-new”, “new”))
x2.test<-chisq.test(x2, correct=FALSE)
x2.test$exp
Information status
Constituent order
non-new
new
Verb-Object-Particle 85.71429 34.28571
Verb-Particle-Object 64.28571 25.71429
x2.test
Pearson’s Chi-squared test
data: x2
X-squared = 19.4444, df = 1, p-value = 1.036e-05
x2.test$res
Information status
Constituent order
non-new
new
Verb-Object-Particle 1.543033 -2.439750
Verb-Particle-Object -1.781742 2.817181
sqrt(x2.test$stat/(sum(x2) * (min(dim(x2))-1)))
X-squared
0.3042903
Both the chi-square value and the effect size hardly change as a result of the elimination of one variable level, and given this loss of one df, the p-value is even much
smaller than before. (Note that other statistical approaches may come to different conclusions, which does not, however, obviate the need for some kind of test of whether
the three levels of Information status need to be kept separate and for an explicit
discussion of which test was used.)
This is a clear case in which Occam’s razor not only makes the results better, but
in which it also may lead to new findings: if the researcher had not already expected
that given and inferable were very similar – unless the researcher wanted to test
whether they are the same, he or she should have just coded one non-new information status – then Occam’s razor has helped to reveal this patterning.
2.2
Multidimensional tables
Two-dimensional frequency tables have probably the most widespread use of all frequency tables. However, most linguistic choices are not determined by only a single
variable, and while the analysis of multidimensional frequency tables is somewhat
382 Stefan Th. Gries
more complex, a growing number of linguists have realized that very often only a
multifactorial study will reveal the most important generalizations and avoid erroneous interpretations arising from the omission of important predictor variables (cf. the
well-known example of Simpson’s paradox; cf. Sheskin 2011:â•›718–720).
Multidimensional frequency tables can be analyzed in many different ways,
which often makes it difficult for the beginner to choose one method over another:
loglinear models/Poisson regression, binary or multinomial logistic regression, (multiple) correspondence analysis, association rules, … are among the most frequent
methods but I cannot discuss them all here. Binary logistic regression is a very widely
used method but, as the name suggests, it is restricted to dependent variables with
only two levels (cf. Gries 2013:â•›Section 5.3 for in-depth discussion as well as Baayen
2008:â•›Section 6.3.1; Johnson 2008:â•›Section 5.4; and Speelman, this volume). I will
therefore discuss one example of a Poisson regression (which could also be investigated with a binary logistic regression). Let me begin, however, with the warning
that this chapter cannot discuss all the tricky details of regression model selection so
readers are advised to brush up their knowledge in this area and/or study additional
materials (especially those readers who do not know linear regressions already); I find
Crawley’s (2005, 2012) books most instructive, and Faraway (2006) also provides a
good, though more technical, introduction.
In this section, I will discuss an example from a recent corpus study published in
the ICAME Journal (Hommerberg and Tottie 2007). Their study explores two complementation patterns of the verb try in British and American English: try to vs. try
and. Their goal is “to show how native speakers of present-day British and American
English actually use the two constructions”, and they use a data set from the Cobuild
Direct Corpus, whose size and composition is summarized in Table 7.
The variables Variety and Mode are self-explanatory; the variable Try refers
to whether speakers/writers used try to or try and, and the variable Clause refers to
whether the VP containing try is itself part of a to-clause (as in we’re going to try (to/
and) (Hommerberg and Tottie 2007:â•›56).
I will assume that Table 7 is in R’s workspace as a data frame called x. The str
command summarizes the structure of the table as follows:
str(x)
‘data.frame’: 16 obs. of 5 variables:
$ VARIETY: Factor w/ 2 levels “american”,“british”:
1 1 1 1 ...
$ MODE
: Factor w/ 2 levels “spoken”,“written”:
1 1 2 1 1 ...
$ TRY
: Factor w/ 2 levels “and”,“to”:
1 1 2 2 1 1 2 2 1 ...
Frequency tables 383
Table 7.╇ The data studied by Hommerberg and Tottie (2007)
Variety
Mode
Try
Clause
Freq.
american
american
american
american
american
american
american
american
british
british
british
british
british
british
british
british
spoken
spoken
spoken
spoken
written
written
written
written
spoken
spoken
spoken
spoken
written
written
written
written
and
and
to
to
and
and
to
to
and
and
to
to
and
and
to
to
other
to
other
to
other
to
other
to
other
to
other
to
other
to
other
to
120
â•⁄90
381
174
â•⁄10
â•⁄26
219
167
503
706
150
133
â•⁄49
127
230
144
$ CLAUSE : Factor w/ 2 levels “other”,“to”:
1 2 1 2 1 2 1 2 ...
$ FREQ
: int 120 90 381 174 10 26 219 167 503 706 ...
Several things are needed for a Poisson regression. First, the relevant R function is
glm, which is short for generalized linear model. Second, the function takes two main
arguments, the first of which is a formula that specifies which dependent variable –
the observed frequencies of occurrence – and which predictors – which independent
variables and which of their interactions – to include. Formulae in R are written as
“dependent variable ~ predictors/independent variables”, where n independent variables combined with asterisks mean ‘include the independent variables and their interactions’ while n independent variables combined with colons mean ‘include the
interaction of these independent variables’. The second argument is family=poisson, which instructs R to compute a Poisson regression with a log-link and not a
‘normal’ linear regression with a Gaussian identity function. In essence, this ensures
that the regression cannot predict negative values (which would not make sense since
frequencies cannot be negative; cf. Crawley 2012:â•›Section 13.3 for discussion).
Given the discussion of model selection, the first step of the actual analysis consists of fitting a maximal model in which all predictors are included. The following
code computes such a model, stores it into a data structure m1, and summarizes this
data structure (the output here is abbreviated and minimally altered).
384 Stefan Th. Gries
m1<-glm(FREQ ~ VARIETY*MODE*TRY*CLAUSE, family=poisson)
summary(m1)
[…]
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept)
4.78749 0.09129 52.444 < 2e-16 ***
VARIETYbrit
1.43310 0.10159 14.106 < 2e-16 ***
MODEwrt
-2.48491 0.32914 -7.550 4.36e-14 ***
TRYto
1.15531 0.10468 11.037 < 2e-16 ***
CLAUSEto
-0.28768 0.13944 -2.063 0.03911 *
VARIETYbrit:MODEwrt
0.15614 0.36157
0.432 0.66586
VARIETYbrit:TRYto
-2.36526 0.14005 -16.889 < 2e-16 ***
MODEwrt:TRYto
1.93118 0.33989
5.682 1.33e-08 ***
VARIETYbrit:CLAUSEto
0.62671 0.15116
4.146 3.38e-05 ***
MODEwrt:CLAUSEto 1.24319 0.39737
3.129 0.00176 **
TRYto:CLAUSEto
-0.49606 0.16678 -2.974 0.00294 **
VARIETYbrit:MODEwrt:TRYto
0.82503 0.38592
2.138 0.03253 *
VARIETYbrit:MODEwrt:CLAUSEto
-0.62985 0.43542 -1.447 0.14803
VARIETYbrit:TRYto:CLAUSEto
0.03675 0.21309
0.172 0.86307
MODEwrt:TRYto:CLAUSEto
-0.73053 0.42051 -1.737 0.08235 .
VARIETYbrit:MODEwrt:TRYto:CLAUSEto
-0.23079 0.48373 -0.477 0.63328
[…]
Null deviance:
2.1620e+03 on 15 degrees of freedom
Residual deviance: -8.7486e-14 on 0 degrees of freedom
[…]
The main part of the output above is a table, which lists the included predictors, their
coefficient estimates and their significance tests. The most relevant columns are the
first (with the name of the predictor), the second headed Estimate, and the last with
the p-value for the predictor. The row for the intercept shows 4.78749 as an estimate,
the antilog of which is 120, the observed frequency of the combination of the alphabetically first factor levels: Variety: american Mode: spoken Try: and Clause:
other.
Frequency tables 385
The estimates that R outputs then for the predictors reflect the difference between
the listed predictor and a reference level. For individual variables, the reference level is
the alphabetically first, unlisted level. For instance, the value of 1.43310 for Variety:
british means that the model estimates that, compared to the reference level of Variety: american, Variety: british increases (positive sign) the estimated frequencies. For instance, the value of –2.48491 for Mode: written means that the model
estimates that, compared to the reference level of Mode: spoken, Mode: written
reduces (negative sign) the estimated frequencies (and more strongly so than Variety: british increases them).
Another way to understand the meanings of the coefficients is to compute the
predictions of the model, for example, for Variety: british Mode: written Try: to
Clause: other, all one needs to do is to add up all coefficients whose predictors are
part of this configuration and antilog the sum:
exp(4.78749+1.43310-2.48491+1.15531+0.15614-2.36526+
1.93118+0.82503)
[1] 230.0002 # rounding difference only
This predicted frequency corresponds to the observed frequency because this is the
maximal model that contains all predictors. According to Occam’s razor, insignificant predictors must now be weeded out. Crucially, the elimination of insignificant
predictors always begins with the highest-order interactions and, as long as there are
still insignificant predictors, proceeds to lower-order interactions and then to individual variables. Crucially, a predictor is not removed even if it is significant when it
still participates in a higher-order interaction. In this case, there is only one four-way
interaction – Variety: Mode: Try: Clause – and it is not significant. Thus, one now
updates the first model by removing that interaction:
m2<-update(m1, ~. -VARIETY:MODE:TRY:CLAUSE)
However, one must now first check whether this simplification of the model was justified. This is how it is done:
anova(m1, m2, test=”LRT”)
[…]
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1
0
0.00000
2
1
0.22492 -1 -0.22492 0.6353
R evaluates the difference between the two models, and because they do not differ
from each significantly (p = 0.6353), Occam’s razor requires one to adopt the simpler
one, m2. One can then inspect this simpler model (with summary(m2)) and it becomes obvious that the only three-way interaction that is not significant is Variety:
Try: Clause. Hence:
386 Stefan Th. Gries
m3<-update(m2, ~. -VARIETY:TRY:CLAUSE)
anova(m2, m3, test=”LRT”)
[…]
Resid. Df Resid. Dev Df
Deviance Pr(>Chi)
1
1
0.22492
2
2
0.22633 -1 -0.0014192 0.9699
When this new model m3 is inspected, it is clear that it cannot be simplified any further: each predictor is significant or participates in a significant interaction (Variety:
Mode is not significant, but participates in Variety: Mode: Try, which is).
summary(m3)
[…]
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
4.79421
0.08228 58.268 < 2e-16 ***
VARIETYbrit
1.42477
0.08915 15.981 < 2e-16 ***
MODEwrt
-2.59755
0.23561 -11.025 < 2e-16 ***
TRYto
1.14646
0.09105 12.592 < 2e-16 ***
CLAUSEto
-0.30343
0.10550 -2.876 0.00403 **
VARIETYbrit:MODEwrt
0.29069
0.22849
1.272 0.20330
VARIETYbrit:TRYto
-2.34943
0.10553 -22.263 < 2e-16 ***
MODEwrt:TRYto
2.05052
0.23742
8.637 < 2e-16 ***
VARIETYbrit:CLAUSEto
0.64521
0.10658
6.054 1.42e-09 ***
MODEwrt:CLAUSEto
1.40279
0.21997
6.377 1.80e-10 ***
TRYto:CLAUSEto -0.47355
0.10385 -4.560 5.11e-06 ***
VARIETYbrit:MODEwrt:TRYto
0.67402
0.22825
2.953 0.00315 **
VARIETYbrit:MODEwrt:CLAUSEto
-0.82045
0.17546 -4.676 2.92e-06 ***
MODEwrt:TRYto:CLAUSEto
-0.90750
0.20548 -4.417 1.00e-05 ***
[…]
Null deviance: 2161.98713 on 15 degrees of freedom
Residual deviance: 0.22633 on 2 degrees of freedom
[…]
Frequency tables 387
Often such data are summarized in the form of a table that, simplifying a bit, conveniently summarizes each independent variable’s significance in one p-value. This
table, a so-called ANOVA table, can be created as follows (cf. Gries 2013:â•›266, 271, for
explanation).
library(car)
options(contrasts=c(“contr.sum”, “contr.poly”))
Anova(m3, type=”III”, test=”LR”) # the results are
not shown here
options(contrasts=c(“contr.treatment”, “contr.poly”))
The table now combines predictors involving the same variables but different variable
levels and confirms more succinctly what was already shown above: each predictor in
m3 but Variety: Mode is significant.
How are the results interpreted? They are interpreted as already hinted at above,
on the basis of the coefficients. Since I cannot discuss all the findings in detail, some
comments must suffice. The data show, trivially, compared to the reference combination of Variety: american Mode: spoken Try: and Clause: other, setting Variety
to british increases the predicted counts (the coefficient for Variety: british is positive), compared to the reference combination of Variety: american Mode: spoken
Try: and Clause: other, setting Mode to written decreases the predicted counts
(the coefficient for Mode: written is negative), etc.
More interesting, however, are the interactions that qualify these main effects. As just one example, consider the interaction Variety: british Try: to. This
strong and highly significant interaction means that, while setting both Variety to
british and Try to to increases the estimated counts (by the antilog of 1.42477 +
1.14646 ≈ 2.57123), their joint effect does not boost the counts accordingly, but decreases them by nearly the same amount (by the antilog of –2.34943); thus, compared
to the predicted frequency for Variety: american Mode: spoken Try: and Clause:
other, 120.81, the predicted frequency for Variety: british Mode: spoken Try:
to Clause: other, is increased by 24.8%, the antilog of 2.57123 – 2.34943. But then
there is also a significant interaction Variety: british Mode: written Try: to …
It is clear that complex interactions like these and the degree to which they are significant or not can hardly be recognized by just eye-balling the data and are usually
only comprehensible on the basis of well-designed graphs (e.g. bar plots of predicted
frequencies).
This concludes the discussion of this example here. Many issues of Poisson regressions could not be covered for reasons of space (such as testing the assumptions
of Poisson regressions or how to handle over-/underdispersion). The method is powerful when applied to complex data sets and definitely worth exploring in more detail.
388 Stefan Th. Gries
3. Conclusion
This brief chapter could, of course, not do justice to all the complexities that can and
do arise in the study of frequency tables. I hope, however, that the above brief remarks and examples have shown how useful a statistically correct and comprehensive
exploration of such data can be and also has hopefully whetted the reader’s appetite to explore such techniques in more detail (and also their graphical exploration,
which I could not address here at all) both in this volume and in the works referred
to throughout this paper.
References
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R.
Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511801686
Bortz, J., Lienert, G. A., & Boehnke, K. (1990). Verteilungsfreie Methoden in der Biostatistik (2nd
ed.). Heidelberg: Springer. DOI: 10.1007/978-3-662-22593-6
Brooks, P., & Tomasello, M. (1999). How children constrain their argument structure constructions. Language, 75, 720–738. DOI: 10.2307/417731
Bybee, J. L. (1985). Morphology: A study of the relation between meaning and form. Amsterdam
& Philadelphia: John Benjamins. DOI: 10.1075/tsl.9
Bybee, J. L., & Scheibman, J. (1999). The effect of usage on degrees of constituency: The reduction of don’t in English. Linguistics, 37, 575–596. DOI: 10.1515/ling.37.4.575
Chen, P. (1982). Discourse and particle movement in English. Studies in Language, 10, 79–95.
DOI: 10.1075/sl.10.1.05che
Crawley, M. (2005). Statistics: An introduction using R. New York: John Wiley.
DOI: 10.1002/ 9781119941750
Crawley, M. (2012). The R book (2nd ed.). Chichester: John Wiley. DOI: 10.1002/9781118448908
Faraway, J. J. (2006). Extending the linear model with R: Generalized linear, mixed effects and
nonparametric regression models. Boca Raton, FL: Chapman and Hall.
Goldberg, A. E., Casenhiser, D., & Sethuraman, N. (2004). Learning argument structure generalizations. Cognitive Linguistics, 14, 289–316.
Gries, St. Th. (2003). Multifactorial analysis in corpus linguistics: A study of particle placement.
London & New York: Continuum.
Gries, St. Th. (2013). Statistics for linguistics with R: A practical introduction. Berlin & New York:
Mouton de Gruyter. DOI: 10.1515/9783110307474
Hommerberg, C., & Tottie, G. (2007). Try to and try and? Verb complementation in British and
American English. ICAME Journal, 31, 45–64.
Johnson, K. (2008). Quantitative methods in linguistics. Malden, MA & Oxford: Blackwell.
Kruisinga, E., & Erades, P. A. (1953). An English grammar. Vol. I. Groningen: P. Noordhoff.
Lohmann, A. (2009). The register-specificity of metaphor. Paper presented at the workshop
‘Corpus, colligation, register variation’ of the 31st DGfS-Tagung.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants.
Science, 274, 1926–1928. DOI: 10.1126/science.274.5294.1926
Frequency tables 389
Saffran, J. R., & Wilson, D. P. (2003). From syllabus to syntax: Multilevel statistical learning by
12-month old infants. Infancy, 4, 273–284. DOI: 10.1207/S15327078IN0402_07
Sheskin, D. J. (2011). Handbook of parametric and non-parametric statistical procedures (5th
ed.). Boca Raton, FL, London & New York: CRC Press.
Schmid, H.-J. (2000). English abstract nouns as conceptual shells: From corpus to cognition.
Berlin & New York: Mouton de Gruyter.
Collostructional analysis
Measuring associations between
constructions and lexical elements
Martin Hilpert
Université de Neuchâtel
This chapter offers a practical introduction to a set of corpus-linguistic analytical methods that are referred to collectively as ‘collostructional analysis’. The
overarching aim in conducting a collostructional analysis is to find out which
lexical items form collocations with a given grammatical construction. The
purpose of such an undertaking, in the context of this book, is to facilitate the
semantic analysis of grammatical constructions. The chapter discusses several
case studies in order to show that the semantic description of grammatical
constructions benefits from a quantitative analysis of the lexical material that
occurs with the respective constructions. A final discussion surveys advantages
and pitfalls of collostructional methods.
Keywords: collocations, quantitative analysis
1. Introduction
This chapter offers an introduction to collostructional analysis for the non-initiated.
It explains how it works, how it is used, and how its results can be interpreted. In doing so, it may serve as a primer for the original series of papers in which the different
collostructional methods are presented (Stefanowitsch and Gries 2003, 2005; Gries
and Stefanowitsch 2004a, 2004b). These papers should be referred to not only for
theoretical background detail, but also for further case studies and discussion of the
statistics that are implemented in the respective methods. Unlike the original papers,
this chapter discusses a few practical aspects of conducting collostructional analyses
that are meant to facilitate the process of getting started with one’s own projects.
The overarching aim in conducting a collostructional analysis is to find out which
lexical items are “typical” of a given grammatical construction. For example, given
a construction such as English keep on V-ing, which verbs typically occur in this
392 Martin Hilpert
constructional frame? As is explained in more detail below, simple raw frequency
counts sometimes do not provide a revealing answer to this question. Rather than
asking which elements occur most often, collostructional analysis uses relative frequency counts to determine which elements occur more frequently in a construction
than would be expected by chance.
A sceptic might wonder why an analysis of constructional collocates should be
useful. A collocational analysis of constructions is motivated by the insight that the
meaning of a construction tends to harmonize with the meanings of the lexical elements that typically occur in it. For instance, a verb will only be acceptable within a
given construction if its arguments are compatible with the roles specified by the construction; this is referred to as the Semantic Coherence Principle (Goldberg 1995:â•›50).
The English ditransitive construction, which conveys the constructional meaning of a
transfer, thus overwhelmingly occurs with transfer verbs such as give or send; the English way-construction, which evokes the effortful creation of a path through obstacles,
occurs with verbs such as make, work, and push. Analyzing the mutual dependencies
between constructions and lexical elements can yield insights into the meaning of
constructions, which is one of the basic aims of Construction Grammar. In practical
terms, if a research project has the goal of describing the semantics of a grammatical
construction, collostructional analysis is a good methodological choice for the job.
Collostructional analysis is not a single method, but is actually a cover term for
three related methods of corpus-linguistic inquiry. Each of these methods investigates
the mutual associations between grammatical constructions and lexical items:
– Collexeme analysis (Stefanowitsch and Gries 2003) is used to investigate which
lexical items typically occupy a given slot in a single grammatical construction
such as the keep on V-ing construction.
– Distinctive collexeme analysis (Gries and Stefanowitsch 2004a) contrasts two or
more constructions with regard to the lexical items that occur with them. This
is especially useful for the comparison of constructions that are roughly synonymous. Pairs of constructions such as the ditransitive vs. the prepositional Dative,
will vs. be going to, or active vs. passive can be fruitfully contrasted with regard to
the main verbs that characterize each respective variant.
– Covarying-collexeme analysis (Gries and Stefanowitsch 2004b; Stefanowitsch and
Gries 2005) reveals dependencies between lexical items that occupy two different
slots within the same construction. To illustrate, the English object-to-subject
raising construction (Proust is tough to read) holds a slot for an adjective and
another one for a verb in the infinitive. A covarying-collexeme analysis can determine typical adjective-verb combinations, which then allow conclusions about
the semantic frames that are most strongly associated with the construction.
Each of the three methods is presented in a separate section below. Before that, a few
remarks on practical matters are in order. Any collostructional analysis begins with
Collostructional analysis 393
the choice of an appropriate corpus and concordancing software. The next step in the
work process is an exhaustive extraction of the construction that one is interested in.
All examples are needed in order to obtain usable results. Depending on the form of
the construction, this may not be trivial. While it is fairly easy to retrieve all examples
of be going to V, even from an untagged corpus, finding all examples of the ditransitive construction or all examples of object-to-subject raising is more difficult. In most
cases, it is advisable to cast a wide net first and weed out false positives by manual
checking. This will often take some time. A practical way to handle this process is to
put all concordance lines in a spreadsheet file, noting for each example in a separate
column whether or not it instantiates the target construction. A cleaned-up concordance with all examples of the target construction is the basis for all three methods that
are outlined below (and in fact for many other corpus-linguistic methods as well).
In this chapter, the model construction that will serve to illustrate all three analytical methods is a subtype of the English it-extraposition construction (Kaltenböck
2005), in which an expletive subject it is followed by a copula, a predicative adjective
and a to-infinitive clause:
(1) a. It is hard to be a corpus linguist.
b. It is important to give evidence.
c. It is tempting to speculate about grammar.
The following sections discuss how collostructional analysis can be used for a semantic analysis of the it’s ADJ to V-construction. All steps are outlined in such a way
that they can be re-created directly by the reader, and in fact this is encouraged. Alternatively, a package with all the datasets discussed in this chapter is available upon
request from the author. The first analytical step would be to extract all examples of
the it’s ADJ to V-construction from a corpus such as the BNC, which can be freely
accessed online (Davies 2004). The construction is fairly frequent; a corpus search
for the pronoun it followed by a form of the copula and a to-infinitive yields 17,216
instances after manual clean-up.
2. Collexeme analysis
Collexeme analysis (Stefanowitsch and Gries 2003) is the most basic of the three collostructional methods. It is used to find out which lexical items in a given slot are
most strongly attracted to a grammatical construction, such as maybe the preferred
adjectives of it’s ADJ to V. With a spreadsheet of a complete concordance, this question is easy to answer. In the BNC, it’s ADJ to V occurs with 462 different adjective
types in total. Table 1 shows the ten most frequent types.
These raw frequencies are a start, but more information is needed. As was mentioned above, a collexeme analysis determines whether a given lexical item occurs
394 Martin Hilpert
Table 1.╇ The ten most frequent adjectives in the it’s ADJ to V-construction
Adjective
Tokens
possible
difficult
important
necessary
easy
impossible
hard
interesting
better
essential
2,434
1,949
1,844
1,367
1,139
â•⁄â•‹995
â•⁄â•‹979
â•⁄â•‹453
â•⁄â•‹340
â•⁄â•‹338
Table 2.╇ Token frequencies in the BNC and in the it’s ADJ to V-construction
Adjective
Corpus frequency (BNC)
Construction frequency (it’s ADJ to V)
possible
difficult
important
necessary
easy
impossible
hard
interesting
better
essential
33,654
21,608
38,715
17,863
13,935
â•⁄6,822
15,285
â•⁄9,417
20,811
â•⁄8,633
2,434
1,949
1,844
1,367
1,139
â•⁄â•‹995
â•⁄â•‹979
â•⁄â•‹453
â•⁄â•‹340
â•⁄â•‹338
with a construction more often than expected. In order to state whether 453 tokens
of interesting are anything out of the ordinary, the token frequencies of the adjectives
in Table 1 need to be compared against their overall token figures in the BNC. How
are these frequencies obtained? One possibility would be to consult a word frequency
list from the BNC, several of which can be found on the Internet. Another possibility
would be to conduct another corpus search, this time for all 462 adjective types that
are found in it’s ADJ to V. Whichever option is chosen, the results should be organized
in a table that lists each adjective with its full corpus frequency and its frequency in it’s
ADJ to V. Table 2 shows the uppermost part of such a table.
With this table in place, it is almost time to run the collexeme analysis. Saved in
.txt format with tab stops, it constitutes the most important input. In addition, it is
necessary to include two numbers: the first one is the token frequency of it’s ADJ to
Collostructional analysis 395
V (17,216); and the second one is the combined token frequency of all elements that
could potentially occur in it’s ADJ to V, which means essentially all adjectives that
occur in the BNC (13,704,818). This figure is best obtained through a tag-sensitive
search for all adjectives, or again from a BNC word list. The most convenient way
to run a collostructional analysis is to use the open source software R, together with
a script written by Stefan Th. Gries that is available upon request (Gries 2004). The
script prompts the user for the following four items:
–
–
–
–
The name of the construction one wishes to study (type: it’s ADJ to V).
The combined frequency of all potential collexemes (type: 13704818).
The combined frequency of all construction tokens (type: 17216).
A table with the three columns ‘adjective’, ‘corpus_frequency ’, and ‘construction_
frequency’ (the script will open a pop-up window that allows the selection of the
.txt file).
Armed with these pieces of information, the script generates a table that lists all adjectives that are found in the construction together, with their observed and expected
frequencies and a measure of their attraction to the construction. This measure, which
uses the Fisher’s exact test and reflects p-values resulting from that test, is referred to
in the following as collostructional strength – the higher it is, the stronger the mutual
attraction. Table 3 shows the most strongly attracted collexemes of it’s ADJ to V.
In this particular case, the nine most strongly attracted collexemes are actually
all found within the ten most frequent adjectives, so one might wonder whether the
additional effort of a collostructional analysis is warranted. However, several points
suggest that it is. First, throughout the table, one can see diverging values of collostructional strength and construction frequency; essential is less frequent than better,
but more strongly associated with it’s ADJ to V. Second, the values of collostructional
strength can be used to distinguish between elements that are significantly attracted
to the construction (a value that is larger than 1.3 indicates significant attraction at
the p-level of 0.05; the value Infinite indicates that the probability of error is infinitely small) and those that are not, which is information that construction frequencies
alone cannot provide. The adjectives embarrassing and artificial (not shown in the
table) both occur four times in the database, but only the former is significantly attracted to the construction. Lastly, the analysis identifies elements with construction
frequencies that are significantly lower than those expected by chance. These elements, for instance, economic, clear, and modern (not shown in the table), can be said
to be repelled by the construction. Given the high overall frequencies of these adjectives, they should occur in the construction more often than they do. The following
examples from the BNC are thus unidiomatic, unusual examples of it’s ADJ to V.
396 Martin Hilpert
Table 3.╇ Collexemes of the it’s ADJ to V-construction
Collexeme
Corpus
frequency
Construction
frequency
Expected
frequency
Collostruction
strength
difficult
easy
essential
hard
important
impossible
interesting
necessary
possible
advisable
tempting
better
reasonable
useful
best
wise
convenient
fair
safe
…
…
21,608
13,935
â•⁄8,633
15,285
38,715
â•⁄6,822
â•⁄9,417
17,863
33,654
â•⁄â•⁄â•‹530
â•⁄â•⁄â•‹666
20,811
â•⁄6,072
â•⁄9,960
26,088
â•⁄1,896
â•⁄1,966
â•⁄8,166
â•⁄6,330
…
…
1,949
1,139
â•⁄â•‹338
â•⁄â•‹979
1,844
â•⁄â•‹995
â•⁄â•‹453
1,367
2,434
â•⁄â•‹135
â•⁄â•‹141
â•⁄â•‹340
â•⁄â•‹207
â•⁄â•‹198
â•⁄â•‹259
â•⁄â•‹106
â•⁄â•⁄â•‹95
â•⁄â•‹143
â•⁄â•‹130
…
…
27.15
17.51
10.85
19.2
48.64
â•⁄8.57
11.83
22.44
42.28
â•⁄0.67
â•⁄0.84
26.15
â•⁄7.63
12.51
32.78
â•⁄2.38
â•⁄2.47
10.26
â•⁄7.95
…
…
Infinite
Infinite
Infinite
Infinite
Infinite
Infinite
Infinite
Infinite
Infinite
262.84
261.67
246.38
213.62
159.61
136.88
132.50
112.83
108.11
106.87
…
…
(2) a. It is economic to make the accumulators faster than the store elements.
b. It is clear to see at most training clubs that praise is in very short supply.
c. Babies aren’t very nice for the first couple of years of their lives either,
though it is modern to say the opposite.
Once the results of a collostructional analysis have been obtained, it is the task of the
analyst to make sense of the data. Lists such as the one in Table 3 can be interpreted qualitatively to assess the constructional semantics. A particularly revealing result
would be to find that there are groups of semantically closely-related items that are
attracted to the construction under investigation. In the case of it’s ADJ to V, one can
identify adjectives that make reference to different scales, such as ease (difficult, easy,
hard), possibility (possible, impossible), importance (important, essential, necessary),
and advisability (advisable, better, best, wise). Further adjectives instantiating these
scales are found if more collexemes are inspected than just the top twenty. When lists
of attracted collexemes are broken down into sets of semantically related items, these
groupings should be done on the basis of close examination of actual examples in
Collostructional analysis 397
their contexts. It is necessary to look beyond the mere list of attracted elements, since
lexical items typically can be used in a range of different meanings. Another important part of a collexeme analysis is to consider the repelled items in detail. The relative
absence of certain lexical items in a construction can yield clues about the constraints
that govern its usage, and so a characterization in negative terms can be very useful.
Before we turn to the next section, a few problematic issues need to be mentioned. One problem that can be observed in Table 3 concerns items that are very
strongly attracted to a construction. For no less than nine adjectives in it’s ADJ to V,
the collexeme analysis returns an infinitely strong value of collostruction strength.
This is an issue that will frequently arise in constructions with low type frequencies.
To illustrate, a collexeme analysis of what Schmid (2000) calls shell nouns (the fact
that CLAUSE, the decision to V, etc.) is very likely to register infinite collostructional
strength for nearly all participating nouns because these form such a small and relatively closed set. This is not a problem for collostructional analysis per se, but it has to
be noted that not all constructions are fruitfully investigated in this way. Good candidates for a collexeme analysis are semantically general constructions (be going to, the
caused-motion construction, the ditransitive construction, etc.) with a type frequency
of 100+ elements. The latter figure, of course, varies with the size of the corpus that is
chosen. Even when such a threshold is surpassed, as is the case for several shell noun
constructions, the analysis may still suffer if most of the types occur almost exclusively within those constructions, and not much in other contexts.
A second problem concerns polysemous elements such as the adjective hard,
which is very frequent in the it’s ADJ to V construction. In the construction, it is
always used in the sense of ‘difficult’. Nonetheless, its corpus frequency also reflects
all those instances in which it means ‘solid’. The inclusion of these instances weakens
its collostructional strength. Polysemous items are therefore at a disadvantage in a
collexeme analysis. There is no easy way to address this problem, as manual coding
for different word senses is often not a feasible option. The best strategy is to be on
the lookout for polysemous collexemes and to consider carefully where a collexeme
analysis might disfavor a particular element.
There are several other criticisms of collostructional analysis that have been
raised and that would merit discussion, the interested reader is in particular referred
to the criticisms in Bybee (2010) and the responses in Gries (2012).
3. Distinctive collexeme analysis
Distinctive collexeme analysis (Gries and Stefanowitsch 2004a) contrasts constructions in their respective collocational preferences. The method is particularly suited
for the study of semantically related constructions, such as, for example, the ditransitive construction and the prepositional dative construction. A distinctive collexeme
398 Martin Hilpert
analysis can bring to light the fact that two constructions, which, at first glance appear
to be synonymous, do in fact display subtle differences. How then could an analysis
of this kind be used for the study of it’s ADJ to V? In order to apply it, an alternative
construction has to be identified, so that there are two constructions to compare. For
instance, it might be fruitful to contrast it’s ADJ to V against a corresponding construction with a non-extraposed predicative adjective, as illustrated in (3).
(3) a. It is hard to be a corpus linguist.
b. [To be/Being/The life of] a corpus linguist is hard.
If one is interested in the question of when and why speakers choose to use an extraposed adjective with an expletive subject pronoun it rather than a non-extraposed
predicative construction, looking at the respective collocational preference partly answers the ‘when’ question and at least provides hints for the ‘why’ question. In order
to run a distinctive collexeme analysis, both constructions need to be exhaustively
extracted from the same corpus. The data from the collexeme analysis of it’s ADJ to V
in the previous section can thus simply be re-used, but it has to be complemented by
a full concordance of what is referred to here as the X is ADJ construction. The BNC
holds 65,823 tokens of this construction that are distributed across 5,733 different
adjective types. Table 4 contrasts the ten most frequent adjective types in it’s ADJ to
V and X is ADJ.
For the computation of a distinctive collexeme analysis, this data needs to be
brought into a slightly different format, which is shown in Table 5. The input data for
a distinctive collexeme analysis consists of the respective token frequencies for all adjective types that are attested in either of the two constructions (5,754 in total), saved
Table 4.╇ The ten most frequent adjectives in it’s ADJ to V and X is ADJ
Adjective
it’s ADJ to V
Adjective
X is ADJ
possible
difficult
important
necessary
easy
impossible
hard
interesting
likely
better
…
…
2,434
1,949
1,844
1,367
1,139
â•⁄â•‹995
â•⁄â•‹979
â•⁄â•‹453
â•⁄â•‹357
â•⁄â•‹340
…
…
right
available
wrong
concerned
true
different
necessary
dead
good
possible
…
…
1,229
1,032
â•⁄â•‹905
â•⁄â•‹859
â•⁄â•‹853
â•⁄â•‹823
â•⁄â•‹796
â•⁄â•‹771
â•⁄â•‹682
â•⁄â•‹657
…
…
Collostructional analysis 399
Table 5.╇ Token frequencies in it’s ADJ to V and X is ADJ
Adjective
Construction frequency
(it’s ADJ to V)
Construction frequency
(X is ADJ)
possible
difficult
important
necessary
easy
impossible
hard
interesting
better
essential
…
…
2,434
1,949
1,844
1,367
1,139
â•⁄â•‹995
â•⁄â•‹979
â•⁄â•‹453
â•⁄â•‹340
â•⁄â•‹338
…
…
657
279
652
796
383
316
â•⁄86
136
269
391
…
…
in .txt format with tab stops. The script prompts the user for the frequencies of both
constructions and the location of the data file.
While some asymmetries can already be gleaned from Tables 4 and 5, a distinctive collexeme analysis can give some more tangible guidelines for a semantic contrast
of the two constructions.
Table 6a shows the ten most distinctive collexemes of it’s ADJ to V. It can be seen
that the results of the collexeme analysis are largely reproduced. The semantic scales
of possibility, ease, advisability, and importance all reappear, suggesting that these
parameters are central enough to the constructional semantics to figure in a contrast
with a predicative construction.
Table 6a.╇ The top ten distinctive collexemes of it’s ADJ to V
Distinctive
collexemes
Observed
(it’s ADJ to V)
Observed
(X is ADJ)
Expected
(it’s ADJ to V)
Expected
(X is ADJ)
Coll.
strength
difficult
easy
hard
important
impossible
necessary
possible
interesting
best
tempting
1,949
1,139
â•⁄â•‹979
1,844
â•⁄â•‹995
1,367
2,434
â•⁄â•‹453
â•⁄â•‹259
â•⁄â•‹141
279
383
â•⁄86
652
316
796
657
136
109
â•⁄â•⁄4
461.92
315.55
220.80
517.48
271.8
448.44
640.84
122.11
â•⁄76.30
â•⁄30.06
1766.08
1206.45
â•⁄844.20
1978.52
1039.20
1714.56
2450.16
â•⁄466.89
â•⁄291.70
â•⁄114.94
Infinite
Infinite
Infinite
Infinite
Infinite
Infinite
Infinite
188.19
â•⁄92.70
â•⁄89.70
400 Martin Hilpert
Table 6b.╇ The top ten distinctive collexemes of X is ADJ
Distinctive
collexemes
Observed
(it’s ADJ to V)
Observed
(X is ADJ)
Expected
(it’s ADJ to V)
Expected
(X is ADJ)
Coll.
strength
available
concerned
different
dead
right
clear
high
present
true
successful
â•⁄0
â•⁄0
â•⁄0
â•⁄0
92
â•⁄1
â•⁄0
â•⁄0
51
â•⁄0
1,032
â•⁄â•‹859
â•⁄â•‹823
â•⁄â•‹771
1,229
â•⁄â•‹433
â•⁄â•‹378
â•⁄â•‹372
â•⁄â•‹853
â•⁄â•‹341
213.96
178.09
170.63
159.85
273.88
89.98
â•⁄78.37
â•⁄77.12
187.42
â•⁄70.70
â•⁄818.04
â•⁄680.91
â•⁄652.37
â•⁄611.15
1047.12
â•⁄344.02
â•⁄299.63
â•⁄294.88
â•⁄716.58
â•⁄270.30
104.87
â•⁄87.18
â•⁄83.51
â•⁄78.21
â•⁄43.65
â•⁄41.86
â•⁄38.24
â•⁄37.63
â•⁄37.16
â•⁄34.49
The most distinctive collexemes of the predicative X is ADJ construction are
shown in Table 6b. They include adjectives denoting truth values (right, true), but
mostly consist of reasonably frequent adjectives that can only be predicated over referential nominals, i.e. not with a non-referential it. This concerns the elements available, concerned, dead, high, present, and successful. These do not occur in it’s ADJ to V
and consequently show up in Table 6b as distinctive for X is ADJ.
As was mentioned above, a distinctive collexeme analysis can be usefully applied
when a theoretical point hinges on the fact that two constructions are non-synonymous. Collocational differences can be used to make this point, and they allow the
analyst to flesh out the semantic contrasts between the two constructions.
4. Covarying-collexeme analysis
The analyses in the previous sections have established that the it’s ADJ to V-construction selects for adjectives that concern semantic frames such as possibility, ease, advisability, and importance. As yet, it has not been asked with what kinds of verbs
the construction might occur preferentially, and whether there could be any interdependencies between certain kinds of adjectives and certain kinds of verbs. A covarying-collexeme analysis addresses precisely this point. Given a construction with
two slots that can be lexically filled, can we determine combinations of elements that
occur more often than would be expected by chance? In the data at hand, the examples of it’s ADJ to V contain 6,257 different combinations, the ten most frequent of
which are shown in Table 7. Also shown are the respective overall frequencies of the
participating adjectives and verbs.
Collostructional analysis 401
Table 7.╇ The ten most frequent adjectives and verbs (ADJ-V) combinations
of it’s ADJ to V
Combination
Tokens
ADJ frequency
V frequency
interesting to note
difficult to see
easy to see
important to note
important to remember
hard to see
fair to say
hard to imagine
hard to believe
possible to make
244
218
190
164
152
149
103
102
â•⁄93
â•⁄90
â•⁄â•‹453
1,949
1,139
1,844
1,844
â•⁄â•‹979
â•⁄â•‹143
â•⁄â•‹979
â•⁄â•‹979
2,434
â•⁄â•‹479
1,032
1,032
â•⁄â•‹479
â•⁄â•‹200
1,032
â•⁄â•‹448
â•⁄â•‹283
â•⁄â•‹224
â•⁄â•‹345
Table 8.╇ Input for a covarying-collexeme analysis of it’s ADJ to V
Adjective
Verb
absurd
absurd
absurd
absurd
absurd
absurd
absurd
absurd
absurd
absurd
…
…
suppose
suppose
suppose
think
think
think
assume
be
blame
compare
…
…
A cursory look at Table 7 already reveals that adjectives denoting ease and difficulty (difficult, easy, hard) co-occur with verbs that denote cognitive processes (see,
imagine, believe); the adjective important co-occurs with verbs that can be used to
introduce a statement (note, remember). A covarying-collexeme analysis can identify
those combinations that are significantly more frequent than expected, given the respective frequencies of their participating elements. This in turn may guide an analysis of the semantic frames that are associated with the construction that is studied.
Again, the input has to be brought into a specific format, which is shown in Table 8.
What is needed is a table with two columns in which all observations are listed. In the
present case, this means that the table has 17, 216 lines, each of which consists of an
402 Martin Hilpert
Table 9.╇ The top twenty covarying collexemes of it’s ADJ to V
Combination
Observed
Expected
Coll. strength
interesting to note
fair to say
important to remember
true to say
reasonable to assume
hard to believe
hard to imagine
important to note
unrealistic to expect
reasonable to suppose
easy to see
important to realise
important to stress
interesting to compare
easy to forget
reasonable to expect
hard to see
unreasonable to expect
good to be
important to recognize
244
103
152
â•⁄49
â•⁄55
â•⁄93
102
164
â•⁄23
â•⁄30
190
â•⁄54
â•⁄45
â•⁄33
â•⁄33
â•⁄25
149
â•⁄14
â•⁄51
â•⁄41
12.60
â•⁄3.72
21.42
â•⁄1.33
â•⁄1.89
12.74
16.09
51.31
â•⁄0.19
â•⁄0.76
68.28
â•⁄8.03
â•⁄5.89
â•⁄2.21
â•⁄3.04
â•⁄1.02
58.69
â•⁄0.13
â•⁄8.38
â•⁄6.64
278.86
133.29
105.44
â•⁄75.72
â•⁄66.40
â•⁄55.89
â•⁄54.33
â•⁄43.98
â•⁄43.92
â•⁄40.76
â•⁄39.73
â•⁄35.36
â•⁄33.87
â•⁄29.78
â•⁄28.47
â•⁄27.58
â•⁄26.31
â•⁄25.82
â•⁄25.16
â•⁄24.67
adjective and a verb, separated by a tab stop. Whilst the computation of the other collostructional analysis usually takes a matter of seconds to complete, a covarying-collexeme analysis with the dataset just described may take longer to run.
The most important results of the analysis are presented in Table 9. The combinations are listed alongside their observed frequencies in the database, which in each
case exceed the value that is expected by chance. All combinations that are shown
display a significant mutual attraction, as can be read off the rightmost column.
Perhaps more tellingly than the analyses in the previous sections, the covarying-collexeme analysis brings out the rhetorical discursive functions of the it’s ADJ
to V-construction, which are also discussed by Kaltenböck (2005:â•›145). Many of the
combinations in Table 9 fulfil a function that can be described as setting the stage for
a new piece of information in discourse. Phrases such as It is interesting to note / fair to
say / important to remember / etc. do not carry focal information in themselves and are
usually less prominently stressed than the material that follows. But in addition, the
table lists combinations such as hard to believe / hard to imagine / hard to see, which a
speaker can use to signal a negative epistemic stance towards a following proposition.
In addition, there are combinations with the verbs expect and suppose that pertain to
Collostructional analysis 403
the validity of an inference. A full semantic analysis of it’s ADJ to V should attempt to
flesh out these types in more detail and describe their interrelations. An in-depth look
at the covarying collexemes of a construction can thus yield more detailed insights
than a study that concentrates on just a single slot of that construction.
5. Concluding remarks
The preceding sections have surveyed a number of research questions that can be
addressed through different collostructional analyses. Hopefully, some readers will be
encouraged to read further, to familiarize themselves further with the methods, and
run their own analyses. Besides what has been presented in this chapter, further ways
to apply collostructional analyses have been suggested. For example, collostructional
analysis can be used for the study of sociolinguistic variation (Wulff et al. 2007) and
language change (Hilpert 2006, 2012). The value of any collostructional analysis ultimately depends on the way it is brought to bear on a question of theoretical interest.
Much as with other statistical methods, the hard task is not the running of the numbers, but the development of a research question that the chosen technique can answer. Used in this way, collostructional analysis is a tool that can substantially advance
our understanding of grammatical constructions and their meanings.
References
Bybee, J. L. (2010). Language, usage and cognition. Cambridge: Cambridge University Press.
DOI: 10.1017/CBO9780511750526
Davies, M. (2004). BYU-BNC: The British National Corpus. Available via http://corpus.byu.edu/
bnc.
Goldberg, A. E. (1995). Constructions. A construction grammar approach to argument structure.
Chicago: University of Chicago Press.
Gries, St. Th. (2004). Coll.analysis 3. A program for R for Windows 2.x.
Gries, St. Th. (2012). Frequencies, probabilities, association measures in usage-/exemplar-based
linguistics: Some necessary clarifications. Studies in Language, 36(3), 477–510.
DOI: 10.1075/sl.36.3.02gri
Gries, St. Th., & Stefanowitsch, A. (2004a). Extending collostructional analysis: A corpus-based
perspective on ‘alternations’. International Journal of Corpus Linguistics, 9(1), 97–129.
DOI: 10.1075/ijcl.9.1.06gri
Gries, St. Th., & Stefanowitsch, A. (2004b). Co-varying collexemes in the into-causative. In
M. Achard, & S. Kemmer (Eds.), Language, culture, and mind (pp. 225–36). Stanford: CSLI.
Hilpert, M. (2006). Distinctive collexemes and diachrony. Corpus Linguistics and Linguistic
Theory, 2(2), 243–56. DOI: 10.1515/CLLT.2006.012
404 Martin Hilpert
Hilpert, M. (2012). Diachronic collostructional analysis. How to use it, and how to deal with
confounding factors. In J. Robinson, & K. Allan (Eds.), Current methods in historical semantics (pp. 133–160). Berlin & New York: Mouton de Gruyter.
Kaltenböck, G. (2005). It-extraposition in English: A functional view. International Journal of
Corpus Linguistics, 10(2), 119–159. DOI: 10.1075/ijcl.10.2.02kal
Schmid, H.-J. (2000). English abstract nouns as conceptual shells: From corpus to cognition.
Berlin/New York: Mouton de Gruyter. DOI: 10.1515/9783110808704
Stefanowitsch, A., & Gries, St. Th. (2003). Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.
DOI: 10.1075/ijcl.8.2.03ste
Stefanowitsch, A., & Gries, St. Th. (2005). Covarying collexemes. Corpus Linguistics and Linguistic Theory, 1(1), 1–43. DOI: 10.1515/cllt.2005.1.1.1
Wulff, S., Stefanowitsch, A., & Gries, St. Th. (2007). Brutal Brits and persuasive Americans:
Variety-specific meaning construction in the into-causative. In G. Radden, K.-M. Köpcke,
T. Berg, & P. Siemund (Eds.), Aspects of meaning construction (pp. 265–281). Amsterdam:
John Benjamins.
Cluster analysis
Finding structure in linguistic data
Dagmar Divjak and Nick Fieller
University of Sheffield
Cluster analysis is an exploratory data analysis technique, encompassing a number of different algorithms and methods for sorting objects into groups. Cluster
analysis requires the analyst to make choices about dissimilarity measures,
grouping algorithms, etc., and these choices are difficult to make without an
understanding of their theoretical implications and a very good understanding
of the data. This chapter provides an introduction to the distance measures and
clustering algorithms most commonly used for cluster analytic work. Different
from Baayen (2008), Johnson (2008) and Gries (2009), its main aim is to equip
the researcher with at least a basic understanding of what is happening when a
dataset is explored with the help of a particular cluster analytic technique.
Keywords: clustering algorithms, distance measures
1. Introduction
We organisms are sensorimotor systems. The things in the world come in contact
with our sensory surfaces, and we interact with them based on what that sensorimotor contact “affords”. (…) At bottom, all of our categories consist in ways we behave
differently toward different kinds of things, whether it be the things we do or don’t
eat, mate with, or flee from, or the things that we describe, through our language, as
prime numbers, affordances, absolute discriminables, or truths. And isn’t that all that
cognition is for – and about?
(Stevan Harnad 2005: To Cognize is to categorize: Cognition is categorization)
One of the key concepts in cognitive linguistics is categorization. To be able to categorize things is a necessary and innate capacity: we need to be able to recognize,
distinguish and understand in order to survive. Our categories signal, for example,
whether the mushroom we pick is edible or not, and whether the animal we encounter
406 Dagmar Divjak and Nick Fieller
is harmless or dangerous. Survival is the main goal of cognition. Categorization is
equally fundamental in language. Growing up, we not only learn which categories are
relevant for us to function in our environment. We also acquire the categories of our
language and learn to use a limited number of words and rules to name a large number
of different items and to express an unlimited number of experiences.
And things do not stop with language. As is the case in other disciplines, categorization is also important in the scientific study of language. As early as the 5th century BC, Sanskrit grammarians grouped words into classes – that would later become
known as parts of speech – distinguishing between inflected nouns and verbs and uninflected pre-verbs and particles. Categorization efforts have been carried out across
linguistic sub-disciplines, ranging from phonology, morphology, syntax and semantics, to discourse analysis and pragmatics. Despite its long history, even for parts of
speech there is currently no generally agreed-upon classification scheme that would
apply to all languages, or even a set of criteria upon which such a scheme should be
based. In some cases, linguists are only now trying to organize entities into groups so
that they can be compared and described.
With desktop computers replacing filing cabinets, it has become easy for linguists
to create very large databases that contain information on a multitude of properties.
At the same time, this complexity may make it too difficult for the human analyst to
detect any structure. One way of solving this issue is by running a cluster analysis.
Cluster analysis (a term first used by Tryon in 1939) is a multivariate analysis technique that organizes information about how similar objects or entities are so that
groups, or clusters, can be formed. Pioneered in machine learning, cluster analysis has
found its way via sociolinguistics (e.g. Shaw’s 1974 work on dialectal boundaries) to
linguistics, where it has now been used to describe a wide range of linguistic phenomena (see references elsewhere in this volume). It is also reaching beyond linguistics
into other Arts and Humanities disciplines, aiding, for example, in the classification of
texts according to relevant dimensions (see Alviar 2008 for the application of cluster
analysis in the style characterization of New Testament texts).
Cluster analysis is an exploratory data analysis technique, encompassing a number of different algorithms and methods for sorting different objects into groups in
such a way that the similarity between two objects of the same group is maximal and
the similarity between two objects that belong to different groups is minimal. In other
words, cluster analysis can be used to discover structures in data and it does this without explaining why that structure exists. Cluster analysis is thus not a routine single
statistical test based on probability theory; instead, it is a data analytic technique, a
collection of different algorithms that put objects into clusters according to well-defined similarity rules. It is mostly used when we do not have any a priori hypotheses,
but are in the exploratory phase of our research. Yet in contrast to other exploratory
methods of this type such as Principal Components Analysis and Multidimensional scaling, cluster analysis requires the analyst to make choices, about dissimilarity
Cluster analysis 407
measures, grouping algorithms, etc., and these choices are difficult to make without
having an understanding of their theoretical implications and a very good understanding of the data. This latter requirement is particularly important, since in contrast to many other statistical methods, there seem to be fewer diagnostics informing
of the weaknesses of any classification solution proposed.
Before getting started, it is important to stress the focus of this chapter. Like
Baayen (2008:â•›Section 5.1.5), Johnson (2008:â•›Ch. 6) and Gries (2009:â•›Ch. 5.5) before
us, this chapter provides an introduction to the distance measures and clustering algorithms most commonly used (for an idea of the amazing variety of functions for
cluster analysis that R has to offer, see http://cran.r-project.org/web/views/Cluster.
html). Our main aim is, however, to equip the researcher with at least a basic understanding of what is happening behind the scenes when s/he explores his/her data with
the help of a particular cluster analytic technique using R (without actually answering
the one