LEL PhD training day, Jan 21 2008 Theory construction and

LEL PhD training day, Jan 21 2008
Theory construction and comparison
“Disciplines differ considerably in the relative emphasis they place on data collection
versus theory construction. In physics, there is a clear division of labor between
experimentalists and theorists. Linguistics, too, has subfields (including psycholinguistics
and sociolinguistics) in which theories tend to be data-driven and others (notably
generative grammar) that focus almost exclusively on the formulation of elegant theories,
with little attention devoted to careful data collection. Unfortunately, the findings of the
experimentalists in linguistics very rarely play a role in the work of generative
grammarians.Rather, theory development tends to follow its own course, tested only by
the unreliable and sometimes malleable intuitions of the theorists themselves. The
theories are consequently of questionable relevance to the facts of language.”
(Wasow & Arnold 2004)
1
Invented examples
•
Invention allows for the construction of basic examples, unencumbered
by extraneous or irrelevant material
•
Invention allows for the construction of paradigms, in which examples
differ minimally with respect to a given characteristic
•
Invention allows for the construction of “ungrammatical” examples
•
Invention allows for the construction of examples which occur rarely (if
ever) in a given corpus
Introspection
The construction of invented examples relies on ”primary intuitions”, i.e.
introspective judgements of well-formedness (typically by the linguist).
“Some argue that primary intuitions are cleaner than other forms of data
because they somehow escape the semantic and pragmatic dimensions of
language use. But making judgements of well-formedness is a type of
language use, albeit a somewhat unusual one. Consulting primary intuitions
unavoidably involves attempting to assign a meaning and to imagine a
context in which the expression under consideration might be used. By
leaving all contextual factors up to the imagination, the use of primary
intuitions regarding sentences in isolation is arguably more subject to
irrelevant interference than an experimental method that explicitly controls
context.” (Wasow & Arnold 2005)
2
Issues with primary intuitions
“With the explosive growth of language technologies, it is increasingly recognized
that the traditional ways of collecting linguistic data are deeply flawed.
Although grammaticality judgments are considered an extremely rich source
of data, it has long been evident that introspections about decontextualized,
constructed examples – especially in syntactic and semantic domains – are
unreliable and inconsistent, as pointed out by sociolinguists and dialectologists
(Labov 1975, 1996; Cornips & Poletto 2004). Improvements in experimental
judgment elicitation techniques have been suggested (Schütze 1996, Cowart
1997, Bard et al. 1996), but the constructed sentences used in many controlled
psycholinguistic experiments are themselves highly artificial, lacking discourse
cohesion and subject to assumptions about default referents (Roland & Jurafsky
2002). Moreover, theoretical linguists are usually unaware of the multiple
variables that are known to affect linguistic judgments and can hardly control
for them (Gries 2005).”
(Bresnan 2007)
Labov’s Working Principles (1975/1987)
I. The Consensus Principle: if there is no reason to think otherwise, assume
that the judgments of any native speaker are characteristic of all speakers
of the language.
II. The Experimenter Principle: if there is any disagreement on introspective
judgments, the judgments of those who are familiar with the theoretical
issues may not be counted as evidence.
III. The Clear Case Principle: disputed judgments should be shown to include
at least one consistent pattern in the speech community or be abandoned.
But
“Linguists are building on sand until they can answer basic questions: what
are the test-retest reliabilities of judgments of grammatical acceptability?
Under what conditions do introspections match speech production? What
are the sources of bias?”
3
Clear cases?
Variance in judgements of acceptability (Cowart 1997)
“The fact that the syntax literature is characterized by consistent inattention to the
methods by which sentence judgments are gathered and summarized is not
compatible with due concern for the role of error variance.”
Example: that-trace
a.
Who do you suppose invited Ann to the circus?
b.
Who do you suppose Ann invited to the circus?
c.
Who do you suppose that invited Ann to the circus?
d
Who do you suppose that Ann invited to the circus?.
Each informant judged five different representatives of the four sentence types. A standard
psychophysical scaling method was used to express relative acceptability. Analysis of a subset of 88
informants who had all judged the same experimental and filler sentences.
4
Results
“The analysis reveals what might appear to be an extraordinary degree of
inconsistency in informant responses; for roughly 90% of the informants, the
average range of variation within sentence type categories exceeded the
difference between the ‘that’-trace violations (c) and the average of the other
three types. Indeed, on average, for individual informants the mean range within
categories was more than 80% larger than the size of the ‘that’-trace effect. The
‘that’-trace effect was smaller than the average range within categories even for
half of the ten informants who showed the least overall variation within
categories. In short, for most informants a great many of their individual sentence
judgments were unrepresentative of the average acceptability of the category
from which the sentence was drawn.” (Cowart, p33)
Informants show the ‘that’-trace effect if the
average of the five bars in the third column
is lower than the average of the bars in the
other three columns.
Informant 1: “best”
Informant 2: “typical” (difference =1.8, but
range of variation within each sentence type
exceeds this)
Informants 3-4: very variable, range of
variation within sentence types , but show
‘that’-trace effect overall
Informant 5: subject extraction with ‘that’
better than object extraction
5
Nevertheless
With a well-designed experiment for eliciting judgments, such error
variations do not obscure the overall result.
Cowart, p19.
What is more
The experiment reveals more than might have been predicted by simple
introspection: object-extraction with ‘that’ is significantly less acceptable
than either subject extraction or object extraction without ‘that’!
Statistical analysis can reveal the contribution of each factor, and their interaction:
Cowart, p124
6
Good design of elicitation experiments
Factorial design
•
Comparing
Who do you suppose invited Ann to the circus?
With
Who do you suppose that invited Ann to the circus?
Not good enough, because the contribution of just adding that is not
controlled for.
Solution: add the object extraction pair.
•
Presentation
No more than one sentence in a token set is presented to each informant
Introduction of filler sentences
Randomization of order of presentation
•
Scaling methods
Ratio scale methods vs category scale methods
Category scales, e.g. 1-5. have the issue that the degree of acceptability
indicated by the difference between 1 and 2 may be different than that
between 4 and 5. Nevertheless, conceptually simple.
Ratio scales ask informants to judge relative acceptability. More complicated
to present, but can work well
.
•
Number of informants
Must be sufficient to be of use for statistical evaluation. For a token set like
the ‘that’-trac set, minimally 8.
7
Corpus and other usage data
Introspection and acceptability judgments are often a poor guide to the
space of grammatical possibility
Usage data can reveal the gaps.
Example: the dative alternation (Bresnan 2007)
That movie gives me the creeps
*That movie give the creeps to me
From the 1970’s, pairs such as these have been used to claim that the double
object construction is associated with possession, while the prepositional
object construction is “allative”
“But many examples of the kinds claimed to be ungrammatical can be found
in current use on the web, including (5) from Bresnan & Nikitina (2003):
(5) a. This life-sized prop will give the creeps to just about anyone!
Guess he wasn’t quite dead when we buried him!
b. Stories like these must give the creeps to people whose idea
of heaven is a world without religion . . .
Again we must ask whether we can trust these examples from the web. Could
they simply be unrepresentative anomalies fished up from the vast depths of
the internet?”
Evidently, the “end-weight” principle is relevant here and outweighs our
original expectations.
8
Corpora and quantitative data
Representative corpora can provide quantitative data which can be
statistically evaluated.
Such data can be compared with results/grammars obtained by introspection
and elicitation of judgments.
For example, Bresnan & Nikitina’s (2003/2007) analysis of the dative
alternation, using Treebank Wall Street Journal, compared with elicited
judgments in a controlled experiment in Bresnan (2006).
Bresnan & Nikitina
2003/2007
(corpus study)
Bresnan 2006
(elicitation experiment)
9
Usage data can be misleading (Pullum 2003)
“Prime ministers Tony Blair (UK) and Bertie Ahern (Eire) were due to turn
up arm in arm to be present at an encouraging announcement of agreement (it
was to prove illusory) between Unionists and Sinn Fein, said The Economist
(October 25th, 2003, p.52, column 1): ‘Downing Street duly announced it,
and up the prime ministers turned’."
“When you're a descriptive grammarian like me, sometimes you have to trust the
corpus and modify your intuitive idea of what is grammatical, and sometimes you
have to use your intuitive knowledge of the language to ward off false impressions
the corpus might give you. It's not a straightforward matter. Science never is.”
Field data
Similar issues arise in the consideration of field data, which can represent
spontaneous usage, e.g. in the collection of usage data, possibly under
controlled circumstances, or require introspection.
The general point is that no one method is likely to fully succeed on its own.
For issues in standardizing elicitation techniques in the context of translated
data , see Cornips & Poletto 2005.
10
This transition from qualitative to quantitative analysis is a familiar one in the
development of science. But the qualitative model of linguistics is not easily
displaced. Many forms of linguistic behavior are categorically invariant.
Furthermore, the number, variety and complexity of linguistic relations are very great,
and it is not likely that a large proportion can be investigated by quantitative means.
At present, we do not know the correct balance between the two modes of analysis:
how far we can go with unsupported qualitative analysis based on introspection,
before the proposals must be confirmed by quantitative studies based on observation
and experiment.
(Labov (1987)
References
Bresnan, Joan 2006. Is syntactic knowledge probabilistic? Experiments with the dative alternation.
Prepublication from http://www.stanford.edu/~bresnan/publications/index.html
Bresnan, Joan 2007. A few lessons from typology. Linguistic Typology 11, 297-306.
Bresnan, Joan & Tatiana Nikitina (2003/2007) The gradience of the dative alternation. Prepublication from
http://www.stanford.edu/~bresnan/publications/index.html
Cornips, Leonie & Cecilia Poletto 2005. On standardising syntactic elicitation techniques. Lingua 115, 939957.
Cowart, Wayne 1997. Experimental Syntax. Applying Objective Methods to Sentence Judgments. Thousand
Oaks, London, New Delhi: Sage Publications
Labov, W. 1987. Some observations on the foundations of linguistics. Unpublished manuscript, available at
http://www.ling.upenn.edu/~wlabov/Papers/Foundations.html
Pullum, G.K. 2003. Corpus fetishism. Language Log.
http://itre.cis.upenn.edu/~myl/languagelog/archives/000122.html
Pullum, G.K. 2003. Up it turned. Language Log.
http://itre.cis.upenn.edu/~myl/languagelog/archives/000058.html
Wasow, Thomas & Jennifer Arnold 2005. Intuitions in Linguistic Argumentation. Lingua 115, 1481-1496.
11