LEL PhD training day, Jan 21 2008 Theory construction and comparison “Disciplines differ considerably in the relative emphasis they place on data collection versus theory construction. In physics, there is a clear division of labor between experimentalists and theorists. Linguistics, too, has subfields (including psycholinguistics and sociolinguistics) in which theories tend to be data-driven and others (notably generative grammar) that focus almost exclusively on the formulation of elegant theories, with little attention devoted to careful data collection. Unfortunately, the findings of the experimentalists in linguistics very rarely play a role in the work of generative grammarians.Rather, theory development tends to follow its own course, tested only by the unreliable and sometimes malleable intuitions of the theorists themselves. The theories are consequently of questionable relevance to the facts of language.” (Wasow & Arnold 2004) 1 Invented examples • Invention allows for the construction of basic examples, unencumbered by extraneous or irrelevant material • Invention allows for the construction of paradigms, in which examples differ minimally with respect to a given characteristic • Invention allows for the construction of “ungrammatical” examples • Invention allows for the construction of examples which occur rarely (if ever) in a given corpus Introspection The construction of invented examples relies on ”primary intuitions”, i.e. introspective judgements of well-formedness (typically by the linguist). “Some argue that primary intuitions are cleaner than other forms of data because they somehow escape the semantic and pragmatic dimensions of language use. But making judgements of well-formedness is a type of language use, albeit a somewhat unusual one. Consulting primary intuitions unavoidably involves attempting to assign a meaning and to imagine a context in which the expression under consideration might be used. By leaving all contextual factors up to the imagination, the use of primary intuitions regarding sentences in isolation is arguably more subject to irrelevant interference than an experimental method that explicitly controls context.” (Wasow & Arnold 2005) 2 Issues with primary intuitions “With the explosive growth of language technologies, it is increasingly recognized that the traditional ways of collecting linguistic data are deeply flawed. Although grammaticality judgments are considered an extremely rich source of data, it has long been evident that introspections about decontextualized, constructed examples – especially in syntactic and semantic domains – are unreliable and inconsistent, as pointed out by sociolinguists and dialectologists (Labov 1975, 1996; Cornips & Poletto 2004). Improvements in experimental judgment elicitation techniques have been suggested (Schütze 1996, Cowart 1997, Bard et al. 1996), but the constructed sentences used in many controlled psycholinguistic experiments are themselves highly artificial, lacking discourse cohesion and subject to assumptions about default referents (Roland & Jurafsky 2002). Moreover, theoretical linguists are usually unaware of the multiple variables that are known to affect linguistic judgments and can hardly control for them (Gries 2005).” (Bresnan 2007) Labov’s Working Principles (1975/1987) I. The Consensus Principle: if there is no reason to think otherwise, assume that the judgments of any native speaker are characteristic of all speakers of the language. II. The Experimenter Principle: if there is any disagreement on introspective judgments, the judgments of those who are familiar with the theoretical issues may not be counted as evidence. III. The Clear Case Principle: disputed judgments should be shown to include at least one consistent pattern in the speech community or be abandoned. But “Linguists are building on sand until they can answer basic questions: what are the test-retest reliabilities of judgments of grammatical acceptability? Under what conditions do introspections match speech production? What are the sources of bias?” 3 Clear cases? Variance in judgements of acceptability (Cowart 1997) “The fact that the syntax literature is characterized by consistent inattention to the methods by which sentence judgments are gathered and summarized is not compatible with due concern for the role of error variance.” Example: that-trace a. Who do you suppose invited Ann to the circus? b. Who do you suppose Ann invited to the circus? c. Who do you suppose that invited Ann to the circus? d Who do you suppose that Ann invited to the circus?. Each informant judged five different representatives of the four sentence types. A standard psychophysical scaling method was used to express relative acceptability. Analysis of a subset of 88 informants who had all judged the same experimental and filler sentences. 4 Results “The analysis reveals what might appear to be an extraordinary degree of inconsistency in informant responses; for roughly 90% of the informants, the average range of variation within sentence type categories exceeded the difference between the ‘that’-trace violations (c) and the average of the other three types. Indeed, on average, for individual informants the mean range within categories was more than 80% larger than the size of the ‘that’-trace effect. The ‘that’-trace effect was smaller than the average range within categories even for half of the ten informants who showed the least overall variation within categories. In short, for most informants a great many of their individual sentence judgments were unrepresentative of the average acceptability of the category from which the sentence was drawn.” (Cowart, p33) Informants show the ‘that’-trace effect if the average of the five bars in the third column is lower than the average of the bars in the other three columns. Informant 1: “best” Informant 2: “typical” (difference =1.8, but range of variation within each sentence type exceeds this) Informants 3-4: very variable, range of variation within sentence types , but show ‘that’-trace effect overall Informant 5: subject extraction with ‘that’ better than object extraction 5 Nevertheless With a well-designed experiment for eliciting judgments, such error variations do not obscure the overall result. Cowart, p19. What is more The experiment reveals more than might have been predicted by simple introspection: object-extraction with ‘that’ is significantly less acceptable than either subject extraction or object extraction without ‘that’! Statistical analysis can reveal the contribution of each factor, and their interaction: Cowart, p124 6 Good design of elicitation experiments Factorial design • Comparing Who do you suppose invited Ann to the circus? With Who do you suppose that invited Ann to the circus? Not good enough, because the contribution of just adding that is not controlled for. Solution: add the object extraction pair. • Presentation No more than one sentence in a token set is presented to each informant Introduction of filler sentences Randomization of order of presentation • Scaling methods Ratio scale methods vs category scale methods Category scales, e.g. 1-5. have the issue that the degree of acceptability indicated by the difference between 1 and 2 may be different than that between 4 and 5. Nevertheless, conceptually simple. Ratio scales ask informants to judge relative acceptability. More complicated to present, but can work well . • Number of informants Must be sufficient to be of use for statistical evaluation. For a token set like the ‘that’-trac set, minimally 8. 7 Corpus and other usage data Introspection and acceptability judgments are often a poor guide to the space of grammatical possibility Usage data can reveal the gaps. Example: the dative alternation (Bresnan 2007) That movie gives me the creeps *That movie give the creeps to me From the 1970’s, pairs such as these have been used to claim that the double object construction is associated with possession, while the prepositional object construction is “allative” “But many examples of the kinds claimed to be ungrammatical can be found in current use on the web, including (5) from Bresnan & Nikitina (2003): (5) a. This life-sized prop will give the creeps to just about anyone! Guess he wasn’t quite dead when we buried him! b. Stories like these must give the creeps to people whose idea of heaven is a world without religion . . . Again we must ask whether we can trust these examples from the web. Could they simply be unrepresentative anomalies fished up from the vast depths of the internet?” Evidently, the “end-weight” principle is relevant here and outweighs our original expectations. 8 Corpora and quantitative data Representative corpora can provide quantitative data which can be statistically evaluated. Such data can be compared with results/grammars obtained by introspection and elicitation of judgments. For example, Bresnan & Nikitina’s (2003/2007) analysis of the dative alternation, using Treebank Wall Street Journal, compared with elicited judgments in a controlled experiment in Bresnan (2006). Bresnan & Nikitina 2003/2007 (corpus study) Bresnan 2006 (elicitation experiment) 9 Usage data can be misleading (Pullum 2003) “Prime ministers Tony Blair (UK) and Bertie Ahern (Eire) were due to turn up arm in arm to be present at an encouraging announcement of agreement (it was to prove illusory) between Unionists and Sinn Fein, said The Economist (October 25th, 2003, p.52, column 1): ‘Downing Street duly announced it, and up the prime ministers turned’." “When you're a descriptive grammarian like me, sometimes you have to trust the corpus and modify your intuitive idea of what is grammatical, and sometimes you have to use your intuitive knowledge of the language to ward off false impressions the corpus might give you. It's not a straightforward matter. Science never is.” Field data Similar issues arise in the consideration of field data, which can represent spontaneous usage, e.g. in the collection of usage data, possibly under controlled circumstances, or require introspection. The general point is that no one method is likely to fully succeed on its own. For issues in standardizing elicitation techniques in the context of translated data , see Cornips & Poletto 2005. 10 This transition from qualitative to quantitative analysis is a familiar one in the development of science. But the qualitative model of linguistics is not easily displaced. Many forms of linguistic behavior are categorically invariant. Furthermore, the number, variety and complexity of linguistic relations are very great, and it is not likely that a large proportion can be investigated by quantitative means. At present, we do not know the correct balance between the two modes of analysis: how far we can go with unsupported qualitative analysis based on introspection, before the proposals must be confirmed by quantitative studies based on observation and experiment. (Labov (1987) References Bresnan, Joan 2006. Is syntactic knowledge probabilistic? Experiments with the dative alternation. Prepublication from http://www.stanford.edu/~bresnan/publications/index.html Bresnan, Joan 2007. A few lessons from typology. Linguistic Typology 11, 297-306. Bresnan, Joan & Tatiana Nikitina (2003/2007) The gradience of the dative alternation. Prepublication from http://www.stanford.edu/~bresnan/publications/index.html Cornips, Leonie & Cecilia Poletto 2005. On standardising syntactic elicitation techniques. Lingua 115, 939957. Cowart, Wayne 1997. Experimental Syntax. Applying Objective Methods to Sentence Judgments. Thousand Oaks, London, New Delhi: Sage Publications Labov, W. 1987. Some observations on the foundations of linguistics. Unpublished manuscript, available at http://www.ling.upenn.edu/~wlabov/Papers/Foundations.html Pullum, G.K. 2003. Corpus fetishism. Language Log. http://itre.cis.upenn.edu/~myl/languagelog/archives/000122.html Pullum, G.K. 2003. Up it turned. Language Log. http://itre.cis.upenn.edu/~myl/languagelog/archives/000058.html Wasow, Thomas & Jennifer Arnold 2005. Intuitions in Linguistic Argumentation. Lingua 115, 1481-1496. 11
© Copyright 2026 Paperzz