Anticipation of clause-final heads. Evidence from eye-tracking and SRNs Lars Konieczny ([email protected]) Center for Cognitive Science, University of Freiburg, Germany Philipp Döring ([email protected]) Center for Cognitive Science, University of Freiburg, Germany To appear in: Proceedings of the ICCS/ASCS-2003 Joint International Conference on Cognitive Science. Sydney, Australia 13 - 17 July 2003 Abstract In a Simple Recurrent Network simulation and an eyetracking study, we investigated the processing of clausefinal verbs. Following the integration cost hypothesis (Gibson, 1998), processing verbs should be the harder, the more complement integrations have to take place. In contrast, probabilistic prediction-based models, like Simple Recurrent Networks (SRNs, Elman, 1990), might anticipate verbs the better, the more dependents have been encountered beforehand. We trained SRNs with a subset of the German language to establish basic dependency relationships between verbs and their arguments in both verb-second and verb-final constructions. The test results established a clear anticipation hypothesis: the more arguments precede the verb, the lower the prediction error and hence, predicted reading times. The data from an eye-tracking experiment confirm the anticipation hypothesis: Clause final verbs are read faster when an additional Dative, instead of a noun-modifying Genitive, is read beforehand. Adverbial PP-adjuncts, in contrast to Noun-modifying PPs, however, did not affect reading times. In general, the results support a restricted anticipation hypothesis. Introduction Working Memory and language In German, like in many other languages, verbs are placed at the end of subordinate clauses (1). (1) Jan glaubte, dass der Gast dem Onkel das Auto empfahl. Jan believed that the guestnom the uncledative the caraccrecommended. ”Jan believed that the guest recommended the car to the uncle.” Processing verb-final constructions poses a number of challenges to the human parser. Before the verb is actually encountered, its arguments have already been processed, and since the syntactic and semantic relations they participate in are not known up until the verb is encountered, their role in the relations is vague at best. When the verb is eventually reached, it must be integrated with all its dependents. This is arguably a costly process. Gibson (1998), in his Dependency locality theory (DLT), made integration cost one of two central cost components. According to DLT, integration is more costly, the more dependencies have to be established, and the longer the distance to be crossed when a dependency is established. DLT adopts the single resource view on working memory (Just and Carpenter, 1992), where all cost components consume energy from the same energy pool. People vary with respect to their memory capacity, and people with a smaller capacity are affected more by harder constructions than people with a larger capacity (King & Just, 1991). The individual capacity can be estimated by the reading span score (Daneman & Carpenter, 1980). According to DLT, low span readers should be more affected by long distance integrations than high spans. An alternative view. MacDonald and Christiansen (2002, henceforth MC) proposed a strikingly different view on working memory. They adopt Elman’s (1990) Simple Recurrent Network (SRN) approach to language processing, which has been demonstrated to capture linguistic regularities, including limited recursive rules, from mere training on predicting the next word in the sentence. In SRNs, processing is indistinguishable from linguistic knowledge, its acquisition, and working memory. MC show that SRNs can be used to predict processing data, and their pattern of results apparently supports their claim (but see Konieczny and Ruh, 2003, for a critical discussion). In SRNs, as in other network architectures, complexity arises primarily from (word order) irregularity of the input. Importantly, more experienced networks suffer less from irregular input. MC hence proposed that the notion of memory capacity be replaced by the subjects’ degree of linguistic experience. Applied to the question of clause-final verb processing, SRNs lack a clear position, as the constructions to be considered exhibit regular word order. However, SRNs are stochastic devices that capture the probabilistic structure of the input. Since load is conceived of as an epiphenomenon of predictability, verbs should generally be easier to process when they are preceded by more dependents, rather than less. Provided that the sequence of preceding dependents carries sufficient combinatory information, like the number and type of arguments, case, thematic roles etc., each additional dependent constrains the class of potential continuations, hence increasing the likelihood for the actual verb to come (cf. Konieczny, 1996). In the remainder of the paper, we first test this hypothesis by training and testing an SRN with input that is generated from a suitable probabilistic grammar. Second, we conducted an eye-tracking study to tackle the question of integration vs. anticipation effects on clause final verbs, by varying the number of dependents that have been processed beforehand. Simulation We ran a simulation study to establish a SRN-based prediction for actual reading data. Materials Two corpora, one for training, the other for testing, were generated with SLG (Rohde, 2002), which takes a probabilistic constraint grammar and generates a sentence corpus, taking care of the distributional probabilities in the grammar. Training corpus. The training corpus was generated from sixty-three words, among them intransitive, transitive and ditransitive verbs (taking Datives and Accusatives), where the Dative was either obligatory or optional. Nouns were either Nominatives, Genitives, Datives or Accusatives. Genitives could be added after any other noun phrase. The majority of sentences in the training corpus were simple, verb-second main clauses, whose purpose was to train basic verb dependency relationships. In subordinate (verb-final) clauses, the grammar permitted NP-arguments to occur in the order NPnom-NPdat-NPacc-verb. A total of 15000 sentences was generated and ordered randomly. A run through all 15000 sentences is considered an epoch. Test corpus. Forty sentences were generated varying the number of arguments in the subordinate clause. Additionally, we added a genitive attribute to the main clause’s Subject-NP in half of the sentences. Sentences were constructed following the sequence NPnom-„that“NPnom-(NPgen)-(NPdat)-NPacc-verbsubord.-verbmatrix-NPacc Method 1 Simulations were run on JavaNNS . We built a SRN consisting of 64 Input and 64 Output units for the 63 1 © JavaNNS group, WSI, University of Tuebingen words and the end of sentence marker (EOS), and 128 hidden and context units. The architecture of the network is shown in Figure 1. 64 output units 128 hidden units 64 input units copy 128 context units Figure 1: Network architecture. Parameter setting. Initial weights were randomly set between +– 0.15. The learning rate was initially set to 0.05 and after 20 epochs to 0.02 to make learning more efficient. We ran 270 epochs, which means the net encountered in total, 4.050.000 training sentences. After every about 10 epochs, the test set was run with learning turned off. Like MC, we calculated the Grammaticality Prediction Error (GPE) for each output to evaluate word-by-word performance. The GPE takes hits (i.e.,the sum of activation of all grammatical output units ), false alarms (i.e., the sum of activation of all ungrammatical output units) and misses (i.e., the sum of activation that grammatical units are activated less than their grammatical probability calls for) into account: hits (2) GPE = 1 − hits + false _ alarms + misses The GPE returns a value between 0 and 1. The GPE serves as an estimate of cognitive load and hence, reading times. Hypotheses SRNs learn to anticipate upcoming words based on previous input. The more constraining the input, the lower should be the error. The anticipation of verbs should thus benefit from processing its complements beforehand. We expect smaller GPEs with an increasing number of complements, with distance kept constant. Furthermore, if experience matters, predictions should get better with more epochs, especially for more distant complements. Results As shown in Figure 2. predictions of the subordinate verb continuously got better over 270 epochs. After about 45 epochs, predictions benefit from having processed a Dative earlier in the clause. An additional Genitive does not seem to have any substantial impact. perform similarly. Readers should therefore show an anticipation effect on clause-final verbs, which should not interact with reading span. 1 0,9 dat- gendat- gen+ dat+ gendat+ gen+ 0,8 0,7 GPE 0,6 Experiment The experiment was designed to test the anticipation hypotheses against the integration cost hypothesis. 0,5 0,4 0,3 Design and materials 0,2 Sentences were constructed according to the following schema: NPnom - „that“ - NPnom - NPgen or dat - NPacc-PPNmod. or V-mod - verbsubord.-verbmatrix-NPacc. The experimental design included the two factors: Case of 2nd NP: dative (3,4) vs. genitive (5,6), and Type of PP: nounmodifying (3,5) vs. verb-modifying (4,6). (3) 2nd NP: Dative PP: Noun-modifying Die Einsicht, dass / der Freund / dem Kunden / das Auto / aus Plastik / verkaufte,/ erheiterte / die Anderen. The insight, that / the friend / the client / the car / (made) from plastic / sold, / amused /the others. ”The insight that the friend sold the car made from plastic to the client amused the others.” (4) 2nd NP: Dative PP: Verb-modifying Die Einsicht, dass / der Freund / dem Kunden /n das Auto / aus Freude / verkaufte, / erheiterte / die Anderen. The insight, that / the friend / the client / the car just for fun / sold, / amused / the others. ”The insight that the friend sold the car to the client just for fun amused the others.” (5) 2nd NP: Genitive PP: Noun-modifying Die Einsicht, dass / der Freund / des Kunden / das Auto / aus Plastik / verkaufte, / erheiterte / die Anderen. The insight, that / the friend / (of) the client / the car / (made) from plastic / sold, / amused / the others. ”The insight that the friend of the client sold the car made from plastic amused the others.” (6) 2nd NP: Genitive PP: Verb-modifying Die Einsicht, dass / der Freund / des Kunden / das Auto / aus Freude / verkaufte, / erheiterte / die Anderen. The insight, that / the friend / (of) the client / the car / just for fun / sold, / amused / the others. ”The insight that the friend of the client sold the car just for fun amused the others.” Twenty sentence sets (five per condition) were constructed following the pattern in (3) to (6). 0,1 0 0 100 epochs 200 300 Figure 2: GPEs at the subordinate verb show an advantage of dative-clauses after about 45 epochs. The effect of an additional Dative reaches its maximum shortly after it starts at about 50 epochs and then shrinks slowly. While the GPE still gets smaller even after 200 epochs, anticipation does not improve with more experience. We tried to nail down the source of the advantage for the dative-sentences. In order to do this, we compared the cumulated actual activations of the optional-dative verbs and the obligatory-dative verbs with the predictions of the stochastic grammar of SLG. Thus, we could examine whether the error was caused by the crucial verbs or some other factor. The network managed to come very close to the probability distribution in the grammar for the sentences with a Dative, whereas it has some problems with the sentences missing a Dative. In the latter case, the net still predicted verbs with an obligatory Dative to a certain degree. Interestingly, the network also missed the number agreement more often here. Discussion This result confirms the anticipation hypothesis for SRNs. An additional dative improved the prediction of the right verb. Not only did it constrain the class of verbs, but the additional information apparently also improved the quality of the prediction, in general, so that number agreement was better than without a Dative. Here, the Dative could always come later in the string – although the strict word order in the training set should have made this possibility unlikely – increasing the degree of uncertainty and hence weakening the verb prediction. Given the SRN-based simulation data, interested in on-line reading data to predictions. If SRNs are adequate models sentence processing, one would expect real we were test their for human subjects to Procedure Prior to the experiment participants were tested on their reading span, using the German test implemented by Hacker, Handrick, and Veres (1996). Participants read Hypotheses If anticipation is the dominant mechanism, the prediction of a verb should benefit from processing its complements, and possibly adverbs, beforehand. Hence there should be smaller reading times on the embedded verb for more complements, with distance kept constant. Closer dependents might have a bigger impact than more distant ones, i.e. PP complements should speed up reading the verb more than datives. We expect that even low spans will have sufficient linguistic experience to exhibit the anticipation effect. Given the simulation results then, there is no prediction of an interaction of span and anticipation. If integration cost is the dominant factor, integrating verbs with more complements should be harder than integrating verbs with less complements. The dative should impose a particularly strong cost component, as integration has to cross three new discourse entities. DLT predicts low span subjects to be more affected by integration cost than high spans. We should therefore expect an interaction of integration cost with reading span at the subordinate verb. Results Twenty-three students from the university of Freiburg were paid 7,50 € or received course credits for participation. One participant had to be excluded because of too many track losses. Figure 3 illustrates mean reading times across the embedded clause in relevant regions, and the at the matrix verb. 900 800 Mean RPDs per word (msec) blocks of sentences on a computer screen. The sentences of each block were displayed five seconds each. At the end of a block, participants had to write down the last word and a short two-word description of each sentence. Block size varied from two to eight sentences. Each complete block was scored as a point. After this test they were instructed about the experimental procedure and then fitted to a head-rest to prevent head movements during reading. They were told to read at a normal pace. After a brief calibration procedure they read five filler sentences to get used to the setting. Participants read a total of 152 randomly ordered sentences twenty of which were targets. After blocks of twenty sentences each, the calibration was redone and the experiment continued with the next block, starting with a filler sentence. Before a sentence was presented, participants had to fixate on a crossmarking on the screen which indicated the position of the first character. As soon as they did so the crossmarking was erased and the sentence displayed. After they finished reading participants had to press a button which replaced the sentence with a simple yes/noquestion. This could be answered by pressing one of two buttons. They answered with a high degree of accuracy (91%), which did not vary across conditions. Apparatus. Eye movements were monitored by a Generation 5.5 Dual Purkinje Image Eye-tracker. Viewing was binocular, but eye movements were recorded only from the right eye. The eye-tracker was connected to an Intel Pentium computer which controlled the stimulus-presentation and stored the output from the eye-tracker. The sampling rate for data collection was 1 KHz. The sentences were presented on a 20-inch colour monitor, beginning in the sixth line. The subject was seated 83 cm from the face of the screen, so that 3 letters equalled about 1 degree of visual angle. External distractions and light reflections were screened off by a black tube and the room was slightly darkened. Dependent Variables and Data Analysis. The eyemovement data were summarized with respect to regions as indicated by the slashes in sentences (3) to (6). For each region we report regression path durations (RPDs) per word. RPDs represent the time between entering and going past a region for the first time. They include first pass reading times, but are extended if there is a regressive saccade after first pass reading. RPDs have been demonstrated to be most sensitive to complexity and garden-pathing effects (Konieczny, Hemforth, Scheepers, & Strube, 1997). First pass reading times below 100 milliseconds were treated as overshoots and added to the reading time on the previously fixated region. Zero RPDs were treated as missing values (conditionalized analyis). After the experiment, participants had to answer some questions about their reading habits. 700 NP2 / PP 600 Dative 500 N-mod 400 Dative V-mod 300 Genitive 200 N-mod 100 Genitive 0 NP nom V-mod NP acc NP dat/gen subord. verb PP N-mod/V-mod matrix verb Figure 3: Average regression path durations (RPD) per word exhibit an advantage at the subordinate verb for clauses with a dative NP. Reading times at the embedded verb were submitted to a two-factorial MANOVA for repeated measures. Table 1 shows the distribution of means. RPDs were shorter (229 ms on average) when a Dative, instead of a Genitive, was read beforehand, (F1(1,21)=7.4, MSe=155590, p<.05; F2(1,19)=5.378, MSe=277240, p<.05). PP-type, however, had no reliable impact on reading the embedded verb, although there was a numerical advantage of 59 ms for noun-modifying PPs, and both factors did not interact (all Fs<1.2). towards one of the alternatives through its plausibility rather than strictly permissible or not. Subjects thus might have kept their initially preferred interpretation, regardless of the intended bias. The PP results should therefore be taken with a grain of salt. The lack of group effects has been predicted by the SRN simulation results. Group sizes are too small to draw firm conclusion from this null effect. Table 1: Average RPDs at the subordinate verb. We presented results from a SRN simulation which confirmed that SRNs anticipate items based on their preceding dependents. Adding a Dative did improve the accuracy of verb predictions. The eye-tracking results confirmed the anticipation hypothesis. The data presented here are in line with results previously reported by Konieczny (2000). Konieczny found reading times of clause final verbs to be shorter when integration had to cross a longer distance to its arguments. Distance was manipulated by including a relative clause to the direct Object, and by adding an adverbial PP (a directional locative). Konieczny interpreted the results as evidence for anticipation of verbs on the basis of information added to one of its arguments (by the RC), or by adding an argument itself (the directional locative). As this result had a potential confound, namely the position in the sentence (words at the end may be read faster anyway), it is worthwhile mentioning that in the same study, he found that relative pronouns did not benefit from being placed closer to the end of the clause and more distant to their host. On the contrary, they were read more slowly there, indicating increased integration cost. Konieczny therefore distinguished predictable (e.g. verbs) from non-predictable items (e.g. relative pronouns). This result was confirmed by Konieczny and Borman (2001), who added a focus particle to the host of the RC, making the relative pronoun predictable. In this setting the integration cost effect disappeared. Furthermore, there is evidence by Vasishth (2002), who found that clause-final verbs in Hindi were read faster when an adverb was added. While this result seems to be at odds with our current finding of a lacking adverb effect, note that this lack could have been produced by various experimental factors as discussed earlier. Vasishth (2002) proposed an ACT-R based (re-) activation model, where the arguments and the verbprediction can be retrieved better the more they get reactivated by additional complements or modifiers. It will be hard to distinguish his concept of reactivation from our concept of anticipation on an empirical bases. However, note that in ACT-R activation decay is estimated by b – 0.5 ln(time) (with b being an arbitrary initial value). Decay is hence steepest during earlier periods and successively flattens later. Reactivation 2nd NP Dative Dative Genitive Genitive PP N-mod. V-mod. N-mod. V-mod. RPD (msec) 555 623 793 843 Reading span. Three groups of about equal size (six to eight participants) were built from the reading span score. There was no reliable interaction of span group with any other factor. Discussion The results clearly disconfirm the integration cost hypothesis. Reading times were faster, not slower, when an additional Dative had to be integrated. Instead, this finding supports the anticipation hypothesis, as processing an additional argument facilitated verb processing. The lack of a PP-effect is compatible with neither the integration cost, nor the unrestricted anticipation hypothesis. Note however, that both effects could have masked each other here. It is possible that facilitation by anticipation and integration cost both exist, however, in varying strengths, so that one might dominate the other in one circumstance and vice versa in another. This assumption is questionable though, as integration cost should be highest, and anticipation lowest, for the more distant NP manipulation, and vice versa for the local PP variation. We should therefore have encountered an integration cost effect for Genitives vs. Datives, and an anticipation effect for noun-modifying vs. adverbial PPs. The actual data are much closer to the opposite pattern though. Note, however, that the lack of a PP effect may be due to a number of inherent reasons. First, adverbs generally impose weaker constraints on verbs than their arguments. Second, the distance to the verb and hence the time left for actually imposing its impact might have been too short for the PP. Third, the PPs in the materials may not have been modifying the verb or the noun as unambiguously as possible. PPs, as opposed to NPs, are not morphologically marked as verb-arguments in German. In many cases, attachment was merely biased General Discussion should therefore have a stronger impact if it took place recently rather than earlier in the sentence. We would therefore expect the PP adverbial adjacent to the verb to have a much more dramatic effect than the addition of an NP complement further upstream. The present results suggest that the opposite is the case. We finally want to point out that the results are in line with Konieczny’s (1996) anticipation proposal. In his doctoral thesis, Konieczny proposed an incremental processor based on an HPSG grammar, where linguistic knowledge is represented in highly interconnected graphs. According to this model, each complement integrated into the sentence structure (by means of unification) adds information to the prediction of the verb. The enriched prediction constrains the class of the verb to come and facilitates later integration, possibly by providing rich cues for lexical access and retrieval of dependents. While this account is straightforward for complements, integrating pre-verbal adjuncts is less direct and hence less efficient. Note that this view is perfectly consistent with the present pattern of data: The dative complement facilitated integration, whereas the adverbial PP did not. To sum up, the data support anticipation models such as SRNs, but in a constrained way such that arguments more strongly predict upcoming words than adverbs. Conclusion We have argued that clause-final verb integration can be facilitated by anticipatory mechanisms. SRNs are instances of stochastic devices that could be shown to realize anticipation of late heads based on their earlier complements. The data reported support the anticipation hypothesis and disconfirm integration cost as the main component of cognitive load. While the results are at least modestly compatible with a variety of models, a valid approach must consider that the specificity, not the distance of a dependent, determines the degree of anticipation. That is, NP arguments elicit a facilitation effect even when they are distant to the verb, whereas adverbial PPs do not necessarily, even when they are adjacent. Future research will have to clarify this issue. Acknowledgments We want to thank Simone Burgi, Heidi Fischer and Sven Eric Hiss for their assistance in the construction of the materials, and Sarah Schimke, Felix Schrape and Kerstin Botsch for running the experiment. We are grateful to Barbara Hemforth for her valuable comments on an earlier version of this paper, and to the all the participants of the connectionist cognitive modeling class, headed by the first author in winter 2002, for their lively and fruitful discussions. All remaining errors are our own, of course. References Daneman M., and Carpenter, P.A. (1980). Individual differences in working memory and reading, Journal of Verbal Learning and Verbal Behaviour, 19, pp450-466 Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. Gibson, E. (1998). Linguistic complexity: Locality of syntactic dependencies. Cognition, 68, 1-76. Hacker, W.; Handrick, S. & Veres, T. (1996). Lesespannentest. Manuscript. University of Dresden. Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: Individual differences in working memory. Psychological Review, 99, 122149. King, J., & Just, M. A. (1991). Inidvidual differences in syntactic processing:the role of working memory. Journal of Memory and Language, 30, 580-602. Konieczny, L. (1996). Human sentence processing: A semantics-oriented parsing approach. Doctoral Thesis. IIG-Berichte, 6-96. Albert-LudwigsUniversity, Freiburg, Germany. Konieczny, L. (2000). Locality and parsing complexity. Journal of Psycholinguistic Research, 29, 627-645. Konieczny, L., & Bormann, T. (2001). Extraposition and anticipation of relative clauses. Manuscript. Centre for Cognitive Science, University of Freiburg. Konieczny, L., Hemforth, B., Scheepers, C. & Strube, G. (1997) The role of lexical heads in parsing: evidence from German. Language and Cognitive Processes, 12, 307-348. Konieczny, L., & Ruh, N. (2003). What’s in an error? A reply to MacDonald & Christiansen (2002). Manuscript submitted, University of Freiburg. MacDonald, M. C., & Christiansen, M. H. (2002). Reassessing Working Memory: Comment on Just and Carpenter (1992) and Waters and Caplan (1996) . Psychological Review, Vol. 109, No. 1, 35–54. Rohde, D.L.T. (1999). The Simple Language Generator: Encoding complex languages with simple grammars. Technical Report CMU-CS-99-123, Carnegie Mellon University, Department of Computer Science, Pittsburgh, PA. Vasishth, S. (2002). Working memory in sentence comprehension: Processing Hindi center embeddings. Unpublished doctoral dissertation, Ohio State University, Columbus, OH.
© Copyright 2026 Paperzz