Anticipation of clause-final heads. Evidence from

Anticipation of clause-final heads.
Evidence from eye-tracking and SRNs
Lars Konieczny ([email protected])
Center for Cognitive Science, University of Freiburg, Germany
Philipp Döring ([email protected])
Center for Cognitive Science, University of Freiburg, Germany
To appear in: Proceedings of the ICCS/ASCS-2003 Joint International Conference on Cognitive Science.
Sydney, Australia 13 - 17 July 2003
Abstract
In a Simple Recurrent Network simulation and an eyetracking study, we investigated the processing of clausefinal verbs. Following the integration cost hypothesis
(Gibson, 1998), processing verbs should be the harder,
the more complement integrations have to take place. In
contrast, probabilistic prediction-based models, like
Simple Recurrent Networks (SRNs, Elman, 1990), might
anticipate verbs the better, the more dependents have
been encountered beforehand. We trained SRNs with a
subset of the German language to establish basic
dependency relationships between verbs and their
arguments in both verb-second and verb-final
constructions. The test results established a clear
anticipation hypothesis: the more arguments precede the
verb, the lower the prediction error and hence, predicted
reading times.
The data from an eye-tracking experiment confirm the
anticipation hypothesis: Clause final verbs are read faster
when an additional Dative, instead of a noun-modifying
Genitive, is read beforehand. Adverbial PP-adjuncts, in
contrast to Noun-modifying PPs, however, did not affect
reading times. In general, the results support a restricted
anticipation hypothesis.
Introduction
Working Memory and language
In German, like in many other languages, verbs are
placed at the end of subordinate clauses (1).
(1) Jan glaubte, dass der Gast dem Onkel das Auto
empfahl.
Jan believed that the guestnom the uncledative the
caraccrecommended.
”Jan believed that the guest recommended the car
to the uncle.”
Processing verb-final constructions poses a number of
challenges to the human parser. Before the verb is
actually encountered, its arguments have already been
processed, and since the syntactic and semantic
relations they participate in are not known up until the
verb is encountered, their role in the relations is vague
at best. When the verb is eventually reached, it must be
integrated with all its dependents. This is arguably a
costly process. Gibson (1998), in his Dependency
locality theory (DLT), made integration cost one of two
central cost components. According to DLT, integration
is more costly, the more dependencies have to be
established, and the longer the distance to be crossed
when a dependency is established. DLT adopts the
single resource view on working memory (Just and
Carpenter, 1992), where all cost components consume
energy from the same energy pool. People vary with
respect to their memory capacity, and people with a
smaller capacity are affected more by harder
constructions than people with a larger capacity (King
& Just, 1991). The individual capacity can be estimated
by the reading span score (Daneman & Carpenter,
1980). According to DLT, low span readers should be
more affected by long distance integrations than high
spans.
An alternative view. MacDonald and Christiansen
(2002, henceforth MC) proposed a strikingly different
view on working memory. They adopt Elman’s (1990)
Simple Recurrent Network (SRN) approach to language
processing, which has been demonstrated to capture
linguistic regularities, including limited recursive rules,
from mere training on predicting the next word in the
sentence. In SRNs, processing is indistinguishable from
linguistic knowledge, its acquisition, and working
memory. MC show that SRNs can be used to predict
processing data, and their pattern of results apparently
supports their claim (but see Konieczny and Ruh, 2003,
for a critical discussion). In SRNs, as in other network
architectures, complexity arises primarily from (word
order) irregularity of the input. Importantly, more
experienced networks suffer less from irregular input.
MC hence proposed that the notion of memory capacity
be replaced by the subjects’ degree of linguistic
experience.
Applied to the question of clause-final verb processing,
SRNs lack a clear position, as the constructions to be
considered exhibit regular word order. However, SRNs
are stochastic devices that capture the probabilistic
structure of the input. Since load is conceived of as an
epiphenomenon of predictability, verbs should
generally be easier to process when they are preceded
by more dependents, rather than less. Provided that the
sequence of preceding dependents carries sufficient
combinatory information, like the number and type of
arguments, case, thematic roles etc., each additional
dependent constrains the class of potential
continuations, hence increasing the likelihood for the
actual verb to come (cf. Konieczny, 1996).
In the remainder of the paper, we first test this
hypothesis by training and testing an SRN with input
that is generated from a suitable probabilistic grammar.
Second, we conducted an eye-tracking study to tackle
the question of integration vs. anticipation effects on
clause final verbs, by varying the number of dependents
that have been processed beforehand.
Simulation
We ran a simulation study to establish a SRN-based
prediction for actual reading data.
Materials
Two corpora, one for training, the other for testing,
were generated with SLG (Rohde, 2002), which takes a
probabilistic constraint grammar and generates a
sentence corpus, taking care of the distributional
probabilities in the grammar.
Training corpus. The training corpus was generated
from sixty-three words, among them intransitive,
transitive and ditransitive verbs (taking Datives and
Accusatives), where the Dative was either obligatory or
optional. Nouns were either Nominatives, Genitives,
Datives or Accusatives. Genitives could be added after
any other noun phrase. The majority of sentences in the
training corpus were simple, verb-second main clauses,
whose purpose was to train basic verb dependency
relationships. In subordinate (verb-final) clauses, the
grammar permitted NP-arguments to occur in the order
NPnom-NPdat-NPacc-verb. A total of 15000 sentences was
generated and ordered randomly. A run through all
15000 sentences is considered an epoch.
Test corpus. Forty sentences were generated varying
the number of arguments in the subordinate clause.
Additionally, we added a genitive attribute to the main
clause’s Subject-NP in half of the sentences. Sentences
were constructed following the sequence NPnom-„that“NPnom-(NPgen)-(NPdat)-NPacc-verbsubord.-verbmatrix-NPacc
Method
1
Simulations were run on JavaNNS . We built a SRN
consisting of 64 Input and 64 Output units for the 63
1
© JavaNNS group, WSI, University of Tuebingen
words and the end of sentence marker (EOS), and 128
hidden and context units. The architecture of the
network is shown in Figure 1.
64 output units
128 hidden units
64 input units
copy
128 context units
Figure 1: Network architecture.
Parameter setting. Initial weights were randomly set
between +– 0.15. The learning rate was initially set to
0.05 and after 20 epochs to 0.02 to make learning more
efficient.
We ran 270 epochs, which means the net encountered
in total, 4.050.000 training sentences. After every about
10 epochs, the test set was run with learning turned off.
Like MC, we calculated the Grammaticality Prediction
Error (GPE) for each output to evaluate word-by-word
performance. The GPE takes hits (i.e.,the sum of
activation of all grammatical output units ), false alarms
(i.e., the sum of activation of all ungrammatical output
units) and misses (i.e., the sum of activation that
grammatical units are activated less than their
grammatical probability calls for) into account:
hits
(2) GPE = 1 −
hits + false _ alarms + misses
The GPE returns a value between 0 and 1. The GPE
serves as an estimate of cognitive load and hence,
reading times.
Hypotheses
SRNs learn to anticipate upcoming words based on
previous input. The more constraining the input, the
lower should be the error. The anticipation of verbs
should thus benefit from processing its complements
beforehand. We expect smaller GPEs with an
increasing number of complements, with distance kept
constant. Furthermore, if experience matters,
predictions should get better with more epochs,
especially for more distant complements.
Results
As shown in Figure 2. predictions of the subordinate
verb continuously got better over 270 epochs. After
about 45 epochs, predictions benefit from having
processed a Dative earlier in the clause. An additional
Genitive does not seem to have any substantial impact.
perform similarly. Readers should therefore show an
anticipation effect on clause-final verbs, which should
not interact with reading span.
1
0,9
dat- gendat- gen+
dat+ gendat+ gen+
0,8
0,7
GPE
0,6
Experiment
The experiment was designed to test the anticipation
hypotheses against the integration cost hypothesis.
0,5
0,4
0,3
Design and materials
0,2
Sentences were constructed according to the following
schema: NPnom - „that“ - NPnom - NPgen or dat - NPacc-PPNmod. or V-mod - verbsubord.-verbmatrix-NPacc. The experimental
design included the two factors: Case of 2nd NP: dative
(3,4) vs. genitive (5,6), and Type of PP: nounmodifying (3,5) vs. verb-modifying (4,6).
(3) 2nd NP: Dative
PP: Noun-modifying
Die Einsicht, dass / der Freund / dem Kunden / das
Auto / aus Plastik / verkaufte,/ erheiterte / die
Anderen.
The insight, that / the friend / the client / the car /
(made) from plastic / sold, / amused /the others.
”The insight that the friend sold the car made from
plastic to the client amused the others.”
(4) 2nd NP: Dative
PP: Verb-modifying
Die Einsicht, dass / der Freund / dem Kunden /n
das Auto / aus Freude / verkaufte, / erheiterte / die
Anderen.
The insight, that / the friend / the client / the car
just for fun / sold, / amused / the others.
”The insight that the friend sold the car to the
client just for fun amused the others.”
(5) 2nd NP: Genitive
PP: Noun-modifying
Die Einsicht, dass / der Freund / des Kunden / das
Auto / aus Plastik / verkaufte, / erheiterte / die
Anderen.
The insight, that / the friend / (of) the client / the
car / (made) from plastic / sold, / amused / the
others.
”The insight that the friend of the client sold the
car made from plastic amused the others.”
(6) 2nd NP: Genitive
PP: Verb-modifying
Die Einsicht, dass / der Freund / des Kunden / das
Auto / aus Freude / verkaufte, / erheiterte / die
Anderen.
The insight, that / the friend / (of) the client / the
car / just for fun / sold, / amused / the others.
”The insight that the friend of the client sold the
car just for fun amused the others.”
Twenty sentence sets (five per condition) were
constructed following the pattern in (3) to (6).
0,1
0
0
100
epochs
200
300
Figure 2: GPEs at the subordinate verb show an
advantage of dative-clauses after about 45 epochs.
The effect of an additional Dative reaches its maximum
shortly after it starts at about 50 epochs and then
shrinks slowly. While the GPE still gets smaller even
after 200 epochs, anticipation does not improve with
more experience.
We tried to nail down the source of the advantage for
the dative-sentences. In order to do this, we compared
the cumulated actual activations of the optional-dative
verbs and the obligatory-dative verbs with the
predictions of the stochastic grammar of SLG. Thus, we
could examine whether the error was caused by the
crucial verbs or some other factor. The network
managed to come very close to the probability
distribution in the grammar for the sentences with a
Dative, whereas it has some problems with the
sentences missing a Dative. In the latter case, the net
still predicted verbs with an obligatory Dative to a
certain degree. Interestingly, the network also missed
the number agreement more often here.
Discussion
This result confirms the anticipation hypothesis for
SRNs. An additional dative improved the prediction of
the right verb. Not only did it constrain the class of
verbs, but the additional information apparently also
improved the quality of the prediction, in general, so
that number agreement was better than without a
Dative. Here, the Dative could always come later in the
string – although the strict word order in the training set
should have made this possibility unlikely – increasing
the degree of uncertainty and hence weakening the verb
prediction.
Given the SRN-based simulation data,
interested in on-line reading data to
predictions. If SRNs are adequate models
sentence processing, one would expect real
we were
test their
for human
subjects to
Procedure
Prior to the experiment participants were tested on their
reading span, using the German test implemented by
Hacker, Handrick, and Veres (1996). Participants read
Hypotheses
If anticipation is the dominant mechanism, the
prediction of a verb should benefit from processing its
complements, and possibly adverbs, beforehand. Hence
there should be smaller reading times on the embedded
verb for more complements, with distance kept
constant. Closer dependents might have a bigger impact
than more distant ones, i.e. PP complements should
speed up reading the verb more than datives.
We expect that even low spans will have sufficient
linguistic experience to exhibit the anticipation effect.
Given the simulation results then, there is no prediction
of an interaction of span and anticipation.
If integration cost is the dominant factor, integrating
verbs with more complements should be harder than
integrating verbs with less complements. The dative
should impose a particularly strong cost component, as
integration has to cross three new discourse entities.
DLT predicts low span subjects to be more affected by
integration cost than high spans. We should therefore
expect an interaction of integration cost with reading
span at the subordinate verb.
Results
Twenty-three students from the university of Freiburg
were paid 7,50 € or received course credits for
participation. One participant had to be excluded
because of too many track losses.
Figure 3 illustrates mean reading times across the
embedded clause in relevant regions, and the at the
matrix verb.
900
800
Mean RPDs per word (msec)
blocks of sentences on a computer screen. The
sentences of each block were displayed five seconds
each. At the end of a block, participants had to write
down the last word and a short two-word description of
each sentence. Block size varied from two to eight
sentences. Each complete block was scored as a point.
After this test they were instructed about the
experimental procedure and then fitted to a head-rest to
prevent head movements during reading. They were
told to read at a normal pace. After a brief calibration
procedure they read five filler sentences to get used to
the setting. Participants read a total of 152 randomly
ordered sentences twenty of which were targets. After
blocks of twenty sentences each, the calibration was
redone and the experiment continued with the next
block, starting with a filler sentence. Before a sentence
was presented, participants had to fixate on a crossmarking on the screen which indicated the position of
the first character. As soon as they did so the crossmarking was erased and the sentence displayed. After
they finished reading participants had to press a button
which replaced the sentence with a simple yes/noquestion. This could be answered by pressing one of
two buttons. They answered with a high degree of
accuracy (91%), which did not vary across conditions.
Apparatus. Eye movements were monitored by a
Generation 5.5 Dual Purkinje Image Eye-tracker.
Viewing was binocular, but eye movements were
recorded only from the right eye. The eye-tracker was
connected to an Intel Pentium computer which
controlled the stimulus-presentation and stored the
output from the eye-tracker. The sampling rate for data
collection was 1 KHz. The sentences were presented on
a 20-inch colour monitor, beginning in the sixth line.
The subject was seated 83 cm from the face of the
screen, so that 3 letters equalled about 1 degree of
visual angle. External distractions and light reflections
were screened off by a black tube and the room was
slightly darkened.
Dependent Variables and Data Analysis. The eyemovement data were summarized with respect to
regions as indicated by the slashes in sentences (3) to
(6). For each region we report regression path durations
(RPDs) per word. RPDs represent the time between
entering and going past a region for the first time. They
include first pass reading times, but are extended if
there is a regressive saccade after first pass reading.
RPDs have been demonstrated to be most sensitive to
complexity and garden-pathing effects (Konieczny,
Hemforth, Scheepers, & Strube, 1997). First pass
reading times below 100 milliseconds were treated as
overshoots and added to the reading time on the
previously fixated region. Zero RPDs were treated as
missing values (conditionalized analyis). After the
experiment, participants had to answer some questions
about their reading habits.
700
NP2 / PP
600
Dative
500
N-mod
400
Dative
V-mod
300
Genitive
200
N-mod
100
Genitive
0
NP nom
V-mod
NP acc
NP dat/gen
subord. verb
PP N-mod/V-mod
matrix verb
Figure 3: Average regression path durations (RPD) per
word exhibit an advantage at the subordinate verb for
clauses with a dative NP.
Reading times at the embedded verb were submitted to
a two-factorial MANOVA for repeated measures. Table
1 shows the distribution of means. RPDs were shorter
(229 ms on average) when a Dative, instead of a
Genitive, was read beforehand, (F1(1,21)=7.4,
MSe=155590, p<.05; F2(1,19)=5.378, MSe=277240,
p<.05). PP-type, however, had no reliable impact on
reading the embedded verb, although there was a
numerical advantage of 59 ms for noun-modifying PPs,
and both factors did not interact (all Fs<1.2).
towards one of the alternatives through its plausibility
rather than strictly permissible or not. Subjects thus
might have kept their initially preferred interpretation,
regardless of the intended bias. The PP results should
therefore be taken with a grain of salt.
The lack of group effects has been predicted by the
SRN simulation results. Group sizes are too small to
draw firm conclusion from this null effect.
Table 1: Average RPDs at the subordinate verb.
We presented results from a SRN simulation which
confirmed that SRNs anticipate items based on their
preceding dependents. Adding a Dative did improve the
accuracy of verb predictions. The eye-tracking results
confirmed the anticipation hypothesis.
The data presented here are in line with results
previously reported by Konieczny (2000). Konieczny
found reading times of clause final verbs to be shorter
when integration had to cross a longer distance to its
arguments. Distance was manipulated by including a
relative clause to the direct Object, and by adding an
adverbial PP (a directional locative). Konieczny
interpreted the results as evidence for anticipation of
verbs on the basis of information added to one of its
arguments (by the RC), or by adding an argument itself
(the directional locative). As this result had a potential
confound, namely the position in the sentence (words at
the end may be read faster anyway), it is worthwhile
mentioning that in the same study, he found that
relative pronouns did not benefit from being placed
closer to the end of the clause and more distant to their
host. On the contrary, they were read more slowly
there, indicating increased integration cost. Konieczny
therefore distinguished predictable (e.g. verbs) from
non-predictable items (e.g. relative pronouns). This
result was confirmed by Konieczny and Borman
(2001), who added a focus particle to the host of the
RC, making the relative pronoun predictable. In this
setting the integration cost effect disappeared.
Furthermore, there is evidence by Vasishth (2002), who
found that clause-final verbs in Hindi were read faster
when an adverb was added. While this result seems to
be at odds with our current finding of a lacking adverb
effect, note that this lack could have been produced by
various experimental factors as discussed earlier.
Vasishth (2002) proposed an ACT-R based (re-)
activation model, where the arguments and the verbprediction can be retrieved better the more they get
reactivated by additional complements or modifiers. It
will be hard to distinguish his concept of reactivation
from our concept of anticipation on an empirical bases.
However, note that in ACT-R activation decay is
estimated by b – 0.5 ln(time) (with b being an arbitrary
initial value). Decay is hence steepest during earlier
periods and successively flattens later. Reactivation
2nd NP
Dative
Dative
Genitive
Genitive
PP
N-mod.
V-mod.
N-mod.
V-mod.
RPD (msec)
555
623
793
843
Reading span. Three groups of about equal size (six to
eight participants) were built from the reading span
score. There was no reliable interaction of span group
with any other factor.
Discussion
The results clearly disconfirm the integration cost
hypothesis. Reading times were faster, not slower,
when an additional Dative had to be integrated. Instead,
this finding supports the anticipation hypothesis, as
processing an additional argument facilitated verb
processing.
The lack of a PP-effect is compatible with neither the
integration cost, nor the unrestricted anticipation
hypothesis. Note however, that both effects could have
masked each other here. It is possible that facilitation
by anticipation and integration cost both exist, however,
in varying strengths, so that one might dominate the
other in one circumstance and vice versa in another.
This assumption is questionable though, as integration
cost should be highest, and anticipation lowest, for the
more distant NP manipulation, and vice versa for the
local PP variation. We should therefore have
encountered an integration cost effect for Genitives vs.
Datives, and an anticipation effect for noun-modifying
vs. adverbial PPs. The actual data are much closer to
the opposite pattern though.
Note, however, that the lack of a PP effect may be due
to a number of inherent reasons. First, adverbs
generally impose weaker constraints on verbs than their
arguments. Second, the distance to the verb and hence
the time left for actually imposing its impact might have
been too short for the PP. Third, the PPs in the materials
may not have been modifying the verb or the noun as
unambiguously as possible. PPs, as opposed to NPs, are
not morphologically marked as verb-arguments in
German. In many cases, attachment was merely biased
General Discussion
should therefore have a stronger impact if it took place
recently rather than earlier in the sentence. We would
therefore expect the PP adverbial adjacent to the verb to
have a much more dramatic effect than the addition of
an NP complement further upstream. The present
results suggest that the opposite is the case.
We finally want to point out that the results are in line
with Konieczny’s (1996) anticipation proposal. In his
doctoral thesis, Konieczny proposed an incremental
processor based on an HPSG grammar, where linguistic
knowledge is represented in highly interconnected
graphs. According to this model, each complement
integrated into the sentence structure (by means of
unification) adds information to the prediction of the
verb. The enriched prediction constrains the class of the
verb to come and facilitates later integration, possibly
by providing rich cues for lexical access and retrieval of
dependents. While this account is straightforward for
complements, integrating pre-verbal adjuncts is less
direct and hence less efficient. Note that this view is
perfectly consistent with the present pattern of data:
The dative complement facilitated integration, whereas
the adverbial PP did not.
To sum up, the data support anticipation models such as
SRNs, but in a constrained way such that arguments
more strongly predict upcoming words than adverbs.
Conclusion
We have argued that clause-final verb integration can
be facilitated by anticipatory mechanisms. SRNs are
instances of stochastic devices that could be shown to
realize anticipation of late heads based on their earlier
complements. The data reported support the
anticipation hypothesis and disconfirm integration cost
as the main component of cognitive load. While the
results are at least modestly compatible with a variety
of models, a valid approach must consider that the
specificity, not the distance of a dependent, determines
the degree of anticipation. That is, NP arguments elicit
a facilitation effect even when they are distant to the
verb, whereas adverbial PPs do not necessarily, even
when they are adjacent. Future research will have to
clarify this issue.
Acknowledgments
We want to thank Simone Burgi, Heidi Fischer and
Sven Eric Hiss for their assistance in the construction of
the materials, and Sarah Schimke, Felix Schrape and
Kerstin Botsch for running the experiment. We are
grateful to Barbara Hemforth for her valuable
comments on an earlier version of this paper, and to the
all the participants of the connectionist cognitive
modeling class, headed by the first author in winter
2002, for their lively and fruitful discussions. All
remaining errors are our own, of course.
References
Daneman M., and Carpenter, P.A. (1980). Individual
differences in working memory and reading, Journal
of Verbal Learning and Verbal Behaviour, 19,
pp450-466
Elman, J. L. (1990). Finding structure in time.
Cognitive Science, 14, 179-211.
Gibson, E. (1998). Linguistic complexity: Locality of
syntactic dependencies. Cognition, 68, 1-76.
Hacker, W.; Handrick, S. & Veres, T. (1996).
Lesespannentest. Manuscript. University of Dresden.
Just, M. A., & Carpenter, P. A. (1992). A capacity
theory of comprehension: Individual differences in
working memory. Psychological Review, 99, 122149.
King, J., & Just, M. A. (1991). Inidvidual differences in
syntactic processing:the role of working memory.
Journal of Memory and Language, 30, 580-602.
Konieczny, L. (1996). Human sentence processing: A
semantics-oriented parsing approach. Doctoral
Thesis.
IIG-Berichte,
6-96.
Albert-LudwigsUniversity, Freiburg, Germany.
Konieczny, L. (2000). Locality and parsing complexity.
Journal of Psycholinguistic Research, 29, 627-645.
Konieczny, L., & Bormann, T. (2001). Extraposition
and anticipation of relative clauses. Manuscript.
Centre for Cognitive Science, University of Freiburg.
Konieczny, L., Hemforth, B., Scheepers, C. & Strube,
G. (1997) The role of lexical heads in parsing:
evidence from German. Language and Cognitive
Processes, 12, 307-348.
Konieczny, L., & Ruh, N. (2003). What’s in an error? A
reply to MacDonald & Christiansen (2002).
Manuscript submitted, University of Freiburg.
MacDonald, M. C., & Christiansen, M. H. (2002).
Reassessing Working Memory: Comment on Just and
Carpenter (1992) and Waters and Caplan (1996) .
Psychological Review, Vol. 109, No. 1, 35–54.
Rohde, D.L.T. (1999). The Simple Language
Generator: Encoding complex languages with simple
grammars. Technical Report CMU-CS-99-123,
Carnegie Mellon University, Department of
Computer Science, Pittsburgh, PA.
Vasishth, S. (2002). Working memory in sentence
comprehension: Processing Hindi center embeddings.
Unpublished doctoral dissertation, Ohio State
University, Columbus, OH.