On the temporal dynamics of language

Acta Psychologica 137 (2011) 181–189
Contents lists available at ScienceDirect
Acta Psychologica
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / a c t p s y
On the temporal dynamics of language-mediated vision and
vision-mediated language
Sarah E. Anderson a,⁎, Eric Chiu b, Stephanie Huette b, Michael J. Spivey b
a
b
Department of Psychology, Cornell University, Ithaca, NY 14853, United States
Cognitive and Information Sciences, University of California, Merced, United States
a r t i c l e
i n f o
Available online 18 October 2010
Keywords:
Psycholinguistics
Visual perception
Eye-tracking
a b s t r a c t
Recent converging evidence suggests that language and vision interact immediately in non-trivial ways,
although the exact nature of this interaction is still unclear. Not only does linguistic information influence
visual perception in real-time, but visual information also influences language comprehension in real-time.
For example, in visual search tasks, incremental spoken delivery of the target features (e.g., “Is there a red
vertical?”) can increase the efficiency of conjunction search because only one feature is heard at a time.
Moreover, in spoken word recognition tasks, the visual presence of an object whose name is similar to the
word being spoken (e.g., a candle present when instructed to “pick up the candy”) can alter the process of
comprehension. Dense sampling methods, such as eye-tracking and reach-tracking, richly illustrate the
nature of this interaction, providing a semi-continuous measure of the temporal dynamics of individual
behavioral responses. We review a variety of studies that demonstrate how these methods are particularly
promising in further elucidating the dynamic competition that takes place between underlying linguistic and
visual representations in multimodal contexts, and we conclude with a discussion of the consequences that
these findings have for theories of embodied cognition.
© 2010 Elsevier B.V. All rights reserved.
“The divisions in thought are thus given disproportionate
importance, as if they were a widespread and pervasive structure
of independently existent breaks in ‘what is,’ rather than merely
convenient features of description and analysis.”
David Bohm (Wholeness and the Implicate Order, 1980, p.34)
1. Introduction
In the late 1970s, the quantum physicist David Bohm found himself
compelled to write about how his hidden-variable field theory of
quantum mechanics implied that mind and matter must be undivided
in time and in space. Around the same time, James J. Gibson (1979) was
reminding psychologists that the “inflow of stimulus energy does not
consist of discrete inputs,” and that, “stimulation is not momentary.” Nor is
action momentary. At the core of the theoretical insights that these two
luminaries were promoting is the simple fact that our environmental
sensory input is continuously changing, and that our motor output is
continuously changing. In fact, many of those changes in environmental
sensory input are the causes of those changes in motor output, such as
when an object that has become salient draws a saccadic eye movement to
it. But it works the other way around as well. Many of those changes in
motor output are the causes of those changes in environmental sensory
⁎ Corresponding author.
E-mail address: [email protected] (S.E. Anderson).
0001-6918/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.actpsy.2010.09.008
input, such as when that new eye position suddenly places an alternative
object close enough to the fovea to also become salient. Determining
which is the chicken and which is the egg in this “action–perception cycle”
(Neisser, 1976) is often impossible.
There are times when those inputs and outputs may appear to change
rather abruptly, seemingly discontinuously. For example, it is traditionally
assumed that saccadic eye movements are ballistic and straight, and that
sensory input is briefly cut off during the saccade's trajectory. However, at
the time scale of dozens or hundreds of milliseconds, one readily observes
that even those sudden changes are underlyingly composed of smooth
continuous changes, with no genuine discontinuities taking place. For
example, saccades can actually curve as a result of multiple salient
locations in space (Theeuwes, Olivers, & Chizk, 2005; see also Emberson,
Weiss, Barbosa, Vatikiotis-Bateson, & Spivey, 2008), and certain aspects of
sensory input can actually get processed during the saccade itself
(Macknik, Fisher, & Bridgeman, 1991). Thus, there appears to be a
fundamental continuity in time and space (an undividedness, if you will)
associated with the action–perception cycle that drives the mutual
interaction in which the organism and environment engage (e.g.,
Chemero, 2009; Spivey, 2007; Turvey, 2007).
What exactly are the consequences that this perspective has on
how we understand vision, attention, working memory, and linguistic
processing? An answer to that question comes readily when one
considers what happens in a natural everyday conversation between
two people. At first glance, it may seem as though a sequence of
spoken words from the speaker delivers a series of discontinuous
182
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
auditory inputs to the listener's language system, and the listener's
eye movement pattern delivers a series of discontinuous visual inputs
(i.e., snapshots) to the listener's visual system. This idealized characterization has encouraged many traditional cognitive psychologists to
assume that the listener's mind is dealing with a sequence in which one
word is auditorily recognized, then the next word, and the next, and
simultaneously one fixated object is visually recognized, then the next
object, and the next. The consequence that a continuous perception–action
cycle has for our view of vision, attention, working memory, and linguistic
processing is that we have to let go of that idealized characterization of
visual and linguistic comprehension as producing a linear string of
symbolic representations. For example, spoken word recognition has such
a gradual time course to it (Allopenna, Magnuson, & Tanenhaus, 1998; see
also McMurray, Tanenhaus, Aslin, & Spivey, 2003) that a new word is often
being heard while the recognition process of the previous word has not
yet run its full course. Thus, the completion of the recognition process for a
given word can actually be contextually influenced by words that are
spoken after that word (Dahan, 2010). Similarly, visual object recognition
has a gradual time course to it (Rolls & Tovee, 1995), such that the
completed neural activation pattern associated with fully-accomplished
recognition of a foveated object (taking approximately 400 ms) is often
not allowed to quite reach completion before the next eye movement is
triggered (approximately every 300 ms). (These not-quite-completed
representations of visual objects are similar to the notion of “good-enough
representations” in sentence processing; Ferreira, Bailey, & Ferraro, 2002).
Additionally, with recent evidence of nonclassical visual receptive fields
that functionally integrate lateral and feedback projections (Gallant,
Connor, & Van Essen, 1998), it becomes clear that, like spoken word
recognition, every object recognition event is being richly contextualized
by its surrounding contours and shapes, even at the earliest stages of its
visual cortical processing (e.g., Grosof, Shapley, & Hawken, 1993; Lee &
Nguyen, 2001). Thus, the consequences that a continuous perception–
action cycle has for our view of vision, attention, working memory, and
linguistic processing are that the “sequences” of linguistic and visual
elements that a mind receives during an everyday conversation are not
sequences at all. The sensory inputs to language and vision are continuous
flowing streams of partially-overlapping, not-quite-divisible elements,
because adjacent words have co-articulation, because multiple objects
are often attended simultaneously, and because sensory processing itself
does not function in discrete time.
With a language subsystem and a vision subsystem each taking their
input streams and producing their output streams continuously, this begs
the question of how they combine these data streams for understanding
environmental situations that intermingle both linguistic and visual
properties. The pervasive modularity assumption of decades ago (Fodor,
1983) has been compromised by findings in a number of fields, time and
time again (for reviews, see Bechtel, 2003; Driver & Spence, 1998; Farah,
1994; Lalanne & Lorenceau, 2004; Spivey, 2007). However, psycholinguists and vision researchers still have a habit of treating their own
favorite mental faculty as though it operates independently of all other
mental faculties. Many psycholinguists readily accept the idea that
language comprehension can dramatically influence visual cognition,
but when they are faced with claims that vision can influence language
processing, they bristle. Inversely, many vision scientists accept with
aplomb the idea that visual perception can profoundly influence
language comprehension, but when you suggest to them that language
could profoundly influence vision, they suddenly become skeptics.
Even the researchers who report evidence of contextual interactions between language and vision tend to betray their implicit biases
with their choice of terminology. When psycholinguists study cognitive
processing in environments that contain visual and linguistic signals, they
refer to the linguistic signal as the target stimulus and the visual signal as
the context. When vision researchers, study cognitive processing in
environments that contain visual and linguistic signals, they refer to the
visual signal as the target stimulus and the linguistic signal as the context.
But it is often essentially the same environment and task! Take, for
example, a case where the visual display contains multiple objects and a
spoken linguistic query or instruction is intended to guide the participant's
eyes to one particular object in the display. If we conceive of this as a
spoken language comprehension task, where a stream of coarticulated
phonemes drives the activation of multiple competing lexical representations (McClelland & Elman, 1986), then the visual signal clearly acts as an
additional constraint to influence that process (Tanenhaus, SpiveyKnowlton, Eberhard, & Sedivy, 1995). However, if we conceive of this
same task as one of visual selection, where a saliency map representing
several interesting objects drives eye movements to those objects (Itti &
Koch, 2001), then it is obviously the linguistic signal that is acting as an
additional contextual constraint that influences that process (Spivey,
Tyler, Eberhard, & Tanenhaus, 2001).
The brain, however, is not paying attention to these labels of “target”
and “context.” A brain that participates in a visuolinguistic experiment has
no idea that it is in a psycholinguistics lab or in a visual perception lab. That
brain doesn't know that the lab's principal investigator is chiefly interested
in language and is treating the visual stimuli as context, or vice versa. The
brain is simply being guided by situational constraints to map an array of
multimodal sensory inputs onto a limited set of afforded actions —
hopefully in a fashion similar to how it normally does that in everyday life.
To the human brain, vision and language are roughly equally relevant
signal streams (among several others) that can mutually constrain one
another continuously in real-time.
The temporal continuity from sensory to motor processing, and the
multiple partially active shared representations in language, vision, and
action, which we will discuss throughout this article, all suggest that
cognition cannot help but be embodied, and thus constrained by the
actions available to the organism (Barsalou, 2008; Chemero, 2009; Spivey,
2007). Many examples of perceptual and cognitive processing, which we
will review in more depth later, suggest that cognition is unavoidably
shaped and nuanced by the sensorimotor constraints of the organism
coupled with its environment (Gibson, 1979; Turvey, 2007). In this article,
we review some of these findings of language influencing vision and of
vision influencing language, with a special focus on the role played in
these discoveries by a broad methodological approach of “dense
sampling.” Particularly among the recent studies of vision influencing
language, researchers have used eye-tracking and reach-tracking to
record multiple samples per trial (not just a reaction time at the end of a
trial), and have found unique types of evidence consistent with multiple
competing partially active representations midway through understanding a visually-contextualized spoken instruction. Our review concludes
with a discussion of the theoretical consequences that this evidence has
for modular vs. interactive accounts of the mind, and for recent ideas
about the embodiment of cognition.
2. Language influences vision
In Fodor's (1983) modularity thesis, two of the most important
properties of modules are: 1) information encapsulation, a property
whereby separate perceptual modules, with independent and specific
functional purposes, do not share the intermediate products of their
information processing directly with one another, and 2) domain
specificity (or “functional specialization,” Barrett & Kurzban, 2006), a
property whereby modules are limited in “the range of information they
can access” (Fodor, 1983, p.47). Since the time that thesis was put forth,
there have been quite a few experimental demonstrations of real-time
perceptual interactions among vision, touch, audition, and linguistic
systems that clearly violate information encapsulation. Importantly, when
information encapsulation is compromised among modules with putatively-independent functions, those modules unavoidably begin processing one another's information sources to some degree. As a result, they
are no longer truly functionally specialized either.
Crossmodal interactions among perceptual systems have been
discussed in philosophy and psychology for some time. For example,
in the 18th century, George Berkeley suggested that visual perception
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
of space is deeply influenced by tactile experience. However, it wasn't
until the end of the 20th century that experimental laboratory findings
began to clearly demonstrate the powerful effect that tactile input can
have on visual perception. For example, tactile stimulation of the left or
right index finger improves speed and accuracy in a visual discrimination
task inside the corresponding hemifield of the visual display (Spence,
Nicholls, Gillespie, & Driver, 1998). Moreover, the neural activation
resulting from that tactile cuing of visual performance is detectable in
neuroimaging of visual cortex (Macaluso, Frith, & Driver, 2000).
Further evidence for non-modular functioning of the visual system
is seen with perceptual interactions between vision and audition. For
example, when a single flash of light is accompanied by two auditory
beeps, it is frequently mis-perceived as two flashes of light (Shams,
Kamitani, & Shimojo, 2000). Moreover, when a leftward-moving disk
and a rightward-moving disk are animated on a computer screen such
that they pass through each other and continue, the baseline perception of
the event is that they passed by each other on slightly different depth
planes. However, if a simple 2.5 millisecond auditory click is delivered at
the point of visual coincidence, observers routinely perceive this same
dynamic visual display as an event where two disks bounce off of each other
and reverse their directions of movement (Sekuler, Sekuler, & Lau, 1997).
There is also evidence for the interaction of vision with so-called
“high level” cognitive processes like language. For example, ascribing
linguistic meaning to an otherwise arbitrary visual stimulus facilitates
performance in visual categorization (Goldstone, Lippa, & Shiffrin,
2001; Lupyan, 2008) and in visual search (Lupyan & Spivey, 2008). In
Lupyan and Spivey's (2008) experiment, some participants were
explicitly instructed to apply a meaningful label to a novel visual stimulus
in a visual search task. As a result, participants who used that label
performed the search faster and more efficiently (i.e., shallower slope of
reaction time by number-of-distractors). Similarly, concurrent delivery of
an auditory label with a noisy visual stimulus (e.g., hearing the name of a
letter when performing a signal detection task with a very low-contrast
image of a letter) has been shown to produce a significantly greater visual
sensitivity measure (i.e., d-prime) specifically when the verbal cue
matches the visual stimulus (Lupyan & Spivey, 2010). These data suggest
that there is a top-down conceptual influence on visual perception, such
that seeing not only depends on what something looks like, but also on
what it means.
Something akin to this idea of top-down guidance of visual processing
was implemented in the form of Wolfe's (1994) Guided Search model,
which imposed some important revisions to Treisman and Gelade's
(1980) original Feature Integration Theory of visual search. As a purely
feed-forward account, the original Feature Integration Theory proposed
that a pre-attentive first-stage of visual search processed its input in
parallel with topographic maps devoted to the detection of individual
features, but that a second-stage attentional system (the integrative
master map) performed its search in a serial fashion for objects that
conjoined multiple partially-distinguishing features. With many types of
stimuli, this model correctly predicted that: a) search for a target that is
distinguished from its distractors by a single feature produces a roughly flat
slope of reaction time by set-size (hence, parallel processing of the
display), and b) search for a target that is distinguished from its distractors
by a conjunction of features produces a linearly increasing slope of reaction
time by set-size (hence, serial processing of the display).
However, a number of findings have been gradually leading to a major
overhaul of Treisman's early insights, such as: 1) Nakayama and
Silverman's (1986) evidence for parallel processing of certain conjunction
searches (see also McLeod, Driver, & Crisp, 1988; Theeuwes & Kooi, 1994);
2) Duncan and Humphreys's (1989) evidence for graded similarity effects
in visual search; 3) Wolfe's (1994) evidence for top-down guidance; 4)
McElree and Carrasco's (1999; see also Dosher, Han, & Lu, 2004) use of the
speed–accuracy tradeoff paradigm to show that both feature and
conjunction search involve parallel processing; 5) Palmer, Verghese, and
Pavel's (2000; see also Eckstein, 1998) use of signal detection theory to
account for conjunction search as essentially a problem of signal-to-noise
183
ratio rather than serial processing; and 6) Wolfe's (1998) demonstration
that RT×set-size functions compiled from dozens of visual search
experiments do not separate themselves into a bimodal distribution of
shallow (parallel) and steep (serial) slopes. Particularly compelling
arguments against the serial processing perspective in visual attention
come from evidence of “biased competition” in extrastriate visual cortex.
Here, both top-down and bottom-up interactions have been found to
mediate neural mechanisms of selective visual attention (Desimone,
1998; Desimone & Duncan, 1995). These findings support a parallel
processing perspective that claims visual attention is better characterized
as a function of partially active representations of objects simultaneously
contending for mappings onto motor output (Desimone & Duncan, 1995;
Mounts & Tomaselli, 2005; Reynolds & Desimone, 2001).
Inspired by the biased competition framework, Spivey and Dale
(2004) and Reali, Spivey, Tyler, and Terranova (2006) developed a parallel
competitive winner-take-all model of reaction times in visual search
(where the capacity limitation comes as a side-effect of the normalization
process that imposes the competition). In this model, representations of
features and of objects are all partially active in parallel and compete over
time until an object representation exceeds a fixed criterion of activation,
at which point a reaction time is recorded. Despite the processing in this
localist attractor network being entirely parallel, conjunction searches (as
well as triple conjunctions and high-similarity searches) in this model
produce linear increases in reaction time as more distractors are added to
the display.
Further support against Treisman and Gelade's (1980) strict dichotomy between parallel processing and serial processing in visual search
comes from studies by Olds, Cowan, and Jolicoeur (2000a,b,c). They
presented single-feature visual search pop-out displays for very brief
periods of time, less than 100 ms in some conditions, before changing
them to conjunction search displays. Although participants did not report
experiencing a pure pop-out effect and their response times were not as
fast as with pure pop-out displays, facilitatory effects did emerge in
response times, due to the very brief period of time that the display
showed only single-feature distractors. This effect of a partial pop-out
process assisting the conjunction search process was called “search
assistance.”
Conceptually similar to Olds's work, there is a related instance of
the initial portion of stimulus delivery instigating a single-feature search,
with the remainder of stimulus delivery converting the search into a
conjunction search. Spivey et al. (2001) put participants in an auditory/
visual (A/V) concurrent condition, where observers are presented with
the conjunction search display concurrently with target identity delivered
via auditory linguistic query (e.g. “Is there a red vertical?”). Thus, while
viewing the display, participants heard “red” before they heard “vertical,”
and thus spent a few hundred milliseconds performing visual search on
the display with knowledge of only one feature. This was compared to a
baseline auditory-first control condition, where target identity was
provided via the same spoken query prior to visual display onset. In
both target-present and target-absent displays, the A/V concurrent
condition produced significantly shallower slopes of reaction time by
set-size, compared to the auditory-first control condition. If visual search
for a conjunction target was a process that could only take place with a
completed target template as the guide for the serial search of a master
map, then search could not commence until the full noun phrase had been
heard, and no improvement in search efficiency could have happened.
Clearly, the concurrent and continuous processing of spoken linguistic
input was able to quickly influence the efficiency with which the visual
search process took place, by allowing the linguistic processing of the first
adjective to already instigate some visual search based on that one feature.
To be sure, previous work has shown that, with practice, participants can
strategically carry out subset-biased search in a conjunction display
(Egeth, Virzi, & Garbart, 1984; Kaptein, Theeuwes, & van der Heijden,
1995). However, this finding of spoken linguistic input being the
immediate guide for that subset-biasing, over the course of a few hundred
milliseconds, provides intriguing evidence for language being able to
184
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
intervene in the real-time processing of the visual system, with little or no
strategic practice. For a concrete demonstration of how a localist attractor
network, relying solely on parallel competition, can mimic this process,
see Reali et al. (2006).
However, it has been shown that the magnitude of this improved
search efficiency is affected by the rate of speech, both in human data
(Gibson, Eberhard, & Bryant, 2005) and in model simulations (Reali et
al., 2006). With faster speech, the slope of reaction time by set-size is
not as flattened out in the A/V concurrent condition. To more
systematically explore subtle timing issues such as this, Chiu and
Spivey (in preparation) devised a semi-concurrent condition, where
observers hear the first adjective (e.g., “Is there a red…”) before onset
of the display and then hear the second adjective (e.g., “vertical?”)
simultaneously with the onset of the display. In this condition, by the
time the display is presented, the second adjective is being heard, and
thus at the point in time at which visual search can begin, the target is
known to be a conjunction of two features. As a result, the slope of
reaction time by set-size is steep and linear, as in the control baseline
condition. However, if a mere 200 ms of silence is spliced in between
the two adjectives, during which the display is being viewed but the
second adjective has not yet been delivered, suddenly the slope of
reaction time by set-size is shallow again. Thus, much like Olds et al.'s
(2000a,b,c) “search assistance,” just a tiny amount of time initially
performing a single-feature search (since only one adjective has been
heard so far) is enough to significantly improve search efficiency.
In the context of myriad perceptual interactions between various
sensory systems (see Spence & Driver, 2004, for extensive review),
and the broad prevalence of top-down neuronal projections throughout cortex (e.g., Kveraga, Ghuman, & Bar, 2007; Mumford, 1992), it
should not be surprising that something as “high level” as language is
nonetheless capable of subtly influencing visual perception at the time
scale of hundreds of milliseconds. However, examining the time course of
such interactive processing has, in the past, relied on manipulating the
independent variable, both in the speed–accuracy tradeoff paradigm and
in examining reaction time distributions (Dosher, 1976; Miller, 1982;
Ratcliff, 1985). Such paradigms rely on manipulating the independent
variable to extract pieces of meta-cognitive information about processing
at different time points in order to re-construct the overall time course of
processing. Rather than running thousands of trials, which rely on
interrupting normal cognitive processing, what might better facilitate
future work on language-mediated vision is not just experimental
manipulations (independent variables) that pry open those milliseconds
during visuolinguistic interaction, but also experimental measures
(dependent variables) that pry open those milliseconds. This line of
research needs methodologies that allow not only a measure of how
accurate a participant was at the end of a trial or how long she took to get
there, but also a measure of what alternative response options were
considered (even just partially) along the way.
3. The dense sampling approach
With longer time scales of cognition, as in development and in
long-term task performance, dense sampling across the time course of
the phenomenon in question has already been extremely informative.
For example, for studying behavior over the course of hours, a time
series of a thousand or more reaction times can be analyzed as one
temporally-extended process of cognitive performance (rather than a
thousand sequential processes of little word recognition events), and
thus reveal statistical patterns of fractal structure in the variance that
are naturally predicted only by an interactive dynamical account of
cognition (Kello, Beltz, Holden, & Van Orden, 2007; Van Orden, Holden,
& Turvey, 2003). Similarly, when Stephen and Mirman (2010) analyzed
the overall distributions of saccade lengths over the course of many
trials in a visual search task, they found evidence consistent with a single
underlying process for both single-feature and conjunction search (e.g.,
Spivey & Dale, 2004), and evidence for lognormal and power-law
distributions, which imply self-organized interaction-dominant dynamics in visual cognition (Aks, Zelinsky, & Sprott, 2002), rather than
additive encapsulated components (Cavanagh, 1988). Thus, by treating
a series of cognitive events as a single durative process, and statistically
analyzing that process, the patterns of data reveal properties of the
phenomenon that are not well accommodated by traditional linear boxand-arrow accounts of cognition. The same general idea, of examining
multiple measures of one large process, applies to developmental
cognition as well. For studying the temporal dynamics of developmental
change over the course of months, Siegler and Lin (2009, p.87) suggest
that, “densely sampling changing competence during the period of rapid
change provides the temporal resolution needed to understand the
learning process.”
Essentially, with every measurement of a change of state, there are
coarse time scales at which that change will look more or less
instantaneous, and there are fine time scales at which that change will
look gradual. For discovering the mechanisms or processes that
actually enact that change of state, it is crucial that our science operate
at a time scale that reveals the underlying gradualness of that change
(Spivey, 2007). In addition to studying temporally-extended cognitive
performance, dense sampling methods (such as eye-tracking and reachtracking) can also be applied to measuring individual behaviors to provide
a window into a cognitive process as it is happening. Interaction between
modalities, such as that occurring in language-mediated vision and in
vision-mediated language, necessitates that we not only understand the
offline resultant choice and latency of the behavior, but that we also
resolve how and when multiple sources of information interact on the
way toward producing the response behavior. These methods can
provide evidence for immediately available gradations of partially active
representations indexed in oculomotor and skeletomotor movements,
as well as evidence for competitive multimodal interaction evolving into
a response over several hundred milliseconds.
Some hints of these multimodal interactions were originally found
in redundant offline measures. For example, congruent auditory–visual
information is known to speed responses (Todd, 1912). However,
evidence for the time course of this process was not revealed until
scientists took up a dense sampling approach to the measurement of the
response movement itself. In a study by Giray and Ulrich (1993), the
nature of this redundant-signals effect was examined by utilizing reaction
time as well as the response force on both unimodal and bimodal trials.
Both an increase in force and a decrease in reaction times were found for
trials where two sources of information were combined, which the
authors use to argue for early and continuous influence of sensory
information on subsequent motor responses, as opposed to the traditional
assumption that this information has no influence after a response is
initiated. Balota and Abrams (1995) show a similar pattern of results with
word frequency: motor movements exhibit more force in response to high
frequency words than low frequency words. These findings suggest that
word frequency not only influences the time required to recognize a word,
but also influences the subsequent response dynamics, implying that the
motor system is not slavishly executing a single command delivered by a
cognitive system after it has completed its processing, but instead the
motor system cooperates with the cognitive system in real-time to cogenerate a response.
A wave of experiments have emerged that measure kinematic features
of the motor movement during a response, providing insight into the
temporal dynamics of activation accumulation. Abrams and Balota (1991)
had participants make rapid limb movements in opposite directions in
order to indicate whether a string of letters was a word or not. In addition
to high lexical frequency speeding response and increasing force, they also
found effects in movement duration, peak acceleration, final velocity, and
initial velocity. These early effects are extremely important for distinguishing between models of perception and cognition that make unique
predictions regarding intermediate stages of processing, where an early
effect of velocity can mean the difference between an encapsulated
modular stage or a partially active distributed representation.
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
The temporal dynamics of motor output can be especially informative when the stimulus delivery itself is inherently extended in time as
well. One of the many concerns in investigating spoken language is the
temporal nature of acoustic events: sounds arrive in a linear order to
form words, sentences, and discourse. Methods such as eye-tracking can
reveal probabilistic activations for visual referents available in the
environment. This close time-locking of saccades to speech allows for
direct time-sensitive measurements of processing that can address finegrained aspects of language comprehension (Tanenhaus et al., 1995). A
primary assumption in this “visual world paradigm” is that saccades are
readily driven by partially active representations. Thus, by collecting
samples of 2–4 fixations per second, one observes that during spoken
word recognition eye movements are made not only to target referents
but also to competitor objects with phonologically similar names
(Allopenna et al., 1998; Spivey-Knowlton, Tanenhaus, Eberhard, & Sedivy,
1995), semantically related properties (Huettig & Altmann, 2005; Yee &
Sedivy, 2006), and visually similar shapes (Huettig & Altmann, 2007).
Even more samples per second can be collected when one records the
temporal dynamics of a reaching movement, again revealing competition
between multiple potential movement destinations (Finkbeiner &
Caramazza, 2008; Tipper, Howard, & Jackson, 1997). Note, however,
that reach-movements are often initiated after a first eye movement, and
therefore this compensatory strength and weakness (denser sampling but
later measurement in reach-tracking) should encourage one to treat these
two methods as complementary, not adversarial. Spivey, Grosjean, and
Knoblich (2005) simplified the reach-tracking method by developing a
computer-mouse-tracking paradigm. This method of sampling full
mouse-movement trajectories at 60 Hz, and looking at their curvatures,
velocity and acceleration profiles, distribution of maximum deviations, as
well as measures of entropy or disorder, can aid in distinguishing between
alternative computational simulations of the temporal dynamics of
uncertainty resolution in spoken word recognition (Spivey, Dale,
Knoblich, & Grosjean, 2010; van der Wel, Eder, Mitchel, Walsh, &
Rosenbaum, 2009). Computer-mouse tracking has also been similarly
informative for studies of sentence processing (Farmer, Anderson, &
Spivey, 2007), semantic categorization (Dale, Kehoe, & Spivey, 2007),
color categorization (Huette & McMurray, 2010), decision making
(McKinstry, Dale, & Spivey, 2008), and social preferences (Freeman,
Ambady, Rule, & Johnson, 2008; Wojnowicz, Ferguson, Dale, & Spivey,
2009).
The reason that dense sampling of motor output during a response
is so informative is that the motor system does not patiently wait until
a cognitive process has reached completion before it begins
generating a movement associated with the results of that cognitive
process. Rather, the multifarious pattern of neural activity associated
with a given cognitive process, as it evolves over the course of several
hundred milliseconds, unavoidably influences the initial generation of
movement plans in oculomotor cortex (Gold & Shadlen, 2000) and in
primary motor cortex (Cisek & Kalaska, 2005). The presence of
reciprocal neural projections between frontal cortex and these motor
areas suggests that they may produce an undivided process of coevolution during those several hundred milliseconds whereby
cognition and action are not quite separable (e.g., Barsalou, 2008;
Chemero, 2009; Hommel, 2004; Nazir et al., 2007; Pulvermüller,
2005; Spivey, 2007).
4. Vision influences language
The area of vision-mediated language is where the dense sampling
approach (especially eye-tracking and reach-tracking) has made a
great deal of progress lately. However, the most well-known textbook
example of vision affecting language (the McGurk Effect) was discovered
well before the dense sampling approach became widely used. In the
McGurk Effect (McGurk & MacDonald, 1976), the participant sees a video
of a face repeating the syllable ga, while hearing the syllable “ba”
synchronized with the face's mouth movement, but reports perceiving the
185
syllable da. Although the auditory input clearly supports a percept of ba,
the visual input of the lips remaining open is incongruous with this
perception. The best fit for the auditory and visual stimuli then is a percept
of da, because it has substantial phonetic similarity with the sound being
heard and a visual compatibility with the lip movement being seen.
Extending this work by digitally altering faces and sound files along a bahdah continuum, Massaro (1999) provided evidence that the visual
perception of the speaker's mouth has an immediate and graded effect
on which of the two phonemes was reported.
Another intriguing example of vision-mediated language is the
process of silent lip-reading, during which the auditory cortex of a
skilled lip-reader is active (Calvert et al., 1997). Even though there is
no actual auditory information arriving at the senses, the skill of lipreading appears to be closely intertwined with auditory processing,
recruiting those linguistic representations in order to understand what
was said. While such demonstrations of interactions between language
and vision are extremely informative, eye-tracking and mouse-tracking
make it possible to develop a more detailed mechanistic understanding of
the temporal dynamics involved in how visual information immediately
impacts language processing, especially lexical and sentence processing.
There are many reasons for looking beyond outcome-based measures,
including that outcome-based measures often do not provide a complete
explanation. Take phonemic categorical perception for example. When
using outcome-based measures, distinguishing /p/ from /b/ appears
discretely categorical: if the voice onset time (VOT) is between 0 and
30 ms, then the sound is perceived as “pah,” and if VOT is between 30 and
60 ms, the sound is perceived as “bah” (Liberman, Harris, Kinney, & Lane,
1961). However, reaction times are significantly longer when the VOT is
closer to this 30 ms boundary than when it is farther away (Pisoni & Tash,
1974). Further, by tracking eye movements in addition to recording
reaction times, we can see that those longer reaction times close to the
phonetic boundary are the result of competition between partially active
representations (McMurray & Spivey, 1999). When the VOT is close to
30 ms, vacillatory eye movements between the two response options are
observed, indicating that both options are temporarily being considered.
Eye-tracking also reveals that this competition lingers long enough to
impact the processing of an entire word, like “pear” and “bear”
(McMurray, Tanenhaus, & Aslin, 2002). These data suggest that using
dense sampling methods like eye- and reach-tracking may be extremely
useful measures to add to outcome-based and reaction time methods
(Abrams & Balota, 1991).
Since the advent of the visual world paradigm (Tanenhaus et al.,
1995; see also Cooper, 1974), eye-tracking has been used extensively
to investigate the real-time interactions of visual information and
language comprehension. Earlier reaction time data suggested that as
a word unfolds over time, it is ambiguous with other words that share
similar sounding onsets (Marslen-Wilson, 1987; Zwitserlood, 1989).
This cohort theory suggests that for a brief period after the onset of a
word, all words beginning with the same phonemic input, called the
cohort, show partial activation. As more phonemic input is received, the
target is more unambiguously identified, with other words dropping out
of the cohort. However, the real-time processing of these cohort effects
and some of the theory's predictions remained untestable with reaction
time data.
Monitoring eye movements, Allopenna et al. (1998; see also
Spivey-Knowlton, 1996) presented subjects with visual displays
containing four items: the target (e.g., a beaker), an onset-competitor
(a beetle), a rhyme competitor (a speaker), and an unrelated referent
(a carriage). Eye movements were recorded as participants heard and
responded to instructions, like “Pick up the beaker.” The pattern of eye
movements revealed that during the latter half of the spoken target
word, the probability of fixating the target or cohort competitor both
gradually increased equally. Around the offset of the spoken target
word, the proportion of looks to the target began to rise precipitously,
and the proportion of looks to the cohort competitor began to decline
soon thereafter. Thus, early in the auditory stimulus, before the target
186
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
has been uniquely identified, competition between the partially active
representations manifests itself in the eye movement patterns. Additionally, these data revealed a greater proportion of fixations to the rhyme
competitor than to the neutral distractor object, providing a clear
demonstration of the rhyme competitor effects naturally predicted by
McClelland and Elman's (1986) interactive neural network simulation of
speech perception.
More recently, in the field of spoken word recognition, the graded
and continuous trajectories of computer-mouse tracking have been
used to complement more discrete fixation-based data from saccadic
eye movement patterns. The arm movements of mouse-tracking are
relatively continuous and can be smoothly redirected mid-flight, allowing
graded effects of dynamic competition between multiple partially active
representations to be concretely visible in individual trials (Spivey et al.,
2005). In the first demonstration of this, participants were presented with
an auditory stimulus such as “Click on the candy” and asked to use a
computer mouse to click on the corresponding picture on the computer
screen. The visual scene contained either two pictures of phonologically
similar items (i.e., a piece of candy and a candle) or two pictures of
phonologically dissimilar items (i.e., a ladle and a candle). Streaming x, y
coordinates were recorded, and the competition between the partially
activated lexical representations was revealed in the smooth curvature of
the hand-movement trajectories. Similar to the work of Allopenna et al.
(1998), when items in the visual scene were phonologically similar, these
trajectories showed graded spatial attraction to the competing object.
This graded spatial attraction provides evidence both of the continuous
uptake of (and interaction between) visual and linguistic information,
and of the dynamic competition between partially active representations.
Just as words are temporarily ambiguous across time, so are many
sentences. The dense sampling methods of eye- and mouse-tracking
in visual contexts have been used extensively to elucidate the processing
of these sentences. Many early experiments in temporarily ambiguous
sentence processing looked at sentences in isolation, with the data
providing a rather modular view of their processing. Take the sentence,
“Since Jay always jogs a mile doesn't seem far.” In this sentence, inflated
reading times were observed when readers encountered the disambiguating information “doesn't” (Fraizer & Rayner, 1982). Researchers
explained this increased reading time by appealing to syntactic heuristics
in the syntactic parsing mechanism, thus arguing for the parser's
autonomy from other information sources, such as semantics and visual
information.
However, processing syntactically ambiguous sentences in conjunction with a visual scene produces drastically different findings (Tanenhaus
et al., 1995). When adults hear a temporarily ambiguous sentence such as
“Put the apple on the towel in the box” while viewing a scene containing
an apple, a towel (incorrect goal location), a box (correct goal location),
and a flower (neutral unrelated referent), they experience the gardenpath effect, temporarily interpreting “on the towel” as the destination of
the putting event, only to later realize that this parse is ultimately
incorrect. In reading experiments, the garden-path effect manifests itself
as inflated reading times at the point of disambiguation; in visual world
eye-tracking it manifests itself as an increased proportion of looks to the
incorrect destination (in this case, the towel) when compared to trials
containing unambiguous sentences like “Put the apple that's on the towel
in the box.” Following previous reading experiments that used discourse
contexts to modulate syntactic garden-path effects (e.g., Altmann &
Steedman, 1988; Spivey-Knowlton, Trueswell, & Tanenhaus, 1993),
Tanenhaus et al. (1995) used a visual context that contained two similar
referent objects for “the apple.” Therefore, upon hearing “Put the apple,”
the listener temporarily did not know which of the two apples was being
referred to. This referential ambiguity caused the upcoming prepositional
phrase “on the towel” to be parsed as attached to the noun phrase, thus
avoiding the garden-path. Indeed, when the visual display contained a
second apple, participants no longer looked at the incorrect destination
(garden-path) location with any significant frequency. Thus, a visual
context was shown to mediate the syntactic processing of a sentence on
the time scale of hundreds of milliseconds (Chambers, Tanenhaus, &
Magnuson, 2004; Snedeker & Trueswell, 2004; Spivey, Tanenhaus,
Eberhard, & Sedivy, 2002; Tanenhaus et al., 1995; Trueswell, Sekerina,
Hill, & Logrip, 1999).
Eye-tracking has been used to explore the time course of sentence
production as well as sentence comprehension. These experiments
demonstrate the link between the production of linguistic structure
and visual attention in a visual world paradigm. In one experiment, a
participant's eye movements allowed her to extract information from
a visual scene based on comprehension of the depicted event, rather
than just on visual saliency (Griffin & Bock, 2000). Further, these eye
movements around the visual scene anticipated the order in which
agents and patients were mentioned in sentence production tasks.
Hence, visual attention, as driven by eye movements, predicted the
ordering of elements in a subsequently produced sentence (Griffin &
Bock, 2000). There is also evidence showing that directly manipulating visual attention in such a visual world paradigm elicits differences
in the way subsequently produced sentences are structured (Gleitman, January, Nappa, & Trueswell, 2007; Tomlin, 1997). These data
suggest that manipulating shifts of visual attention can influence word
order choices, predicting both passive and active sentence constructions.
Tracking the continuous motor output of mouse-movement trajectories further explicates the dynamic competition between partially active
syntactic representations in the visual world paradigm. Rather than
moving actual 3-D objects in a visual scene, participants used a computer
mouse to move objects around a computer screen in response to
sentences like, “Put the apple on the towel in the box” (Farmer et al.,
2007). As in earlier results, there was no evidence of a garden-path effect
in the two-referent visual context. However, the continuous trajectories
provided further data about the competition process between partially
active syntactic representations in the one-referent context. Across
individual computer-mouse trajectories, the garden-path effect manifested itself as graded spatial attraction toward the competing goal
destination. This smooth spatial attraction toward the competing goal
suggests the visual and linguistic inputs were continuously and smoothly
interacting in a competitive process, such that the garden-path parse was
partially active and thus the incorrect destination location was garnering
some degree of attention during visuomotor processing.
These continuous mouse-tracking results provide support for
continuous parallel processing in cognition, in a similar way to how
eye-tracking results provide such support (Magnuson, 2005). However, it has recently been suggested that the tell-tale curvatures in
mouse-movement trajectories of this type can actually be explained,
in principle, by a model in which cognitive processing is discrete and
serial (postulating one symbolic representation at a time), and the
motor output is produced by a continuous parallel processing system
(van der Wel et al., 2009). In this model, two motor movements
corresponding to a strategic upward movement and then a perceptual
decision movement are asynchronously averaged together to produce a
smoothly curved motor output (Henis & Flash, 1995). This distinction
between perceptual processing and action planning provides an existence
proof in which motor output may be continuous, but the underlying
perceptual/cognitive decisions are serial. This model then creates
problems for theories of embodied cognition that propose that perception
and cognition are dynamically coupled with action.
However, it seems unlikely that one neural system (cognition)
would behave in one way (i.e., using discrete representations in
sequence) and then feed into a second system (action) that behaves in
a qualitatively different way (i.e., using continuous representations in
parallel). In their reply to van der Wel et al. (2009), Spivey and
colleagues used the same equations that van der Wel et al. (2009)
used for their model, adding a mechanism of dynamic competition
between the multiple simultaneous cognitive representations that
drive those motor commands (Spivey et al., 2010). As there is nothing
uniquely serial about the equations used by Henis and Flash (1995), the
results of Spivey et al.'s model provide evidence that the perceptual and
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
motor decisions can both be made in a continuous parallel fashion. For
example, cognitive representations for two response options initiate
motor commands for both potential reach locations (Cisek & Kalaska,
2005), and the averaging weights for those two motor commands start
out equal. This instigates motor output that is initially aimed at the
midpoint between the two potential reach locations. As one cognitive
representation receives increasing perceptual support, its weight ramps
up, while the weight for the other cognitive representation ramps down.
These changing weights are used to produce a dynamically averaged
motor movement that smoothly curves in a manner identical to the
human data. Hence, a dynamic and continuous cognitive/perceptual
discrimination task flows smoothly into a similarly dynamic and
continuous motor output.
This section has reviewed a number of ways in which language is
impacted by vision, and suggests that looking beyond reaction time
data in such tasks is crucial when addressing how these information
sources interact in real-time. Monitoring eye- and reach-movements
provides a window into how partially active lexical and syntactic
representations are activated and compete over time. Similarly, these
methods also allow researchers to investigate the way that objects in a
visual scene can immediately guide language processing, stacking the
competition process in favor of the interpretations and actions that
the situation most readily affords.
5. Conclusion
How is it that the possible actions afforded by a visuolinguistic
situation can influence the way that situation is perceived in the first
place? We are accustomed to thinking of visual perception and language
comprehension as taking place before action planning, but it is becoming
clear that this is an oversimplification. As J. J. Gibson (1979) encouraged
decades ago, we need to recognize that sensory input flows continuously
into our perceptual subsystems, and therefore there is no real break in the
stimulus flow during which perception reaches a point of culmination and
then action planning could finally become operative to receive the output
of that perceptual culmination. Rather, motor output itself is often
continuously blending over time some weighted combination of multiple
motor commands (Gold & Shadlen, 2000; Tipper, Houghton, & Howard,
2000), while multiple alternative action plans are being simultaneously
prepared (Cisek & Kalaska, 2005), and feedback projections from these
frontal and motor areas of the brain are continuously influencing the
activation patterns in primary sensory areas of the brain (Kveraga et al.,
2007). This is how the temporally-extended process of perceptual
recognition (in both language and vision) co-evolves in concert with the
temporally-extended process of action planning. And this is precisely why
dense sampling of motor output is so richly informative about real-time
cognitive processes in language and vision. The patterns of data in
continuous motor output are tightly correlated with the patterns of data in
continuous cognitive processing.
This temporal continuity in signal processing, non-modularity in
information transmission, and the concomitant representational
coextensiveness among language, vision, and action, have the logical
consequence that the embodiment of cognition cannot help but be real
(Barsalou, 2008; Spivey, 2007). Every example of cognitive processing
– whether it is the recognition of a visual object (Tucker & Ellis, 2001),
the comprehension of a spoken sentence (Anderson & Spivey, 2009),
or the contemplation of an abstract concept (Richardson, Spivey,
Barsalou, & McRae, 2003) – is unavoidably shaped and nuanced by the
sensorimotor constraints of the organism coupled with its environment (Gibson, 1979; Turvey, 2007). The question remains, however, is
the shape and nuance that a sensorimotor context bestows upon a
cognitive representation fundamental to the functioning of that representation or is it merely an epiphenomenal decoration loosely attached to
that representation.
In stark contrast to fundamental embodiment claims (e.g., Barsalou,
2008; Gallese & Lakoff, 2005; Glenberg, 2007), Mahon and Caramazza
187
(2008) have recently suggested that potentiation of sensorimotor
properties occurs after access to a conceptual symbol has taken place,
and therefore amounts to little more than spreading activation of
associated properties. Mahon and Caramazza argue that the cognitive
operations carried out by a sequence of accessed symbolic concepts
happens exactly that same way it always would, with or without this
spreading of activation to vestigial sensorimotor properties that subtly
influence reaction times and eye movements. Mahon and Caramazza's
proposal responsibly acknowledges the existence of a large and growing
literature base of embodiment-related findings in vision, language,
memory, and conceptual processing. To address that work, their model
postulates a centralized module for conceptual processing that does the
real work of cognition, and then activation spreads uni-directionally to
sensory and motor systems. This account simultaneously preserves the
autonomy of symbolic conceptual processing and accommodates the
majority of the experimental findings in the embodied cognition
literature.
However, there exists a small handful of embodiment-related findings
that don't fit nicely into Mahon and Caramazza's (2008) proposal.
Although many results in embodied cognition are describable as a
conceptual entity becoming active and then spreading their activation in
the direction of some associated sensory or motor processes, a few studies
have shown this spreading of activation in the other direction. That is, by
perturbing a sensory or motor process during a cognitive task, one can
alter the way that cognitive process takes place. For example, when people
hold a pencil sideways in their mouth (and thus activate some of the same
muscles used for smiling) they give more positive affective judgments of
humorous cartoons (Strack, Martin, & Stepper, 1988). Similarly, when
people's hands are immobilized, their speech production exhibits more
disfluencies (Rauscher, Krauss, & Chen, 1996). More recently, Pulvermüller, Hauk, Ilmoniemi, and Nikulin (2005) used mild transcranial magnetic
stimulation (TMS) to induce neural activation in arm- and leg-related
regions of primary motor cortex while participants carried out a lexical
decision task. When mild TMS was applied to the leg region of primary
motor cortex, participants responded faster to leg-related action verbs
(e.g., kick) than to arm-related action verbs (e.g., throw), whereas when
mild TMS was applied to the arm region of primary motor cortex, this
pattern was reversed. Meteyard, Zokaei, Bahrami, and Vigliocco (2008)
show a similar effect of visual motion perception influencing lexical
decision. While viewing a subthreshold noisy motion stimulus that drifted
upward or downward, participants exhibited slower reaction times to
words that denoted directions of motion that were incongruent with the
directionality of the irrelevant subthreshold motion stimulus. Thus, to the
degree that the ubiquitous tried-and-tested lexical decision task taps a
cognitive process, these two recent findings show that motor and sensory
properties exert their influence in the other direction, changing the way a
string of letters is evaluated as being a word or not a word.
Such data are difficult to account for in a more traditional view of
cognition, which separates the lexical decision task into two stages: one
corresponding to the cognitive/perceptual task and one corresponding to
the motor planning task. Specifically, the data of Pulvermüller et al. (2005)
demonstrate differential results based on what part of the primary motor
cortex experienced TMS and the content of the word. If the lexical decision
task consists of two separate stages and the TMS was influencing only the
motor stage, then one would not expect differential effects for responses
to “hand” and “leg” action words, because the motor stage would not
know the semantics of the word being responded to. However, the data
show exactly such differential effects, and hence are problematic for
accounts like those of Mahon and Caramazza (2008). It may be the case
that, in order to protect the autonomy of their conceptual processing
module, Mahon and Caramazza (2008) would choose to place the lexical
decision task outside the domain of the kind of cognitive processing they
find valuable, but such a move would surely slide into the margins of
science a massive amount of literature that has been used in the past to
bolster the very foundations of their own classical cognitive science
framework.
188
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
To settle this debate, it is clear that more empirical research is
necessary on both sides. Additionally, the field needs computational
simulations to better explicate the various theoretical claims, and to make
clear quantitative predictions for new experiments (e.g., Anderson,
Huette, Matlock, & Spivey, 2010; Howell, Jankowicz, & Becker, 2005).
Based on the evidence reviewed here, in language-mediated vision and
vision-mediated language, we think that the balance is leaning away from
classical cognitive science and toward dynamic embodied accounts of
cognition. Rather than cognition being a “box in the head” in which logical
rules are applied to discrete symbols in a fashion unlike anywhere else in
the brain, the cognitive processes of language, vision, attention and
memory may be best described as emergent properties that self-organize
among the continuous neural interactions between sensory, motor, and
association cortices of a brain that is embodied with sensors and effectors
that are intimately coupled with the environment.
References
Abrams, R. A., & Balota, D. A. (1991). Mental chronometry: Beyond reaction time.
Psychological Science, 2, 153−157.
Aks, D., Zelinsky, G., & Sprott, J. (2002). Memory across eye-movements: 1/f dynamic in
visual search. Nonlinear Dynamics, Psychology, and Life Sciences, 6, 1−25.
Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of
spoken word recognition: Evidence for continuous mapping models. Journal of
Memory and Language, 38, 419−439.
Altmann, G., & Steedman, M. (1988). Interaction with context during human sentence
processing. Cognition, 30, 191−238.
Anderson, S. E., Huette, S., Matlock, T., & Spivey, M. J. (2010). On the temporal dynamics
of negated perceptual simulations. In F. Parrill, M. Turner, & V. Tobin (Eds.),
Meaning, form, and body (pp. 1−20). Stanford: CSLI Publications.
Anderson, S. E., & Spivey, M. J. (2009). The enactment of language: Decades of
interactions between linguistic and motor processes. Language and Cognition, 1,
87−111.
Balota, D. A., & Abrams, R. A. (1995). Mental chronometry: Beyond onset latencies in the
lexical decision task. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 21, 1289−1302.
Barrett, H. C., & Kurzban, R. (2006). Modularity in cognition: Framing the debate.
Psychological Review, 113, 628−647.
Barsalou, L. W. (2008). Grounded cognition. Annual Review of Psychology, 59, 617−645.
Bechtel, W. (2003). Modules, brain parts, and evolutionary psychology. In S. Scher, & F.
Rauscher (Eds.), Evolutionary psychology: Alternative approaches (pp. 211−227).
Dordrecht, Netherlands: Kluwer Academic Publishers.
Bohm, D. (1980). Wholeness and the implicate order. London: Routledge & Kegan Paul.
Calvert, G., Bullmore, E., Brammer, M., Cambell, R., Williams, S., McGuire, P., et al.
(1997). Activation of auditory cortex during silent lipreading. Science, 276,
593−596.
Cavanagh, P. (1988). Pathways in early vision. In Z. Pylyshyn (Ed.), Computational
processes in human vision: An interdisciplinary perspective (pp. 254−289). Norwood,
NJ: Ablex Publishing.
Chambers, C., Tanenhaus, M., & Magnuson, J. (2004). Actions and affordances in
syntactic ambiguity resolution. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 30, 687−696.
Chemero, A. P. (2009). Radical embodied cognition. Cambridge, MA: MIT Press.
Chiu, E. M. & Spivey M. J. (in preparation). Timing of speech and display affects the
linguistic mediation of visual search.
Cisek, P., & Kalaska, J. F. (2005). Neural correlates of reaching decisions in dorsal
premotor cortex: Specification of multiple direction choices and final selection of
action. Neuron, 45, 801−814.
Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language: A
new methodology for the real-time investigation of speech perception, memory,
and language processing. Cognitive Psychology, 6, 84−107.
Dahan, D. (2010). The time course of interpretation in speech comprehension. Current
Directions in Psychological Science, 19, 121−126.
Dale, R., Kehoe, C., & Spivey, M. J. (2007). Graded motor responses in the time course of
categorizing atypical exemplars. Memory and Cognition, 35, 15−28.
Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate
visual cortex. Philosophical Transactions of the Royal Society B: Biological Sciences,
353, 1245.
Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention.
Annual Review of Neuroscience, 18, 193−222.
Dosher, B. (1976). Retrieval of sentences from memory: A speed accuracy tradeoff
study. Cognitive Psychology, 8, 291−310.
Dosher, B., Han, S., & Lu, Z. (2004). Parallel processing in visual search asymmetry.
Journal of Experimental Psychology: Human Perception and Performance, 30, 3−27.
Driver, J., & Spence, C. (1998). Crossmodal attention. Current Opinion in Neurobiology, 8,
245−253.
Duncan, J., & Humphreys, G. (1989). Visual search and stimulus similarity. Psychological
Review, 96, 433−458.
Eckstein, M. (1998). The lower visual search efficiency for conjunctions is due to noise
and not serial attentional processing. Psychological Science, 9, 111−118.
Egeth, H., Virzi, R., & Garbart, H. (1984). Searching for conjunctively defined
targets. Journal of Experimental Psychology: Human Perception and Performance,
10, 32−39.
Emberson, L., Weiss, R., Barbosa, A., Vatikiotis-Bateson, E., & Spivey, M. (2008). Crossed
hands curve saccades: Multisensory dynamics in saccade trajectories. Proceedings
of the 29th Annual Conference of the Cognitive Science Society.
Farah, M. (1994). Neuropsychological inference with an interactive brain: A critique of
the “locality” assumption. Behavioral and Brain Sciences, 17, 43−104.
Farmer, T. A., Anderson, S. E., & Spivey, M. J. (2007). Gradiency and visual context in
syntactic garden-paths. Journal of Memory & Language, 57, 570−595.
Ferreira, F., Bailey, K., & Ferraro, V. (2002). Good-enough representations in language
comprehension. Current Directions in Psychological Science, 11, 11−15.
Finkbeiner, M., & Caramazza, A. (2008). Modulating the masked congruence priming
effect with the hands and the mouth. Journal of Experimental Psychology: Human
Perception and Performance, 34, 894−918.
Fodor, J. (1983). The modularity of mind: An essay on faculty psychology. Cambridge, MA:
MIT Press.
Fraizer, L., & Rayner, K. (1982). Making and correcting errors during sentence
comprehension: Eye movements in the analysis of structurally ambiguous
sentences. Cognitive Psychology, 14, 178−210.
Freeman, J. B., Ambady, N., Rule, N. O., & Johnson, K. L. (2008). Will a category cue attract
you? Motor output reveals dynamic competition across person construal. Journal of
Experimental Psychology: General, 137, 673−690.
Gallant, J., Connor, C., & Van Essen, D. (1998). Neural activity in areas V1, V2 and V4
during free viewing of natural scenes compared to controlled viewing. Neuroreport,
9, 85−89.
Gallese, V., & Lakoff, G. (2005). The brain's concepts: The role of the sensory-motor
system in reason and language. Cognitive Neuropsychology, 22, 455−479.
Gibson, B. S., Eberhard, K. M., & Bryant, T. A. (2005). Linguistically mediated visual
search: The critical role of speech rate. Psychonomic Bulletin & Review, 12, 276.
Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin.
Giray, M., & Ulrich, R. (1993). Motor coactivation revealed by response force in divided
and focused attention. Journal of Experimental Psychology: Human Perception and
Performance, 19, 1278−1291.
Gleitman, L. R., January, D., Nappa, R., & Trueswell, J. C. (2007). On the give and take
between event apprehension and utterance formulation. Journal of Memory and
Language, 57, 544−569.
Glenberg, A. M. (2007). Language and action: Creating sensible combinations of ideas.
In G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 361−370). Oxford,
UK: Oxford University Press.
Gold, J. I., & Shadlen, M. N. (2000). Representation of a perceptual decision in
developing oculomotor commands. Nature, 404, 390−394.
Goldstone, R. L., Lippa, Y., & Shiffrin, R. M. (2001). Altering object representations
through category learning. Cognition, 78, 27−43.
Griffin, Z., & Bock, K. (2000). What the eyes say about speaking. Psychological Science,
11, 274−279.
Grosof, D., Shapley, R., & Hawken, M. (1993). Macaque V1 neurons can signal “illusory”
contours. Nature, 365, 550−552.
Henis, E. A., & Flash, T. (1995). Mechanisms underlying the generation of averaged
modified trajectories. Biological Cybernetics, 72, 407−419.
Hommel, B. (2004). Event files: Feature binding in and across perception and action.
Trends in Cognitive Sciences, 8, 494−500.
Howell, S. R., Jankowicz, D., & Becker, S. (2005). A model of grounded language
acquisition: Sensorimotor features improve lexical and grammatical learning.
Journal of Memory and Language, 53, 258−276.
Huette, S., & McMurray, B. (2010). Continuous dynamics of color categorization.
Psychonomic Bulletin & Review, 17, 348−354.
Huettig, F., & Altmann, G. T. M. (2005). Word meaning and the control of eye fixation:
Semantic competitor effects and the visual world paradigm. Cognition, 96, 23−32.
Huettig, F., & Altmann, G. T. M. (2007). Visual-shape competition and the control of eye
fixation during the processing of unambiguous and ambiguous words. Visual
Cognition, 15, 985−1018.
Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews
Neuroscience, 2, 194−203.
Kaptein, N. A., Theeuwes, J., & van der Heijden, A. H. C. (1995). Search for a
conjunctively defined target can be selectively limited to a color-defined subset of
elements. Journal of Experimental Psychology: Human Perception & Performance, 21,
1053−1069.
Kello, C. T., Beltz, B. C., Holden, J. G., & Van Orden, G. C. (2007). The emergent
coordination of cognitive function. Journal of Experimental Psychology: General, 136,
551−568.
Kveraga, K., Ghuman, A. S., & Bar, M. (2007). Top-down predictions in the cognitive
brain. Brain and Cognition, 65, 145−168.
Lalanne, C., & Lorenceau, J. (2004). Crossmodal integration for perception and action.
Journal of Physiology — Paris, 98, 265−279.
Lee, T. S., & Nguyen, M. (2001). Dynamics of subjective contour formation in the early
visual cortex. Proceedings of the National Academy of Science, 98, 1907−1911.
Liberman, A., Harris, K., Kinney, J., & Lane, H. (1961). The discrimination of relative
onset-time of the components of certain speech and nonspeech patterns. Journal of
Experimental Psychology, 61, 379−388.
Lupyan, G. (2008). The conceptual grouping effect: Categories matter (and named
categories matter more). Cognition, 108, 566−577.
Lupyan, G., & Spivey, M. J. (2008). Perceptual processing is facilitated by ascribing
meaning to novel stimuli. Current Biology, 18, 410−412.
Lupyan, G., & Spivey, M. J. (2010). Making the invisible visible: Verbal but not visual
cues enhance visual detection. PLoS ONE, 5(7), e11452.
S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189
Macaluso, E., Frith, C., & Driver, J. (2000). Modulation of human visual cortex by
crossmodal spatial attention. Science, 289, 1206−1208.
Macknik, S. L., Fisher, B. D., & Bridgeman, B. (1991). Flicker distorts visual space
constancy. Vision Research, 31, 2057−2064.
Magnuson, J. S. (2005). Moving hand reveals dynamics of thought. Proceedings of the
National Academy of Sciences, 102, 9995−9996.
Mahon, B. Z., & Caramazza, A. (2008). A critical look at the embodied cognition
hypothesis and a new proposal for grounding conceptual content. Journal of
Physiology — Paris, 102, 59−70.
Marslen-Wilson, M. (1987). Functional parallelism in spoken word recognition.
Cognition, 25, 71−102.
Massaro, D. (1999). Speechreading: Illusion or window into pattern recognition. Trends
in Cognitive Sciences, 3, 310−317.
McClelland, J., & Elman, J. (1986). The TRACE model of speech perception. Cognitive
Psychology, 18, 1−86.
McElree, B., & Carrasco, M. (1999). The temporal dynamics of visual search: Evidence
for parallel processing in feature and conjunction searches. Journal of Experimental
Psychology: Human Perception and Performance, 25, 1517−1539.
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264,
746−748.
McKinstry, C., Dale, R., & Spivey, M. J. (2008). Action dynamics reveal parallel
competition in decision making. Psychological Science, 19, 22−24.
McLeod, P., Driver, J., & Crisp, J. (1988). Visual search for a conjunction of movement
and form is parallel. Nature, 332, 154−155.
McMurray, B., Tanenhaus, M., & Aslin, R. (2002). Gradient effects of within-category
phonetic variation on lexical access. Cognition, 86, B33−B42.
McMurray, B., Tanenhaus, M., Aslin, R., & Spivey, M. (2003). Probabilistic constraint
satisfaction at the lexical/phonetic interface: Evidence for gradient effects of
within-category VOT on lexical access. Journal of Psycholinguistic Research, 32,
77−97.
McMurray, R., & Spivey, M. (1999). The categorical perception of consonants: The
interaction of learning and processing. Proceedings of the Chicago Linguistic Society
Panels, 35-2 (pp. 205−221).
Meteyard, L., Zokaei, N., Bahrami, B., & Vigliocco, G. (2008). Now you see it: Visual
motion interferes with lexical decision on motion words. Current Biology, 18,
R732−R733.
Miller, J. (1982). Discrete versus continuous stage models of human information
processing: In search of partial output. Journal of Experimental Psychology: Human
Perception and Performance, 8, 273−296.
Mounts, J., & Tomaselli, R. (2005). Competition for representation is mediated by
relative attentional salience. Acta Psychologica, 118, 261−275.
Mumford, D. (1992). On the computational architecture of the neocortex II. The role of
cortico-cortical loops. Biological Cybernetics, 66, 241−251.
Nakayama, K., & Silverman, G. (1986). Serial and parallel processing of visual feature
conjunctions. Nature, 320, 264−265.
Nazir, T. A., Boulenger, V., Jeannerod, M., Paulignan, Y., Roy, A., & Silber, B. (2007).
Language-induced motor perturbations during the execution of a reaching
movement. Journal of Cognitive Neuroscience, 18, 1607−1615.
Neisser, U. (1976). Cognition and reality: Principles and implications of cognitive
psychology. San Francisco, California: W. H. Freeman.
Olds, E. S., Cowan, W. B., & Jolicoeur, P. (2000a). Partial orientation pop-out helps
difficult search for orientation. Perception & Psychophysics, 62, 1341−1347.
Olds, E. S., Cowan, W. B., & Jolicoeur, P. (2000b). The time-course of pop-out search.
Vision Research, 40, 891−912.
Olds, E. S., Cowan, W. B., & Jolicoeur, P. (2000c). Tracking visual search over space and
time. Psychonomic Bulletin & Review, 7, 292−300.
Palmer, J., Verghese, P., & Pavel, M. (2000). The psychophysics of visual search. Vision
Research, 40, 1227−1268.
Pisoni, D., & Tash, J. (1974). Reaction times to comparisons within and across phonetic
categories. Perception and Psychophysics, 15, 285−290.
Pulvermüller, F. (2005). Brain mechanisms linking language and action. Nature Reviews
Neuroscience, 6, 576−582.
Pulvermüller, F., Hauk, O., Ilmoniemi, R., & Nikulin, V. (2005). Functional links between
motor and language systems. European Journal of Neuroscience, 21, 793−797.
Ratcliff, R. (1985). Theoretical interpretations of the speed and accuracy of positive and
negative responses. Psychological Review, 92, 212−225.
Rauscher, F. B., Krauss, R. M., & Chen, Y. (1996). Gesture, speech and lexical access:
The role of lexical movements in speech production. Psychological Science, 7,
226−231.
Reali, F., Spivey, M. J., Tyler, M. J., & Terranova, J. (2006). Inefficient conjunction search
made efficient by concurrent spoken delivery of target identity. Perception and
Psychophysics, 68, 959.
Reynolds, J., & Desimone, R. (2001). Neural mechanisms of attentional selection. In J. Braun,
C. Koch, & J. Davis (Eds.), Visual attention and cortical circuits (pp. 121−135).
Cambridge, Mass: MIT Press.
Richardson, D., Spivey, M., Barsalou, L., & McRae, K. (2003). Spatial representations
activated during real-time comprehension of verbs. Cognitive Science, 27, 767−780.
Rolls, E., & Tovee, M. (1995). Sparseness of the neuronal representation of stimuli in the
primate temporal visual cortex. Journal of Neurophysiology, 73, 713−726.
Sekuler, R., Sekuler, A., & Lau, R. (1997). Sound alters visual motion perception. Nature,
385, 308.
Shams, L., Kamitani, Y., & Shimojo, S. (2000). What you see is what you hear. Nature,
408, 788.
189
Siegler, R. S., & Lin, X. (2009). Self-explanations promote children's learning. In H. S. Waters, &
W. Schneider (Eds.), Metacognition, strategy use, and instruction (pp. 85−113). New York:
Guilford Publications.
Snedeker, J., & Trueswell, J. (2004). The developing constraints on parsing decisions:
The role of lexical-biases and referential scenes in child and adult sentence
processing. Cognitive Psychology, 49, 238−299.
Spence, C., & Driver, J. (Eds.). (2004). Crossmodal space and crossmodal attention. Oxford:
Oxford University Press.
Spence, C., Nicholls, M. E. R., Gillespie, N., & Driver, J. (1998). Cross-modal links in
exogenous covert spatial orienting between touch, audition, and vision. Perception
& Psychophysics, 60, 544−557.
Spivey, M. J. (2007). The continuity of mind. New York: Oxford University Press.
Spivey, M. J., & Dale, R. (2004). On the continuity of mind: Toward a dynamical account
of cognition. In B. Ross (Ed.), The psychology of learning & motivation. Volume 45
(pp. 87−142). San Diego: Elsevier.
Spivey, M. J., Dale, R., Knoblich, G., & Grosjean, M. (2010). Do curved reaching
movements emerge from competing perceptions? Journal of Experimental Psychology: Human Perception and Performance, 36, 251−254.
Spivey, M. J., Grosjean, M., & Knoblich, G. (2005). Continuous attraction toward
phonological competitors. Proceedings of the National Academy of Sciences, 102,
10393−10398.
Spivey, M. J., Tanenhaus, M., Eberhard, K., & Sedivy, J. (2002). Eye movements and
spoken language comprehension: Effects of visual context on syntactic ambiguity
resolution. Cognitive Psychology, 45, 447−481.
Spivey, M. J., Tyler, M. J., Eberhard, K. M., & Tanenhaus, M. K. (2001). Linguistically
mediated visual search. Psychological Science, 12, 282−286.
Spivey-Knowlton, M. (1996). Integration of visual and linguistic information: Human data
and model simulations. Doctoral dissertation, University of Rochester, Rochester,
NY.
Spivey-Knowlton, M. J., Tanenhaus, M., Eberhard, K., & Sedivy, J. (1995). Eye
movements accompanying language and action in a visual context: Evidence
against modularity. Proceedings of the 17th Annual Conference of the Cognitive
Science Society. (pp. 25−30)Hillsdale, NJ: Erlbaum.
Spivey-Knowlton, M., Trueswell, J., & Tanenhaus, M. (1993). Context effects in syntactic
ambiguity resolution. Canadian Journal of Experimental Psychology, 47, 276−309.
Stephen, D. G., & Mirman, D. (2010). Interactions dominate the dynamics of visual
cognition. Cognition, 115, 154−165.
Strack, F., Martin, L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the
human smile: A nonobtrusive test of the facial feedback hypothesis. Journal of
Personality and Social Psychology, 54, 768−777.
Tanenhaus, M., Spivey-Knowlton, M., Eberhard, K., & Sedivy, J. (1995). Integration of
visual and linguistic information during spoken language comprehension. Science,
268, 1632−1634.
Theeuwes, J., & Kooi, F. (1994). Parallel search for a conjunction of contrast polarity and
shape. Vision Research, 34, 3013−3016.
Theeuwes, J., Olivers, C. N. L., & Chizk, C. L. (2005). Remembering a location makes the
eyes curve away. Psychological Science, 16, 196−199.
Tipper, S., Houghton, G., & Howard, L. (2000). Behavioural consequences of selection
from neural population codes. In S. Monsell, & J. Driver (Eds.), Control of cognitive
processes: Attention and performance XVIII (pp. 233−245). Cambridge, MA: MIT.
Tipper, S., Howard, L., & Jackson, S. (1997). Selective reaching to grasp: Evidence for
distractor interference effects. Visual Cognition, 4, 1−38.
Todd, J. W. (1912). Reaction to multiple stimuli. Archives of Psychology, 25, 1−65.
Tomlin, R. (1997). Mapping conceptual representations into linguistic representations:
The role of attention in grammar. In J. Nuyts, & E. Pederson (Eds.), Language and
conceptualization (pp. 162−189). Cambridge: Cambridge University Press.
Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive
Psychology, 12, 97−136.
Trueswell, J., Sekerina, K., Hill, & Logrip, L. (1999). The kindergarten-path effect:
Studying on-line sentence processing in young children. Cognition, 73, 89−134.
Tucker, M., & Ellis, R. (2001). The potentiation of grasp types during visual object
categorization. Visual Cognition, 8, 769−800.
Turvey, M. T. (2007). Action and perception at the level of synergies. Human Movement
Science, 26, 657−697.
van der Wel, R. P. R. D., Eder, J., Mitchel, A., Walsh, M., & Rosenbaum, D. (2009).
Trajectories emerging from discrete versus continuous processing models in
phonological competitor tasks. Journal of Experimental Psychology: Human
Perception and Performance, 32, 588−594.
Van Orden, G., Holden, J., & Turvey, M. (2003). Self-organization of cognitive
performance. Journal of Experimental Psychology: General, 132, 331−350.
Wojnowicz, M., Ferguson, M., Dale, R., & Spivey, M. J. (2009). The self-organization of
explicit attitudes. Psychological Science, 20, 1428−1435.
Wolfe, J. M. (1994). Guided Search 2.0: A revised model of visual search. Psychonomic
Bulletin & Review, 1, 202−238.
Wolfe, J. M. (1998). What can 1 million trials tell us about visual search? Psychological
Science, 9, 33−39.
Yee, E., & Sedivy, J. (2006). Eye movements reveal transient semantic activation during
spoken word recognition. Journal of Experimental Psychology: Learning, Memory and
Cognition, 32, 1−14.
Zwitserlood, P. (1989). The locus of the effects of sentential-semantic context in
spoken-word processing. Cognition, 32, 25−64.