Acta Psychologica 137 (2011) 181–189 Contents lists available at ScienceDirect Acta Psychologica j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / a c t p s y On the temporal dynamics of language-mediated vision and vision-mediated language Sarah E. Anderson a,⁎, Eric Chiu b, Stephanie Huette b, Michael J. Spivey b a b Department of Psychology, Cornell University, Ithaca, NY 14853, United States Cognitive and Information Sciences, University of California, Merced, United States a r t i c l e i n f o Available online 18 October 2010 Keywords: Psycholinguistics Visual perception Eye-tracking a b s t r a c t Recent converging evidence suggests that language and vision interact immediately in non-trivial ways, although the exact nature of this interaction is still unclear. Not only does linguistic information influence visual perception in real-time, but visual information also influences language comprehension in real-time. For example, in visual search tasks, incremental spoken delivery of the target features (e.g., “Is there a red vertical?”) can increase the efficiency of conjunction search because only one feature is heard at a time. Moreover, in spoken word recognition tasks, the visual presence of an object whose name is similar to the word being spoken (e.g., a candle present when instructed to “pick up the candy”) can alter the process of comprehension. Dense sampling methods, such as eye-tracking and reach-tracking, richly illustrate the nature of this interaction, providing a semi-continuous measure of the temporal dynamics of individual behavioral responses. We review a variety of studies that demonstrate how these methods are particularly promising in further elucidating the dynamic competition that takes place between underlying linguistic and visual representations in multimodal contexts, and we conclude with a discussion of the consequences that these findings have for theories of embodied cognition. © 2010 Elsevier B.V. All rights reserved. “The divisions in thought are thus given disproportionate importance, as if they were a widespread and pervasive structure of independently existent breaks in ‘what is,’ rather than merely convenient features of description and analysis.” David Bohm (Wholeness and the Implicate Order, 1980, p.34) 1. Introduction In the late 1970s, the quantum physicist David Bohm found himself compelled to write about how his hidden-variable field theory of quantum mechanics implied that mind and matter must be undivided in time and in space. Around the same time, James J. Gibson (1979) was reminding psychologists that the “inflow of stimulus energy does not consist of discrete inputs,” and that, “stimulation is not momentary.” Nor is action momentary. At the core of the theoretical insights that these two luminaries were promoting is the simple fact that our environmental sensory input is continuously changing, and that our motor output is continuously changing. In fact, many of those changes in environmental sensory input are the causes of those changes in motor output, such as when an object that has become salient draws a saccadic eye movement to it. But it works the other way around as well. Many of those changes in motor output are the causes of those changes in environmental sensory ⁎ Corresponding author. E-mail address: [email protected] (S.E. Anderson). 0001-6918/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.actpsy.2010.09.008 input, such as when that new eye position suddenly places an alternative object close enough to the fovea to also become salient. Determining which is the chicken and which is the egg in this “action–perception cycle” (Neisser, 1976) is often impossible. There are times when those inputs and outputs may appear to change rather abruptly, seemingly discontinuously. For example, it is traditionally assumed that saccadic eye movements are ballistic and straight, and that sensory input is briefly cut off during the saccade's trajectory. However, at the time scale of dozens or hundreds of milliseconds, one readily observes that even those sudden changes are underlyingly composed of smooth continuous changes, with no genuine discontinuities taking place. For example, saccades can actually curve as a result of multiple salient locations in space (Theeuwes, Olivers, & Chizk, 2005; see also Emberson, Weiss, Barbosa, Vatikiotis-Bateson, & Spivey, 2008), and certain aspects of sensory input can actually get processed during the saccade itself (Macknik, Fisher, & Bridgeman, 1991). Thus, there appears to be a fundamental continuity in time and space (an undividedness, if you will) associated with the action–perception cycle that drives the mutual interaction in which the organism and environment engage (e.g., Chemero, 2009; Spivey, 2007; Turvey, 2007). What exactly are the consequences that this perspective has on how we understand vision, attention, working memory, and linguistic processing? An answer to that question comes readily when one considers what happens in a natural everyday conversation between two people. At first glance, it may seem as though a sequence of spoken words from the speaker delivers a series of discontinuous 182 S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 auditory inputs to the listener's language system, and the listener's eye movement pattern delivers a series of discontinuous visual inputs (i.e., snapshots) to the listener's visual system. This idealized characterization has encouraged many traditional cognitive psychologists to assume that the listener's mind is dealing with a sequence in which one word is auditorily recognized, then the next word, and the next, and simultaneously one fixated object is visually recognized, then the next object, and the next. The consequence that a continuous perception–action cycle has for our view of vision, attention, working memory, and linguistic processing is that we have to let go of that idealized characterization of visual and linguistic comprehension as producing a linear string of symbolic representations. For example, spoken word recognition has such a gradual time course to it (Allopenna, Magnuson, & Tanenhaus, 1998; see also McMurray, Tanenhaus, Aslin, & Spivey, 2003) that a new word is often being heard while the recognition process of the previous word has not yet run its full course. Thus, the completion of the recognition process for a given word can actually be contextually influenced by words that are spoken after that word (Dahan, 2010). Similarly, visual object recognition has a gradual time course to it (Rolls & Tovee, 1995), such that the completed neural activation pattern associated with fully-accomplished recognition of a foveated object (taking approximately 400 ms) is often not allowed to quite reach completion before the next eye movement is triggered (approximately every 300 ms). (These not-quite-completed representations of visual objects are similar to the notion of “good-enough representations” in sentence processing; Ferreira, Bailey, & Ferraro, 2002). Additionally, with recent evidence of nonclassical visual receptive fields that functionally integrate lateral and feedback projections (Gallant, Connor, & Van Essen, 1998), it becomes clear that, like spoken word recognition, every object recognition event is being richly contextualized by its surrounding contours and shapes, even at the earliest stages of its visual cortical processing (e.g., Grosof, Shapley, & Hawken, 1993; Lee & Nguyen, 2001). Thus, the consequences that a continuous perception– action cycle has for our view of vision, attention, working memory, and linguistic processing are that the “sequences” of linguistic and visual elements that a mind receives during an everyday conversation are not sequences at all. The sensory inputs to language and vision are continuous flowing streams of partially-overlapping, not-quite-divisible elements, because adjacent words have co-articulation, because multiple objects are often attended simultaneously, and because sensory processing itself does not function in discrete time. With a language subsystem and a vision subsystem each taking their input streams and producing their output streams continuously, this begs the question of how they combine these data streams for understanding environmental situations that intermingle both linguistic and visual properties. The pervasive modularity assumption of decades ago (Fodor, 1983) has been compromised by findings in a number of fields, time and time again (for reviews, see Bechtel, 2003; Driver & Spence, 1998; Farah, 1994; Lalanne & Lorenceau, 2004; Spivey, 2007). However, psycholinguists and vision researchers still have a habit of treating their own favorite mental faculty as though it operates independently of all other mental faculties. Many psycholinguists readily accept the idea that language comprehension can dramatically influence visual cognition, but when they are faced with claims that vision can influence language processing, they bristle. Inversely, many vision scientists accept with aplomb the idea that visual perception can profoundly influence language comprehension, but when you suggest to them that language could profoundly influence vision, they suddenly become skeptics. Even the researchers who report evidence of contextual interactions between language and vision tend to betray their implicit biases with their choice of terminology. When psycholinguists study cognitive processing in environments that contain visual and linguistic signals, they refer to the linguistic signal as the target stimulus and the visual signal as the context. When vision researchers, study cognitive processing in environments that contain visual and linguistic signals, they refer to the visual signal as the target stimulus and the linguistic signal as the context. But it is often essentially the same environment and task! Take, for example, a case where the visual display contains multiple objects and a spoken linguistic query or instruction is intended to guide the participant's eyes to one particular object in the display. If we conceive of this as a spoken language comprehension task, where a stream of coarticulated phonemes drives the activation of multiple competing lexical representations (McClelland & Elman, 1986), then the visual signal clearly acts as an additional constraint to influence that process (Tanenhaus, SpiveyKnowlton, Eberhard, & Sedivy, 1995). However, if we conceive of this same task as one of visual selection, where a saliency map representing several interesting objects drives eye movements to those objects (Itti & Koch, 2001), then it is obviously the linguistic signal that is acting as an additional contextual constraint that influences that process (Spivey, Tyler, Eberhard, & Tanenhaus, 2001). The brain, however, is not paying attention to these labels of “target” and “context.” A brain that participates in a visuolinguistic experiment has no idea that it is in a psycholinguistics lab or in a visual perception lab. That brain doesn't know that the lab's principal investigator is chiefly interested in language and is treating the visual stimuli as context, or vice versa. The brain is simply being guided by situational constraints to map an array of multimodal sensory inputs onto a limited set of afforded actions — hopefully in a fashion similar to how it normally does that in everyday life. To the human brain, vision and language are roughly equally relevant signal streams (among several others) that can mutually constrain one another continuously in real-time. The temporal continuity from sensory to motor processing, and the multiple partially active shared representations in language, vision, and action, which we will discuss throughout this article, all suggest that cognition cannot help but be embodied, and thus constrained by the actions available to the organism (Barsalou, 2008; Chemero, 2009; Spivey, 2007). Many examples of perceptual and cognitive processing, which we will review in more depth later, suggest that cognition is unavoidably shaped and nuanced by the sensorimotor constraints of the organism coupled with its environment (Gibson, 1979; Turvey, 2007). In this article, we review some of these findings of language influencing vision and of vision influencing language, with a special focus on the role played in these discoveries by a broad methodological approach of “dense sampling.” Particularly among the recent studies of vision influencing language, researchers have used eye-tracking and reach-tracking to record multiple samples per trial (not just a reaction time at the end of a trial), and have found unique types of evidence consistent with multiple competing partially active representations midway through understanding a visually-contextualized spoken instruction. Our review concludes with a discussion of the theoretical consequences that this evidence has for modular vs. interactive accounts of the mind, and for recent ideas about the embodiment of cognition. 2. Language influences vision In Fodor's (1983) modularity thesis, two of the most important properties of modules are: 1) information encapsulation, a property whereby separate perceptual modules, with independent and specific functional purposes, do not share the intermediate products of their information processing directly with one another, and 2) domain specificity (or “functional specialization,” Barrett & Kurzban, 2006), a property whereby modules are limited in “the range of information they can access” (Fodor, 1983, p.47). Since the time that thesis was put forth, there have been quite a few experimental demonstrations of real-time perceptual interactions among vision, touch, audition, and linguistic systems that clearly violate information encapsulation. Importantly, when information encapsulation is compromised among modules with putatively-independent functions, those modules unavoidably begin processing one another's information sources to some degree. As a result, they are no longer truly functionally specialized either. Crossmodal interactions among perceptual systems have been discussed in philosophy and psychology for some time. For example, in the 18th century, George Berkeley suggested that visual perception S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 of space is deeply influenced by tactile experience. However, it wasn't until the end of the 20th century that experimental laboratory findings began to clearly demonstrate the powerful effect that tactile input can have on visual perception. For example, tactile stimulation of the left or right index finger improves speed and accuracy in a visual discrimination task inside the corresponding hemifield of the visual display (Spence, Nicholls, Gillespie, & Driver, 1998). Moreover, the neural activation resulting from that tactile cuing of visual performance is detectable in neuroimaging of visual cortex (Macaluso, Frith, & Driver, 2000). Further evidence for non-modular functioning of the visual system is seen with perceptual interactions between vision and audition. For example, when a single flash of light is accompanied by two auditory beeps, it is frequently mis-perceived as two flashes of light (Shams, Kamitani, & Shimojo, 2000). Moreover, when a leftward-moving disk and a rightward-moving disk are animated on a computer screen such that they pass through each other and continue, the baseline perception of the event is that they passed by each other on slightly different depth planes. However, if a simple 2.5 millisecond auditory click is delivered at the point of visual coincidence, observers routinely perceive this same dynamic visual display as an event where two disks bounce off of each other and reverse their directions of movement (Sekuler, Sekuler, & Lau, 1997). There is also evidence for the interaction of vision with so-called “high level” cognitive processes like language. For example, ascribing linguistic meaning to an otherwise arbitrary visual stimulus facilitates performance in visual categorization (Goldstone, Lippa, & Shiffrin, 2001; Lupyan, 2008) and in visual search (Lupyan & Spivey, 2008). In Lupyan and Spivey's (2008) experiment, some participants were explicitly instructed to apply a meaningful label to a novel visual stimulus in a visual search task. As a result, participants who used that label performed the search faster and more efficiently (i.e., shallower slope of reaction time by number-of-distractors). Similarly, concurrent delivery of an auditory label with a noisy visual stimulus (e.g., hearing the name of a letter when performing a signal detection task with a very low-contrast image of a letter) has been shown to produce a significantly greater visual sensitivity measure (i.e., d-prime) specifically when the verbal cue matches the visual stimulus (Lupyan & Spivey, 2010). These data suggest that there is a top-down conceptual influence on visual perception, such that seeing not only depends on what something looks like, but also on what it means. Something akin to this idea of top-down guidance of visual processing was implemented in the form of Wolfe's (1994) Guided Search model, which imposed some important revisions to Treisman and Gelade's (1980) original Feature Integration Theory of visual search. As a purely feed-forward account, the original Feature Integration Theory proposed that a pre-attentive first-stage of visual search processed its input in parallel with topographic maps devoted to the detection of individual features, but that a second-stage attentional system (the integrative master map) performed its search in a serial fashion for objects that conjoined multiple partially-distinguishing features. With many types of stimuli, this model correctly predicted that: a) search for a target that is distinguished from its distractors by a single feature produces a roughly flat slope of reaction time by set-size (hence, parallel processing of the display), and b) search for a target that is distinguished from its distractors by a conjunction of features produces a linearly increasing slope of reaction time by set-size (hence, serial processing of the display). However, a number of findings have been gradually leading to a major overhaul of Treisman's early insights, such as: 1) Nakayama and Silverman's (1986) evidence for parallel processing of certain conjunction searches (see also McLeod, Driver, & Crisp, 1988; Theeuwes & Kooi, 1994); 2) Duncan and Humphreys's (1989) evidence for graded similarity effects in visual search; 3) Wolfe's (1994) evidence for top-down guidance; 4) McElree and Carrasco's (1999; see also Dosher, Han, & Lu, 2004) use of the speed–accuracy tradeoff paradigm to show that both feature and conjunction search involve parallel processing; 5) Palmer, Verghese, and Pavel's (2000; see also Eckstein, 1998) use of signal detection theory to account for conjunction search as essentially a problem of signal-to-noise 183 ratio rather than serial processing; and 6) Wolfe's (1998) demonstration that RT×set-size functions compiled from dozens of visual search experiments do not separate themselves into a bimodal distribution of shallow (parallel) and steep (serial) slopes. Particularly compelling arguments against the serial processing perspective in visual attention come from evidence of “biased competition” in extrastriate visual cortex. Here, both top-down and bottom-up interactions have been found to mediate neural mechanisms of selective visual attention (Desimone, 1998; Desimone & Duncan, 1995). These findings support a parallel processing perspective that claims visual attention is better characterized as a function of partially active representations of objects simultaneously contending for mappings onto motor output (Desimone & Duncan, 1995; Mounts & Tomaselli, 2005; Reynolds & Desimone, 2001). Inspired by the biased competition framework, Spivey and Dale (2004) and Reali, Spivey, Tyler, and Terranova (2006) developed a parallel competitive winner-take-all model of reaction times in visual search (where the capacity limitation comes as a side-effect of the normalization process that imposes the competition). In this model, representations of features and of objects are all partially active in parallel and compete over time until an object representation exceeds a fixed criterion of activation, at which point a reaction time is recorded. Despite the processing in this localist attractor network being entirely parallel, conjunction searches (as well as triple conjunctions and high-similarity searches) in this model produce linear increases in reaction time as more distractors are added to the display. Further support against Treisman and Gelade's (1980) strict dichotomy between parallel processing and serial processing in visual search comes from studies by Olds, Cowan, and Jolicoeur (2000a,b,c). They presented single-feature visual search pop-out displays for very brief periods of time, less than 100 ms in some conditions, before changing them to conjunction search displays. Although participants did not report experiencing a pure pop-out effect and their response times were not as fast as with pure pop-out displays, facilitatory effects did emerge in response times, due to the very brief period of time that the display showed only single-feature distractors. This effect of a partial pop-out process assisting the conjunction search process was called “search assistance.” Conceptually similar to Olds's work, there is a related instance of the initial portion of stimulus delivery instigating a single-feature search, with the remainder of stimulus delivery converting the search into a conjunction search. Spivey et al. (2001) put participants in an auditory/ visual (A/V) concurrent condition, where observers are presented with the conjunction search display concurrently with target identity delivered via auditory linguistic query (e.g. “Is there a red vertical?”). Thus, while viewing the display, participants heard “red” before they heard “vertical,” and thus spent a few hundred milliseconds performing visual search on the display with knowledge of only one feature. This was compared to a baseline auditory-first control condition, where target identity was provided via the same spoken query prior to visual display onset. In both target-present and target-absent displays, the A/V concurrent condition produced significantly shallower slopes of reaction time by set-size, compared to the auditory-first control condition. If visual search for a conjunction target was a process that could only take place with a completed target template as the guide for the serial search of a master map, then search could not commence until the full noun phrase had been heard, and no improvement in search efficiency could have happened. Clearly, the concurrent and continuous processing of spoken linguistic input was able to quickly influence the efficiency with which the visual search process took place, by allowing the linguistic processing of the first adjective to already instigate some visual search based on that one feature. To be sure, previous work has shown that, with practice, participants can strategically carry out subset-biased search in a conjunction display (Egeth, Virzi, & Garbart, 1984; Kaptein, Theeuwes, & van der Heijden, 1995). However, this finding of spoken linguistic input being the immediate guide for that subset-biasing, over the course of a few hundred milliseconds, provides intriguing evidence for language being able to 184 S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 intervene in the real-time processing of the visual system, with little or no strategic practice. For a concrete demonstration of how a localist attractor network, relying solely on parallel competition, can mimic this process, see Reali et al. (2006). However, it has been shown that the magnitude of this improved search efficiency is affected by the rate of speech, both in human data (Gibson, Eberhard, & Bryant, 2005) and in model simulations (Reali et al., 2006). With faster speech, the slope of reaction time by set-size is not as flattened out in the A/V concurrent condition. To more systematically explore subtle timing issues such as this, Chiu and Spivey (in preparation) devised a semi-concurrent condition, where observers hear the first adjective (e.g., “Is there a red…”) before onset of the display and then hear the second adjective (e.g., “vertical?”) simultaneously with the onset of the display. In this condition, by the time the display is presented, the second adjective is being heard, and thus at the point in time at which visual search can begin, the target is known to be a conjunction of two features. As a result, the slope of reaction time by set-size is steep and linear, as in the control baseline condition. However, if a mere 200 ms of silence is spliced in between the two adjectives, during which the display is being viewed but the second adjective has not yet been delivered, suddenly the slope of reaction time by set-size is shallow again. Thus, much like Olds et al.'s (2000a,b,c) “search assistance,” just a tiny amount of time initially performing a single-feature search (since only one adjective has been heard so far) is enough to significantly improve search efficiency. In the context of myriad perceptual interactions between various sensory systems (see Spence & Driver, 2004, for extensive review), and the broad prevalence of top-down neuronal projections throughout cortex (e.g., Kveraga, Ghuman, & Bar, 2007; Mumford, 1992), it should not be surprising that something as “high level” as language is nonetheless capable of subtly influencing visual perception at the time scale of hundreds of milliseconds. However, examining the time course of such interactive processing has, in the past, relied on manipulating the independent variable, both in the speed–accuracy tradeoff paradigm and in examining reaction time distributions (Dosher, 1976; Miller, 1982; Ratcliff, 1985). Such paradigms rely on manipulating the independent variable to extract pieces of meta-cognitive information about processing at different time points in order to re-construct the overall time course of processing. Rather than running thousands of trials, which rely on interrupting normal cognitive processing, what might better facilitate future work on language-mediated vision is not just experimental manipulations (independent variables) that pry open those milliseconds during visuolinguistic interaction, but also experimental measures (dependent variables) that pry open those milliseconds. This line of research needs methodologies that allow not only a measure of how accurate a participant was at the end of a trial or how long she took to get there, but also a measure of what alternative response options were considered (even just partially) along the way. 3. The dense sampling approach With longer time scales of cognition, as in development and in long-term task performance, dense sampling across the time course of the phenomenon in question has already been extremely informative. For example, for studying behavior over the course of hours, a time series of a thousand or more reaction times can be analyzed as one temporally-extended process of cognitive performance (rather than a thousand sequential processes of little word recognition events), and thus reveal statistical patterns of fractal structure in the variance that are naturally predicted only by an interactive dynamical account of cognition (Kello, Beltz, Holden, & Van Orden, 2007; Van Orden, Holden, & Turvey, 2003). Similarly, when Stephen and Mirman (2010) analyzed the overall distributions of saccade lengths over the course of many trials in a visual search task, they found evidence consistent with a single underlying process for both single-feature and conjunction search (e.g., Spivey & Dale, 2004), and evidence for lognormal and power-law distributions, which imply self-organized interaction-dominant dynamics in visual cognition (Aks, Zelinsky, & Sprott, 2002), rather than additive encapsulated components (Cavanagh, 1988). Thus, by treating a series of cognitive events as a single durative process, and statistically analyzing that process, the patterns of data reveal properties of the phenomenon that are not well accommodated by traditional linear boxand-arrow accounts of cognition. The same general idea, of examining multiple measures of one large process, applies to developmental cognition as well. For studying the temporal dynamics of developmental change over the course of months, Siegler and Lin (2009, p.87) suggest that, “densely sampling changing competence during the period of rapid change provides the temporal resolution needed to understand the learning process.” Essentially, with every measurement of a change of state, there are coarse time scales at which that change will look more or less instantaneous, and there are fine time scales at which that change will look gradual. For discovering the mechanisms or processes that actually enact that change of state, it is crucial that our science operate at a time scale that reveals the underlying gradualness of that change (Spivey, 2007). In addition to studying temporally-extended cognitive performance, dense sampling methods (such as eye-tracking and reachtracking) can also be applied to measuring individual behaviors to provide a window into a cognitive process as it is happening. Interaction between modalities, such as that occurring in language-mediated vision and in vision-mediated language, necessitates that we not only understand the offline resultant choice and latency of the behavior, but that we also resolve how and when multiple sources of information interact on the way toward producing the response behavior. These methods can provide evidence for immediately available gradations of partially active representations indexed in oculomotor and skeletomotor movements, as well as evidence for competitive multimodal interaction evolving into a response over several hundred milliseconds. Some hints of these multimodal interactions were originally found in redundant offline measures. For example, congruent auditory–visual information is known to speed responses (Todd, 1912). However, evidence for the time course of this process was not revealed until scientists took up a dense sampling approach to the measurement of the response movement itself. In a study by Giray and Ulrich (1993), the nature of this redundant-signals effect was examined by utilizing reaction time as well as the response force on both unimodal and bimodal trials. Both an increase in force and a decrease in reaction times were found for trials where two sources of information were combined, which the authors use to argue for early and continuous influence of sensory information on subsequent motor responses, as opposed to the traditional assumption that this information has no influence after a response is initiated. Balota and Abrams (1995) show a similar pattern of results with word frequency: motor movements exhibit more force in response to high frequency words than low frequency words. These findings suggest that word frequency not only influences the time required to recognize a word, but also influences the subsequent response dynamics, implying that the motor system is not slavishly executing a single command delivered by a cognitive system after it has completed its processing, but instead the motor system cooperates with the cognitive system in real-time to cogenerate a response. A wave of experiments have emerged that measure kinematic features of the motor movement during a response, providing insight into the temporal dynamics of activation accumulation. Abrams and Balota (1991) had participants make rapid limb movements in opposite directions in order to indicate whether a string of letters was a word or not. In addition to high lexical frequency speeding response and increasing force, they also found effects in movement duration, peak acceleration, final velocity, and initial velocity. These early effects are extremely important for distinguishing between models of perception and cognition that make unique predictions regarding intermediate stages of processing, where an early effect of velocity can mean the difference between an encapsulated modular stage or a partially active distributed representation. S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 The temporal dynamics of motor output can be especially informative when the stimulus delivery itself is inherently extended in time as well. One of the many concerns in investigating spoken language is the temporal nature of acoustic events: sounds arrive in a linear order to form words, sentences, and discourse. Methods such as eye-tracking can reveal probabilistic activations for visual referents available in the environment. This close time-locking of saccades to speech allows for direct time-sensitive measurements of processing that can address finegrained aspects of language comprehension (Tanenhaus et al., 1995). A primary assumption in this “visual world paradigm” is that saccades are readily driven by partially active representations. Thus, by collecting samples of 2–4 fixations per second, one observes that during spoken word recognition eye movements are made not only to target referents but also to competitor objects with phonologically similar names (Allopenna et al., 1998; Spivey-Knowlton, Tanenhaus, Eberhard, & Sedivy, 1995), semantically related properties (Huettig & Altmann, 2005; Yee & Sedivy, 2006), and visually similar shapes (Huettig & Altmann, 2007). Even more samples per second can be collected when one records the temporal dynamics of a reaching movement, again revealing competition between multiple potential movement destinations (Finkbeiner & Caramazza, 2008; Tipper, Howard, & Jackson, 1997). Note, however, that reach-movements are often initiated after a first eye movement, and therefore this compensatory strength and weakness (denser sampling but later measurement in reach-tracking) should encourage one to treat these two methods as complementary, not adversarial. Spivey, Grosjean, and Knoblich (2005) simplified the reach-tracking method by developing a computer-mouse-tracking paradigm. This method of sampling full mouse-movement trajectories at 60 Hz, and looking at their curvatures, velocity and acceleration profiles, distribution of maximum deviations, as well as measures of entropy or disorder, can aid in distinguishing between alternative computational simulations of the temporal dynamics of uncertainty resolution in spoken word recognition (Spivey, Dale, Knoblich, & Grosjean, 2010; van der Wel, Eder, Mitchel, Walsh, & Rosenbaum, 2009). Computer-mouse tracking has also been similarly informative for studies of sentence processing (Farmer, Anderson, & Spivey, 2007), semantic categorization (Dale, Kehoe, & Spivey, 2007), color categorization (Huette & McMurray, 2010), decision making (McKinstry, Dale, & Spivey, 2008), and social preferences (Freeman, Ambady, Rule, & Johnson, 2008; Wojnowicz, Ferguson, Dale, & Spivey, 2009). The reason that dense sampling of motor output during a response is so informative is that the motor system does not patiently wait until a cognitive process has reached completion before it begins generating a movement associated with the results of that cognitive process. Rather, the multifarious pattern of neural activity associated with a given cognitive process, as it evolves over the course of several hundred milliseconds, unavoidably influences the initial generation of movement plans in oculomotor cortex (Gold & Shadlen, 2000) and in primary motor cortex (Cisek & Kalaska, 2005). The presence of reciprocal neural projections between frontal cortex and these motor areas suggests that they may produce an undivided process of coevolution during those several hundred milliseconds whereby cognition and action are not quite separable (e.g., Barsalou, 2008; Chemero, 2009; Hommel, 2004; Nazir et al., 2007; Pulvermüller, 2005; Spivey, 2007). 4. Vision influences language The area of vision-mediated language is where the dense sampling approach (especially eye-tracking and reach-tracking) has made a great deal of progress lately. However, the most well-known textbook example of vision affecting language (the McGurk Effect) was discovered well before the dense sampling approach became widely used. In the McGurk Effect (McGurk & MacDonald, 1976), the participant sees a video of a face repeating the syllable ga, while hearing the syllable “ba” synchronized with the face's mouth movement, but reports perceiving the 185 syllable da. Although the auditory input clearly supports a percept of ba, the visual input of the lips remaining open is incongruous with this perception. The best fit for the auditory and visual stimuli then is a percept of da, because it has substantial phonetic similarity with the sound being heard and a visual compatibility with the lip movement being seen. Extending this work by digitally altering faces and sound files along a bahdah continuum, Massaro (1999) provided evidence that the visual perception of the speaker's mouth has an immediate and graded effect on which of the two phonemes was reported. Another intriguing example of vision-mediated language is the process of silent lip-reading, during which the auditory cortex of a skilled lip-reader is active (Calvert et al., 1997). Even though there is no actual auditory information arriving at the senses, the skill of lipreading appears to be closely intertwined with auditory processing, recruiting those linguistic representations in order to understand what was said. While such demonstrations of interactions between language and vision are extremely informative, eye-tracking and mouse-tracking make it possible to develop a more detailed mechanistic understanding of the temporal dynamics involved in how visual information immediately impacts language processing, especially lexical and sentence processing. There are many reasons for looking beyond outcome-based measures, including that outcome-based measures often do not provide a complete explanation. Take phonemic categorical perception for example. When using outcome-based measures, distinguishing /p/ from /b/ appears discretely categorical: if the voice onset time (VOT) is between 0 and 30 ms, then the sound is perceived as “pah,” and if VOT is between 30 and 60 ms, the sound is perceived as “bah” (Liberman, Harris, Kinney, & Lane, 1961). However, reaction times are significantly longer when the VOT is closer to this 30 ms boundary than when it is farther away (Pisoni & Tash, 1974). Further, by tracking eye movements in addition to recording reaction times, we can see that those longer reaction times close to the phonetic boundary are the result of competition between partially active representations (McMurray & Spivey, 1999). When the VOT is close to 30 ms, vacillatory eye movements between the two response options are observed, indicating that both options are temporarily being considered. Eye-tracking also reveals that this competition lingers long enough to impact the processing of an entire word, like “pear” and “bear” (McMurray, Tanenhaus, & Aslin, 2002). These data suggest that using dense sampling methods like eye- and reach-tracking may be extremely useful measures to add to outcome-based and reaction time methods (Abrams & Balota, 1991). Since the advent of the visual world paradigm (Tanenhaus et al., 1995; see also Cooper, 1974), eye-tracking has been used extensively to investigate the real-time interactions of visual information and language comprehension. Earlier reaction time data suggested that as a word unfolds over time, it is ambiguous with other words that share similar sounding onsets (Marslen-Wilson, 1987; Zwitserlood, 1989). This cohort theory suggests that for a brief period after the onset of a word, all words beginning with the same phonemic input, called the cohort, show partial activation. As more phonemic input is received, the target is more unambiguously identified, with other words dropping out of the cohort. However, the real-time processing of these cohort effects and some of the theory's predictions remained untestable with reaction time data. Monitoring eye movements, Allopenna et al. (1998; see also Spivey-Knowlton, 1996) presented subjects with visual displays containing four items: the target (e.g., a beaker), an onset-competitor (a beetle), a rhyme competitor (a speaker), and an unrelated referent (a carriage). Eye movements were recorded as participants heard and responded to instructions, like “Pick up the beaker.” The pattern of eye movements revealed that during the latter half of the spoken target word, the probability of fixating the target or cohort competitor both gradually increased equally. Around the offset of the spoken target word, the proportion of looks to the target began to rise precipitously, and the proportion of looks to the cohort competitor began to decline soon thereafter. Thus, early in the auditory stimulus, before the target 186 S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 has been uniquely identified, competition between the partially active representations manifests itself in the eye movement patterns. Additionally, these data revealed a greater proportion of fixations to the rhyme competitor than to the neutral distractor object, providing a clear demonstration of the rhyme competitor effects naturally predicted by McClelland and Elman's (1986) interactive neural network simulation of speech perception. More recently, in the field of spoken word recognition, the graded and continuous trajectories of computer-mouse tracking have been used to complement more discrete fixation-based data from saccadic eye movement patterns. The arm movements of mouse-tracking are relatively continuous and can be smoothly redirected mid-flight, allowing graded effects of dynamic competition between multiple partially active representations to be concretely visible in individual trials (Spivey et al., 2005). In the first demonstration of this, participants were presented with an auditory stimulus such as “Click on the candy” and asked to use a computer mouse to click on the corresponding picture on the computer screen. The visual scene contained either two pictures of phonologically similar items (i.e., a piece of candy and a candle) or two pictures of phonologically dissimilar items (i.e., a ladle and a candle). Streaming x, y coordinates were recorded, and the competition between the partially activated lexical representations was revealed in the smooth curvature of the hand-movement trajectories. Similar to the work of Allopenna et al. (1998), when items in the visual scene were phonologically similar, these trajectories showed graded spatial attraction to the competing object. This graded spatial attraction provides evidence both of the continuous uptake of (and interaction between) visual and linguistic information, and of the dynamic competition between partially active representations. Just as words are temporarily ambiguous across time, so are many sentences. The dense sampling methods of eye- and mouse-tracking in visual contexts have been used extensively to elucidate the processing of these sentences. Many early experiments in temporarily ambiguous sentence processing looked at sentences in isolation, with the data providing a rather modular view of their processing. Take the sentence, “Since Jay always jogs a mile doesn't seem far.” In this sentence, inflated reading times were observed when readers encountered the disambiguating information “doesn't” (Fraizer & Rayner, 1982). Researchers explained this increased reading time by appealing to syntactic heuristics in the syntactic parsing mechanism, thus arguing for the parser's autonomy from other information sources, such as semantics and visual information. However, processing syntactically ambiguous sentences in conjunction with a visual scene produces drastically different findings (Tanenhaus et al., 1995). When adults hear a temporarily ambiguous sentence such as “Put the apple on the towel in the box” while viewing a scene containing an apple, a towel (incorrect goal location), a box (correct goal location), and a flower (neutral unrelated referent), they experience the gardenpath effect, temporarily interpreting “on the towel” as the destination of the putting event, only to later realize that this parse is ultimately incorrect. In reading experiments, the garden-path effect manifests itself as inflated reading times at the point of disambiguation; in visual world eye-tracking it manifests itself as an increased proportion of looks to the incorrect destination (in this case, the towel) when compared to trials containing unambiguous sentences like “Put the apple that's on the towel in the box.” Following previous reading experiments that used discourse contexts to modulate syntactic garden-path effects (e.g., Altmann & Steedman, 1988; Spivey-Knowlton, Trueswell, & Tanenhaus, 1993), Tanenhaus et al. (1995) used a visual context that contained two similar referent objects for “the apple.” Therefore, upon hearing “Put the apple,” the listener temporarily did not know which of the two apples was being referred to. This referential ambiguity caused the upcoming prepositional phrase “on the towel” to be parsed as attached to the noun phrase, thus avoiding the garden-path. Indeed, when the visual display contained a second apple, participants no longer looked at the incorrect destination (garden-path) location with any significant frequency. Thus, a visual context was shown to mediate the syntactic processing of a sentence on the time scale of hundreds of milliseconds (Chambers, Tanenhaus, & Magnuson, 2004; Snedeker & Trueswell, 2004; Spivey, Tanenhaus, Eberhard, & Sedivy, 2002; Tanenhaus et al., 1995; Trueswell, Sekerina, Hill, & Logrip, 1999). Eye-tracking has been used to explore the time course of sentence production as well as sentence comprehension. These experiments demonstrate the link between the production of linguistic structure and visual attention in a visual world paradigm. In one experiment, a participant's eye movements allowed her to extract information from a visual scene based on comprehension of the depicted event, rather than just on visual saliency (Griffin & Bock, 2000). Further, these eye movements around the visual scene anticipated the order in which agents and patients were mentioned in sentence production tasks. Hence, visual attention, as driven by eye movements, predicted the ordering of elements in a subsequently produced sentence (Griffin & Bock, 2000). There is also evidence showing that directly manipulating visual attention in such a visual world paradigm elicits differences in the way subsequently produced sentences are structured (Gleitman, January, Nappa, & Trueswell, 2007; Tomlin, 1997). These data suggest that manipulating shifts of visual attention can influence word order choices, predicting both passive and active sentence constructions. Tracking the continuous motor output of mouse-movement trajectories further explicates the dynamic competition between partially active syntactic representations in the visual world paradigm. Rather than moving actual 3-D objects in a visual scene, participants used a computer mouse to move objects around a computer screen in response to sentences like, “Put the apple on the towel in the box” (Farmer et al., 2007). As in earlier results, there was no evidence of a garden-path effect in the two-referent visual context. However, the continuous trajectories provided further data about the competition process between partially active syntactic representations in the one-referent context. Across individual computer-mouse trajectories, the garden-path effect manifested itself as graded spatial attraction toward the competing goal destination. This smooth spatial attraction toward the competing goal suggests the visual and linguistic inputs were continuously and smoothly interacting in a competitive process, such that the garden-path parse was partially active and thus the incorrect destination location was garnering some degree of attention during visuomotor processing. These continuous mouse-tracking results provide support for continuous parallel processing in cognition, in a similar way to how eye-tracking results provide such support (Magnuson, 2005). However, it has recently been suggested that the tell-tale curvatures in mouse-movement trajectories of this type can actually be explained, in principle, by a model in which cognitive processing is discrete and serial (postulating one symbolic representation at a time), and the motor output is produced by a continuous parallel processing system (van der Wel et al., 2009). In this model, two motor movements corresponding to a strategic upward movement and then a perceptual decision movement are asynchronously averaged together to produce a smoothly curved motor output (Henis & Flash, 1995). This distinction between perceptual processing and action planning provides an existence proof in which motor output may be continuous, but the underlying perceptual/cognitive decisions are serial. This model then creates problems for theories of embodied cognition that propose that perception and cognition are dynamically coupled with action. However, it seems unlikely that one neural system (cognition) would behave in one way (i.e., using discrete representations in sequence) and then feed into a second system (action) that behaves in a qualitatively different way (i.e., using continuous representations in parallel). In their reply to van der Wel et al. (2009), Spivey and colleagues used the same equations that van der Wel et al. (2009) used for their model, adding a mechanism of dynamic competition between the multiple simultaneous cognitive representations that drive those motor commands (Spivey et al., 2010). As there is nothing uniquely serial about the equations used by Henis and Flash (1995), the results of Spivey et al.'s model provide evidence that the perceptual and S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 motor decisions can both be made in a continuous parallel fashion. For example, cognitive representations for two response options initiate motor commands for both potential reach locations (Cisek & Kalaska, 2005), and the averaging weights for those two motor commands start out equal. This instigates motor output that is initially aimed at the midpoint between the two potential reach locations. As one cognitive representation receives increasing perceptual support, its weight ramps up, while the weight for the other cognitive representation ramps down. These changing weights are used to produce a dynamically averaged motor movement that smoothly curves in a manner identical to the human data. Hence, a dynamic and continuous cognitive/perceptual discrimination task flows smoothly into a similarly dynamic and continuous motor output. This section has reviewed a number of ways in which language is impacted by vision, and suggests that looking beyond reaction time data in such tasks is crucial when addressing how these information sources interact in real-time. Monitoring eye- and reach-movements provides a window into how partially active lexical and syntactic representations are activated and compete over time. Similarly, these methods also allow researchers to investigate the way that objects in a visual scene can immediately guide language processing, stacking the competition process in favor of the interpretations and actions that the situation most readily affords. 5. Conclusion How is it that the possible actions afforded by a visuolinguistic situation can influence the way that situation is perceived in the first place? We are accustomed to thinking of visual perception and language comprehension as taking place before action planning, but it is becoming clear that this is an oversimplification. As J. J. Gibson (1979) encouraged decades ago, we need to recognize that sensory input flows continuously into our perceptual subsystems, and therefore there is no real break in the stimulus flow during which perception reaches a point of culmination and then action planning could finally become operative to receive the output of that perceptual culmination. Rather, motor output itself is often continuously blending over time some weighted combination of multiple motor commands (Gold & Shadlen, 2000; Tipper, Houghton, & Howard, 2000), while multiple alternative action plans are being simultaneously prepared (Cisek & Kalaska, 2005), and feedback projections from these frontal and motor areas of the brain are continuously influencing the activation patterns in primary sensory areas of the brain (Kveraga et al., 2007). This is how the temporally-extended process of perceptual recognition (in both language and vision) co-evolves in concert with the temporally-extended process of action planning. And this is precisely why dense sampling of motor output is so richly informative about real-time cognitive processes in language and vision. The patterns of data in continuous motor output are tightly correlated with the patterns of data in continuous cognitive processing. This temporal continuity in signal processing, non-modularity in information transmission, and the concomitant representational coextensiveness among language, vision, and action, have the logical consequence that the embodiment of cognition cannot help but be real (Barsalou, 2008; Spivey, 2007). Every example of cognitive processing – whether it is the recognition of a visual object (Tucker & Ellis, 2001), the comprehension of a spoken sentence (Anderson & Spivey, 2009), or the contemplation of an abstract concept (Richardson, Spivey, Barsalou, & McRae, 2003) – is unavoidably shaped and nuanced by the sensorimotor constraints of the organism coupled with its environment (Gibson, 1979; Turvey, 2007). The question remains, however, is the shape and nuance that a sensorimotor context bestows upon a cognitive representation fundamental to the functioning of that representation or is it merely an epiphenomenal decoration loosely attached to that representation. In stark contrast to fundamental embodiment claims (e.g., Barsalou, 2008; Gallese & Lakoff, 2005; Glenberg, 2007), Mahon and Caramazza 187 (2008) have recently suggested that potentiation of sensorimotor properties occurs after access to a conceptual symbol has taken place, and therefore amounts to little more than spreading activation of associated properties. Mahon and Caramazza argue that the cognitive operations carried out by a sequence of accessed symbolic concepts happens exactly that same way it always would, with or without this spreading of activation to vestigial sensorimotor properties that subtly influence reaction times and eye movements. Mahon and Caramazza's proposal responsibly acknowledges the existence of a large and growing literature base of embodiment-related findings in vision, language, memory, and conceptual processing. To address that work, their model postulates a centralized module for conceptual processing that does the real work of cognition, and then activation spreads uni-directionally to sensory and motor systems. This account simultaneously preserves the autonomy of symbolic conceptual processing and accommodates the majority of the experimental findings in the embodied cognition literature. However, there exists a small handful of embodiment-related findings that don't fit nicely into Mahon and Caramazza's (2008) proposal. Although many results in embodied cognition are describable as a conceptual entity becoming active and then spreading their activation in the direction of some associated sensory or motor processes, a few studies have shown this spreading of activation in the other direction. That is, by perturbing a sensory or motor process during a cognitive task, one can alter the way that cognitive process takes place. For example, when people hold a pencil sideways in their mouth (and thus activate some of the same muscles used for smiling) they give more positive affective judgments of humorous cartoons (Strack, Martin, & Stepper, 1988). Similarly, when people's hands are immobilized, their speech production exhibits more disfluencies (Rauscher, Krauss, & Chen, 1996). More recently, Pulvermüller, Hauk, Ilmoniemi, and Nikulin (2005) used mild transcranial magnetic stimulation (TMS) to induce neural activation in arm- and leg-related regions of primary motor cortex while participants carried out a lexical decision task. When mild TMS was applied to the leg region of primary motor cortex, participants responded faster to leg-related action verbs (e.g., kick) than to arm-related action verbs (e.g., throw), whereas when mild TMS was applied to the arm region of primary motor cortex, this pattern was reversed. Meteyard, Zokaei, Bahrami, and Vigliocco (2008) show a similar effect of visual motion perception influencing lexical decision. While viewing a subthreshold noisy motion stimulus that drifted upward or downward, participants exhibited slower reaction times to words that denoted directions of motion that were incongruent with the directionality of the irrelevant subthreshold motion stimulus. Thus, to the degree that the ubiquitous tried-and-tested lexical decision task taps a cognitive process, these two recent findings show that motor and sensory properties exert their influence in the other direction, changing the way a string of letters is evaluated as being a word or not a word. Such data are difficult to account for in a more traditional view of cognition, which separates the lexical decision task into two stages: one corresponding to the cognitive/perceptual task and one corresponding to the motor planning task. Specifically, the data of Pulvermüller et al. (2005) demonstrate differential results based on what part of the primary motor cortex experienced TMS and the content of the word. If the lexical decision task consists of two separate stages and the TMS was influencing only the motor stage, then one would not expect differential effects for responses to “hand” and “leg” action words, because the motor stage would not know the semantics of the word being responded to. However, the data show exactly such differential effects, and hence are problematic for accounts like those of Mahon and Caramazza (2008). It may be the case that, in order to protect the autonomy of their conceptual processing module, Mahon and Caramazza (2008) would choose to place the lexical decision task outside the domain of the kind of cognitive processing they find valuable, but such a move would surely slide into the margins of science a massive amount of literature that has been used in the past to bolster the very foundations of their own classical cognitive science framework. 188 S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 To settle this debate, it is clear that more empirical research is necessary on both sides. Additionally, the field needs computational simulations to better explicate the various theoretical claims, and to make clear quantitative predictions for new experiments (e.g., Anderson, Huette, Matlock, & Spivey, 2010; Howell, Jankowicz, & Becker, 2005). Based on the evidence reviewed here, in language-mediated vision and vision-mediated language, we think that the balance is leaning away from classical cognitive science and toward dynamic embodied accounts of cognition. Rather than cognition being a “box in the head” in which logical rules are applied to discrete symbols in a fashion unlike anywhere else in the brain, the cognitive processes of language, vision, attention and memory may be best described as emergent properties that self-organize among the continuous neural interactions between sensory, motor, and association cortices of a brain that is embodied with sensors and effectors that are intimately coupled with the environment. References Abrams, R. A., & Balota, D. A. (1991). Mental chronometry: Beyond reaction time. Psychological Science, 2, 153−157. Aks, D., Zelinsky, G., & Sprott, J. (2002). Memory across eye-movements: 1/f dynamic in visual search. Nonlinear Dynamics, Psychology, and Life Sciences, 6, 1−25. Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition: Evidence for continuous mapping models. Journal of Memory and Language, 38, 419−439. Altmann, G., & Steedman, M. (1988). Interaction with context during human sentence processing. Cognition, 30, 191−238. Anderson, S. E., Huette, S., Matlock, T., & Spivey, M. J. (2010). On the temporal dynamics of negated perceptual simulations. In F. Parrill, M. Turner, & V. Tobin (Eds.), Meaning, form, and body (pp. 1−20). Stanford: CSLI Publications. Anderson, S. E., & Spivey, M. J. (2009). The enactment of language: Decades of interactions between linguistic and motor processes. Language and Cognition, 1, 87−111. Balota, D. A., & Abrams, R. A. (1995). Mental chronometry: Beyond onset latencies in the lexical decision task. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 1289−1302. Barrett, H. C., & Kurzban, R. (2006). Modularity in cognition: Framing the debate. Psychological Review, 113, 628−647. Barsalou, L. W. (2008). Grounded cognition. Annual Review of Psychology, 59, 617−645. Bechtel, W. (2003). Modules, brain parts, and evolutionary psychology. In S. Scher, & F. Rauscher (Eds.), Evolutionary psychology: Alternative approaches (pp. 211−227). Dordrecht, Netherlands: Kluwer Academic Publishers. Bohm, D. (1980). Wholeness and the implicate order. London: Routledge & Kegan Paul. Calvert, G., Bullmore, E., Brammer, M., Cambell, R., Williams, S., McGuire, P., et al. (1997). Activation of auditory cortex during silent lipreading. Science, 276, 593−596. Cavanagh, P. (1988). Pathways in early vision. In Z. Pylyshyn (Ed.), Computational processes in human vision: An interdisciplinary perspective (pp. 254−289). Norwood, NJ: Ablex Publishing. Chambers, C., Tanenhaus, M., & Magnuson, J. (2004). Actions and affordances in syntactic ambiguity resolution. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 687−696. Chemero, A. P. (2009). Radical embodied cognition. Cambridge, MA: MIT Press. Chiu, E. M. & Spivey M. J. (in preparation). Timing of speech and display affects the linguistic mediation of visual search. Cisek, P., & Kalaska, J. F. (2005). Neural correlates of reaching decisions in dorsal premotor cortex: Specification of multiple direction choices and final selection of action. Neuron, 45, 801−814. Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6, 84−107. Dahan, D. (2010). The time course of interpretation in speech comprehension. Current Directions in Psychological Science, 19, 121−126. Dale, R., Kehoe, C., & Spivey, M. J. (2007). Graded motor responses in the time course of categorizing atypical exemplars. Memory and Cognition, 35, 15−28. Desimone, R. (1998). Visual attention mediated by biased competition in extrastriate visual cortex. Philosophical Transactions of the Royal Society B: Biological Sciences, 353, 1245. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193−222. Dosher, B. (1976). Retrieval of sentences from memory: A speed accuracy tradeoff study. Cognitive Psychology, 8, 291−310. Dosher, B., Han, S., & Lu, Z. (2004). Parallel processing in visual search asymmetry. Journal of Experimental Psychology: Human Perception and Performance, 30, 3−27. Driver, J., & Spence, C. (1998). Crossmodal attention. Current Opinion in Neurobiology, 8, 245−253. Duncan, J., & Humphreys, G. (1989). Visual search and stimulus similarity. Psychological Review, 96, 433−458. Eckstein, M. (1998). The lower visual search efficiency for conjunctions is due to noise and not serial attentional processing. Psychological Science, 9, 111−118. Egeth, H., Virzi, R., & Garbart, H. (1984). Searching for conjunctively defined targets. Journal of Experimental Psychology: Human Perception and Performance, 10, 32−39. Emberson, L., Weiss, R., Barbosa, A., Vatikiotis-Bateson, E., & Spivey, M. (2008). Crossed hands curve saccades: Multisensory dynamics in saccade trajectories. Proceedings of the 29th Annual Conference of the Cognitive Science Society. Farah, M. (1994). Neuropsychological inference with an interactive brain: A critique of the “locality” assumption. Behavioral and Brain Sciences, 17, 43−104. Farmer, T. A., Anderson, S. E., & Spivey, M. J. (2007). Gradiency and visual context in syntactic garden-paths. Journal of Memory & Language, 57, 570−595. Ferreira, F., Bailey, K., & Ferraro, V. (2002). Good-enough representations in language comprehension. Current Directions in Psychological Science, 11, 11−15. Finkbeiner, M., & Caramazza, A. (2008). Modulating the masked congruence priming effect with the hands and the mouth. Journal of Experimental Psychology: Human Perception and Performance, 34, 894−918. Fodor, J. (1983). The modularity of mind: An essay on faculty psychology. Cambridge, MA: MIT Press. Fraizer, L., & Rayner, K. (1982). Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14, 178−210. Freeman, J. B., Ambady, N., Rule, N. O., & Johnson, K. L. (2008). Will a category cue attract you? Motor output reveals dynamic competition across person construal. Journal of Experimental Psychology: General, 137, 673−690. Gallant, J., Connor, C., & Van Essen, D. (1998). Neural activity in areas V1, V2 and V4 during free viewing of natural scenes compared to controlled viewing. Neuroreport, 9, 85−89. Gallese, V., & Lakoff, G. (2005). The brain's concepts: The role of the sensory-motor system in reason and language. Cognitive Neuropsychology, 22, 455−479. Gibson, B. S., Eberhard, K. M., & Bryant, T. A. (2005). Linguistically mediated visual search: The critical role of speech rate. Psychonomic Bulletin & Review, 12, 276. Gibson, J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin. Giray, M., & Ulrich, R. (1993). Motor coactivation revealed by response force in divided and focused attention. Journal of Experimental Psychology: Human Perception and Performance, 19, 1278−1291. Gleitman, L. R., January, D., Nappa, R., & Trueswell, J. C. (2007). On the give and take between event apprehension and utterance formulation. Journal of Memory and Language, 57, 544−569. Glenberg, A. M. (2007). Language and action: Creating sensible combinations of ideas. In G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 361−370). Oxford, UK: Oxford University Press. Gold, J. I., & Shadlen, M. N. (2000). Representation of a perceptual decision in developing oculomotor commands. Nature, 404, 390−394. Goldstone, R. L., Lippa, Y., & Shiffrin, R. M. (2001). Altering object representations through category learning. Cognition, 78, 27−43. Griffin, Z., & Bock, K. (2000). What the eyes say about speaking. Psychological Science, 11, 274−279. Grosof, D., Shapley, R., & Hawken, M. (1993). Macaque V1 neurons can signal “illusory” contours. Nature, 365, 550−552. Henis, E. A., & Flash, T. (1995). Mechanisms underlying the generation of averaged modified trajectories. Biological Cybernetics, 72, 407−419. Hommel, B. (2004). Event files: Feature binding in and across perception and action. Trends in Cognitive Sciences, 8, 494−500. Howell, S. R., Jankowicz, D., & Becker, S. (2005). A model of grounded language acquisition: Sensorimotor features improve lexical and grammatical learning. Journal of Memory and Language, 53, 258−276. Huette, S., & McMurray, B. (2010). Continuous dynamics of color categorization. Psychonomic Bulletin & Review, 17, 348−354. Huettig, F., & Altmann, G. T. M. (2005). Word meaning and the control of eye fixation: Semantic competitor effects and the visual world paradigm. Cognition, 96, 23−32. Huettig, F., & Altmann, G. T. M. (2007). Visual-shape competition and the control of eye fixation during the processing of unambiguous and ambiguous words. Visual Cognition, 15, 985−1018. Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2, 194−203. Kaptein, N. A., Theeuwes, J., & van der Heijden, A. H. C. (1995). Search for a conjunctively defined target can be selectively limited to a color-defined subset of elements. Journal of Experimental Psychology: Human Perception & Performance, 21, 1053−1069. Kello, C. T., Beltz, B. C., Holden, J. G., & Van Orden, G. C. (2007). The emergent coordination of cognitive function. Journal of Experimental Psychology: General, 136, 551−568. Kveraga, K., Ghuman, A. S., & Bar, M. (2007). Top-down predictions in the cognitive brain. Brain and Cognition, 65, 145−168. Lalanne, C., & Lorenceau, J. (2004). Crossmodal integration for perception and action. Journal of Physiology — Paris, 98, 265−279. Lee, T. S., & Nguyen, M. (2001). Dynamics of subjective contour formation in the early visual cortex. Proceedings of the National Academy of Science, 98, 1907−1911. Liberman, A., Harris, K., Kinney, J., & Lane, H. (1961). The discrimination of relative onset-time of the components of certain speech and nonspeech patterns. Journal of Experimental Psychology, 61, 379−388. Lupyan, G. (2008). The conceptual grouping effect: Categories matter (and named categories matter more). Cognition, 108, 566−577. Lupyan, G., & Spivey, M. J. (2008). Perceptual processing is facilitated by ascribing meaning to novel stimuli. Current Biology, 18, 410−412. Lupyan, G., & Spivey, M. J. (2010). Making the invisible visible: Verbal but not visual cues enhance visual detection. PLoS ONE, 5(7), e11452. S.E. Anderson et al. / Acta Psychologica 137 (2011) 181–189 Macaluso, E., Frith, C., & Driver, J. (2000). Modulation of human visual cortex by crossmodal spatial attention. Science, 289, 1206−1208. Macknik, S. L., Fisher, B. D., & Bridgeman, B. (1991). Flicker distorts visual space constancy. Vision Research, 31, 2057−2064. Magnuson, J. S. (2005). Moving hand reveals dynamics of thought. Proceedings of the National Academy of Sciences, 102, 9995−9996. Mahon, B. Z., & Caramazza, A. (2008). A critical look at the embodied cognition hypothesis and a new proposal for grounding conceptual content. Journal of Physiology — Paris, 102, 59−70. Marslen-Wilson, M. (1987). Functional parallelism in spoken word recognition. Cognition, 25, 71−102. Massaro, D. (1999). Speechreading: Illusion or window into pattern recognition. Trends in Cognitive Sciences, 3, 310−317. McClelland, J., & Elman, J. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1−86. McElree, B., & Carrasco, M. (1999). The temporal dynamics of visual search: Evidence for parallel processing in feature and conjunction searches. Journal of Experimental Psychology: Human Perception and Performance, 25, 1517−1539. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746−748. McKinstry, C., Dale, R., & Spivey, M. J. (2008). Action dynamics reveal parallel competition in decision making. Psychological Science, 19, 22−24. McLeod, P., Driver, J., & Crisp, J. (1988). Visual search for a conjunction of movement and form is parallel. Nature, 332, 154−155. McMurray, B., Tanenhaus, M., & Aslin, R. (2002). Gradient effects of within-category phonetic variation on lexical access. Cognition, 86, B33−B42. McMurray, B., Tanenhaus, M., Aslin, R., & Spivey, M. (2003). Probabilistic constraint satisfaction at the lexical/phonetic interface: Evidence for gradient effects of within-category VOT on lexical access. Journal of Psycholinguistic Research, 32, 77−97. McMurray, R., & Spivey, M. (1999). The categorical perception of consonants: The interaction of learning and processing. Proceedings of the Chicago Linguistic Society Panels, 35-2 (pp. 205−221). Meteyard, L., Zokaei, N., Bahrami, B., & Vigliocco, G. (2008). Now you see it: Visual motion interferes with lexical decision on motion words. Current Biology, 18, R732−R733. Miller, J. (1982). Discrete versus continuous stage models of human information processing: In search of partial output. Journal of Experimental Psychology: Human Perception and Performance, 8, 273−296. Mounts, J., & Tomaselli, R. (2005). Competition for representation is mediated by relative attentional salience. Acta Psychologica, 118, 261−275. Mumford, D. (1992). On the computational architecture of the neocortex II. The role of cortico-cortical loops. Biological Cybernetics, 66, 241−251. Nakayama, K., & Silverman, G. (1986). Serial and parallel processing of visual feature conjunctions. Nature, 320, 264−265. Nazir, T. A., Boulenger, V., Jeannerod, M., Paulignan, Y., Roy, A., & Silber, B. (2007). Language-induced motor perturbations during the execution of a reaching movement. Journal of Cognitive Neuroscience, 18, 1607−1615. Neisser, U. (1976). Cognition and reality: Principles and implications of cognitive psychology. San Francisco, California: W. H. Freeman. Olds, E. S., Cowan, W. B., & Jolicoeur, P. (2000a). Partial orientation pop-out helps difficult search for orientation. Perception & Psychophysics, 62, 1341−1347. Olds, E. S., Cowan, W. B., & Jolicoeur, P. (2000b). The time-course of pop-out search. Vision Research, 40, 891−912. Olds, E. S., Cowan, W. B., & Jolicoeur, P. (2000c). Tracking visual search over space and time. Psychonomic Bulletin & Review, 7, 292−300. Palmer, J., Verghese, P., & Pavel, M. (2000). The psychophysics of visual search. Vision Research, 40, 1227−1268. Pisoni, D., & Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Perception and Psychophysics, 15, 285−290. Pulvermüller, F. (2005). Brain mechanisms linking language and action. Nature Reviews Neuroscience, 6, 576−582. Pulvermüller, F., Hauk, O., Ilmoniemi, R., & Nikulin, V. (2005). Functional links between motor and language systems. European Journal of Neuroscience, 21, 793−797. Ratcliff, R. (1985). Theoretical interpretations of the speed and accuracy of positive and negative responses. Psychological Review, 92, 212−225. Rauscher, F. B., Krauss, R. M., & Chen, Y. (1996). Gesture, speech and lexical access: The role of lexical movements in speech production. Psychological Science, 7, 226−231. Reali, F., Spivey, M. J., Tyler, M. J., & Terranova, J. (2006). Inefficient conjunction search made efficient by concurrent spoken delivery of target identity. Perception and Psychophysics, 68, 959. Reynolds, J., & Desimone, R. (2001). Neural mechanisms of attentional selection. In J. Braun, C. Koch, & J. Davis (Eds.), Visual attention and cortical circuits (pp. 121−135). Cambridge, Mass: MIT Press. Richardson, D., Spivey, M., Barsalou, L., & McRae, K. (2003). Spatial representations activated during real-time comprehension of verbs. Cognitive Science, 27, 767−780. Rolls, E., & Tovee, M. (1995). Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex. Journal of Neurophysiology, 73, 713−726. Sekuler, R., Sekuler, A., & Lau, R. (1997). Sound alters visual motion perception. Nature, 385, 308. Shams, L., Kamitani, Y., & Shimojo, S. (2000). What you see is what you hear. Nature, 408, 788. 189 Siegler, R. S., & Lin, X. (2009). Self-explanations promote children's learning. In H. S. Waters, & W. Schneider (Eds.), Metacognition, strategy use, and instruction (pp. 85−113). New York: Guilford Publications. Snedeker, J., & Trueswell, J. (2004). The developing constraints on parsing decisions: The role of lexical-biases and referential scenes in child and adult sentence processing. Cognitive Psychology, 49, 238−299. Spence, C., & Driver, J. (Eds.). (2004). Crossmodal space and crossmodal attention. Oxford: Oxford University Press. Spence, C., Nicholls, M. E. R., Gillespie, N., & Driver, J. (1998). Cross-modal links in exogenous covert spatial orienting between touch, audition, and vision. Perception & Psychophysics, 60, 544−557. Spivey, M. J. (2007). The continuity of mind. New York: Oxford University Press. Spivey, M. J., & Dale, R. (2004). On the continuity of mind: Toward a dynamical account of cognition. In B. Ross (Ed.), The psychology of learning & motivation. Volume 45 (pp. 87−142). San Diego: Elsevier. Spivey, M. J., Dale, R., Knoblich, G., & Grosjean, M. (2010). Do curved reaching movements emerge from competing perceptions? Journal of Experimental Psychology: Human Perception and Performance, 36, 251−254. Spivey, M. J., Grosjean, M., & Knoblich, G. (2005). Continuous attraction toward phonological competitors. Proceedings of the National Academy of Sciences, 102, 10393−10398. Spivey, M. J., Tanenhaus, M., Eberhard, K., & Sedivy, J. (2002). Eye movements and spoken language comprehension: Effects of visual context on syntactic ambiguity resolution. Cognitive Psychology, 45, 447−481. Spivey, M. J., Tyler, M. J., Eberhard, K. M., & Tanenhaus, M. K. (2001). Linguistically mediated visual search. Psychological Science, 12, 282−286. Spivey-Knowlton, M. (1996). Integration of visual and linguistic information: Human data and model simulations. Doctoral dissertation, University of Rochester, Rochester, NY. Spivey-Knowlton, M. J., Tanenhaus, M., Eberhard, K., & Sedivy, J. (1995). Eye movements accompanying language and action in a visual context: Evidence against modularity. Proceedings of the 17th Annual Conference of the Cognitive Science Society. (pp. 25−30)Hillsdale, NJ: Erlbaum. Spivey-Knowlton, M., Trueswell, J., & Tanenhaus, M. (1993). Context effects in syntactic ambiguity resolution. Canadian Journal of Experimental Psychology, 47, 276−309. Stephen, D. G., & Mirman, D. (2010). Interactions dominate the dynamics of visual cognition. Cognition, 115, 154−165. Strack, F., Martin, L., & Stepper, S. (1988). Inhibiting and facilitating conditions of the human smile: A nonobtrusive test of the facial feedback hypothesis. Journal of Personality and Social Psychology, 54, 768−777. Tanenhaus, M., Spivey-Knowlton, M., Eberhard, K., & Sedivy, J. (1995). Integration of visual and linguistic information during spoken language comprehension. Science, 268, 1632−1634. Theeuwes, J., & Kooi, F. (1994). Parallel search for a conjunction of contrast polarity and shape. Vision Research, 34, 3013−3016. Theeuwes, J., Olivers, C. N. L., & Chizk, C. L. (2005). Remembering a location makes the eyes curve away. Psychological Science, 16, 196−199. Tipper, S., Houghton, G., & Howard, L. (2000). Behavioural consequences of selection from neural population codes. In S. Monsell, & J. Driver (Eds.), Control of cognitive processes: Attention and performance XVIII (pp. 233−245). Cambridge, MA: MIT. Tipper, S., Howard, L., & Jackson, S. (1997). Selective reaching to grasp: Evidence for distractor interference effects. Visual Cognition, 4, 1−38. Todd, J. W. (1912). Reaction to multiple stimuli. Archives of Psychology, 25, 1−65. Tomlin, R. (1997). Mapping conceptual representations into linguistic representations: The role of attention in grammar. In J. Nuyts, & E. Pederson (Eds.), Language and conceptualization (pp. 162−189). Cambridge: Cambridge University Press. Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12, 97−136. Trueswell, J., Sekerina, K., Hill, & Logrip, L. (1999). The kindergarten-path effect: Studying on-line sentence processing in young children. Cognition, 73, 89−134. Tucker, M., & Ellis, R. (2001). The potentiation of grasp types during visual object categorization. Visual Cognition, 8, 769−800. Turvey, M. T. (2007). Action and perception at the level of synergies. Human Movement Science, 26, 657−697. van der Wel, R. P. R. D., Eder, J., Mitchel, A., Walsh, M., & Rosenbaum, D. (2009). Trajectories emerging from discrete versus continuous processing models in phonological competitor tasks. Journal of Experimental Psychology: Human Perception and Performance, 32, 588−594. Van Orden, G., Holden, J., & Turvey, M. (2003). Self-organization of cognitive performance. Journal of Experimental Psychology: General, 132, 331−350. Wojnowicz, M., Ferguson, M., Dale, R., & Spivey, M. J. (2009). The self-organization of explicit attitudes. Psychological Science, 20, 1428−1435. Wolfe, J. M. (1994). Guided Search 2.0: A revised model of visual search. Psychonomic Bulletin & Review, 1, 202−238. Wolfe, J. M. (1998). What can 1 million trials tell us about visual search? Psychological Science, 9, 33−39. Yee, E., & Sedivy, J. (2006). Eye movements reveal transient semantic activation during spoken word recognition. Journal of Experimental Psychology: Learning, Memory and Cognition, 32, 1−14. Zwitserlood, P. (1989). The locus of the effects of sentential-semantic context in spoken-word processing. Cognition, 32, 25−64.
© Copyright 2026 Paperzz