MULTIMODAL INPUT EXPERIMENT IN SIMULTANEOUS INTERPRETING: AN EYE-TRACKING Kilian G. Seeber, Geneva (Switzerland) 1. Introduction One corollary of the rapid and pervasive spread of information and communication technologies is a world in which, as individuals, we are rarely exposed to isolated stimuli. Instead, we are usually confronted with and simultaneously tend to sensory information on different channels (Lewandowski / Kobus 1993). The same might hold true for professional conference interpreters, who face a wide range of scenarios involving different input channels. At a time when content is delivered using an array of multimedia devices, speakers enrich their discourse with multimedia slide presentations, the effects of which (on the simultaneous interpreting process) have not yet been studied comprehensively (Moser-Mercer 2005). AIIC’s Technical and Health Committee postulates that interpreters need to see the speaker’s movements, expressions, gestures, any kind of projection (slides, overhead, video), and what’s going on in the room, especially on the rostrum and at the speaker’s podium (AIIC). These recommendations are backed up by many practitioners, who intuitively perceive the need for visual input to interpret well (Rennert 2008), but also by research into remote interpreting suggesting perceived earlier onset of fatigue, perceived increase of stress as well as feelings of alienation and difficulty concentrating (e.g., Kurz 1999; MoserMercer 2003; European Parliament 2005). The bulk of this research has focused on the effects of a very complex task on the actor (the interpreter) and the product (the interpreter’s output), while little progress has been made in understanding its effect on the process (the interpreting task). Without the latter, however, it is impossible to comprehend the reasons for the impact on the former and to change the parameters necessary to mitigate or eliminate some of its negative effects. Among the questions that have yet to be addressed, for example, is what sources of input interpreters tend to while performing the simultaneous interpreting task. Although it stands to reason that different interpreters might attempt to obtain extra-linguistic information from different visual sources at different times (Moser-Mercer 2002), it is equally conceivable that in an experimental set-up it might be possible to discern patterns across participants. While the eye-tracking paradigm might hold the potential to generate the data necessary to inform the debate about remote interpreting, this project’s objective is more modest: it attempts to answer the arguably simpler question of what it is that simultaneous 342 Kilian G. Seeber interpreters look at. If the methodology turns out to be particularly suitable for the exploration of these phenomena, attempts should be made to apply it more systematically. 2. The interaction between auditory and visual input Already halfway though the last century, Broadbent (1956) found evidence suggesting an interaction between auditory and visual stimuli. When presented at the same time, bisensory stimuli were found to enhance memory recall. The body of literature suggesting facilitation effects that are attributable to bisensory presentation of auditory-verbal stimuli had increased substantially by the early seventies (Halpern / Lantz 1974). Evidence was collected suggesting that a mixed mode of presentation can increase the amount of information processed by working memory (Penney 1989). Furthermore, a redundant signal effect showing faster reaction times when participants respond to simultaneously presented, redundant signals, rather than a single signal, was replicated for bimodal dividedattention tasks (Lewandowski / Kobus 1993). One explanation for improved performance during dual-modality presentation of stimuli is the conjectured existence of separate working memory processors for auditory and visual information (Mousavi et al. 1995), a theoretical notion developed by Wickens (1984), reflected in his multiple resource model and conflict matrix and applied to different language processing tasks by Seeber (2007). These observations, along with the notion that the listener’s understanding of a message is influenced by nonverbal components, even though they might not be perceived or decoded consciously, will constitute the theoretical foundation for the present experiment. 3. Multimodal input in simultaneous interpreting Professional conference interpreters are regularly confronted with multimodal input, be it because speakers use facial expressions and gestures while they are speaking, or because they resort to visual aids like slides with text and images to complement or emphasize what they are saying. Although we do not yet know how multiple sources of information are treated by the interpreting brain, which of them are integrated and how that integration process operates, there is considerable evidence suggesting that language processing is influenced by multiple sources of information (Jesse et al. 2000/1). The issue of multimodal input, therefore, is not limited to the realm of remote interpreting, rather, it applies to most ordinary conference interpreting scenarios. The assumption that simultaneous conference interpreters’ gaze is closely correlated with the visual information needed to help with the processing of the meaning the interpreter is constructing (Moser-Mercer 2002) is a logical, albeit hitherto unsubstantiated, extrapolation Multimodal Input in Simultaneous Interpreting: An Eye-Tracking Experiment 343 of early findings (Cooper 1974), supporting the notion that listeners move their eyes to the visual elements most closely related to the the unfolding discourse. If paralanguage and kinesics provide information associated with the verbally coded message by repeating, supporting or contradicting that message (Poyatos 1987), then it is conceivable that conference interpreters will visually tend to those stimuli. In the simplest scenario, when the visual channel is used to repeat what is being expressed on the auditory channel, the interpreter could benefit from the aforementioned redundancy effect to process the bimodally presented information. This scenario was chosen for the present experiment. More specifically, we will take a closer look at situations in which verbally expressed numbers are accompanied by non-verbally coded message components1. Numbers, which constitute a well-documented source of difficulty in interpreting (Alessandrini 1990), can be expressed using different modalities: auditory-verbal (numbers expressed in spoken discourse), visual-spatial (numbers expressed as hand gestures), and visual-verbal (numbers expressed as names or numerals). An eyetracking experiment was designed to explore the extent to which simultaneous interpreters tend to the visual (-verbal or -spatial) channel when numbers are presented on the auditory-verbal channel. 4. The experiment The purpose of this experiment was to determine the extent to which professional simultaneous conference interpreters tend to visually-spatially and visuallynumerally presented numbers when interpreting discourse containing numbers. Eye movements were recorded while interpreters simultaneously interpreted a video presentation containing numbers which were either gestured by the speaker or shown on a screen next to the speaker. 5. Method Participants The experimental group consisted of 10 professional conference interpreters (all female, mean age 44, σ 7) with their professional domicile in Geneva. They all had a minimum of five years of professional experience (mean experience 16, σ 6), work regularly as conference interpreters (mean days worked last year 160, σ 103) and they all had English as one of their passive languages. They interpreted simultaneously into their respective mother tongue (2 Arabic, 4 German and 4 Spanish). 1 See Poyatos (1987) for a comprehensive overview. 344 Kilian G. Seeber Materials The materials consisted of a split-screen video recording (16:9 format, showing the speaker on the left and the slide presentation on the right half of the screen) of 6’20’’ on the International Labour Organization. A 20-second introduction was followed by four 1’30’’ segments. The discourse for each segment contained three small numbers (between 1 and 10) and three large numbers (between 20 and 7,000), all embedded in sentences of the type “The organization has X members” (mean duration 3.45 sec, σ .66). Small numbers were gestured together with spoken numbers. The stroke, i.e., the part of the gesture encoding the message, temporally coincided with the verbal production of the number. Large numbers were shown on the slides on the right side of the screen. The three large numbers for each segment were visible as from its onset. Small and large numbers appeared in a fixed random order throughout the discourse. Figure 1: Still of split-screen video materials with areas of interest Apparatus Eye-gaze patterns were recording using a Tobii 120 Hz remote eye-tracker. The sound was administered using Bosch LBB 3443 headphones. Procedure The experiment was carried out at LaborInt, the FTI’s research laboratory. Participants were seated at a desk at a distance of approximately 60 cm from the eyetracker. The sound was administered over headphones; volume and treble controls were available. After a 9-point calibration the video sequence was shown and simultaneously interpreted. Three areas of interest (AOIs) were identified on the screen: the speaker’s head (for facial expressions), the speaker’s torso (for gestures) and the numerals on the slides. Participation was voluntary. All participants took part in a prize draw of CHF 250. Multimodal Input in Simultaneous Interpreting: An Eye-Tracking Experiment 345 6. Results and discussion Overall gaze duration was used to measure processing of AOIs (see Rayner 1998). Paired t-tests were carried out to compare the means between the two conditions. On average, participants looked at the speaker’s face significantly longer during sentences containing small numbers (M= 27.47, SE= 4.82) than during sentences containing large numbers (M= 21.06, SE= 4.01), t(9)= -3.587, p<.05. Furthermore, they looked at the numbers shown on the slides significantly longer during sentences containing large numbers (M= 9.59, SE= 2.37) than during sentences containing small numbers (M= 4.93, SE= 1.78), t(9)= 3.697, p<.05. Finally, no significant difference was found between the time participants looked at the speaker’s hands during sentences containing small numbers (M= 2.10, SE= 1.22) and sentences containing large numbers (M= 1.72, SE= 1.68), t(9)= -.635, p=.54. Figure 2: Gaze time for the three areas of interest, error bars: SE These results support the notion that interpreters tend to the speaker’s face, be it as a means to receive cues facilitating verbal processing (see Jesse et al. 2000/1), or as an early behavioral mechanism (Perlman et al. 2009). The eye-gaze patterns directed at the information on the slides seem to suggest that interpreters scan the slide for additional information to complement the auditory input during the presentation of both small and large numbers. This finding corroborates the notion that simultaneous interpreters actively search for information on the visualspatial channel to complement the information on the auditory-verbal channel. When complementary information is available on the visual-verbal channel, interpreters spend twice as long gazing at it (9.53 sec) than when it is not found (4.93 sec). Surprisingly, interpreters tend to the visual-spatial channel, i.e., the 346 Kilian G. Seeber speaker’s hands, equally long during the verbal presentation of large and small numbers. This result might be explained by the fact that the very motion of the gesture, rather than its communicated code, attracts the interpreter’s attention (Rayner 1998), as it is impractical if not impossible to gesture numbers up to 7,000 on one’s hands. It is also possible, however, that eye movements towards the location of gestures are triggered by the beginning of the number name expressed on the auditory-verbal channel (e.g., “seven-thousand”), which could have been expressed as a gesture. While not within the scope of this experiment, the difference in gaze duration on gestures and slides raises the question about inherent task difficulty. It is well documented that the processing of large numbers increases cognitive load more than the processing of small numbers (Alessandrini 1990). This factor needs to be considered during replication and deserves further investigation. 7. Conclusion The importance of visual input during simultaneous conference interpreting has been stressed by practitioners and echoed by researchers. While many of the latter have attempted to discern the effect of multimodal processing, to date no attempt has been made to measure what it is interpreters actually look at while they perform the simultaneous interpreting task. Anderson (1994) concedes that in her experiment participants did not always look at the screen containing the visual information. Similarly, Kurz (1999) admits that it is impossible to know to which extent participants in her experiment actually made use of the visually presented information. The purpose of this experiment was to measure what interpreters look at during the task. Eye-gaze patterns suggest that under experimental conditions simultaneous interpreters tend to visually presented complementary information but that there might be modality-related differences. Using the appropriate design, eyetracking technology might hold the potential to answer the question of why interpreters want visual input, what kind of visual input they use, and when. References AIIC (n.d.): What about monitors in SI booths? Online: http://www.aiic.net/ ViewPage.cfm/article90 (02.07.2011) Alessandrini, M.S. (1990): Translating numbers in consecutive interpretation: An empirical experimental study. In: The Interpreters’ Newsletter 3, 77-80 Anderson, L. (1994): Simultaneous Interpretation: Contextual and Translation Aspects. In: Lambert, S. / Moser-Mercer, B. (eds.): Bridging the Gap: Empirical research in simultaneous interpretation. Amsterdam, 101-120 Multimodal Input in Simultaneous Interpreting: An Eye-Tracking Experiment 347 Broadbent, D.E. (1956): Successive responses to simultaneous stimuli. In: Quarterly Journal of Experimental Psychology 8, 145-152 Cooper, R.M. (1974): The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception. In: Memory and Language Processing 6/1, 84-107 European Parliament (2005): Study concerning the constraints arising from remote interpreting. Report of the 3rd remote interpretation test. Online: www.euractiv.com/31/images/EPremoteinterpretingreportexecutive_summery_tcm31151942.pdf (02.07.2011) Halpern, J. / Lantz, A.E. (1974): Learning to utilize information presented over two sensory channels. In: Perception and Psychophysics 16/2, 321-238 Jesse, A. / Vrignaud, N. / Massaro, D.W. (2000/01): The processing of information from multiple sources in simultaneous interpreting. In: Interpreting 5, 95-115 Kurz, I. (1999): Tagungsort Genf / Nairobi / Wien: Zu einigen Aspekten des Teledolmetschens. In: Kadric, M. / Kaindl, K. / Pöchhacker, F. (eds.): Festschrift für Mary SnellHornby zum 60. Geburtstag. Tübingen, 291-302 Lewandowski, L.J. / Kobus, D.A. (1993): The effects of Redundancy in bimodal word processing. In: Human Performance 6/3, 229-239 Moser-Mercer, B. (2002): Situation Models: the cognitive relation between interpreter, speaker and audience. In: Israel, F. (ed.): Identité, altérité, équivalence? La traduction comme relation. Actes du Colloque international tenu à l’ESIT les 24, 25 et 26 mai 2000 en homage à Marianne Lederer. Paris: Lettres Modernes Minard, 163-187 ___ (2003): Remote interpreting: Assessment of human factors and performance parameters. Online: http://aiic.net/ViewPage.cfm/page1125.htm (02.07.2011) ___ (2005): Remote interpreting: Issues of multi-sensory integration in a multilingual task. In: Meta 50/2, 727-738 Mousavi, S.Y. / Low, R. / Sweller, J. (1995): Reducing cognitive load by mixing auditory and visual presentation modes. In: Journal of Educational Psychology 87/2, 319-334 Penney, C. (1989): Modality effects and the structure of short-term verbal memory. In: Memory and Cognition 17, 398-422 Perlman, S.B. / Morris, J.P. / Vander Wyk, B.C. / Green, S.R. / Doyle, J. / Pelphrey, K.A. (2009): Individual differences in personality predict how people look at faces. In: PLoS ONE 4/: e5952. doi:10.1371/journal.pone.0005952 Poyatos, F. (1987): Nonverbal communication in simultaneous and consecutive interpretation: A theoretical model and new perspectives. In: Textcontext 2-2/3, 73-108 Rayner, K. (1998): Eye movements in reading and information processing: 20 years of research. In: Psychological Bulletin 124/3, 372-422 Rennert, S. (2008): Visual input in simultaneous interpreting. In: Meta 53/1, 204-217 Seeber, K.G. (2007): Thinking outside the cube: Modeling language processing tasks in a multiple resource paradigm. In: Proceedings of Interspeech 2007, Antwerp, 13821385 Wickens, C.D. (1984): Processing resources in attention. In: Parasuraman, R. / Davies, D.R. (eds.): Varieties of attention. New York, 63-102
© Copyright 2026 Paperzz