Kilian G. Seeber, Geneva (Switzerland) 1. Introduction One corollary

MULTIMODAL INPUT
EXPERIMENT
IN
SIMULTANEOUS INTERPRETING: AN EYE-TRACKING
Kilian G. Seeber, Geneva (Switzerland)
1.
Introduction
One corollary of the rapid and pervasive spread of information and communication technologies is a world in which, as individuals, we are rarely exposed to
isolated stimuli. Instead, we are usually confronted with and simultaneously tend
to sensory information on different channels (Lewandowski / Kobus 1993). The
same might hold true for professional conference interpreters, who face a wide
range of scenarios involving different input channels. At a time when content is
delivered using an array of multimedia devices, speakers enrich their discourse
with multimedia slide presentations, the effects of which (on the simultaneous
interpreting process) have not yet been studied comprehensively (Moser-Mercer
2005). AIIC’s Technical and Health Committee postulates that interpreters need
to see the speaker’s movements, expressions, gestures, any kind of projection
(slides, overhead, video), and what’s going on in the room, especially on the
rostrum and at the speaker’s podium (AIIC). These recommendations are backed
up by many practitioners, who intuitively perceive the need for visual input to
interpret well (Rennert 2008), but also by research into remote interpreting suggesting perceived earlier onset of fatigue, perceived increase of stress as well as
feelings of alienation and difficulty concentrating (e.g., Kurz 1999; MoserMercer 2003; European Parliament 2005). The bulk of this research has focused
on the effects of a very complex task on the actor (the interpreter) and the product (the interpreter’s output), while little progress has been made in understanding its effect on the process (the interpreting task). Without the latter, however, it
is impossible to comprehend the reasons for the impact on the former and to
change the parameters necessary to mitigate or eliminate some of its negative
effects. Among the questions that have yet to be addressed, for example, is what
sources of input interpreters tend to while performing the simultaneous interpreting task. Although it stands to reason that different interpreters might attempt to
obtain extra-linguistic information from different visual sources at different times
(Moser-Mercer 2002), it is equally conceivable that in an experimental set-up it
might be possible to discern patterns across participants. While the eye-tracking
paradigm might hold the potential to generate the data necessary to inform the
debate about remote interpreting, this project’s objective is more modest: it attempts to answer the arguably simpler question of what it is that simultaneous
342
Kilian G. Seeber
interpreters look at. If the methodology turns out to be particularly suitable for
the exploration of these phenomena, attempts should be made to apply it more
systematically.
2.
The interaction between auditory and visual input
Already halfway though the last century, Broadbent (1956) found evidence suggesting an interaction between auditory and visual stimuli. When presented at the
same time, bisensory stimuli were found to enhance memory recall. The body of
literature suggesting facilitation effects that are attributable to bisensory presentation of auditory-verbal stimuli had increased substantially by the early seventies (Halpern / Lantz 1974). Evidence was collected suggesting that a mixed
mode of presentation can increase the amount of information processed by working memory (Penney 1989). Furthermore, a redundant signal effect showing
faster reaction times when participants respond to simultaneously presented,
redundant signals, rather than a single signal, was replicated for bimodal dividedattention tasks (Lewandowski / Kobus 1993). One explanation for improved
performance during dual-modality presentation of stimuli is the conjectured
existence of separate working memory processors for auditory and visual information (Mousavi et al. 1995), a theoretical notion developed by Wickens (1984),
reflected in his multiple resource model and conflict matrix and applied to different language processing tasks by Seeber (2007). These observations, along with
the notion that the listener’s understanding of a message is influenced by nonverbal components, even though they might not be perceived or decoded consciously, will constitute the theoretical foundation for the present experiment.
3.
Multimodal input in simultaneous interpreting
Professional conference interpreters are regularly confronted with multimodal
input, be it because speakers use facial expressions and gestures while they are
speaking, or because they resort to visual aids like slides with text and images to
complement or emphasize what they are saying. Although we do not yet know
how multiple sources of information are treated by the interpreting brain, which
of them are integrated and how that integration process operates, there is considerable evidence suggesting that language processing is influenced by multiple
sources of information (Jesse et al. 2000/1). The issue of multimodal input,
therefore, is not limited to the realm of remote interpreting, rather, it applies to
most ordinary conference interpreting scenarios. The assumption that simultaneous conference interpreters’ gaze is closely correlated with the visual information
needed to help with the processing of the meaning the interpreter is constructing
(Moser-Mercer 2002) is a logical, albeit hitherto unsubstantiated, extrapolation
Multimodal Input in Simultaneous Interpreting: An Eye-Tracking Experiment
343
of early findings (Cooper 1974), supporting the notion that listeners move their
eyes to the visual elements most closely related to the the unfolding discourse. If
paralanguage and kinesics provide information associated with the verbally coded message by repeating, supporting or contradicting that message (Poyatos
1987), then it is conceivable that conference interpreters will visually tend to
those stimuli. In the simplest scenario, when the visual channel is used to repeat
what is being expressed on the auditory channel, the interpreter could benefit
from the aforementioned redundancy effect to process the bimodally presented
information. This scenario was chosen for the present experiment. More specifically, we will take a closer look at situations in which verbally expressed numbers are accompanied by non-verbally coded message components1. Numbers,
which constitute a well-documented source of difficulty in interpreting (Alessandrini 1990), can be expressed using different modalities: auditory-verbal (numbers expressed in spoken discourse), visual-spatial (numbers expressed as hand
gestures), and visual-verbal (numbers expressed as names or numerals). An eyetracking experiment was designed to explore the extent to which simultaneous
interpreters tend to the visual (-verbal or -spatial) channel when numbers are
presented on the auditory-verbal channel.
4.
The experiment
The purpose of this experiment was to determine the extent to which professional simultaneous conference interpreters tend to visually-spatially and visuallynumerally presented numbers when interpreting discourse containing numbers.
Eye movements were recorded while interpreters simultaneously interpreted a
video presentation containing numbers which were either gestured by the speaker or shown on a screen next to the speaker.
5.
Method
Participants
The experimental group consisted of 10 professional conference interpreters (all
female, mean age 44, σ 7) with their professional domicile in Geneva. They all
had a minimum of five years of professional experience (mean experience 16, σ
6), work regularly as conference interpreters (mean days worked last year 160, σ
103) and they all had English as one of their passive languages. They interpreted
simultaneously into their respective mother tongue (2 Arabic, 4 German and 4
Spanish).
1
See Poyatos (1987) for a comprehensive overview.
344
Kilian G. Seeber
Materials
The materials consisted of a split-screen video recording (16:9 format, showing
the speaker on the left and the slide presentation on the right half of the screen)
of 6’20’’ on the International Labour Organization. A 20-second introduction was
followed by four 1’30’’ segments. The discourse for each segment contained
three small numbers (between 1 and 10) and three large numbers (between 20
and 7,000), all embedded in sentences of the type “The organization has X members” (mean duration 3.45 sec, σ .66). Small numbers were gestured together
with spoken numbers. The stroke, i.e., the part of the gesture encoding the message, temporally coincided with the verbal production of the number. Large
numbers were shown on the slides on the right side of the screen. The three large
numbers for each segment were visible as from its onset. Small and large numbers appeared in a fixed random order throughout the discourse.
Figure 1: Still of split-screen video materials with areas of interest
Apparatus
Eye-gaze patterns were recording using a Tobii 120 Hz remote eye-tracker. The
sound was administered using Bosch LBB 3443 headphones.
Procedure
The experiment was carried out at LaborInt, the FTI’s research laboratory. Participants were seated at a desk at a distance of approximately 60 cm from the eyetracker. The sound was administered over headphones; volume and treble controls were available. After a 9-point calibration the video sequence was shown
and simultaneously interpreted. Three areas of interest (AOIs) were identified on
the screen: the speaker’s head (for facial expressions), the speaker’s torso (for
gestures) and the numerals on the slides. Participation was voluntary. All participants took part in a prize draw of CHF 250.
Multimodal Input in Simultaneous Interpreting: An Eye-Tracking Experiment
345
6.
Results and discussion
Overall gaze duration was used to measure processing of AOIs (see Rayner
1998). Paired t-tests were carried out to compare the means between the two
conditions. On average, participants looked at the speaker’s face significantly
longer during sentences containing small numbers (M= 27.47, SE= 4.82) than
during sentences containing large numbers (M= 21.06, SE= 4.01), t(9)= -3.587,
p<.05. Furthermore, they looked at the numbers shown on the slides significantly
longer during sentences containing large numbers (M= 9.59, SE= 2.37) than
during sentences containing small numbers (M= 4.93, SE= 1.78), t(9)= 3.697,
p<.05. Finally, no significant difference was found between the time participants
looked at the speaker’s hands during sentences containing small numbers (M=
2.10, SE= 1.22) and sentences containing large numbers (M= 1.72, SE= 1.68),
t(9)= -.635, p=.54.
Figure 2: Gaze time for the three areas of interest, error bars: SE
These results support the notion that interpreters tend to the speaker’s face, be it
as a means to receive cues facilitating verbal processing (see Jesse et al. 2000/1),
or as an early behavioral mechanism (Perlman et al. 2009). The eye-gaze patterns
directed at the information on the slides seem to suggest that interpreters scan the
slide for additional information to complement the auditory input during the
presentation of both small and large numbers. This finding corroborates the notion that simultaneous interpreters actively search for information on the visualspatial channel to complement the information on the auditory-verbal channel.
When complementary information is available on the visual-verbal channel,
interpreters spend twice as long gazing at it (9.53 sec) than when it is not found
(4.93 sec). Surprisingly, interpreters tend to the visual-spatial channel, i.e., the
346
Kilian G. Seeber
speaker’s hands, equally long during the verbal presentation of large and small
numbers. This result might be explained by the fact that the very motion of the
gesture, rather than its communicated code, attracts the interpreter’s attention
(Rayner 1998), as it is impractical if not impossible to gesture numbers up to
7,000 on one’s hands. It is also possible, however, that eye movements towards
the location of gestures are triggered by the beginning of the number name expressed on the auditory-verbal channel (e.g., “seven-thousand”), which could
have been expressed as a gesture.
While not within the scope of this experiment, the difference in gaze duration on
gestures and slides raises the question about inherent task difficulty. It is well
documented that the processing of large numbers increases cognitive load more
than the processing of small numbers (Alessandrini 1990). This factor needs to
be considered during replication and deserves further investigation.
7.
Conclusion
The importance of visual input during simultaneous conference interpreting has
been stressed by practitioners and echoed by researchers. While many of the
latter have attempted to discern the effect of multimodal processing, to date no
attempt has been made to measure what it is interpreters actually look at while
they perform the simultaneous interpreting task. Anderson (1994) concedes that
in her experiment participants did not always look at the screen containing the
visual information. Similarly, Kurz (1999) admits that it is impossible to know to
which extent participants in her experiment actually made use of the visually
presented information.
The purpose of this experiment was to measure what interpreters look at during
the task. Eye-gaze patterns suggest that under experimental conditions simultaneous interpreters tend to visually presented complementary information but that
there might be modality-related differences. Using the appropriate design, eyetracking technology might hold the potential to answer the question of why interpreters want visual input, what kind of visual input they use, and when.
References
AIIC (n.d.): What about monitors in SI booths? Online: http://www.aiic.net/
ViewPage.cfm/article90 (02.07.2011)
Alessandrini, M.S. (1990): Translating numbers in consecutive interpretation: An empirical experimental study. In: The Interpreters’ Newsletter 3, 77-80
Anderson, L. (1994): Simultaneous Interpretation: Contextual and Translation Aspects. In:
Lambert, S. / Moser-Mercer, B. (eds.): Bridging the Gap: Empirical research in simultaneous interpretation. Amsterdam, 101-120
Multimodal Input in Simultaneous Interpreting: An Eye-Tracking Experiment
347
Broadbent, D.E. (1956): Successive responses to simultaneous stimuli. In: Quarterly
Journal of Experimental Psychology 8, 145-152
Cooper, R.M. (1974): The control of eye fixation by the meaning of spoken language: A
new methodology for the real-time investigation of speech perception. In: Memory
and Language Processing 6/1, 84-107
European Parliament (2005): Study concerning the constraints arising from remote interpreting. Report of the 3rd remote interpretation test. Online:
www.euractiv.com/31/images/EPremoteinterpretingreportexecutive_summery_tcm31151942.pdf (02.07.2011)
Halpern, J. / Lantz, A.E. (1974): Learning to utilize information presented over two sensory channels. In: Perception and Psychophysics 16/2, 321-238
Jesse, A. / Vrignaud, N. / Massaro, D.W. (2000/01): The processing of information from
multiple sources in simultaneous interpreting. In: Interpreting 5, 95-115
Kurz, I. (1999): Tagungsort Genf / Nairobi / Wien: Zu einigen Aspekten des Teledolmetschens. In: Kadric, M. / Kaindl, K. / Pöchhacker, F. (eds.): Festschrift für Mary SnellHornby zum 60. Geburtstag. Tübingen, 291-302
Lewandowski, L.J. / Kobus, D.A. (1993): The effects of Redundancy in bimodal word
processing. In: Human Performance 6/3, 229-239
Moser-Mercer, B. (2002): Situation Models: the cognitive relation between interpreter,
speaker and audience. In: Israel, F. (ed.): Identité, altérité, équivalence? La traduction
comme relation. Actes du Colloque international tenu à l’ESIT les 24, 25 et 26 mai
2000 en homage à Marianne Lederer. Paris: Lettres Modernes Minard, 163-187
___ (2003): Remote interpreting: Assessment of human factors and performance parameters. Online: http://aiic.net/ViewPage.cfm/page1125.htm (02.07.2011)
___ (2005): Remote interpreting: Issues of multi-sensory integration in a multilingual
task. In: Meta 50/2, 727-738
Mousavi, S.Y. / Low, R. / Sweller, J. (1995): Reducing cognitive load by mixing auditory
and visual presentation modes. In: Journal of Educational Psychology 87/2, 319-334
Penney, C. (1989): Modality effects and the structure of short-term verbal memory. In:
Memory and Cognition 17, 398-422
Perlman, S.B. / Morris, J.P. / Vander Wyk, B.C. / Green, S.R. / Doyle, J. / Pelphrey, K.A.
(2009): Individual differences in personality predict how people look at faces. In:
PLoS ONE 4/: e5952. doi:10.1371/journal.pone.0005952
Poyatos, F. (1987): Nonverbal communication in simultaneous and consecutive interpretation: A theoretical model and new perspectives. In: Textcontext 2-2/3, 73-108
Rayner, K. (1998): Eye movements in reading and information processing: 20 years of
research. In: Psychological Bulletin 124/3, 372-422
Rennert, S. (2008): Visual input in simultaneous interpreting. In: Meta 53/1, 204-217
Seeber, K.G. (2007): Thinking outside the cube: Modeling language processing tasks in a
multiple resource paradigm. In: Proceedings of Interspeech 2007, Antwerp, 13821385
Wickens, C.D. (1984): Processing resources in attention. In: Parasuraman, R. / Davies,
D.R. (eds.): Varieties of attention. New York, 63-102