Effect of pitch–space correspondence on sound

Exp Brain Res
DOI 10.1007/s00221-013-3674-2
Research Article
Effect of pitch–space correspondence on sound‑induced visual
motion perception
Souta Hidaka · Wataru Teramoto · Mirjam Keetels ·
Jean Vroomen Received: 14 March 2013 / Accepted: 2 August 2013
© Springer-Verlag Berlin Heidelberg 2013
Abstract The brain tends to associate specific features
of stimuli across sensory modalities. The pitch of a sound
is for example associated with spatial elevation such that
higher-pitched sounds are felt as being “up” in space and
lower-pitched sounds as being “down.” Here we investigated whether changes in the pitch of sounds could be
effective for visual motion perception similar to those in
the location of sounds. We demonstrated that only sounds
that alternate in up/down location induced illusory vertical
motion of a static visual stimulus, while sounds that alternate in higher/lower pitch did not induce this illusion. The
pitch of a sound did not even modulate the visual motion
perception induced by sounds alternating in up/down location. Interestingly, though, sounds alternating in higher/
lower pitch could become a driver for visual motion if they
were paired in a previous exposure phase with vertical visual apparent motion. Thus, only after prolonged exposure,
the pitch of a sound became an inducer for upper/lower
visual motion. This occurred even if during exposure the
Electronic supplementary material The online version of this
article (doi:10.1007/s00221-013-3674-2) contains supplementary
material, which is available to authorized users.
S. Hidaka (*) Department of Psychology, Rikkyo University, 1‑2‑26, Kitano,
Niiza‑shi, Saitama 352‑8558, Japan
e-mail: [email protected]
W. Teramoto Department of Computer Science and Systems Engineering,
Muroran Institute of Technology, 27‑1 Mizumoto‑cho,
Muroran 050‑8585, Japan
M. Keetels · J. Vroomen Department of Cognitive Neuropsychology, Tilburg University,
Warandelaan 2, 5000 LE Tilburg, The Netherlands
pitch and location of the sounds were paired in an incongruent fashion. These findings indicate that pitch–space
correspondence is not so strong to drive or modulate visual
motion perception. However, associative exposure could
increase the saliency of pitch–space relationships and then
the pitch could induce visual motion perception by itself.
Keywords Crossmodal correspondence ·
Multisensory perception · Auditory space ·
Pitch · Visual motion perception
Introduction
People receive large amounts of multiple sensory inputs
from the surrounding environment. Our brain automatically and efficiently integrates these multisensory inputs
to establish coherent and robust perceptions and cognitions
(Ernst and Bülthoff 2004). While an influential cue for the
integration is spatiotemporal consistency (Calvert et al.
2004), more abstract “correspondences” between sensory
inputs can also serve as a factor to associate or bind them.
One example is the pitch of a sound that can provide not
only a high–low sensation of pitch, but also an up–down
(or high/low) impression in space (Bernstein and Edelstein
1971; Evans and Treisman 2010; Mudd 1963; Pratt 1930;
Roffler and Butler 1968; Rusconi et al. 2006) (see Marks
2004; Spence 2011 for review).
At least three different “cues” may be responsible for
establishing these crossmodal correspondences: Correspondence in the “magnitude” of brain activity (bright
lights and loud sounds both induce “more” brain activity),
natural statistics including simple co-occurrence of events
(bigger objects produce bass sounds), and semantic consistency (the word “high” is common for high-pitched sounds
13
and upper spatial elevation) (Spence 2011). In many circumstances, these cues co-occur making it difficult to tease
them apart and to establish their contribution in isolation.
To demonstrate these crossmodal associations, many
studies have adopted tasks like speeded classification. For
example, the reaction time to an upper (or lower) visual target is faster when a higher-pitched (or lower-pitched) sound
is concurrently presented than when a lower-pitched (or
higher-pitched) sound is presented (Bernstein and Edelstein
1971; Spence 2011). Some studies have also demonstrated
correspondence effects using more “indirect” response or
attentional tasks (Evans and Treisman 2010; Chiou and
Rich 2012; Klapetek et al. 2012; Mossbridge et al. 2011;
Parise and Spence 2008, 2012). Evans and Treisman (2010)
showed that higher- or lower-pitched sounds boost the reaction time not only for judgments of a visual target’s position but also for judgments of a visual target’s feature (orientation) when the target is presented at a congruent spatial
position (i.e., upper or lower) relative to an incongruent
situation (i.e., lower or upper). Chiou and Rich (2012) also
reported that higher- or lower-pitched sounds worked as
a “spatial” cue such that these sound shifted participant’s
attention to an upper or lower visual location. These findings suggest that attentional/response-levels of processing
could be involved in crossmodal correspondences.
Recently, several studies also demonstrated crossmodal
correspondence effects by using unspeeded perceptual
tasks (e.g., temporal order judgment (TOJ), methods of
adjustment) for pitch–size, pitch–shape (Parise and Spence
2009), voice–size (Sweeny et al. 2012), and auditory amplitude–visual spatial frequency (Guzman-Martinez et al.
2012) pairs. These studies thus indicate the involvement of
perceptual processing in crossmodal correspondences. With
regard to pitch–space correspondence, a study reported
that continuous changes (glides) in pitch affected visual
motion perception (Maeda et al. 2004). However, since the
continuous pitch changes could dominantly induce motion
impression by itself (Walker 1987), crossmodal interaction in motion processing rather than pitch–space correspondences is assumed to mainly contribute to this finding.
Therefore, although pitch–space correspondence is considered to be one of the most typical examples of crossmodal
correspondence, the effects of this correspondence in a
perceptual domain had not been purely investigated with a
comparable manner between the pitch and spatial information of sounds.
The aim of the present study was thus to investigate
whether changes in the pitch of sounds could be effective
for visual motion perception similar to those in the location
of sounds. In addition to introduce discrete pitch changes
rather than continuous ones, we adopted sound-induced
illusory visual motion as a tool. Sounds alternating in
space can induce strong illusory visual motion perception
13
Exp Brain Res
of static visual stimuli (Hidaka et al. 2009) [see Supplementary Information S1A (movie)]. Relying on signaldetection theory (Macmillan and Creelman 2004), this
phenomenon has been found to affect sensitivity (d-prime)
(Hidaka et al. 2011b). In line with previous studies, we
here report that sounds that alternate in vertical space (up/
down) also induce vertical illusory motion (Teramoto et al.
2010b). The new finding in our current study is that sounds
that alternate in pitch (high/low) do not induce this illusory
visual motion so that they are not necessarily comparable
to those that alternate in spatial location (up/down) [see
Supplementary Information S1B (movie)] (Experiment
1). Moreover, the driving effect of the auditory up–down
spatial information is not affected by whether the pitch of
the sounds changes in a congruent (higher-pitched sound in
upper location, lower-pitched sound in lower location) or
incongruent (higher-pitched sound in lower location, lowerpitched sound in upper location) fashion (Experiment 2).
These findings indicate that pitch–space correspondence
is not so strong as to drive or modulate visual motion perception. Importantly, though, we also report that high- and
low-pitched sounds can acquire a driving effect for visual
motion if they have been associated with vertical visual
apparent motion for a few minutes (Teramoto et al. 2010a).
After prolonged exposure, higher/lower-pitched sounds
thus induced up–down visual motion, even if these sounds
were paired in an incongruent fashion with visual motion
during exposure (Experiment 3). This may indicate that
associative exposure can increase the saliency of pitch–
space corresponding relationships, and then the pitch can
induce visual motion perception by itself.
Experiments 1 and 2
Experiments 1 and 2 tested whether the alternation of
higher- and lower-pitched sound induced visual motion
similar to sounds that alternate in upper and lower location
(Hidaka et al. 2009; Teramoto et al. 2010b; Hidaka et al.
2011b). In Experiment 1, we investigated driving effects
of pitch alone for visual motion perception; sounds were
presented either with or without alternation in pitch (alternating pitch and constant pitch conditions, respectively),
while spatial elevation of those sounds was kept constant
(thus perceived as coming from the center) (Fig. 1a) [see
also Supplementary Information S1B (movie)]. In Experiment 2, we further investigated modulatory effects of pitch
on auditory spatial information; sounds always alternated
in upper and lower locations, while the pitch alternated
either in a congruent fashion (i.e., higher pitch in upper
location, lower pitch in lower location) or in an incongruent
fashion (i.e., higher pitch in lower location, lower pitch in
upper location) (Fig. 1b).
Exp Brain Res
Fig. 1 Absence of driving and
modulatory effects of pitch
information in sound-induced
visual motion perception
(Experiments 1 and 2). a, b
Schematic illustrations of
stimuli and auditory conditions of Experiments 1 and 2,
respectively. In Experiment 1,
sounds were presented either
with or without pitch alternation with fixed auditory spatial
information. In Experiment 2,
sounds were presented alternating in upper and lower locations, while pitch of the sounds
alternated either in a congruent
or an incongruent fashion. c, d
Results of Experiments 1 and 2,
respectively. Error bars denote
the standard error of the mean
(N = 8). Asterisks indicate statistically significant differences
(p < .05)
Methods
Participants and apparatus
Written consent was obtained from each participant prior to
experiments. The experiments were approved by the local
ethics committee of Rikkyo University. Each of the 16 participants (eight participants in each experiment) had normal
or corrected-to-normal vision and normal hearing. The participants were naïve to the purpose of the experiment. A
customized PC and MATLAB (The Mathworks Inc.) with
the Psychophysics Toolbox (Brainard 1997; Pelli 1997)
were used to control the experiment. Visual stimuli were
presented on a CRT display with a resolution of 800 × 600
pixels and a refresh rate of 60 Hz. The viewing distance
was 45 cm. Auditory stimuli were generated digitally
13
Exp Brain Res
(sampling frequency 44.1 kHz) and delivered through loudspeakers. The upper speaker was set at 50 cm above, while
the lower speaker was set at 50 cm below the center of the
display. The horizontal position of the speakers was aligned
with that of the visual stimuli. A numeric keypad was used
for recording responses. We confirmed that the onset of
the visual and auditory stimuli was synchronized using a
digital oscilloscope. The observers were instructed to place
their heads on a chin rest. All the experiments were conducted in a dark room.
Stimuli
We presented a red circle (0.4° in diameter; 17.43 cd/m2) as
a fixation point on a black background. A sequence of
white bars (3° × 0.2°; 5.08 cd/m2) was presented as visual
stimuli in the right visual field at an eccentricity of either
10 or 20°. Each bar was presented for 400 ms with 100 ms
of the inter-stimulus interval (ISI). For auditory stimuli,
two white noise bursts were created and filtered with a
1-octave frequency band; the center frequency was either
3 kHz (higher) or 1.2 kHz (lower). These sounds were presented for 50 ms with a cosine ramp of 5 ms at the onset
and offset. The amplitude was adjusted such that the two
sounds were equally loud in our experimental situation.
The sound pressure levels of the lower and higher tones
were 69 dB SPL and 73 dB SPL, respectively, with monaural presentation and 70 dB SPL and 73 dB SPL, respectively, with binaural presentations.1 We confirmed that
these stimuli could induce pitch–space correspondence
effect in response domain by adopting a speeded classification task (Rusconi et al. 2006) (see Fig. 2). The onset timing of each noise burst was synchronized with that of the
visual stimulus.
Procedure
Each experiment consisted of training and main sessions. In
each session, the participants were asked to judge whether
the visual stimulus was perceived as static or moving. During the main session, the participants were asked to make
1
We confirmed that the stimuli could be discriminated only by
pitch. We sequentially presented higher- and lower-pitched tones, or
vice versa, with 1,000 ms of ISI and asked 10 participants to judge
which tone was perceived as higher in pitch or larger in amplitude
(these response domains were randomly assigned in each trial). Pitch
discrimination performance was nearly perfect (percentage of correct
responses (standard errors of the mean) was 94.5 % (1.9 %) and 93 %
(2.1 %) in the monaural and binaural presentations, respectively).
On the contrary, amplitude discrimination performance was not significantly different from chance (54.5 % (9.0 %) and 55.0 % (8.9 %),
t(9) = 0.50 and 0.58, in the monaural and binaural presentations,
respectively).
13
Fig. 2 We confirmed that our auditory stimuli had pitch–space
correspondence effects in a speeded classification task. After the
500 ms presentation of the fixation point, participants (N = 8)
were presented a single high- or low-pitched band-pass noise
(center frequency of 3 or 1.2 kHz, respectively) and were asked
to make judgments as quickly and as accurately as possible about
either the location (upper or lower for a location discrimination
task) or the pitch (higher or lower pitch for a pitch discrimination
task) of the sounds, ignoring the irrelevant dimension. The sound
was presented from either the upper or lower loudspeakers in the
location judgment task and from both loudspeakers (i.e., without spatial elevation) in the pitch judgment task. a The stimulus–
response mapping was either congruent (i.e., the upper response
key for upper-location or higher-pitched sounds) or incongruent
(i.e., the lower response key for upper-location or higher-pitched
sounds). In the congruent response key assignment, the “8” and
“2” keys on the numerical keypad were assigned to the upper/
higher and lower buttons, respectively. In the incongruent response
key assignment, the relationship was reversed. Reaction time (RT)
and accuracy were recorded. The experiment consisted of 160 trials: Judgment (2) × Key assignment (2) × Pitch (2) × Repetitions
(20). Each judgment type and key assignment was introduced as
a blocked design, and the order of these conditions was counterbalanced among the participants. While the pitch was fixed in
each block and counterbalanced among the blocks for the location
judgment, the pitch was randomly varied among the trials for the
pitch judgment. RTs shorter than 200 ms and longer than 1,200 ms
were excluded from analysis (Location-congruent task: 2.19 %,
Location-incongruent task: 6.56 %, Pitch-congruent task: 0.31 %,
Pitch-incongruent task: 5.00 %). Mean error rates of the location
and pitch judgments were as follows: Location-congruent task:
10.94 %, Location-incongruent task: 17.50 %, Pitch-congruent
task: 0.94 %, Pitch-incongruent task: 3.75 %. b Regarding RT data,
a two-way repeated measures ANOVA with Congruency × Judgment type found a main effect for Congruency (F(1, 7) = 9.49,
p < .05): The RTs for the congruent condition were significantly
shorter than those for the incongruent condition. A main effect for
Judgment type (F(1, 7) = 8.16, p < .05) revealed that the RTs for
location judgments were longer than those for pitch judgments.
An interaction effect between the factors was not significant (F(1,
7) = 1.38, p = .29). These results thus indicate that our higher/
lower-pitched sounds were associated, as sounds from upper and
lower location, with an upper/lower response space. Error bars
denote the standard error of the mean (N = 8). Asterisks indicate
statistically significant differences (p < .05)
Exp Brain Res
the judgments while trying to ignore the sounds. The training session consisted of 40 trials: Visual stimulus (2; static/
moving) × Eccentricity (2) × Repetition (10). The white
bar was presented 6 times without the sounds. The bar was
vertically displaced back and forth by 0.2° for the moving
condition and at a fixed location for the static condition.
The training session was repeated until the discrimination
performance reached above 75 % for each eccentricity.
This session was introduced because the visual stimuli presented at relatively large eccentricities were sometimes perceived as moving without sounds (e.g., Hidaka et al. 2009),
and this effect should be dissociated from auditory driving
effect. In the main session of Experiment 1, the sounds
were presented from both upper and lower loudspeakers
(i.e., giving the impression that the sound came from the
center). This session consisted of 240 trials: Visual stimulus (2) × Eccentricity (2) × Sound (3) × Repetition (20).
The sounds were presented either with pitch changes (alternating pitch condition) or without pitch changes (constant
pitch condition). A silent condition was also included as
baseline. In Experiment 2, the main session consisted of
320 trials: Visual stimulus (2) × Eccentricity (2) × Sound
(4) × Repetition (20). The sounds were presented from
the upper and lower loudspeakers in either the congruent
(higher-pitched sound in upper location) or incongruent
(lower-pitched sounds in upper location) pitch-visual position assignment. The conditions with the constant pitch
sounds (constant pitch condition) and without the sounds
(silent condition) were also introduced. In each experiment,
the order of the conditions was randomly assigned in each
trial and counterbalanced among the participants.
Results and discussion
We calculated d-prime and β values (Macmillan and Creelman 2004) as indices of perceptual sensitivity and response/
decisional biases, respectively (Hidaka et al. 2011b). We
regarded the responses of perceived static stimuli as a “hit”
for the static trials and as a “false alarm” for the moving trials (Supplementary Information S2). Thus, lower d-prime
values in conditions where sounds were present indicate
sound-induced illusory visual motion.
With regard to Experiment 1, a two-way repeated measures analysis of variance (ANOVA) with Sound × Eccentricity in d-prime values revealed a significant main effect
of Eccentricity (F(1, 7) = 18.28, p < .005). However, the
crucial effect of Sound (F(2, 14) = 0.89, p = .43) and their
interaction (F(2, 14) = 0.44, p = .65) were not significant
(Fig. 1c). The ANOVA in the β values revealed a significant
main effect of Sound (F(2, 14) = 6.13, p < .05). The post
hoc test (p < .05) revealed that the β value of the alternating
pitch condition was smaller than the other conditions. This
indicates that, consistent with the results of the speeded
classification task (Fig. 2), changes in pitch could be effective in response/decisional domain. These results thus demonstrate that sounds that alternate in high/low pitch do not
induce illusory visual motion perception.
In Experiment 2, the two-way repeated measures
ANOVA in d-prime values revealed main effects for
Sound (F(3, 21) = 6.64, p < .005) and Eccentricity (F(1,
7) = 15.47, p < .01), as well as a significant interaction effect among these factors (F(3, 21) = 3.32, p < .05)
(Fig. 1d). Regarding a simple main effect of Sound at 20°
of Eccentricity (F(3, 42) = 6.68, p < .001), post hoc tests
(Tukey’s HSD, p < .05) revealed that the d-prime for the
silent condition was always higher than the other conditions
and that there was no difference among the sound-present
conditions. Sounds alternating in upper/lower locations
thus always induced illusory visual motion of static stimuli
at 20°, irrespective of the presence or absence of pitch–
space congruency. The corresponding simple main effect
at 10° of Eccentricity was not significant (F(3, 42) = 2.81,
p = .06). The ANOVA in the β values revealed a significant
main effect of sound conditions (F(3, 21) = 3.87, p < .05).
The post hoc test showed that the β value of the congruent condition was smaller than those for the constant pitch
condition. Again, this suggests that the correspondence
between pitch and spatial information would be effective,
particularly in response/decisional domain (see also Fig. 2).
In line with previous studies (Hidaka et al. 2009, 2011b;
Teramoto et al. 2010b), these results show that auditory
spatial shifts induced illusory motion perception of static
visual stimuli especially at a far peripheral eccentricity.
Most importantly, congruency between sound location and
pitch did not modulate that perceptual effect.
Taken together, the results of Experiments 1 and 2 show
that the alternation of pitch do neither induce illusory visual motion perception nor modulate the driving effect of
auditory spatial information.
Experiment 3
In Experiment 3, we examined whether the saliency of
pitch for spatial information could be changed by associating pitch with visual apparent motion. Previous studies
have shown that arbitrary pitch information can be associative with and induce horizontal visual motion within a
few minutes of exposure to a paired presentation of these
stimuli (Teramoto et al. 2010a). This sound-contingent
visual motion perception is confirmed to occur at a perceptual level (Hidaka et al. 2011a; Kobayashi et al. 2012a, b).
Here, we used a 9-minute exposure phase in which sounds
alternating in pitch (higher/lower) without spatial elevation
(thus perceived as appearing from the center) were paired
with visual stimuli (white bars) alternately displaced by
13
5° in the vertical direction. Pitch and visual stimulus locations were either congruent (e.g., a higher-pitched sound
paired with the upper visual stimulus) or incongruent
(e.g., a higher-pitched sound paired with the lower visual
stimulus) (Fig. 3a) [see also Supplementary Information
S1C (movie)]. Test sessions were held before and after the
exposure sessions to quantify the exposure effects on visual
motion perception.
Fig. 3 Effects of associative
exposure between vertical
visual motion and alternating
pitch sounds (Experiment 3).
a Schematic illustrations of
exposure and test sessions and
an example of session flow. In
a 9-minute exposure session,
sounds alternating in pitch
appearing from central location
were paired with visual stimuli
alternating in vertical locations.
The pitch of the sound was
either congruent or incongruent
with the visual stimulus locations. Test sessions were held
before (pre-test) and after (posttest) the exposure sessions. In
each test session, the visual
stimuli shifted in upward or
downward direction with pitch
changes from higher-to-lower
or vice versa. b Psychometric
functions. On the horizontal
axis, negative values indicate
downward visual motion, and
positive values indicate upward
motion. Point of 50 % responses
was estimated as the point of
subjective stationarity (PSS).
c Amount of PSS shifts. Error
bars denote one standard error
of the mean (N = 8). Asterisks
indicate statistically significant
differences (p < .05)
13
Exp Brain Res
Methods
Eight participants were recruited in this experiment. They
had normal or corrected-to-normal vision and normal hearing and were naïve to the purpose of the experiment. In the
display with a resolution of 1,600 × 1,200, the white bars
were presented in the left and right visual fields at 10° of
Eccentricity. This experiment consisted of three sessions,
Exp Brain Res
a pre-test, an exposure, and a post-test session. In the
exposure session, the white bar constantly moved up and
down with 5° of distance for 9 min (Duration of each bar
was 400 ms and ISI between successive bars was 100 ms,
amounting to 540 number of exposure in total). The onset
of the visual stimuli was synchronized with that of the
sounds. The pitch and the location of the visual stimulus
were either congruent (i.e., higher-pitched sound with
upper visual stimulus) or incongruent (i.e., higher-pitched
sound with lower visual stimulus). Participants were asked
to keep looking at the fixation point. The visual stimuli
were presented in either the right or the left visual field
for each exposure type because the contingency effect has
sharp spatial selectivity (~5°) at the exposed visual field
(Teramoto et al. 2010a; Hidaka et al. 2011a). In pre- and
post-test sessions, the points of subjective stationarity for
motion direction were measured by using motion nulling
procedure with the method of constant stimuli (Teramoto
et al. 2010a; Hidaka et al. 2011a; Kobayashi et al. 2012a,
b). The estimation of PSS with motion nulling procedure
is a traditional, reliable, and direct measurement of motion
perception (e.g., Arman et al. 2006; Cavanagh and Favreau
1985; Mateeff et al. 1985). Typically, in this procedure, the
magnitude of illusory motion perception is measured by
presenting physical motion in the same or opposite direction of illusory motion with various displacement sizes. If
illusory motion is not perceived, observers’ responses and
the resulting PSS are firmly consistent with the perception
of physical motion. However, if illusory motion occurs, the
PSS would shift because illusory motion perception boosts
the percept of a consistent physical motion signal and even
cancels out an inconsistent physical motion signal. In each
trial, two visual stimuli were sequentially presented, producing vertical apparent motion. The moving distances
(0.06, 0.12, 0.24, and 0.48°) and directions (upward/
downward) were randomly assigned. The onsets of the
two visual stimuli were synchronized with the higher and
lower sounds (H-L) or vice versa (L–H). The silent condition was also introduced as baseline. The participants
were asked to judge the perceived motion direction of the
visual stimuli by ignoring the sounds if presented. Each
pre- and post-test session consisted of 240 trials: Distance (4) × Direction (2) × Sound (3) × Repetition (10).
In each session, the conditions were randomly assigned
in each trial and counterbalanced among the participants.
While the pre-tests were successively completed in one of
the visual fields (left or right), post-tests were introduced
after each exposure type in the exposed visual field. The
exposure types and exposed visual fields (Congruent—
Left and Incongruent—Right or Incongruent—Left and
Congruent—Right) were randomly assigned in each participant and counterbalanced among the participants. The
experiment lasted about an hour. Except for these, the
apparatus, stimuli, and procedures were identical to those
of Experiment 1.
Results and discussion
We plotted the proportions of the upward motion perception as a function of moving distance (Fig. 3b). To estimate
the points of subjective stationarities (PSSs), we obtained
the 50 % point by fitting a cumulative normal-distribution
function to each individual’s psychometric function. In
order to compare the PSSs (Supplementary Information
S3) between the sound conditions and between exposure
types, we calculated the amount of PSS shifts by subtracting the PSSs of the silent condition from those of the
sound conditions in each test session and exposure type
(Kobayashi et al. 2012b) (Fig. 3c). A three-way repeated
measures ANOVA on the PSS with Test session × Exposure × Sound found a significant interaction effect between
Test session and Sound (F(1, 7) = 5.87, p < .05). While
the simple main effects of Sound were not significant in
the pre-test session (F(1, 14) = 1.79, p = .20), the effect
was significant in the post-test session (Fs(1, 14) = 13.18,
p < .005). The PSSs shifted to the negative (downward
motion) direction in the H–L condition and to the positive (upward motion) direction in the L–H condition, irrespective of exposure type: The congruent condition thus
induced illusory visual motion in the same direction that
the participants were exposed to (i.e., the exposure to H–L
sounds induced frequent downward motion perception),
whereas the incongruent condition induced illusory visual
motion in the opposite direction that the participants were
exposed to (i.e., the exposure to L–H sounds also induced
frequent downward motion perception).
Consistent with Experiment 1, we found that pitch did
not affect visual motion perception before exposure. However, after exposure, a change in pitch did alter visual
motion perception. In line with previous studies (Teramoto
et al. 2010a; Hidaka et al. 2011a; Kobayashi et al. 2012a,
b), these results indicate that a new pitch–space association could be formed and this induces sound-induced visual
motion. What is new is that the effect of exposure occurred
in a congruent manner irrespective of the exposure types:
H–L/L–H sounds consistently induced downward/upward
visual motion perception. A previous study suggested that
a top–down association with pitch and brightness could
serve to make their correspondence relationship salient to
the participants and then elicit the correspondence effect in
visual search task (Klapetek et al. 2012). In line with this
idea, one explanation for the current findings would be that
the pitch–space correspondence existing in higher processing levels somehow modulates perceptual association in a
congruent manner by top–down control such as through
attentional guidance to a specific relationship (Ahissar and
13
Hochstein 1993; Chiou and Rich 2012). An alternative
explanation would be that the representation of the pitch–
space correspondence is originally but weakly represented
at the perceptual processing level. In fact, there was a slight
(albeit non-significant) trend for a pitch–space correspondence in the pre-test. If the representation of the congruent
pitch–space association is weakly established, the congruent exposure might activate such representations and
directly induce the congruency effect. The representation
of an incongruent pitch–space association would be newly
shaped by the prolonged exposure in the current study and
activate the counterpart congruent representation by the
enhancement of crossmodal connectivity (Zangenehpour
and Zatorre 2010) so that the congruency effect might
appear. In either explanation, the findings may suggest that
the prolonged exposure can increase the saliency of pitch–
space corresponding relationships, and then the pitch can
induce visual motion perception.
General discussion
The present study investigated whether the discrete changes
in the pitch of sounds, which is typically associated with
spatial location information (pitch–space correspondence),
could be effective for visual motion perception. We adopted
sound-induced illusory visual motion as a tool for investigating the perceptual effects of pitch–space correspondence on visual motion perception in a comparable manner
between the pitch and spatial information of sounds. We
found that, contrary to the changes in location of sounds,
the pitch changes did not have a driving or modulating
effect on visual motion perception (Experiments 1 and 2).
We further found that, after the pitch changes were associated with visual motion, they did become inducers of visual
motion in a congruent manner even after an incongruent
paired association (Experiment 3). These findings suggest
that pitch–space correspondence is originally not so strong
to drive or modulate visual motion perception. However,
associative exposure could increase the saliency of pitch–
space corresponding relationships and then the pitch could
induce visual motion perception.
In Experiments 1 and 2, we focused on differences in
d-primes as the index of perceptual sensitivity to visual
motion. We confirmed that the alternation of pitch information did not induce the changes in d-prime values, while
this did have some effect on β values, which are the index
of response/decisional biases. In Experiment 3, in addition
to that the changes in PSSs demonstrated the driving effect
of pitch on visual motion, we also confirmed that the slope
of the psychometric functions (just noticeable differences;
JND) did not differ among the auditory conditions even in
the post-test session [see Supplementary Information S4].
13
Exp Brain Res
If response/decisional biases existed, JND should differ
because the upward/downward sound presentations would
induce frequent upward/downward responses especially
when motion direction was uncertain. These results suggest
that the findings in the current experiments cannot be simply explained by response/decisional biases.
Pitch–space correspondence effects have often been
demonstrated by speeded classification tasks (Bernstein
and Edelstein 1971) (see also Fig. 2). Recently, such correspondence has also been demonstrated using an indirect
response task (Evans and Treisman 2010; Klapetek et al.
2012; Parise and Spence 2008, 2012) or attentional tasks
(Chiou and Rich 2012; Mossbridge et al. 2011). These
findings indicate that attentional/response-levels of processing could be involved in crossmodal correspondences.
Recently, some studies successfully demonstrate crossmodal correspondence effects by using unspeeded perceptual tasks for pitch–size, pitch–shape (Parise and Spence
2009), voice–size (Sweeny et al. 2012), and auditory amplitude–visual spatial frequency (Guzman-Martinez et al.
2012) pairs. However, none of such studies were reported
for pitch–space correspondence. Maeda et al. (2004) also
reported that changes in pitch induced visual motion perception. However, considering that they presented continuous pitch changes (glides) which are more likely to elicit
vertical motion impression than two discrete pitch sounds
used in the current research (Walker 1987), a key element
in Maeda et al. (2004) may be audiovisual interaction in
motion processing rather than pitch–space correspondence.
Thus, our results provide direct evidence that there exist
limitations of pitch–space correspondence at a perceptual
level based in an unspeeded perceptual task (sound-induced
visual motion perception).
It has been debated how crossmodal correspondences
are acquired in the brain. Recently, it has been reported
that pitch–space and pitch–shape correspondence effects
would be observed even for pre-linguistic 4-month-old
infants (Walker et al. 2010) and that not only humans but
also chimpanzees exhibit pitch–luminance correspondence
(Ludwig et al. 2011). These findings suggest that crossmodal correspondences may be innate because linguisticor conceptually based acquisition processes are inapplicable. However, it has also been argued that there is linguistic
diversity, and all languages do not necessarily use the same
spatial metaphor for pitch (Dolscheid et al. 2011). Also,
both humans and chimpanzees could learn the relationship of multimodal inputs/events based on natural statistics
including simple concurrence of events (Adams et al. 2004;
Ernst 2005, 2007). Thus, we could also consider that crossmodal correspondences are empirically acquired after birth
(Mossbridge et al. 2011; Spence and Deroy 2012). The current findings echo both of these ideas: Experiment 3 demonstrated that the pitched sounds did come to induce visual
Exp Brain Res
motion after the paired exposure of pitch and visual spatial
information. Moreover, the correspondence relationships
were observed irrespective of the exposure types. These
findings indicate that there originally exist representations
for pitch–space correspondence and the representations
could be activated via associative exposure for being effective at a perceptual level.
In the current study, we investigated the perceptual reality of pitch–space correspondence by adopting a behavioral
task that is highly compatible with pitch–space associations
in sounds (sound-induced visual motion). Recently, it was
reported that, while continuous pitch changes had biasing
effects on visual motion perception equivalent to natural
auditory motion signal in a behavioral task, the underlying
neural responses were different during the task: Whereas
natural auditory motion signal modulated the responses
in hMT area, continuous pitch changes did not have such
effect but modulated the responses in relatively higher
brain region (superior/intraparietal sulcus) (Sadaghiani
et al. 2009). Based on these findings, the adoption of brain
imaging techniques should be necessary in near future to
investigate where the representations of pitch–space correspondence originally exist and how they are activated and
associated with perceptual processing during associative
exposure in the brain.
Acknowledgments We thank Wouter D.H. Stumpel for his technical supports. We are grateful to anonymous reviewers for their valuable and insightful comments and suggestions for early versions of
the manuscript. This research was supported by the Ministry of Education, Culture, Sports, Science and Technology, Grant-in-Aid for
Specially Promoted Research (No. 19001004) and Rikkyo University
Special Fund for Research.
References
Adams WJ, Graf EW, Ernst MO (2004) Experience can change the
“light-from-above” prior. Nat Neurosci 7:1057–1058
Ahissar M, Hochstein S (1993) Attentional control of early perceptual
learning. Proc Natl Acad Sci USA 90:5718–5722
Arman AC, Ciaramitaro VM, Boynton GM (2006) Effects of featurebased attention on the motion aftereffect at remote locations.
Vision Res 46:2968–2976
Bernstein IH, Edelstein BA (1971) Effects of some variations in
auditory input upon visual choice reaction time. J Exp Psychol
87:241–247
Brainard DH (1997) The psychophysics toolbox. Spat Vis 10:433–436
Calvert GA, Spence C, Stein BE (eds) (2004) The handbook of multisensory processing. MIT Press, Cambridge
Cavanagh P, Favreau OE (1985) Color and luminance share a common motion pathway. Vision Res 25:1595–1601
Chiou R, Rich AN (2012) Cross-modality correspondence between
pitch and spatial location modulates attentional orienting. Perception 41:339–353
Dolscheid S, Shayan S, Majid A, Casasanto D. (2011) The thickness of musical pitch: psychophysical evidence for the Whorfian
hypothesis. In: Proceedings of the 33rd Annual Conference of the
Cognitive Science Society, pp 537–542
Ernst MO (2005) A Bayesian view on multimodal cue integration. In:
Knoblich G, Thornton I, Grosjean M, Shiffrar M (eds) Perception
of the human body perception from the inside out. Oxford University Press, New York, pp 105–131
Ernst MO (2007) Learning to integrate arbitrary signals from vision
and touch. J Vis 7(7):1–14
Ernst MO, Bülthoff HH (2004) Merging the senses into a robust percept. Trends Cogn Sci 8:162–169
Evans KK, Treisman A (2010) Natural cross-modal mappings
between visual and auditory features. J Vis 10(1):6: 1–12
Guzman-Martinez E, Ortega L, Grabowecky M, Mossbridge J, Suzuki
S (2012) Interactive coding of visual spatial frequency and auditory amplitude-modulation rate. Curr Biol 22:383–388
Hidaka S, Manaka Y, Teramoto W, Sugita Y, Miyauchi R, Gyoba J,
Suzuki Y, Iwaya Y (2009) Alternation of sound location induces
visual motion perception of a static object. PLoS ONE 4:e8188
Hidaka S, Teramoto W, Kobayashi M, Sugita Y (2011a) Sound-contingent visual motion aftereffect. BMC Neurosci 12:44
Hidaka S, Teramoto W, Sugita Y, Manaka Y, Sakamoto S, Suzuki Y
(2011b) Auditory motion information drives visual motion perception. PLoS ONE 6:e17499
Klapetek A, Ngo MK, Spence C (2012) Does crossmodal correspondence modulate the facilitatory effect of auditory cues on visual
search? Atten Percept Psychophys 74:1154–1167
Kobayashi M, Teramoto W, Hidaka S, Sugita Y (2012a) Indiscriminable sounds determine the direction of visual motion. Sci Rep
2:365
Kobayashi M, Teramoto W, Hidaka S, Sugita Y (2012b) Sound frequency and aural selectivity in sound-contingent visual motion
aftereffect. PLoS ONE 7:e36803
Ludwig VU, Adachi I, Matsuzawa T (2011) Visuoauditory mappings
between high luminance and high pitch are shared by chimpanzees (Pan troglodytes) and humans. Proc Natl Acad Sci USA
108:20661–20665
Macmillan NA, Creelman CD (2004) Detection theory: a user’s guide,
2nd edn. Lawrence Erlbaum Associates Inc, New Jersey
Maeda F, Kanai R, Shimojo S (2004) Changing pitch induced visual
motion illusion. Curr Biol 14:R990–R991
Marks LE (2004) Cross-modal interactions in speeded classification.
In: Calvert GA, Spence C, Stein BE (eds) Handbook of multisensory processes. MIT Press, Cambridge, pp 85–105
Mateeff S, Hohnsbein J, Noack T (1985) Dynamic visual capture:
apparent auditory motion induced by a moving visual target. Perception 14:721–727
Mossbridge JA, Grabowecky M, Suzuki S (2011) Changes in
auditory frequency guide visual-spatial attention. Cognition
121:133–139
Mudd SA (1963) Spatial stereotypes of four dimensions of pure tone.
J Exp Psychol 66:347–352
Parise C, Spence C (2008) Synesthetic congruency modulates the
temporal ventriloquism effect. Neurosci Lett 442:257–261
Parise CV, Spence C (2009) “When birds of a feather flock together”:
synesthetic correspondences modulate audiovisual integration in
non-synesthetes. PLoS ONE 4:e5664
Parise CV, Spence C (2012) Audiovisual crossmodal correspondences
and sound symbolism: a study using the implicit association test.
Exp Brain Res 220:319–333
Pelli DG (1997) The VideoToolbox software for visual psychophysics: transforming numbers into movies. Spat Vis 10:437–442
Pratt CC (1930) The spatial character of high and low tones. J Exp
Psychol 13:278–285
Roffler SK, Butler RA (1968) Factors that influence the localization
of sound in the vertical plane. J Acoust Soc Am 43:1255–1259
Rusconi E, Kwan B, Giordano BL, Umiltà C, Butterworth B (2006)
Spatial representation of pitch height: the SMARC effect. Cognition 99:113–129
13
Sadaghiani S, Maier JX, Noppeney U (2009) Natural, metaphoric,
and linguistic auditory direction signals have distinct influences
on visual motion processing. J Neurosci 29:6490–6499
Spence C (2011) Crossmodal correspondences: a tutorial review.
Atten Percept Psychophys 73:971–995
Spence C, Deroy O (2012) Crossmodal correspondences: innate or
learned? Iperception 3:316–318
Sweeny TD, Guzman-Martinez E, Ortega L, Grabowecky M, Suzuki
S (2012) Sounds exaggerate visual shape. Cognition 124:194–200
Teramoto W, Hidaka S, Sugita Y (2010a) Sounds move a static visual
object. PLoS ONE 5:e12255
Teramoto W, Manaka Y, Hidaka S, Sugita Y, Miyauchi R, Sakamoto S, Gyoba J, Iwaya Y, Suzuki Y (2010b) Visual motion
13
Exp Brain Res
perception induced by sounds in vertical plane. Neurosci Lett
479:221–225
Walker R (1987) The effects of culture, environment, age, and musical training on choices of visual metaphors for sound. Percept
Psychophys 42:491–502
Walker P, Bremner JG, Mason U, Spring J, Mattock K, Slater A,
Johnson SP (2010) Preverbal infants’ sensitivity to synaesthetic
cross-modality correspondences. Psychol Sci 21:21–25
Zangenehpour S, Zatorre RJ (2010) Crossmodal recruitment of primary visual cortex following brief exposure to bimodal audiovisual stimuli. Neuropsychologia 48:591–600