Dichotic speech recognition in noise using reduced spectral

Dichotic speech recognition in noise using
reduced spectral cues
Philipos C. Loizoua) and Arunvijay Mani
Department of Electrical Engineering, University of Texas at Dallas, P.O. Box 830688, EC 33, Richardson,
Texas 75083-0688
Michael F. Dorman
Arizona Biomedical Institute, Arizona State University, Tempe, Arizona 85287
共Received 30 August 2002; revised 25 April 2003; accepted 25 April 2003兲
It is generally accepted that the fusion of two speech signals presented dichotically is affected by the
relative onset time. This study investigated the hypothesis that spectral resolution might be an
additional factor influencing spectral fusion when the spectral information is split and presented
dichotically to the two ears. To produce speech with varying degrees of spectral resolution, speech
materials embedded in ⫹5 dB S/N speech-shaped noise were processed through 6 –12 channels and
synthesized as a sum of sine waves. Two different methods of splitting the spectral information were
investigated. In the first method, the odd-index channels were presented to one ear and the
even-index channels to the other ear. In the second method the lower frequency channels were
presented to one ear and the high-frequency channels to the other ear. Results indicated that spectral
resolution did affect spectral fusion, and the effect differed across speech materials, with the
sentences being affected the most. Sentences, processed through six or eight channels and presented
dichotically in the low–high frequency condition were not fused as accurately as when presented
monaurally. Sentences presented dichotically in the odd–even frequency condition were identified
more accurately than when presented in the low–high condition. © 2003 Acoustical Society of
America. 关DOI: 10.1121/1.1582861兴
PACS numbers: 43.71.Es, 43.71.Ky 关KWG兴
I. INTRODUCTION
Since the seminal paper by Cherry in the early 1950s
共Cherry, 1953兲 on the ‘‘cocktail party’’ effect and dichotic
listening, much work has been done to understand dichotic
speech perception 共Cutting, 1976兲. Several researchers have
demonstrated the remarkable ability of the brain to fuse two
different signals presented simultaneously in both ears 共i.e.,
presented dichotically兲 to a single auditory percept. Fusion of
speech signals can occur in many forms 共Cutting, 1976兲.
Broadbent and Ladefoged 共1957兲 demonstrated that when the
listeners were presented simultaneously with a signal containing the F1 of /$~/ to the left ear and a signal containing
the F2 of /$~/ to the right ear, subjects heard the stimulus as
/$~/. Others 共e.g., Hawles, 1970; Cutting, 1976兲 have demonstrated that if a particular /"~/ is presented to one ear and
a particular /,~/ is presented to the other ear, listeners often
reported hearing a single item, /$~/. Sound localization is yet
another form of fusion since the acoustic signals received in
each ear differ in amplitude and phase 共time difference兲. Cutting 共1976兲 identified and analyzed six different types of fusion that occur at possibly three different levels of auditory
processing. Cutting termed the fusion observed in the Broadbent and Ladefoged 共1957兲 study as ‘‘spectral’’ fusion, and
that is the focus of this paper. Spectral fusion occurs when
different, but complementary, spectral regions of the same
signal are presented to opposite ears.
Much work has been done to understand the various
a兲
Electronic mail: [email protected]
J. Acoust. Soc. Am. 114 (1), July 2003
factors influencing spectral fusion 共Cutting, 1976兲. Three
factors were found to be most important: relative onset time,
relative intensity, and fundamental frequency 共F0兲. Cutting
共1976兲 delayed the information reaching one ear with respect
to the other, and found that spectral fusion decreased significantly when the delay was increased more than 40 ms. Rand
共1974兲 found no effect on fusion when the signal presented
to one ear was attenuated by as much as 40 dB, suggesting
that spectral fusion is immune to relative intensity variations.
Similar experiments were carried out to examine the effect of
differences in fundamental frequency 共F0兲 on spectral fusion.
When listeners were presented with signals differing in F0,
they reported hearing two sounds, however the identity of
the fused percept was maintained independent of differences
in F0 共Darwin, 1981; Cutting, 1976兲. Darwin 共1981兲 demonstrated that the identification of ten three-formant vowels
was unaffected by formants excited at different fundamentals
or starting at 100-ms intervals.
In summary, of the three factors investigated in the literature, the relative onset-timing difference seemed to have
the largest effect on dichotic speech perception. In this paper,
we investigate whether spectral resolution could be considered as another factor that could potentially influence spectral fusion. Spectral resolution is an issue that needs to be
taken into account when dealing with cochlear implant listeners, known to receive a limited amount of spectral information. The recent introduction of bilateral cochlear implants
spurred the question of whether cochlear implant listeners
would be able to fuse speech information presented dichotically. The possible advantage of dichotic 共electrical兲 stimu-
0001-4966/2003/114(1)/475/9/$19.00
© 2003 Acoustical Society of America
475
TABLE I. The 3-dB cutoff frequencies 共Hz兲 of the bandpass filters used in this study. FL and FH indicate the low
and high cutoff frequencies respectively of the bandpass filters.
6 channels
Channels
1
2
3
4
5
6
7
8
9
10
11
12
8 channels
FL
FH
FL
FH
FL
FH
300.0
487.2
791.1
1284.5
2085.8
3387.1
487.2
791.1
1284.5
2085.8
3387.1
5500.0
261.9
526.6
857.3
1270.4
1786.5
2431.2
3236.7
4242.9
526.6
857.3
1270.4
1786.5
2431.2
3236.7
4242.9
5500.0
191.6
356.8
549.5
774.3
1036.6
1342.5
1699.4
2115.8
2601.5
3168.2
3829.2
4600.4
356.8
549.5
774.3
1036.6
1342.5
1699.4
2115.8
2601.5
3168.2
3829.2
4600.4
5500.0
lation is reduction in channel interaction, since one can
stimulate the electrodes alternately across the two ears 共e.g.,
electrode 1 in left ear followed by electrode 2 in right ear and
so on兲.
To examine the effect of spectral resolution on dichotic
speech recognition, noisy speech was processed into a small
number 共6 –12兲 of channels and presented to normal-hearing
listeners dichotically. Two different conditions were considered. In the first condition, low-frequency information was
presented to one ear and high-frequency information was
presented to the other ear. In the second condition, the frequency information was interleaved between the two ears
with the odd-index frequency channels fed to one ear and the
even-index channels fed to the other ear. At issue is whether
spectral fusion is affected by 共1兲 poor spectral resolution or/
and 共2兲 the way spectral information is split 共low/high versus
interleaved兲 and presented to the two ears. We hypothesize
that both spectral resolution and the type of spectral information presented to the two ears will affect spectral fusion.
II. DICHOTIC LISTENING IN NOISE
A. Method
1. Subjects
Nine normal-hearing listeners 共20 to 30 years of age兲
participated in this experiment. All subjects were native
speakers of American English. The subjects were paid for
their participation. The subjects were undergraduate students
from the University of Texas at Dallas.
2. Speech material
Subjects were tested on sentence, vowel, and consonant
recognition. The vowel test included the syllables: ‘‘heed,
hid, hayed, head, had, hod, hud, hood, hoed, who’d, heard’’
produced by male and female talkers. A total of 22 vowel
tokens were used for testing, 11 produced by seven male
speakers and 11 produced by six female speakers 共not all
speakers produced all 11 vowels兲. These tokens were a subset of the vowels used in Loizou et al. 共1998兲 and were selected from a large vowel database 共Hillenbrand et al., 1995兲
to represent the complete area of the vowel space. The con476
12 channels
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
sonant test used 16 consonants in /~C~/ context taken from
the Iowa consonant test 共Tyler et al., 1987兲. All /~C~/ syllables were produced by a male speaker.
The sentence test used sentences from the HINT database 共Nilsson et al., 1994兲. Two lists, consisting of 20 sentences, were used for each condition, and different lists were
used in all conditions.
3. Signal processing
Speech material was first low-pass filtered using a sixthorder elliptical filter with a cutoff frequency of 6000 Hz.
Filtered speech was passed though a preemphasis filter
共high-pass兲 with a cutoff frequency of 2000 Hz. This was
followed by band-pass filtering into n (n⫽6,8,12) frequency
bands using sixth-order Butterworth filters. The cutoff frequencies of the bandpass filters are given in Table I. Logarithmic frequency spacing was used for n⫽6, and mel frequency spacing 共linear spacing up to 1000 Hz and
logarithmic thereafter兲 was used for n⫽8,12. The output of
each channel was passed through a full-wave rectifier followed by a second-order Butterworth low-pass filter with a
cutoff frequency of 400 Hz to obtain the envelope of each
channel output. Corresponding to each channel a sinusoid
was generated with frequency set to the center frequency of
the channel and with amplitude set to the root-mean-squared
共rms兲 energy of the channel envelope, estimated every 4 ms.
The phases of the sinusoids were estimated from the FFT of
the speech segment as per Loizou et al. 共1999兲. No interpolation was done on the amplitudes or phases to smooth out
any discontinuities across the 4-ms segments.
Two sets of sine waves were synthesized for dichotic
presentation, one corresponding to the left ear and one corresponding to the right ear. The sine waves with frequencies
corresponding to the left-ear channels were summed to produce the left-ear signal, and, similarly, the sinewaves corresponding to the right-ear channels were summed to produce
the right-ear signal. In the condition, for instance, in which
the low-frequency information was presented to the left ear
and the high-frequency information was presented to the
right ear, the envelope amplitudes corresponding to channels
1⫺n/2 were used to synthesize the left-ear signal and the
amplitudes corresponding to channels n/2⫹1⫺n were used
Loizou et al.: Dichotic listening in noise
to synthesize the right-ear signal. The levels of the two synthesized signals 共left and right兲 were adjusted so that the sum
of the two levels was equal to the rms value of the original
speech segment. This was done by multiplying the left and
right signals by the same energy normalization value. Hence,
no imbalance was introduced, in terms of level differences or
spectral tilt, between the left and right envelope amplitudes.
4. Procedure
The experiments were performed on a PC equipped with
a Creative Labs SoundBlaster soundcard. Stimuli were
played to the listeners either monaurally or dichotically
through Sennheiser’s HD 250 Linear II circumaural headphones. For the vowel and consonant tests, a graphical user
interface was used that enabled the subjects to indicate their
response by clicking a button corresponding to the syllable
played. For the sentence test, subjects were asked to write
down the words they heard. The sentences were scored in
terms of percent words identified correctly 共all words were
scored兲. During the practice session, the identity of the test
syllables 共vowels or consonants兲 and sentences was displayed on the screen.
At the beginning of each test the subjects were presented
with a practice session in which the speech materials were
processed through the same number of channels used in the
test and presented monaurally in quiet and in ⫹5 dB speechshaped noise. For further practice, vowel and consonant tests
were administered with feedback to each subject. Three repetitions were used in the feedback session. The practice session lasted approximately 2 h. After the practice and feedback sessions, the subjects were tested with the various
dichotic and monaural conditions. The vowels and consonants were completely randomized and presented to the listeners six times. No feedback was provided during the test,
and all the tests were done with speech embedded in ⫹5 dB
speech-shaped noise taken from the HINT database.
Two different dichotic conditions were considered. In
the first condition, which we refer to as low–high dichotic
condition, the low-frequency information 共consisting of half
of the total number of channels兲 was presented to one ear,
and the high-frequency information 共consisting of the remaining half high-frequency channels兲 was presented to the
other ear. In the second dichotic condition, which we call
odd–even 共or interleaved兲 dichotic condition, the odd-index
frequency channels were presented to one ear, while the
even-index channels were presented to the other ear. In the
monaural condition, the signal was presented monaurally to
either the left or the right ear 共chosen randomly兲 of the subject. The order in which the conditions and number of channels was presented was partially counterbalanced between
subjects to avoid order effects. In the vowel and consonant
tests, there were six repetitions of each vowel and each consonant. The vowels and the consonants were completely randomized. A different set of 20-sentence lists was used for
each condition.
Pilot data showed that the 12-channel condition in ⫹5
dB S/N yielded performance close to ceiling. Hence, for the
12-channel condition, we performed additional listening experiments to assess whether subjects were indeed integrating
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
the information from the two ears, or whether they were
receiving sufficient information in each of the two ears
alone. For comparison with the odd–even stimuli presented
dichotically, two additional conditions were created. In the
first condition, the odd-index channels were presented to the
left ear alone, and in the second condition, the even-index
channels were presented to the right ear alone. Similarly, for
comparison with the low–high stimuli presented dichotically,
the low-frequency channels 共lower half number of channels兲
were presented to the left ear alone and the high-frequency
channels were presented to the right ear alone. Sentences,
vowels and consonants were processed though the one-ear
conditions for the odd-even condition comparison. Vowels
and consonants were also processed through the one-ear conditions for the low–high comparison. No sentences were processed for the low–high comparison since there were an insufficient number of unique sentences in the HINT database.
B. Results
The mean percent correct scores for sentence, vowel and
consonant recognition is shown in Fig. 1 as a function of
number of channels for different presentation modes.
1. Sentences
A two-way ANOVA with repeated measures, using spectral resolution 共number of channels兲 and presentation mode
共monaural, dichotic odd–even and dichotic low–high兲 as
within-subject factors, showed a significant main effect of
spectral resolution 关 F(2,16)⫽82.05, p⬍0.0005兴 , a significant effect of presentation mode 关 F(2,16)⫽10.57, p
⫽0.001兴 , and a significant interaction 关 F(4,32)⫽3.36, p
⫽0.02兴 between spectral resolution and presentation mode.
Posthoc tests according to Tukey 共at alpha⫽0.05) showed
that there was a significant (p⬍0.05) difference between the
performance obtained with the two dichotic conditions for 12
and 8 channels, but not for 6 channels. There was also a
significant difference (p⬍0.05) between the performance
obtained with dichotic presentation 共low-high兲 and monaural
presentation for the 6- and 8-channel conditions, but not for
the 12-channel condition.
Figure 2 共top panel兲 compares the dichotic performance
共12 channels兲 obtained on sentence recognition with the performance obtained when the even channels were presented to
the left ear alone, and the odd channels were presented to the
right ear alone. Posthoc Fisher’s LSD tests showed that there
was a significant difference (p⬍0.05) between the one-ear
performance and the dichotic performance on sentence recognition, suggesting that subjects were able to integrate the
information from the two ears. That is, the information presented in each ear alone was not sufficient to recognize sentences in ⫹5 dB S/N with high 共⬎90%兲 accuracy.
2. Vowels
A two-way ANOVA with repeated measures showed a
significant main effect of spectral resolution 关 F(2,16)
⫽46.05, p⬍0.0005兴 , a significant effect of presentation
mode 关 F(2,16)⫽4.79, p⫽0.023兴 , and a significant interaction 关 F(4,32)⫽7.16, p⬍0.0005兴 between spectral resolution
and presentation mode on vowel recognition. Posthoc tests
Loizou et al.: Dichotic listening in noise
477
according to Tukey showed that there was a significant (p
⬍0.05) difference between each of the dichotic conditions
and the monaural condition for eight-channels. There was
also a significant difference between the two dichotic conditions for six channels.
Figure 2 compares the performance obtained dichotically 共for both conditions兲 with the performance obtained
with the left and right ears only. Posthoc Fisher’s LSD tests
showed that there was a significant difference (p⬍0.05) between the one-ear performance and the dichotic performance
on vowel recognition, suggesting that subjects were able to
integrate the information from the two ears and obtain a
vowel score higher than the score obtained with either ear
alone.
3. Consonants
A two-way ANOVA with repeated measures showed a
significant main effect of spectral resolution 关 F(2,16)
⫽46.05, p⬍0.0005兴 , a nonsignificant effect of presentation
mode 关 F(2,16)⫽4.79, p⫽0.34兴 , and a significant interaction 关 F(4,32)⫽7.16, p⫽0.002兴 between spectral resolution
and presentation mode on consonant recognition. As shown
in Fig. 1, the presentation mode 共dichotic versus monaural兲
had no effect on consonant recognition.
Figure 2 compares the consonant performance obtained
dichotically against the performance obtained with either the
left or right ears alone. Posthoc Fisher’s LSD tests showed
that there was a significant difference (p⬍0.05) between the
one-ear performance and the dichotic performance on consonant recognition.
The consonant confusion matrices were also analyzed in
terms of percent information transmitted as per Miller and
Nicely 共1955兲. The feature analysis is shown in Fig. 3. A
two-way ANOVA performed on each feature separately
showed a marginally significant effect of presentation mode
for manner (p⫽0.04) and voicing (p⫽0.047) and a nonsignificant effect for place (p⫽0.31). There was a significant
interaction for place and voicing (p⬍0.005), but not for
manner (p⫽0.13). There was significant effect (p⬍0.005)
of spectral resolution for all three features.
C. Discussion
The above results indicate that the effect of presentation
mode 共dichotic versus monaural兲 differed across speech materials and degree of spectral resolution. The recognition of
sentences was affected the most. Recognition of vowels was
also affected but to a lesser degree. As it is evident from the
mean performance on sentences processed through six and
eight channels, subjects were not able to fuse sentences presented dichotically in the low–high condition with the same
accuracy as when presented monaurally. Subjects were also
not able to fuse vowels, processed through eight channels
and presented dichotically 共in either low/high or even–odd
conditions兲, with the same accuracy as when presented monaurally. It should be noted that there was large subject variability in performance 共see Fig. 4兲 for vowels and sentences
processed through six channels 共see, for instance, subjects’
S1, S2 and S7 performance on sentence recognition, and
478
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
FIG. 1. Mean identification of sentences 共scored in percent words correct兲,
vowels, and consonants as a function of number of channels and presentation mode 共monaural versus dichotic兲. Error bars indicate standard errors of
the mean.
subjects’ S4, S5 and S7 performance on vowel recognition兲.
In contrast, subjects were able to fuse vowels and sentences
processed through 12 channels very accurately and more
consistently. These results suggest that spectral resolution
may have a significant effect on spectral fusion depending on
the dichotic presentation mode 共low/high versus interleaved兲.
No such effect was found for consonants. Consonants were
fused accurately with both dichotic presentations regardless
of the spectral resolution. The difference in effect on spectral
fusion between consonants and the other speech materials
suggests that vowels and sentences might be better 共more
sensitive兲 speech materials to use in studies of dichotic
speech perception. This conclusion must be viewed with caution, taking into account the fact the performance variability
might be partially due to the variability in speech material
used in this study, with the vowel material being more variable 共produced by both male and female talkers, with mulLoizou et al.: Dichotic listening in noise
FIG. 2. 共Top panel兲 Mean speech recognition obtained when the odd- and
even-index channels were presented
dichotically 共hatched bars兲, the oddchannels were presented to the left ear
only 共diagonally filled bars兲, and the
even-channels were presented to the
right ear only 共dark bars兲. 共Bottom
panel兲. Mean vowel and consonant
recognition obtained when the lowand high-frequency channels were presented dichotically 共hatched bars兲, the
low-frequency channels were presented to the left ear only 共diagonally
filled bars兲, and the high-frequency
channels were presented to the right
ear only 共dark bars兲. Speech materials
were processed through 12 channels.
Error bars indicate standard errors of
the mean.
tiple productions of each vowel兲 and the consonant material
being less variable 共produced by a single male talker with a
single production of each consonant兲.
When interpreting the results on vowel recognition, two
confounding factors need to be considered. First, the filter
spacing was logarithmic for the 6-channel condition and
mel-like for the 8- and 12-channel conditions. We cannot
exclude the possibility that a different outcome might have
been obtained had a different filter spacing was used, and
this warrants further investigation. Second, the 8- and 12channel conditions had a slightly larger 共by about 38 –108
Hz兲 overall bandwidth than the 6-channel condition. Given
that the roll-off of the sixth-order filters used was not too
sharp, and the results from our previous study 共Loizou et al.,
2000a兲 indicate no significant difference on vowel recognition between different signal bandwidths, we do not believe
that the additional bandwidth in the 8- and 12-channel conditions affected the outcome of this study.
No effect on spectral fusion was found in this study in
the identification of consonants processed through a small
number of channels. This outcome is consistent with the
findings in the study reported by Lawson et al. 共1999兲 with
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
bilateral cochlear implant listeners. Lawson et al. presented
consonants in /~C~/ context to two bilateral implant users.
The consonants were presented dichotically in the same two
conditions used in our study, even–odd and low–high conditions. The bilateral subjects were fitted with six- and eightchannel processors. Results showed a small advantage of dichotic stimulation, however the difference was not
statistically significant. No experiments were done in Lawson et al. 共1999兲 with other speech materials. Identification
of consonants, in general, is known to be robust to extremely
low spectral-resolution conditions 共Shannon et al., 1995;
Dorman et al., 1997兲 and even conditions in which large
segments of the consonant spectra are missing 共Lippmann,
1996; Kasturi et al., 2002兲. Dichotic identification of consonants does not seem to be an exception.
The fact that the listeners were not able to fuse with high
accuracy sentences processed through six and eight channels
and presented in the low–high dichotic condition cannot be
easily explained given that we did not manipulate in this
study the relative onset time, the F0, or the relative intensity
of the two signals presented to each ear. One possible explanation is that the information presented in each ear was so
Loizou et al.: Dichotic listening in noise
479
FIG. 3. Mean identification of the consonant features ‘‘manner,’’ ‘‘place,’’
and ‘‘voicing’’ as a function of number of channels and presentation mode
共monaural versus dichotic兲. Error bars indicate standard errors of the mean.
impoverished, due to the high noise level and low spectral
resolution 共three or four channels in each ear兲, that it was not
perceived as being speech by the auditory system, hence it
was not integrated centrally. This is based on the assumption
that the ability to fuse a sound presented to one ear with
another sound presented to the other ear must depend on
recognizing that the two components come from the same
speech utterance 共and probably the same talker兲. In contrast,
in the 12-channel condition the subjects had access to 6
channels of information in each ear, and we know from Fig.
2 that moderate levels of performance can be achieved in ⫹5
dB S/N with only 6 channels presented to each ear. A similar
outcome was also reported by Dorman et al. 共2001兲 with
HINT sentences presented dichotically in quiet with even
channels fed in one ear and odd channels fed in the other ear.
Sentences were processed through six channels in a manner
similar to this study. Sentence scores obtained monaurally
480
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
were significantly higher than the scores obtained dichotically.
In addition to observing a significant difference between
the dichotic and the monaural conditions, a significant difference was also observed between the two dichotic conditions.
The way spectral information was split and presented to the
two ears was important for vowel and sentence recognition,
but not for consonant recognition. For sentences processed
through 8 and 12 channels, the mean scores obtained with
the interleaved 共odd–even兲 condition were significantly
higher than the scores obtained with the low–high condition.
For vowels processed through six channels, the scores obtained with the low–high condition were significantly higher
than the scores obtained with the interleaved condition. The
higher scores obtained with the low–high condition in vowel
recognition may be attributed to the fact that in this condition
F1 information 共low-frequency information兲 was presented
to one ear and F2 information 共higher-frequency information兲 was presented to the other ear. This condition must be
easier to deal with compared to the more challenging condition 共interleaved兲 in which pieces of F1 and F2 information
are distributed between the two ears. Subjects did not have
difficulty piecing together the F1/F2 information, however,
when the vowels were processed through 8 and 12 channels.
The above explanation can not be easily extended to sentences, because the situation with sentences differs from that
of vowels, in that listeners are relying on other cues, besides
F1/F2 information, for word recognition.
One possible explanation for the higher scores obtained
in sentence recognition with the interleaved condition is that
it provides greater dichotic release from masking 关a phenomenon first reported by Rand 共1974兲兴 compared to the low–
high condition. Rand 共1974兲 presented the F1 of CV syllables to one ear, and the F2 attenuated by 40 dB to the other
ear, and observed that subjects were able to identify the consonants accurately despite the large relative intensity differences between the two formants. However, when he attenuated the upper formants by 30 dB and presented the stimuli
to both ears 共i.e., diotically兲, the listeners were unable to
identify the consonants accurately. He attributed the advantage of dichotic presentation to release from spectral masking. Others 共e.g., Lunner et al., 1993; Pandey et al., 2001;
Lyregaard, 1982; Franklin, 1975兲 have attempted to exploit
the dichotic release from masking in bilateral hearing aids
and advocated the use of dichotic presentation as a means of
compensating for the poor frequency selectivity of listeners
with sensorineural hearing loss. Lunner et al. 共1993兲 fitted
three hearing-impaired listeners with eight-channel hearing
aids and presented sentences dichotically, with the oddnumber channels fed to one ear and the even-number channels fed to the other ear. An improvement of 2 dB in speech
reception threshold 共SRT兲 was found compared to the condition in which the sentences were presented binaurally. Franklin 共1975兲 investigated the effect of presenting a lowfrequency band 共240– 480 Hz兲 and a high-frequency band
共1020–2040 Hz兲 on consonant recognition in six hearingimpaired listeners with moderate to severe sensorineural
hearing loss. The scores obtained when the low- and highfrequency bands were presented to opposite ears were sigLoizou et al.: Dichotic listening in noise
FIG. 4. Individual subject performance for speech materials processed
through six channels.
nificantly higher than the scores obtained when the two
bands were presented to the same ear. Unlike the above two
studies, a few other studies 共e.g., Lyregaard, 1982; Turek
et al., 1980兲 found no benefit of dichotic presentation with
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
hearing-impaired listeners tested on synthetic /" $ ,/ identification 共Turek et al., 1980兲 or vowel/consonant identification 共Lyregaard, 1982兲.
In the present study, speech was synthesized as a sum of
Loizou et al.: Dichotic listening in noise
481
a small number 共6 –12兲 of sinewaves. Sinewave speech was
advocated in our earlier work 共Dorman et al., 1997兲 as an
alternative 关to noise-band simulations 共Shannon et al.,
1995兲兴 acoustic model for cochlear implants. This acoustic
model is not meant by any means to mimic the percept elicited by electrical stimulation, but rather to serve as a tool in
assessing CI listener’s performance in the absence of confounding factors 共e.g., neuron survival, electrode insertion
depth, etc.兲 associated with cochlear implants. We previously
demonstrated on tests of reduced intensity resolution 共Loizou
et al., 2000b兲, reduced spectral contrast 共Loizou and Poroy,
2001兲, and number of channels of stimulation 共Dorman and
Loizou, 1998兲 that the performance of normal-hearing subjects, when listening to speech processed through the sinewave acoustic model, is similar to that of, at least, the better
performing patients with cochlear implants. In that context,
the results of the present study might provide valuable insights to bilateral cochlear implants. The performance obtained with the interleaved dichotic condition was not significantly different from the performance obtained
monaurally when the speech materials were processed
through 12 channels. For bilateral patients receiving a large
number of channels 共⬎12兲, dichotic stimulation might therefore provide an advantage over unilateral stimulation, in that
it can potentially reduce possible channel interactions since
the electrodes can be stimulated in an interleaved fashion
across the ears without sacrificing performance. For patients
receiving only a small number of channels of information,
dichotic 共electric兲 stimulation might not produce the same
level of performance as unilateral 共monaural兲 stimulation, at
least for sentences presented in the low–high dichotic condition. The large variability in performance among subjects
should be noted however 共Fig. 4兲. Lastly, of the two methods
that can be used to present spectral information dichotically,
the interleaved method is recommended since it consistently
outperformed the low–high method in sentence recognition.
III. SUMMARY AND CONCLUSIONS
This study investigated the hypothesis that spectral resolution might be a factor, in addition to relative onset time,
influencing spectral fusion when the spectral information is
split and presented dichotically to the two ears. Two different
methods of splitting the spectral information were investigated. In the first method, which we refer to as the odd–even
method, the odd-index channels were presented to one ear
and the even-index channels to the other ear. In the second
method, which we refer to as low–high method, the lowerfrequency channels were presented to one ear and the highfrequency channels to the other ear. The results of the present
study suggest the following.
共i兲
Spectral resolution did affect spectral fusion, and the
effect differed across speech materials and the two
dichotic presentation modes considered. The recognition of sentences presented in the low–high dichotic
condition was affected the most. Sentences, processed
through six and eight channels and presented dichotically in the low–high condition, were not fused as
accurately as when presented monaurally. Vowels pro-
482
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
共ii兲
共iii兲
cessed through eight channels and presented dichotically 共in either the interleaved or low–high conditions兲 were not fused as accurately as when presented
monaurally. In contrast, vowels and sentences processed through 12 channels were fused very accurately.
Dichotically presented consonants were fused as well
as monaurally presented consonants independent of
spectral resolution 共number of channels兲 and the way
spectral information was split and presented to the
two ears. That is, the performance obtained monaurally was the same as the performance obtained dichotically in all conditions. The difference in outcomes between the consonants and the other speech
materials suggests that perhaps the vowels and sentence materials are better 共more sensitive兲 materials to
be used in studies of dichotic speech recognition assessment.
A significant difference in performance was found in
vowel and sentence recognition between the two
methods used to split the spectral information for dichotic presentation. For sentence recognition, the
scores obtained with the odd–even 共interleaved兲 dichotic method were significantly higher than the
scores obtained with the low–high dichotic method.
One possible explanation for this difference is that the
odd–even method provides a greater dichotic release
from masking compared to the low-high method.
ACKNOWLEDGMENTS
The authors would like to thank Dr. Richard Tyler and
Dr. Ken Grant for providing valuable comments to this paper. This research was supported by Grant No. R01
DC03421 from the National Institute of Deafness and other
Communication Disorders, NIH. This project was the basis
for the Master’s thesis of the second author 共AM兲 in the
Department of Electrical Engineering at the University of
Texas—Dallas.
Broadbent, D. E., and Ladefoged, P. 共1957兲. ‘‘On the fusion of sounds reaching different sense organs,’’ J. Acoust. Soc. Am. 29, 708 –710.
Cherry, E. C. 共1953兲. ‘‘Some experiments on the recognition of speech, with
one and with two ears,’’ J. Acoust. Soc. Am. 25, 975–979.
Cutting, J. E. 共1976兲. ‘‘Auditory and linguistic process in speech perception:
Inferences from six fusions in dichotic listening,’’ Psychol. Rev. 83, 114 –
140.
Darwin, C. J. 共1981兲. ‘‘Perceptual grouping of speech components differing
in fundamental frequency and onset-time,’’ Q. J. Exp. Psychol. A 33A,
185–207.
Dorman, M., and Loizou, P. 共1998兲. ‘‘The identification of consonants and
vowels by cochlear implants patients using a 6-channel CIS processor and
by normal hearing listeners using simulations of processors with two to
nine channels,’’ Ear Hear. 19, 162–166.
Dorman, M., Loizou, P., and Rainey, D. 共1997兲. ‘‘Speech intelligibility as a
function of the number of channels of stimulation for signal processors
using sine-wave and noise-band outputs,’’ J. Acoust. Soc. Am. 102, 2403–
2411.
Dorman, M., Loizou, P., Spahr, A. J., Maloff, E. S., and Wie, S. V. 共2001兲.
‘‘Speech understanding with dichotic presentation of channels: Results
from acoustic models of bilateral cochlear implants,’’ 2001 Conference on
Implantable Auditory Prostheses, Asilomar, Monterey, CA.
Franklin, B. 共1975兲. ‘‘The effect of combining low and high-frequency
bands on consonant recognition in the hearing impaired,’’ J. Speech Hear.
Res. 18, 719–727.
Loizou et al.: Dichotic listening in noise
Halwes, T. G. 共1970兲. ‘‘Effects of dichotic fusion on the perception of
Speech,’’ Doctoral dissertation, University of Minnesota, Dissertation Abstracts International, 31, p. 1565B, 共University Microfilms No. 70-15,
736兲.
Hillenbrand, J., Getty, L., Clark, M., and Wheeler, K. 共1995兲. ‘‘Acoustic
characteristics of American English vowels,’’ J. Acoust. Soc. Am. 97,
3099–3111.
Kasturi, K., Loizou, P., Dorman, M. 共2002兲. ‘‘The intelligibility of speech
with holes in the spectrum,’’ J. Acoust. Soc. Am. 112, 1102–1111.
Lawson, D., Wilson, B., Zerbi, M., and Finley C. 共1999兲. ‘‘Speech processors for auditory prostheses,’’ Fourth Quarterly Progress Report, Center
for Auditory Prosthesis Research Triangle Institute.
Lippmann, R. P. 共1996兲. ‘‘Accurate consonant perception without midfrequency speech energy,’’ IEEE Trans. Speech Audio Process. 4共1兲, 66 –
69.
Loizou, P., and Poroy, O. 共2001兲. ‘‘Minimum spectral contrast needed for
vowel identification by normal-hearing and cochlear implant listeners,’’ J.
Acoust. Soc. Am. 110, 1619–1627.
Loizou, P., Dorman, M., and Powell, V. 共1998兲. ‘‘The recognition of vowels
produced by men, women, boys and girls by cochlear implant patients
using a six-channel CIS processor,’’ J. Acoust. Soc. Am. 103, 1141–1149.
Loizou, P., Dorman, M., and Tu, Z. 共1999兲. ‘‘On the number of channels
needed to understand speech,’’ J. Acoust. Soc. Am. 106, 2097–2103.
Loizou, P., Poroy, O., and Dorman, M. 共2000a兲. ‘‘The effect of parametric
variations of cochlear implant processors on speech understanding,’’ J.
Acoust. Soc. Am. 108, 790– 802.
Loizou, P., Dorman, M., Poroy, O., and Spahr, T. 共2000b兲. ‘‘Speech recognition by normal-hearing and cochlear implant listeners as a function of
intensity resolution,’’ J. Acoust. Soc. Am. 108, 2377–2388.
J. Acoust. Soc. Am., Vol. 114, No. 1, July 2003
Lunner, T., Arlinger, S., and Hellgren, J. 共1993兲. ‘‘8-channel digital filter
bank for hearing aid use: Preliminary results in monaural, diotic and dichotic modes,’’ Scand. Audiol. Suppl. 38, 75– 81.
Lyregaard, P. E. 共1982兲. ‘‘Frequency selectivity and speech intelligibility in
noise,’’ Scand. Audiol. Suppl. 15, 113–122.
Miller, G. A., and Nicely, P. E. 共1955兲. ‘‘An analysis of perceptual confusions among some English consonants,’’ J. Acoust. Soc. Am. 27,
338 –352.
Nilsson, M., Soli, S., and Sullivan, J. 共1994兲. ‘‘Development of the Hearing
in Noise Test for the measurement of speech reception thresholds in quiet
and in noise,’’ J. Acoust. Soc. Am. 95, 1085–1099.
Pandey, P. C., Jangamashetti, D. S., and Cheeran, A. N., 共2001兲. ‘‘Binaural
dichotic presentation to reduce the effect of temporal and spectral masking
in sensorineural hearing impairment,’’ 142nd Meeting of the Acoustical
Society of America, Ft. Lauderdale, FL.
Rand, T. C. 共1974兲. ‘‘Dichotic release from masking for speech,’’ J. Acoust.
Soc. Am. 55, 678 – 680.
Shannon, R., Zeng, F-G., Kamath, V., Wygonski, J., and Ekelid, M. 共1995兲.
‘‘Speech recognition with primarily temporal cues,’’ Science 270, 303–
304.
Turek, S., Dorman, M., Franks, J., and Summerfield, Q. 共1980兲. ‘‘Identification of synthetic /"$,/ by hearing impaired listeners under monotic and
dichotic formant presentation,’’ J. Acoust. Soc. Am. 67, 1031–1040.
Tyler, R., Preece, J., and Lowder, M. 共1987兲. ‘‘The Iowa audiovisual speech
perception laser videodisc,’’ Laser Videodisc and Laboratory Report, Dept.
of Otolaryngology, Head and Neck Surgery, University of Iowa Hospital
and Clinics, Iowa City.
Loizou et al.: Dichotic listening in noise
483