speech is at least 4-dimensional: receptive fields in

1
SPEECH IS AT LEAST 4-DIMENSIONAL: RECEPTIVE FIELDS IN
TIME-FREQUENCY
Je A. Bilmesy and Dan Ellisy
fbilmes,[email protected]
Department of Electrical Engineering and Computer Sciences
University of California at Berkeley
Berkeley, CA 94704, USA
y
International Computer Science Institute
1947 Center Street, Suite 600
Berkeley, CA 94704, USA
1. INTRODUCTION
The successful integration of temporal information is crucial
for speech recognition: for example, dynamic time warping
and more recently hidden Markov models (HMMs) have
been critical in speech recognition technology, and both
these methods amount to time-aligning feature vectors to
stored word and sentence templates. The correlogram [14]
relies on the temporal processing of individual outputs of a
cochlea-inspired lter bank. Furui [6] posited the existence
of perceptual critical points located in time and containing
information crucial for the perception of speech, and this
idea inspired the statistical speech recognition model SPAM
[11] which focusses modeling power on points of transition
or maximal spectral change. This suggests that certain time
points of a speech signal are in some sense more \important" than others for a recognition task. With HMMs, the
locations of the state transitions inuence the particular realization of a model and therefore its likelihood; states between transitions are of secondary importance. The SPAM
results [1, 12] in fact show that we can group these \nontransitional" states together into a single broad category
and still achieve a good recognition error rate.
All the above methods, however, neglect another potential axis of discrimination, namely the cross-spectrotemporal co-information. Non-convex or disjoint patterns
with signicant temporal and spectral extent cannot be
detected directly. Instead, existing methods focus either
across frequency on a small slice of time (e.g., LPC cepstral
features of a single speech frame) or across time on a small
slice of frequency (e.g., the correlogram).
There are in fact several results suggesting that the utilization of cross spectro-temporal co-information can have
a benecial eect on speech processing and recognition. In
the articial neural network (ANN) speech community [10]
it has been shown that using multi-frame context windows
can improve recognition scores. The loss of independent
feature vectors notwithstanding, ANNs with such \wide"
context windows have the potential to learn time- and
frequency- like patterns, depending on the features used. In
[2], it was shown that a cross-channel correlation algorithm
can be used to nd formants in voiced speech in high noise
situations. In [5], it was shown that using cross-channel correlation can be used to identify individual sound sources in
a mixed auditory scene. Also, in [8], it was suggested that
using long-term cross-channel correlation could be used as
a measure of speech quality.
It can also be argued that the use of cross-spectrotemporal information is biologically plausible. Echoic memory is a temporary buer in the auditory system that holds
pre-attentive information for a brief period of time beJuly 30, 1996
fore subsequent, more detailed, and more taxing processing
takes place [13]. It is likely that this storage occurs at the
post-cochlear level as we have no evidence for such memory
before or during cochlear processing. Therefore, the echoic
storage can plausibly be thought of as a form of processed
spectro-temporal buer. Thus assumed, it would be surprising if subsequent processing did not attempt to nd patterns
utilizing not just the temporal or spectral axes alone, but
shaped regions spanning both time and frequency.
Therefore, it may be postulated that the auditory system
has the capability, over a 200ms time-span comparable to
echoic store, to observe the co-occurrence of information in
dierent spectro-temporal regions. Similar to the cells in
the visual system that respond to particular shapes, one
may consider receptive elds over a form of post-cochlear
spectro-temporal plane. Later stages of the auditory system could derive arbitrarily shaped regions that perhaps
dynamically scale, shift, and transform according to a variety of control mechanisms.
In this paper, we consider a new representation of speech
that attempts to explicitly represent non-convex spectrotemporal co-information. Section 2 discusses the computational aspects of our representation. Section 3 illustrates
with an example. And nally, section 4 discusses current
and future work.
2. THE MODCROSSGRAM
The most general (and most naive) way of computationally
encoding the information described above is brute force.
That is, given a time-frequency grid, we could derive features based on the co-information among all pairs, triples,
quadruples, etc. of grid elements. This clearly would be an
infeasible representation for current speech recognition systems. To mitigate this combinatorial explosion of features,
we henceforth consider only pairs of grid points within a
limited spectro-temporal region and resolution.
The modcrossgram (modulation envelopes crosscorrelated) is a new feature extraction method that may
be used to compute such spectro-temporal co-information.
The processing is as follows (see Figure 1): We rst compute the modulation envelopes in each channel of a criticalband-like lterbank by rectifying and band-pass ltering.
As early as 1939 [4], modulation envelopes have been shown
to carry the crucial phonetic information of a speech signal.
Also, since the envelopes are narrow-band, they may be
down-sampled to recover full band-width and reduce subsequent computational demands. Note that we band-pass
rather than low-pass lter to remove low-frequency modulation energy known to be of little importance to the speech
signal [3, 7].
DRAFT
14:57
2
The modulation envelopes are then processed by shortterm cross-correlation dened as follows:
( )=
Ri;j t; `
X
N
k
( + k)x (t + k + `)w
xi t
j
k
(1)
=0
where x is the i envelope channel, t is the starting oset
within the signals, ` is the correlation lag, N is the number
of points used to compute the correlation, and w are windowing coecients. All pairs of channels are processed by
the above for each desired time step. The result is, for each
t, a rectangular prism whose 3 axes are indexed respectively
by the rst frequency channel, the second frequency channel, and the correlation lag (see Figure 1). The resulting
representation provides an estimate of the co-information
between two frequency channels separated by a given lag `
starting at absolute position t.
General cross-correlation between signals x and y is a
function of two lags. If we assume joint-stationarity between the signals, the correlation is only dependent on one
variable, the dierence between the two lags, and possesses
the property R (`) = R (;`). Because Equation 1 is
short-term cross-correlation | i.e., because we always use
N points to compute the correlation for each t and ` | this
property no longer holds. Therefore, we must potentially
consider R (t; `) and R (t; `) for all `.
The modcrossgram computes more than the cooccurrence of energy between two spectro-temporal regions.
Because our envelopes have been band-pass rather than
low-pass ltered, the DC components have been removed.
Therefore, Equation 1 may nd correlation of the envelopes'
spectral components, within limits determined by the bandpass lters, without being dominated by positive correlation
from DC osets.
Given a representation as described above, a subsequent
pattern recognition algorithm can form receptive elds
based on disconnected spectro-temporal regions. In addition, deltas computed from such receptive elds can determine spectral change on non-convex temporally or spectrally separated regions rather than just collapsed accross
frequency. By discovering the i; j; ` positions that contain
the information crucial to speech, features can potentially
be formed that are preferentially sensitive to speech-like
sounds.
th
i
k
xy
i;j
yx
j;i
frames to 10 frames (horizontal axis). Because it is dicult
to digest such a large quantity of data, we provide a third
plot.
The bottom plot shows the top 60db of the positive
cross-correlation between channel 17 and 8. Each horizontal stripe corresponds to a receptive eld over the coinformation between these two frequency channels at the
corresponding lag. Observe that for \ka", at around frame
70, we see signicant correlation at positive lags; a little
later, we see signicant correlation at negative lags. This
reects the timing dierence between the initial stop release and the subsequent voiced onset. As expected, these
correlations are not observed for \ga", which exhibits quite
dierent patterns around frame 130.
There are of course many other receptive elds available
for observation in the modcrossgram. Our belief is that
some combination of these will be useful for a pattern recognizer to discriminate between speech sounds.
In this proposal, we have introduced the modcrossgram, a
new representation of speech signals. We are currently in
the process of integrating features derived from the modcrossgram into a standard speech recognition system. In
the near future, we plan to use the modcrossgram directly
to derive receptive elds that will be appropriate to speech,
and then use these as features for a speech recognition system.
3. EXAMPLES
The modcrossgram is inherently 4-dimensional { for each
time point t, the result is a 3-dimensional rectangular prism
indexed by channel numbers i and j , and a bipolar lag time
`. It is dicult to visualize all of its information simultaneously.
Figure 2 shows 3 plots. The top plot is the normal spectrogram of the utterance \ka ga." As the spectrogram
shows, these two syllables dier principally in the voicing
onset time { the delay between the stop-burst and the start
of the periodic voicing { which is about 80ms in \ka" but
nearly simultaneous in \ga" [9].
The middle plot shows the complete modcrossgram from
a 22 channel lter bank for the utterance time-aligned to
the spectrogram. This picture shows the top 60dB of the
positive correlation, a maximum bipolar lag span of 250ms
(20 frames), and uses 37.5ms (3 frames) for each correlation
point. At each frame number and frequency channel, a
small matrix shows the correlation between that channel
and all other channels (vertical axis) and the lag from -10
July 30, 1996
4. CURRENT AND FUTURE WORK
DRAFT
REFERENCES
[1] J. Bilmes, N. Morgan, S.-L. Wu, and H. Bourlard.
Stochastic perceptual speech models with durational
dependence. ICSLP, November 1996.
[2] L. Deng and I. Kheirallah. Dynamic formant tracking of noisy speech using temporal analysis on outputs
from a nonlinear cochlear model. IEEE Trans. Biomedical Engineering, 40(5):456{467, May 1993.
[3] R. Drullman, J. M. Festen, and R. Plomp. Eect of
reducing slow temporal modulations on speech reception. JASA, 95(5):2670{2680, May 1994.
[4] H. Dudley. Remaking speech. JASA, 11(2):169{177,
October 1939.
[5] D. P. W. Ellis. A simulation of vowel segregation based
on across-channel glottal-pulse synchrony. Technical
Report 252, MIT Media Lab Perceptual Computing,
Cambridge MA, 92139, 1993. Pres. 126th meeting of
the Acoustical Society of America, Denver, Oct. 1993.
[6] S. Furui. On the role of spectral transition for speech
perception. JASA, 80(4):1016{1025, October 1986.
[7] H. Hermansky and N. Morgan. RASTA processing of
speech. IEEE Trans. on Speech and Audio Processing,
2(4):578{589, October 1994.
[8] T. Houtgast and J. A. Verhave. A physical approach to
speech quality assessment: Correlation patterns in the
speech spectrogram. In EUROSPEECH 91., volume 1,
pages 285{288. Istituto Int. Comunicazioni, 1991.
[9] R.D. Kent, J. Dembowski, and N.J. Lass. The acoustic
characteristics of american english. In N.J. Lass, editor, Principles of Experimental Phonetics, chapter 5.
Mosby, 1996.
14:57
3
envelope follower
|x|
input
sound
crosscorrelation
N
filterbank
i
t=2
t=1
j
lag
Figure 1. The process used to compute the ModCrossGram, resulting in a 4-dimensional representation.
freq (Hz)
4000
100
2000
0
50
0
0
0.2
0.4
0.6
time (secs)
0.8
1
1.2
60
freq (Hz)
2625
40
1104
464
20
195
lag (frames)
60
80
100
120
140
160
0
60
10
40
0
20
−10
60
80
100
120
12.5ms frame number
140
160
0
Figure 2. Top: Spectrogram, Middle: Modcrossgram, Bottom: Correlation between channels 17 (CF 1560 Hz) and 8 (CF 328
Hz).
[10] R. P. Lippmann. Review of neural networks for speech
[13] U. Neisser. Cognitive Psychology. Appleton-Centryrecognition. In A. Waibel and K.-F. Lee, editors, ReadCrofts, 1967.
ings in Speech Recognition, pages 374{392. Morgan
[14] M. Slaney and R. F. Lyon. On the importance of time
Kaufmann, 1990.
[11] N. Morgan, H. Bourlard, S. Greenberg, and H. Hermansky. Stochastic perceptual auditory-event-based
models for speech recognition. Proc. Intl. Conf. on Spoken Language Processing, pages 1943{1946, September
1994.
[12] N. Morgan, S.-L. Wu, and H. Bourlard. Digit recognition with stochastic perceptual models. Proc. Eurospeech'95, September 1995.
July 30, 1996
DRAFT
{ a temporal representation of sound. In M. Cooke,
S. Beet, and M. Crawford, editors, Visual Representations of Speech Signals, pages 95{116. John Wiley and
Sons Ltd., 1994.
14:57