The pitch of short-duration fundamental frequency glissandos

The pitch of short-duration fundamental frequency glissandos
Christophe d’Alessandro, Sophie Rosset, and Jean-Pierre Rossi
LIMSI–CNRS, BP 133 F-91403 Orsay, France
~Received 9 October 1997; accepted for publication 7 July 1998!
Pitch perception for short-duration fundamental frequency (F0) glissandos was studied. In the first
part, new measurements using the method of adjustment are reported. Stimuli were F0 glissandos
centered at 220 Hz. The parameters under study were: F0 glissando extents ~0, 0.8, 1.5, 3, 6, and
12 semitones, i.e., 0, 10.17, 18.74, 38.17, 76.63, and 155.56 Hz!, F0 glissando durations ~50, 100,
200, and 300 ms!, F0 glissando directions ~rising or falling!, and the extremity of F0 glissandos
matched ~beginning or end!. In the second part, the main results are discussed: ~1! perception seems
to correspond to an average of the frequencies present in the vicinity of the extremity matched; ~2!
the higher extremities of the glissando seem more important; ~3! adjustments at the end are closer
to the extremities than adjustments at the beginning. In the third part, numerical models accounting
for the experimental data are proposed: a time-average model and a weighted time-average model.
Optimal parameters for these models are derived. The weighted time-average model achieves a 94%
accurate prediction rate for the experimental data. The numerical model is successful in predicting
the pitch of short-duration F0 glissandos. © 1998 Acoustical Society of America.
@S0001-4966~98!03810-7#
PACS numbers: 43.66.Hg, 43.75.Bc @DWG#
INTRODUCTION
In natural audio signals like speech or music, fundamental frequency (F0) is almost never constant. The main psychological correlate of F0 is pitch. It is well known that the
physical aspect (F0) and the psychological aspect ~pitch!
can differ significantly, particularly for sounds whose fundamental frequency changes. The aim of the present article is
to report new measurements and to propose a numerical
model for pitch perception of short-duration F0 glissandos.
Short-duration F0 glissandos are thought of as prototypes or
building blocks for the F0 contours that can be encountered
in many natural signals, like notes in music or syllables in
speech.
Several theories accounting for human instantaneous
pitch detection are available, together with the corresponding
pitch detection algorithms @see Hess ~1983! or Hermes
~1993! for reviews#. These theories and techniques focus on
instantaneous, or at least short-term, attributes of the acoustic
signal. Only a few periods of signals are considered in order
to compute the time-varying instantaneous fundamental frequency. However, the link between instantaneous F0 contours and pitch perception for larger units, like F0 glissandos, is by no means straightforward. It is therefore important
to distinguish between instantaneous F0, which is a relatively well-known problem, and pitch perception for shortduration F0 glissandos.
Perceptual thresholds for predicting under what conditions F0 changes are audible have long been studied. A first
perceptual threshold is the glissando threshold, or absolute
threshold of pitch change. It is expressed in frequency unit
per unit of time. Below the threshold, a F0 glissando is
perceived with constant pitch. Above the threshold, a pitch
glissando is perceived. Data on the glissando threshold for
speechlike sounds have been obtained by Klatt ~1973!,
Rossi1 ~1971, 1978a!, and Schouten ~1985!, following the
2339
J. Acoust. Soc. Am. 104 (4), October 1998
pioneering work of Sergeant and Harris ~1962! for pure
tones. A unified view of this problem was presented by ’t
Hart ~1976! and ’t Hart et al. ~1990!, who conclude that the
threshold can be approached by G t 50.16/T 2 , where G t is
the glissando threshold ~in semitones/s! and T the tone duration. When a glissando is above the glissando threshold, a
separate perception of the high and low extremities of the F0
patterns appears. Nabelek et al. ~1970! first studied this effect in a work on the pitch for tone bursts of changing frequency. They coined the term ‘‘separation’’ for when only
the final part of the burst contributes to the pitch judgment,
and the term ‘‘fusion’’ for when the overall F0 pattern contributes to the pitch judgment. Thresholds for separation and
fusion are consistent with the glissando threshold.
Thresholds can predict whether a given F0 glissando
~with a certain duration and a certain extent! will be perceived as a constant pitch tone or as a pitch glissando. However, the data on perceptual thresholds for pitch changes do
not address the problem of what pitch is actually perceived,
i.e., what the subjective pitch value is at the beginning and
end of the F0 glissando. The first aim of this study is to give
explicit measurements of the pitch perceived at the beginning and end of F0 glissandos. One of the few studies addressing this problem can be found in Rossi ~1971, 1978a!,
where natural and synthetic vowels were used. Unfortunately, the experimental conditions used by Rossi were not
likely to give accurate results, because of the choice of subjects ~a group of students in phonetics whose individual capabilities in pitch discrimination were not tested! and because of the stimulus presentation method ~loudspeakers in a
classroom!. Therefore, new and more accurate measurements
were needed.
Another issue which needs to be reinvestigated is
whether there are asymmetries in the perception of supraliminar rising and falling F0 glissandos. Several articles
0001-4966/98/104(4)/2339/10/$15.00
© 1998 Acoustical Society of America
2339
have noted such asymmetries, but the results have not always
been consistent. Hombert ~1975! and Rump and Hermes
~1996! found that the high register has more weight in intonation perception. Brady ~1961! and Wieringen and Pols
~1995! reported that for pure tone glides, the end of the tone
has more perceptual weight than the beginning. For pure
tones and complex tones, Demany and McAnally ~1994! and
Demany and Clément ~1995, 1997!, found an asymmetry
between peaks and troughs for large frequency modulations.
However, Rossi ~1978a! reported that no differences occurred in the perception of rising and falling glissandos. In
this article, we compare the perception of rises and falls over
a wide range of conditions.
The final purpose of this work is to propose a numerical
model of pitch perception for short-duration F0 glissandos.
A first model addressing this problem was proposed by Rossi
~1971, 1978a!, using natural and synthetic vowels. Rossi
postulated that for vowels with F0 glissandos ~either rising
or falling!, the pitch perceived at the end of vowels corresponds to an averaging of F0 toward the end of the glissando. This is expressed in Rossi’s so-called ‘‘2/3 rule,’’
which can be stated as follows:
For dynamic tones in a vowel, the pitch perceived corresponds to a point between the second
and the third third of the vowel.
For example, if a linear glissando between 100 and 200
Hz lasts 150 ms, this rule predicts that the pitch perceived is
somewhere in the F0 values of the glissando in the time
interval 100–150 ms, i.e., between 166.6 and 200 Hz. In
Rossi’s terminology, a dynamic tone is a tone with perceived
pitch change. Of course, this rule applies only to the experimental situation envisaged in Rossi’s experiments, namely
pitch perception measured at the end of tones in vowels. This
model was also applied to the pitch of glidelike F0 curves in
folk songs. Ross ~1987! reported good agreement between
his experimental results and Rossi’s rule. In mathematical
terms, Rossi’s rule is an example of a time-average ~TA!
model of pitch perception. Pitch is computed by averaging
F0 on the time interval of the tone:
p5
* t2
t1 F0 ~ t ! d t
* t2
t1 d t
,
~1!
where p is the pitch perceived, F0( t ) is the time-varying
fundamental frequency, t1 and t2 are the averaging time
limits. In systems for automatic intonation analysis, several
authors used ‘‘pragmatic’’ TA models. For instance, Mertens
~1987! and House ~1990! computed pitch as the average of
F0 on 20–30 ms at the end of vowels in syllables.
In a study on pitch perception for short-duration vibrato
tones in singing, d’Alessandro and Castellengo ~1994!
showed that TA models were not suitable in this situation.
The main problem was that TA models were not able to give
more perceptual weight to the end of tones. They proposed
instead a weighted time-average ~WTA! model of pitch perception. Pitch is computed as the time average of F0 viewed
through a time window:
2340
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
FIG. 1. The four parts ABCD of the experiments.
p5
a ~ t 2t2 !
* t2
1 b ! F0 ~ t ! d t
t1 ~ e
a ~ t 2t2 !
* t2
1b !dt
t1 ~ e
,
~2!
where the constants b and a define the raised exponential
window. The constant a is the factor of the exponent in the
window. It gives more or less weight to the final part of the
tone. The constant b is the height of the raised exponential
window, i.e., the height of a rectangular window, whose
width is the tone duration. Contrary to a, the constant b
gives equal weight to all parts of the tone. Thus b controls
the weight given to the whole tone, while a controls the
weight given to the end of the tone. The optimal parameters
were a 522 and b 50.20. This WTA model was also successfully applied to intonation stylization in speech in
d’Alessandro and Mertens ~1995!.
In summary, the aim of the present study is to address
the following matters: ~1! to give new measurements of the
pitch perceived at the beginning and end of short-duration
F0 glissandos; ~2! to study the asymmetries in perception of
rises and falls; ~3! to evaluate a WTA model of F0 integration for pitch perception. This model should be able to predict the pitch perceived at the beginning and end of an F0
glissando. In Sec. I, the experimental method is described.
Then the results are presented and analyzed in Sec. II. A
numerical model accounting for our data is proposed and
discussed in Sec. III. Section IV concludes.
I. METHOD
A. Conditions
The aim of these experiments is to measure the pitch
perceived for short-duration F0 glissandos. Linear F0 glissandos are of the form F0( t )5a3 t 1b, where b is the
frequency ~in Hz! at the beginning of the tone, a is the slope
of the tone ~if d is the duration of the tone, a3d1b is the
frequency at the end of the tone!, and t represents time. The
stimulus are pairs of synthetic vowels, with a 300-ms silent
interval between each member of a stimulus pair. One member of a pair is a glissando ~either a rise or a fall!. The other
member of a pair is flat ~constant F0! with the same duration. The test is divided into four parts ABCD, summarized
in Fig. 1.
d’Alessandro et al.: Pitch of glissandos
2340
TABLE I. Experimental conditions. R: rise. F: fall. B: adjustment at the beginning. E: adjustment at the end. b:
below the glissando threshold. g: glissando threshold. a: above the glissando threshold. ¯ : condition not tested.
50 ms
B
0 ST
0.8 ST
1.5 ST
3 ST
6 ST
12 ST
100 ms
E
B
E
B
300 ms
E
B
E
R
F
R
F
R
F
R
F
R
F
R
F
R
F
R
R
b
b
b
g
a
a
¯
b
b
g
a
a
b
b
b
g
a
a
¯
b
b
g
a
a
b
b
g
a
a
a
¯
b
g
a
a
a
b
b
g
a
a
a
¯
b
g
a
a
a
b
g
a
a
a
a
¯
g
a
a
a
a
b
g
a
a
a
a
¯
g
a
a
a
a
b
a
a
a
a
a
¯
a
a
a
a
a
b
a
a
a
a
a
¯
a
a
a
a
a
In ~A! the first member of a stimulus pair is a rise, and
the second member is flat. The subjects are asked to adjust a
flat tone to the end of a rise.
In ~B! the first member of a stimulus pair is flat, and the
second member is a rise. The subjects are asked to adjust a
flat tone to the beginning of a rise.
In ~C! the first member of a stimulus pair is a fall, and
the second member is flat. The subjects are asked to adjust a
flat tone to the end of a fall.
In ~D! the first member of a stimulus pair is flat, and the
second member is a fall. The subjects are asked to adjust a
flat tone to the beginning of a fall.
All the stimuli are centered at 220 Hz, a typical F0 for a
female voice. Four parameters are studied: the glissando direction ~rise/fall!, the extremity matched ~beginning/end!,
duration ~50, 100, 200, and 300 ms! and extent. The glissando extents are defined according to the glissando threshold, using data reported in ’t Hart et al. ~1990!. For the 50
ms duration condition, G t 50.16/(0.05) 2 564 semitones/s. The extent of the glissando at threshold (G tr) is 64
30.0553.2 semitones. For this condition, the extents studied in the experiments are 0, 0.8, 1.5, 3, 6 and 12 semitones
~or 0, 10.17, 18.74, 38.17, 76.63, and 155.56 Hz!. This corresponds approximately to 0.253G tr , 0.53G tr , G tr , 2
3G tr , 43G tr , in addition to the 0-Hz extent condition. The
same values are used for the 100, 200, and 300 ms duration
conditions. The extents are symmetric around the center
value 220 Hz on a semitone ~logarithmic! scale: F0 is within
220 Hz6N/2 semitones, where N is the extent of the stimuli
in semitones. As the glissandos are linear functions of time
in Hz, the value of 220 Hz is not reached at the center of the
stimuli in time because of the wider interval in Hz for higher
frequencies. In summary, there are 24 conditions for rises
~including four ‘‘0 semitone’’ conditions, which are considered as rises! and 20 conditions for falls. The resulting 88
conditions are summarized in Table I, together with the glissando threshold. Glissandos above and below the glissando
threshold are considered in the experiments.
B. Stimuli
The stimuli were synthetic posterior vowels /a/, computed by a digital parallel formant synthesizer. The center
frequencies, bandwidth and relative amplitudes of the first
four formants (F1, F2, F3, F4) were typical values for a
female
speaker:
F15(650 Hz, 80 Hz, 0 dB),
F2
2341
200 ms
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
5(1100 Hz, 90 Hz, 28 dB),
F35(2900 Hz, 140 Hz,
211 dB), F45(3300 Hz, 130 Hz, 220 dB). Synthetic
voice stimuli were chosen because they were fully controlled, and because they were closer to natural sounds ~like
speech and singing! than pure tones or square waves. The
signals corresponding to all the experimental conditions were
synthesized and stored on a computer disk. Signals were
sampled at 16 kHz, using 16 bits per sample. Prestored examples were played at test time, following the procedure
described below.
The signal peak amplitude was the same for all the
stimuli. Sound intensity variation within all the stimuli was
less than 2.4 dB. This was small enough to avoid consideration of the effect of intensity changes on pitch perception
@see Rossi ~1978b! for a discussion of interaction of amplitude and fundamental frequency glissandos in pitch perception#.
C. Procedure
Previous data published on the pitch perceived for shortduration F0 glissandos were generally obtained using a
forced-choice procedure ~Rossi, 1971; ’t Hart et al., 1990!.
However, we preferred a method of adjustment similar to the
method used in a study of pitch perception for short-duration
vibrato tones in d’Alessandro and Castellengo ~1994!. The
method of adjustment has a number of advantages: ~1! each
response of a subject gives an estimation of the pitch perceived; ~2! the method helps concentration, as the subject is
actively engaged in the adjustment process; ~3! the data obtained are less variable than forced-choice data in frequency
discrimination tasks, as shown by Wier et al. ~1976!.
Subjects were asked to adjust the frequency of the flat
tone so that it was equal to the frequency at the beginning or
at the end of the glissando. To make an adjustment, the subject pointed to a place on a one-octave scale on the computer
screen, pressed a button, and heard a stimulus pair. If she/he
was satisfied with this pair, the subject reported a match. If
she/he was not satisfied with this pair, the subject moved the
mouse pointer on the scale and listened to the corresponding
stimulus pair. Low frequencies were on the left of the scale,
and high frequencies on the right. For each match, the subject had no time limit for giving an answer, and could play
any stimulus pair for the same condition. Subjects typically
listened to about ten stimulus pairs while manipulating the
frequency of the flat tone before reporting a match. The subd’Alessandro et al.: Pitch of glissandos
2341
TABLE II. Experimental results. Means and standard deviations for the matches reported by the subjects, as a
function of extent, duration, rise/fall, and begin/end ~see parts ABCD in the text!. Means and standard deviations are in Hz.
Extent ST ~Hz!
50 ms
100 ms
200 ms
300 ms
221.261.0
226.268.7
229.869.1
241.1611.6
255.266.8
291.0613.7
221.161.0
227.069.5
231.4610.8
242.0612.7
256.565.3
292.0612.7
219.162.5
218.162.9
216.1610.7
213.3611.6
209.5620.0
208.2635.7
219.561.4
218.164.0
215.2610.5
212.4611.3
203.8619.4
201.8629.1
217.367.2
215.364.9
210.966.3
201.0612.2
193.7620.5
216.964.0
214.665.4
210.466.6
198.0610.0
193.5620.5
223.164.4
225.367.2
234.0611.4
251.6611.6
275.9622.0
222.563.0
226.765.2
235.066.7
254.369.5
282.0616.6
Part A
0.0~220.9!
0.8~215.0–225.1!
1.5~210.6–229.4!
3.0~201.7–239.9!
6.0~185.0–261.6!
12.0~155.6–311.1!
221.561.8
224.067.1
222.865.4
234.7615.5
255.1613.7
281.9612.4
221.161.0
225.566.9
227.7610.6
238.6612.4
257.069.7
283.7612.8
Part
0.0~220.9!
0.8~215.0–225.1!
1.5~210.6–229.4!
3.0~201.7–239.9!
6.0~185.0–261.6!
12.0~155.6–311.1!
218.163.6
219.062.0
215.569.9
213.9612.4
215.7624.3
215.9639.7
219.061.5
218.363.4
216.3610.3
216.0613.1
213.3625.4
213.8641.1
Part
0.8~215.0–225.1!
1.5~210.6–229.4!
3.0~201.7–239.9!
6.0~185.0–261.6!
12.0~155.6–311.1!
217.667.2
217.962.0
215.264.1
208.1615.9
214.1631.2
220.262.0
223.767.7
228.367.8
246.6616.5
267.4618.8
jects were asked to always give an answer. For each test, a
fixed scale spanning one octave was proposed to the subjects
for adjustment of the flat tone. This scale was divided into
steps of 2.25 Hz. This step size corresponded to about 1.4
times the difference limens reported by Moore ~1973! for
short-duration tones ~about 1.6 Hz for 50 ms at 250 Hz!. The
signals were presented to both ears at a level of 80 dB SPL
through Beyer DT48 headphones. The subjects participated
in five test sessions divided into 20 min periods, with a comfortable pause between periods. A typical session lasted between 2 and 4 h. The first test session was considered as a
training session, and was not used in the results.
D. Subjects
A group of eight musically trained subjects participated
in the experiments. One of the subjects was one of the authors ~CdA!, and the others were music students who were
paid for the experiments. Results are likely to be accurate
with trained and selected subjects. However, interpretation of
the results as representative of the general population may be
questionable, because of the superior skills of our subjects in
pitch discrimination. With 8 subjects and 4 sessions, 32
matches were made for each of the 88 experimental conditions.
2342
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
C
217.168.3
217.562.6
213.565.1
205.7613.8
202.0622.0
Part
0.8~215.0–225.1!
1.5~210.6–229.4!
3.0~201.7–239.9!
6.0~185.0–261.6!
12.0~155.6–311.1!
B
D
221.162.8
224.165.3
231.169.5
250.3614.3
272.1618.9
II. RESULTS AND DISCUSSION
A. Results
Means and standard deviations averaged for all subjects
are reported in Table II and plotted in Fig. 2. For each condition in Fig. 2, horizontal sides of the box represent duration
and vertical sides of the box represent extent. The four empty
circles near the corners of the boxes represent the means
obtained for the four parts ABCD of the test. The coordinates
of a circle represent the mean frequency ~along the y axis!
reported by the subjects, and the time ~along the x axis! that
corresponds to the instant when the glissando frequency was
equal to the mean frequency. Thus the F0 rises can be represented in the boxes by a line between the bottom left corner and the top right corner. The falls can be represented in
the boxes by a line between the top left corner and the bottom right corner. If the subjects were able to perfectly perceive the time-varying fundamental frequency, matches
would place the four points in the four corners. Part A ~rise/
end! is generally in the right/top corner, part B ~rise/
beginning! is generally in the left/bottom corner, part C ~fall/
end! is generally in the right/bottom corner, and part D ~fall/
beginning! is generally in the left/top corner. Responses are
always close to the extremity matched. This indicates that
subjects did not take all the glissando into account, but rather
made an average of the frequencies present near the extremd’Alessandro et al.: Pitch of glissandos
2342
~1! the high register has more weight in perception ~because
matches are closer to the higher extremity of the glissando!.
~2! the integration interval is shorter at the end and longer at
the beginning ~because matches are closer to the extremity at the end than at the beginning!.
FIG. 2. Experimental results. Means reported in Table II. Each circle represents the mean obtained for each condition. Each box corresponds to the
four parts ABCD of the experiments. For part A, rises adjusted at the end,
results are generally in the right and top corner. For part B, rises adjusted at
the beginning, results are generally in the left and bottom corner. For part C,
falls adjusted at the end, results are generally in the right and bottom corner.
For part D, falls adjusted at the beginning, results are generally in the left
and top corner. The underlined conditions represent infraliminar glissandos,
glissandos that are below the glissando threshold ~they are marked ‘‘b’’ in
Table I!. The other conditions represents supraliminar glissandos.
ity. The underlined conditions in Fig. 2 represent infraliminar glissandos, glissandos that are below the glissando
threshold ~they are marked ‘‘b’’ in Table I!. The other conditions represent supraliminar glissandos. For supraliminar
glissandos, a ‘‘butterfly’’ pattern seems consistent across
duration/extent conditions. Most butterflies share a similar
form: part A is the nearest to the corner, and in several cases
is matched even out of the box ~average differences between
means and extremities for part A are: 6.35 Hz!; part B is the
farthest from the corner ~B: 20.12 Hz!; point C is farther
from the corner than point D ~C: 16.43 Hz, D: 12.74 Hz!.
The parts can be classified in the order A,D,C,B as a
function of their distances to the corresponding extremity.
The butterfly pattern shows a possible asymmetry between rises and falls. Pitch for rises ~compared to falls! is
closer to the end when matched at the end ~compare parts A
and C!, and farther off the beginning when matched at the
beginning ~compare parts B and D!. This might indicate different intervals for perceptual time averaging that may depend on the factor ‘‘rise/fall.’’ It seems that subjects are
more influenced by the higher extremity than by the lower
extremity. Typically, falls are adjusted closer to the beginning, i.e., the higher extremity ~compare C and D!, whereas
rises are adjusted closer to the end, i.e., the higher extremity
~compare A and B!. Heuristic rules for drawing the butterfly
patterns in Fig. 2 can be stated as follows:
2343
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
Rule ~1! states that A,B and C.D. Rule ~2! states that
A,D and B.C. Thus, one can deduce that A,D,C,B.
The same tendency can be seen for standard deviations.
The standard deviations reported in Table II are an indication
of the consistency of the adjustments. Generally, standard
deviations are smaller for adjustments at the end ~parts A and
C!, and higher for adjustments at the beginning ~parts B and
D!. This could indicate that it was easier to make consistent
adjustments at the end. Also, standard deviations are smaller
for adjustment of the higher extremity ~compare A and D,
and B and C!. For the large extent conditions ~6 semitones
and 12 semitones!, the standard deviations can be ordered as:
A,D,C,B. Thus it seems that when adjustments are farther from the extremity matched, they are also less consistent. For the smaller extent conditions, it is difficult to interpret the differences in standard deviation. For the 0 semitone
condition, the standard deviations are small when adjusted at
the end. In this case, they are close to the difference limens
for short tones ~Moore, 1973!.
An analysis of variance was carried out with factors
‘‘duration’’ ~four values!, ‘‘rise/fall’’ ~two values!,
‘‘beginning/end’’ ~two values!, ‘‘extent’’ ~five values!, ‘‘session’’ ~four values!. The factor ‘‘duration’’ was not significant @F(3,560)520.055; p50.9830#. As a matter of fact,
only a slight dependence on duration is noticeable in Fig. 2.
Means followed a slightly rising or falling line as duration
increased for a same extent, but this tendency was not significant. This showed that duration played only a minor role
compared to other factors. No significant effect was found
for the factor ‘‘beginning/end’’ @F(1,560)51.936; p
50.1646#; subject responses averaged for parts A and C
were not different from the subject responses averaged for
parts B and D. The effect of the factor ‘‘rise/fall’’ was significant @F(1,560)526.94; p,0.0001#. This indicates that
the asymmetry observed in perception of rises and falls is
significant. The factor ‘‘extent’’ was significant @F(4,560)
572.67; p,0.0001#. This was likely to occur, as for most
conditions the extents were close to or above the glissando
threshold. Interactions between ‘‘rise/fall’’ and ‘‘beginning/
end’’ were very strong @F(1,560)51092.80; p,0.0001#
This indicates different behaviors for the different parts A, B,
C and D. The interaction between ‘‘rise/fall’’ and ‘‘extent’’
was significant @F(4,560)53.07; p50.0162#. Again, this
confirms an asymmetry in perception of rises and falls,
which depends on extent. All other interactions were not
significant.
B. Asymmetry between higher and lower extremities
Rule ~1! states that the higher extremity is ‘‘better’’ perceived than the lower extremity. ‘‘Better’’ perceived means
that adjustments are closer to the higher extremity, when it is
the extremity matched, compared to the lower extremity,
d’Alessandro et al.: Pitch of glissandos
2343
again when it is the extremity matched. This experimental
result suggests that the high register might be perceptually
more important than the low register. Similar findings have
already been reported in other works. A difference between
rises and falls was noticed by Rump and Hermes ~1996! in
an experiment on prominence in intonation. The stimuli were
/mamama/ utterances, with fundamental frequency around
100 Hz. They found that high pitch levels contributed to
prominence more than low pitch levels. Demany and
McAnally ~1994! and Demany and Clément ~1995, 1997!
conducted a number of studies on perception of wide frequency modulations. In these experiments, pure tones or
complex tones were used, between 250 and 4000 Hz. They
consistently found that frequency peaks are better perceived
than frequency troughs. This effect seems robust against factors like the carrier signal, modulation waveform, frequency
register and intensity. In a study on perception of contour
tones, Hombert ~1975! asked subjects to adjust a static tone
at the beginning of a rise or a fall. The results are very
similar to ours: the beginning of a gliding tone is better perceived for falls than for rises. In other words, for a task rather
similar to the present experiments, Hombert ~1975! found the
same type of effect: The higher extremity is better perceived
than the lower extremity. This indicates that rule ~1! might
be a rather general effect for perception of supraliminar ~or
wide! frequency modulations.
C. Asymmetry between beginning and end
The second type of asymmetry found in our data, rule
~2!, is an asymmetry between adjustments at the beginning
and at the end of tones. This type of asymmetry has been
reported in several studies of wide frequency modulations, as
in formant transitions between consonants and vowels in
speech. Brady et al. ~1961! suggest that a higher perceptual
weight could be assigned to the most recent information.
More recently, Porter et al. ~1991!, Wieringen and Pols
~1995!, Wieringen ~1995!, and Cullen et al. ~1992! reported
differences in sensitivity between initial and final transitions
of short formant glides. All these studies conclude that the
end is better perceived than the beginning. This indicates that
rule ~2! might be a rather general effect for perception of
time-varying frequency modulation.
D. Asymmetries between rises and falls
The possible asymmetry in perception of rises and falls
has been the subject of controversial discussion in many papers on tone or intonation perception. Dooley and Moore
~1988! discussed asymmetry in difference limens for glide
detection. In a two-alternative, forced-choice procedure, they
measured frequency difference limens ~DLF! for rises and
falls. Rises and falls were presented randomly before and
after a constant tone signal. Signals were merged, corresponding to parts A and B on the one hand and C and D on
the other, and no effects between beginning and end were
likely to occur. Pure tones were used, with center frequencies
ranging from 500 to 8000 Hz, and durations ranging from 50
to 800 ms. Dooley and Moore ~1988! found that UP–DL
~DLF for rises! were consistently higher than DOWN–DL
2344
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
~DLF for falls!. Thus it seems that rises are more difficult to
perceive than falls. The opposite effect is reported by Gardner and Wilson ~1979!. Dooley and Moore conclude that
these differences are difficult to interpret. The differences
between rises and falls went unnoticed in many previous
works on the glissando threshold, as ’t Hart et al. ~1990!
pointed out. They also noted the large dispersion of the data
on the glissando threshold found in the literature. All these
studies were for difference limens. In this case, it is difficult
to conclude about possible asymmetries between rises and
falls. On the contrary, in the case of wide and supraliminar
frequency modulations, it seems that the difference between
rises and falls is consistent in many experiments ~Hombert,
1975; Rump and Hermes, 1996; Demany and McAnally,
1994; Demany and Clément, 1995, 1997!. In this case, the
combination of rules ~1! and ~2! can shed some light on the
possible asymmetry in perception of rises and falls, as it is
discussed above.
E. Summary
In summary: ~1! the mean frequency perceived for shortduration F0 glissandos is in the direction of the extremity
matched; ~2! distances of adjustments to the extremity differ
for rises and falls; ~3! adjustments are closer to the extremity
for adjustments at the end, compared to adjustments at the
beginning; ~4! adjustments are closer to the higher extremity
compared to the lower extremity; ~5! ‘‘duration’’ plays a
minor role; ~6! ‘‘extent’’ is highly significant. As a consequence, the four parts ABCD for a given condition of extent
and duration have distinct behaviors, which can be seen in
the asymmetric butterfly patterns of Fig. 2.
Time averaging of F0 could explain pitch perception for
F0 glissandos. It also seems that not all the glissando is
taken into account for time averaging. These results are compatible with Rossi’s 2/3 rule. One can easily check that most
of our experimental data are comprised between the second
and the third third of the glissandos.2 However, this rule is
not really a model, as it gives only an indication of where
pitch is. It is of little use in the case of large extent. For
example, if one considers the 12 semitone conditions, Rossi’s rule predicts that the pitch perceived is between the second and the third third of the glissando, and one third of the
glissando represents more than 50 Hz. Of course the result is
correct, but not very accurate. In the next section, a more
accurate model is discussed.
III. NUMERICAL MODEL
A. Equation of the model
A numerical model able to predict experimental data is
searched for. The aim of this model is to predict the pitch
perceived at the beginning and end of a given F0 glissando.
The model should be able to give a response which is close
to the subjects’ mean response. Both TA and WTA models
are studied. The TA model integrates F0 over a time interval
of the glissando. The integration interval is near the extremity matched. Assuming a linear glissando F05a t 1b, a TA
model is
d’Alessandro et al.: Pitch of glissandos
2344
p5
* t2
t1 ~ a t 1b ! d t
* t2
t1 d t
5a
t11t2
1b.
2
~3!
The model has only one parameter, d, defined as follows. Let d be the glissando duration. Then d is the length of
the averaging interval for integration, expressed as a proportion of d. The time limits of integration are (t150, t25 d
3d! for an adjustment at the beginning, and (t15d2 d
3d, t25d! for an adjustment at the end. The TA model is
computed as
p~ d !5
H
ad ~ 22 d !
1b
2
ad d
1b
2
adjustment at the end
~4!
adjustment at the beginning.
Like the TA model, the WTA model integrates F0 on a
time interval of the glissando, but F0 is modulated by a
weighing function. The aim of this weighing function is to
concentrate the integration of frequency on a part of the glissando. It is reasonable to assume that time plays a role in the
averaging process, so more attention must be paid to the end
in case of adjustment at the end, and to the beginning in case
of adjustment at the beginning. This is the reason that the
weighing function chosen is an exponential function. The
model has two parameters. d represents the duration of the
averaging interval expressed as a proportion of the tone duration, and a is the exponential factor:
p~ a,d !5
5
* t2
t1 exp„a ~ t 2t2 ! …~ a t 1b ! d t
~5!
Let d be the glissando duration. Then the time limits of
integration are (t150, t25 d 3d! for an adjustment at the
beginning, and (t15d2 d 3d, t25d! for an adjustment at
the end. The WTA model is computed as:
p~ a,d !5
5
a2a ~ d2 d d ! exp~ 2 a d d !
a
1b2
12exp~ 2 a d d !
a
adjustment at the end,
a
add
1b2
12exp~ 2 a d d !
a
~6!
adjustment at the beginning.
B. Estimation of optimal parameters
The TA model depends on one free parameter, and the
WTA model depends on two free parameters. Optimal parameters of the models must be estimated. A classical means
for estimation of optimal parameters is to minimize the rootmean-square ~rms! distance between the model response and
the experimental data.
Let m(n) be the experimental mean ~reported in Table
II! for a particular condition n of factors ‘‘duration,’’ ‘‘extent,’’ ‘‘rise/fall,’’ and ‘‘beginning/end.’’ Let p d , a (n) be the
response of a model with parameters d,a for condition n.
The normalized rms distance is computed by
2345
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
A(
N
n51
„p d , a ~ n ! 2m ~ n ! …2 ,
~7!
where n is between 1 and N, and where N is the number of
conditions considered.
The minimum rms distance between the model and data,
using an iterative minimization procedure, gives the optimal
parameters. However, the raw distance between the model
and the data is not very informative by itself. A good way to
gain further insight into the model quality is to evaluate the
fit to the data, by computing the number Q of conditions for
which the model response is not statistically different from
the subject’s response: The larger the Q, the better the fit. Q
is computed as follows. Let e(n) be the experimental standard deviation ~reported in Table II! for a particular condition n of factors ‘‘duration,’’ ‘‘extent,’’ ‘‘rise/fall,’’ and
‘‘beginning/end.’’ The statistical equivalence between this
predicted value and the experimental data is computed by the
Student t-test ~recall that 32 responses are available for each
test condition!: t(n)5 @ „m(n)2 p d , a (n)…A31# /e(n). Using a
level of significance of 0.05, the critical value is 2.03 for a
two-tailed test with df531. The statistical equivalence between the predicted value and the experimental data for a
condition n is computed using a function q d , a (n) defined by
q d,a~ n ! 5
H
1,
if 22.03,t ~ n ! ,2.03,
0,
otherwise.
~8!
The quality Q( d , a ) of the model with parameters d,a is the
number of conditions for which the model and the data are
statistically equivalent:
* t2
t1 exp„a ~ t 2t2 ! …d t
a
at22at1 exp„a ~ t12t2 ! …
1b2 .
12exp„a ~ t12t2 ! …
a
1
rms~ d , a ! 5
N
N
Q~ d,a !5
(
n51
q d,a~ n ! ,
~9!
n is between 1 and N, where N is the number of conditions
considered. For instance, if the model is searched for all the
conditions, then N580; but if it is searched only for rises
matched at the end, then N520. The zero-semitone conditions are not considered, because every model will give the
same answer in this case.
Discussion of the results reported above suggests that
asymmetries between beginning versus end, higher versus
lower extremities, and rises versus falls have to be considered. This indicates that each part ABCD should be considered separately for modeling. Thus four sets of parameters
are considered for modeling, one for each part ABCD, with
N520 for each set. The optimization procedure described
above is applied to each part. In addition, the optimization
procedure is applied to the entire set of data ~coined ‘‘all
data’’ condition!. Optimal parameters, and the corresponding
number of conditions for which the predicted data are not
significantly different from the experimental data, are reported in Table III. Results for each condition ~using four
sets of parameters! are displayed in Fig. 3. An empty circle
indicates a condition for which the model response is statistically different from the data. A filled circle indicates a condition for which the model response is not statistically different from experimental data.
d’Alessandro et al.: Pitch of glissandos
2345
TABLE III. Optimal parameters for the TA and the WTAM models. ‘‘Q’’ represents the ‘‘quality’’ of the
model: the number of conditions for which the predicted data are not significantly different from the experimental data ~see text!, and ‘‘#’’ represents the number of conditions for each part. ‘‘%’’ is the corresponding
percentage (1003Q/#). ‘‘All data’’ means that a same parameter setting is used for all data. ‘‘ABCD’’ means
that a different parameter setting is used for each part ABCD.
TA d
All data
Part A
Part B
Part C
Part D
ABCD
0.48
0.18
0.69
0.59
0.45
¯
Q ~#!
49
17
20
16
15
68
~80!
~20!
~20!
~20!
~20!
~80!
%
61
85
100
80
75
85
C. Discussion
The TA and WTA models are able to predict 85% and
94% of the data, provided that the four parts ABCD are
considered separately. The percentages are reduced to only
61% and 65% with the same parameter setting for all parts.
This result is not surprising, given the asymmetries observed
in the experimental data. Thus for modeling the four parts
ABCD must be considered separately. In this case, good results are obtained for both models, though the WTA model
performs better than the TA model.
The optimal a and d parameters for the WTA model are
different for the four parts of the test. The interpretation of d
is straightforward: It represents the proportion of the glissan-
FIG. 3. Weighted time-average models. Each circle represents the value
obtained by four weighted time-average models corresponding to the four
parts of the test. A filled circle represents a conditions which is not statistically different from the experimental data. The underlined conditions represent infraliminar glissanndos, glissandos that are below the glissando threshold ~they are marked ‘‘b’’ in Table I!. The other conditions represent
supraliminar glissandos.
2346
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
WTA d
WTA a
Q ~#!
0.54
0.26
0.82
0.75
0.56
¯
7.2
24.7
27.5
13.5
219.4
¯
52
17
20
20
18
75
~80!
~20!
~20!
~20!
~20!
~80!
%
65
85
100
100
90
94
dos over which integration occurs. Interpretation of a is twofold. On the one hand, the sign of a indicates if more attention was paid to the beginning or end of the glissando. A
positive a indicates that a higher perceptual weight is assigned to the most recent information ~end!. A negative a
indicates that a higher perceptual weight is assigned to the
most distant information ~beginning!. On the other hand, the
magnitude of a represents the amount of damping introduced
in averaging. A large magnitude means that a higher perceptual weight is assigned to the information close to the extremity matched. A small magnitude means that all the parts
of the time interval used for averaging have almost the same
weight. The averaging interval ( d 50.26) is very short for
part A: about 26% of the end of the glissando. The parameter
a 524.7 is positive, which means that the end of the glissando has more weight. The averaging window length in part
B ( d 50.82) is long: about 82% of the glide. The parameter
a 527.5 is negative. This indicates that more attention was
paid to the beginning of the glissando. The same remark
holds for part D, where a 5219.4 is negative, though for
part D the averaging window ( d 50.56) is shorter than for
part B. For part C, the parameter a 513.5 is positive, which
indicates that more attention was paid to the end of the glissando. The averaging window ( d 50.75) is long, about 75%
of the tone duration. The same remarks hold for the TA
model. The averaging window is shorter for part A ~18%!,
longer for part B ~69%!. It is also long for parts C and D.
The combination of the magnitude of a and d can be
used for computing the distance between the model responses and the extremity matched. Recall that in the results
section, it was found that this distance is ordered as
A,D,C,B for the experimental means. The constant a
magnitude represents the damping factor. When it is high,
damping is high and then the model response is close to the
extremity. The distance of the model response and the extremity can be ordered as follows: A,D,C,B because the
corresponding a are 24.7~A!.19.4~D!.13.5~C!.7.5~B!.
The parameter d represents the proportion of the glissando
used for integration. When d is small, the model response is
close to the extremity. We obtain the same order:
A,D,C,B, because 0.26~A!,0.56~D!,0.75~C!,0.82~B!.
Therefore, the optimal parameters obtained for the WTA
model are in good agreement with the heuristic rules derived
d’Alessandro et al.: Pitch of glissandos
2346
from the experimental results: The same order is found for
the four parts ABCD of the test.
In summary, it seems that both the TA and the WTA
models are able to predict a significant amount of the experimental data. However, the WTA model performs better than
the TA model. This is not surprising, since the WTA model
has two parameters compared to only one parameter for the
TA model. The model’s parameters give some insight into
the experimental data, and are compatible with the butterfly
patterns found in the experimental data. It must be pointed
out that it is necessary to know the glissando direction and
whether the match is at the beginning or end in order to
select a set of parameters. This means that the WTA model
in itself is not able to explain the asymmetries, but must be
adapted to each situation. In this case it is able to predict the
pitch perceived. The WTA model is a functional model, and
not a model of the physiological or psychological processes
that take place in pitch perception. It may help to compute
the pitch perceived for an F0 glissando, but is not able to
explain how this pitch is perceived. In particular, the explanation of experimental asymmetries in pitch perception is
still unclear.
IV. CONCLUSION
In this study, we studied perception of short-duration
linear F0 glissandos. Perception of rising and falling glissandos, matched at the beginning and at the end, was measured.
Significant differences between the four main parts ABCD of
the experiments ~rises adjusted at the end, rises adjusted at
the beginning, falls adjusted at the end, falls adjusted at the
beginning! were found. This indicates that there is a difference between perception of rises and falls and perception of
the beginning and the end of tones. This might be explained
by a greater perceptual weight for the high register, compared to the low register. The experimental results showed
that some sort of averaging of the F0 contours was performed by the subjects. Frequencies near the extremity
matched ~beginning or end! seemed more important. Simple
numerical models for prediction of the experimental data
were tried: a time-average model and a weighted timeaverage model of pitch perception. Optimal parameters for
these models were estimated using a statistical distance criterion. The models performed well in predicting experimental data: 94% of the data computed with the best WTA models were not significantly different from experimental data.
The optimal model’s parameter settings varied significantly
among parts ABCD ~rises versus falls, adjustment at the end
versus adjustment at the beginning!. This was consistent with
the finding of significant differences in the perception of the
four parts. It can be concluded from our experiments that
pitch perception for short-duration F0 glissandos corresponds to a weighted time average of F0. The WTA model
proposed in the present study is close to the WTA model that
was proposed for perception of short-duration vibrato tones
in d’Alessandro and Castellengo ~1994!. However, the parameters obtained in both studies are different, as in this
study they are also different among different parts.
2347
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
ACKNOWLEDGMENTS
We are grateful to John P. Madden and Adrian Houtsma
for their detailed constructive comments on an earlier version of this article, and to the editorial assistance and criticisms provided by D. Wesley Grantham. Thanks also to
Olivier Piot, who helped in the experimental data collection.
1
Mario Rossi, not to be confused with Jean-Pierre Rossi, one of the authors.
Rossi’s rule was stated for adjustment at the end of stimuli. The same type
of rule can be formulated in the case of adjustment at the beginning, as
follows: For dynamic tones in a vowel, the pitch perceived at the beginning
corresponds to a point between the first and the second third of the vowel.
2
Brady, P. T., House, A. S., and Stevens, K. N. ~1961!. ‘‘Perception of
sounds characterized by a rapidly changing resonant frequency,’’ J.
Acoust. Soc. Am. 33, 1357–1362.
Cullen, J. K., Houtsma, A. M. J., and Collier, R. ~1992!. ‘‘Discrimination of
brief tone-glides with high rate of frequency changes,’’ Abstract of the
Fifteenth Midwinter Meeting of the Association for Research in Otolaryngology, 67.
d’Alessandro, C., and Castellengo, M. ~1994!. ‘‘The pitch of short duration
vibrato tones,’’ J. Acoust. Soc. Am. 95, 1617–1630.
d’Alessandro, C., and Mertens, P. ~1995!. ‘‘Automatic pitch contour stylization using a model of tonal perception,’’ Comput. Speech Lang. 9, 257–
288.
Dooley, G. J., and Moore, B. C. J. ~1988!. ‘‘Detection of linear frequency
glides as a function of frequency and duration,’’ J. Acoust. Soc. Am. 84,
2045–2057.
Demany, L., and Clément, S. ~1995!. ‘‘The perception of frequency peaks
and troughs in wide frequency modulation. III. Complex carriers,’’ J.
Acoust. Soc. Am. 98, 2515–2523.
Demany, L., and Clément, S. ~1997!. ‘‘The perception of frequency peaks
and troughs in wide frequency modulation. IV. Effects of modulation
waveform,’’ J. Acoust. Soc. Am. 102, 2935–2944.
Demany, L., and McAnally, K. I. ~1994!. ‘‘The perception of frequency
peaks and troughs in wide frequency modulation,’’ J. Acoust. Soc. Am.
96, 706–715.
Gardner, R. B., and Wilson, J. P. ~1979!. ‘‘Evidence for direction-specific
channels in the processing of frequency modulations,’’ J. Acoust. Soc.
Am. 66, 704–709.
Hermes, D. J. ~1993!. ‘‘Pitch analysis,’’ in Visual Representation of Speech
Signals, edited by M. Cooke, S. Beet, and M. Crawford ~Wiley, Chichester, UK!.
Hess, W. ~1983!. Pitch Determination of Speech Signals. Algorithms and
Devices ~Springer-Verlag, Berlin!.
Hombert, J. M. ~1975!. ‘‘Perception of contour tones: an experimental
study,’’ Proc. 1st Annu. Meet. Berkeley Ling. Soc., pp. 221–232.
House, D. ~1990!. Tonal Perception in Speech ~Lund U.P., Lund, Sweden!.
Klatt, D. H. ~1973!. ‘‘Discrimination of fundamental frequency contours in
synthetic speech: Implications for models of pitch perception,’’ J. Acoust.
Soc. Am. 53, 8–16.
Mertens, P. ~1987!. ‘‘L’intonation du français. De la description linguistique
á la reconnaissance automatique,’’ Ph.D. thesis, Catholic University of
Leuven, Belgium.
Moore, B. C. J. ~1973!. ‘‘Frequency difference limens for short-duration
tones,’’ J. Acoust. Soc. Am. 54, 610–619.
Nabelek, I. V., Nabelek, A. K., and Hirsh, I. J. ~1970!. ‘‘Pitch of tone bursts
of changing frequency,’’ J. Acoust. Soc. Am. 48, 536–553.
Porter, R. J., Cullen, J. K., Collins, M. J., and Jackson, D. F. ~1991!. ‘‘Discrimination of formant transition onset frequency: Psychoacoustic cues at
short, moderate, and long durations,’’ J. Acoust. Soc. Am. 90, 1298–1308.
Ross, J. ~1987!. ‘‘The pitch of glide-like F0 curves in Votic folk songs,’’
Proceedings of XIth International Congress of Phonetic Sciences, XIth
ICPhS, Tallin, USSR, pp. 174–177.
Rossi, M. ~1971!. ‘‘Le seuil de glissando ou seuil de perception des variations tonales pour les sons de la parole,’’ Phonetica 23, 1–33.
Rossi, M. ~1978!. ‘‘La perception des glissando descendants dans les contours prosodiques,’’ Phonetica 35, 11–40.
d’Alessandro et al.: Pitch of glissandos
2347
Rossi, M. ~1978!. ‘‘Interaction of intensity glides and frequency glissandos,’’ Language and Speech 21, 384–396.
Rump, H. H., and Hermes, D. J. ~1996!. ‘‘Prominence lent by rising and
falling pitch movements: Testing two models,’’ J. Acoust. Soc. Am. 100,
1122–1131.
Schouten, H. E. M. ~1985!. ‘‘Identification and discrimination of sweep
tones,’’ Percept. Psychophys. 37, 369–376.
Sergeant, R. L., and Harris, J. D. ~1962!. ‘‘Sensitivity to unidirectional
frequency modulation,’’ J. Acoust. Soc. Am. 34, 1625–1628.
’t Hart, J. ~1976!. ‘‘Psychoacoustic backgrounds of pitch contour stylization,’’ IPO Ann. Prog. Rep. 11, Eindhoven, The Netherlands, 11–19.
2348
J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998
’t Hart, J., Collier, R., and Cohen, A. ~1990!. A Perceptual Study of Intonation ~Cambridge U.P., London!.
van Wieringen, A. ~1995!. ‘‘Perceiving dynamic speechlike sounds,’’ Ph.D.
thesis, Amsterdam University.
van Wieringen, A., and Pols, L. C. V. ~1995!. ‘‘Discrimination of single and
complex consonant-vowel- and vowel-consonant-like formant transitions,’’ J. Acoust. Soc. Am. 98, 1304–1312.
Wier, C. C., Jesteadt, W., and Green, D. M. ~1976!. ‘‘A comparison of
method-of-adjustment and forced-choice procedures in frequency discrimination,’’ Percept. Psychophys. 19, 75–79.
d’Alessandro et al.: Pitch of glissandos
2348