The pitch of short-duration fundamental frequency glissandos Christophe d’Alessandro, Sophie Rosset, and Jean-Pierre Rossi LIMSI–CNRS, BP 133 F-91403 Orsay, France ~Received 9 October 1997; accepted for publication 7 July 1998! Pitch perception for short-duration fundamental frequency (F0) glissandos was studied. In the first part, new measurements using the method of adjustment are reported. Stimuli were F0 glissandos centered at 220 Hz. The parameters under study were: F0 glissando extents ~0, 0.8, 1.5, 3, 6, and 12 semitones, i.e., 0, 10.17, 18.74, 38.17, 76.63, and 155.56 Hz!, F0 glissando durations ~50, 100, 200, and 300 ms!, F0 glissando directions ~rising or falling!, and the extremity of F0 glissandos matched ~beginning or end!. In the second part, the main results are discussed: ~1! perception seems to correspond to an average of the frequencies present in the vicinity of the extremity matched; ~2! the higher extremities of the glissando seem more important; ~3! adjustments at the end are closer to the extremities than adjustments at the beginning. In the third part, numerical models accounting for the experimental data are proposed: a time-average model and a weighted time-average model. Optimal parameters for these models are derived. The weighted time-average model achieves a 94% accurate prediction rate for the experimental data. The numerical model is successful in predicting the pitch of short-duration F0 glissandos. © 1998 Acoustical Society of America. @S0001-4966~98!03810-7# PACS numbers: 43.66.Hg, 43.75.Bc @DWG# INTRODUCTION In natural audio signals like speech or music, fundamental frequency (F0) is almost never constant. The main psychological correlate of F0 is pitch. It is well known that the physical aspect (F0) and the psychological aspect ~pitch! can differ significantly, particularly for sounds whose fundamental frequency changes. The aim of the present article is to report new measurements and to propose a numerical model for pitch perception of short-duration F0 glissandos. Short-duration F0 glissandos are thought of as prototypes or building blocks for the F0 contours that can be encountered in many natural signals, like notes in music or syllables in speech. Several theories accounting for human instantaneous pitch detection are available, together with the corresponding pitch detection algorithms @see Hess ~1983! or Hermes ~1993! for reviews#. These theories and techniques focus on instantaneous, or at least short-term, attributes of the acoustic signal. Only a few periods of signals are considered in order to compute the time-varying instantaneous fundamental frequency. However, the link between instantaneous F0 contours and pitch perception for larger units, like F0 glissandos, is by no means straightforward. It is therefore important to distinguish between instantaneous F0, which is a relatively well-known problem, and pitch perception for shortduration F0 glissandos. Perceptual thresholds for predicting under what conditions F0 changes are audible have long been studied. A first perceptual threshold is the glissando threshold, or absolute threshold of pitch change. It is expressed in frequency unit per unit of time. Below the threshold, a F0 glissando is perceived with constant pitch. Above the threshold, a pitch glissando is perceived. Data on the glissando threshold for speechlike sounds have been obtained by Klatt ~1973!, Rossi1 ~1971, 1978a!, and Schouten ~1985!, following the 2339 J. Acoust. Soc. Am. 104 (4), October 1998 pioneering work of Sergeant and Harris ~1962! for pure tones. A unified view of this problem was presented by ’t Hart ~1976! and ’t Hart et al. ~1990!, who conclude that the threshold can be approached by G t 50.16/T 2 , where G t is the glissando threshold ~in semitones/s! and T the tone duration. When a glissando is above the glissando threshold, a separate perception of the high and low extremities of the F0 patterns appears. Nabelek et al. ~1970! first studied this effect in a work on the pitch for tone bursts of changing frequency. They coined the term ‘‘separation’’ for when only the final part of the burst contributes to the pitch judgment, and the term ‘‘fusion’’ for when the overall F0 pattern contributes to the pitch judgment. Thresholds for separation and fusion are consistent with the glissando threshold. Thresholds can predict whether a given F0 glissando ~with a certain duration and a certain extent! will be perceived as a constant pitch tone or as a pitch glissando. However, the data on perceptual thresholds for pitch changes do not address the problem of what pitch is actually perceived, i.e., what the subjective pitch value is at the beginning and end of the F0 glissando. The first aim of this study is to give explicit measurements of the pitch perceived at the beginning and end of F0 glissandos. One of the few studies addressing this problem can be found in Rossi ~1971, 1978a!, where natural and synthetic vowels were used. Unfortunately, the experimental conditions used by Rossi were not likely to give accurate results, because of the choice of subjects ~a group of students in phonetics whose individual capabilities in pitch discrimination were not tested! and because of the stimulus presentation method ~loudspeakers in a classroom!. Therefore, new and more accurate measurements were needed. Another issue which needs to be reinvestigated is whether there are asymmetries in the perception of supraliminar rising and falling F0 glissandos. Several articles 0001-4966/98/104(4)/2339/10/$15.00 © 1998 Acoustical Society of America 2339 have noted such asymmetries, but the results have not always been consistent. Hombert ~1975! and Rump and Hermes ~1996! found that the high register has more weight in intonation perception. Brady ~1961! and Wieringen and Pols ~1995! reported that for pure tone glides, the end of the tone has more perceptual weight than the beginning. For pure tones and complex tones, Demany and McAnally ~1994! and Demany and Clément ~1995, 1997!, found an asymmetry between peaks and troughs for large frequency modulations. However, Rossi ~1978a! reported that no differences occurred in the perception of rising and falling glissandos. In this article, we compare the perception of rises and falls over a wide range of conditions. The final purpose of this work is to propose a numerical model of pitch perception for short-duration F0 glissandos. A first model addressing this problem was proposed by Rossi ~1971, 1978a!, using natural and synthetic vowels. Rossi postulated that for vowels with F0 glissandos ~either rising or falling!, the pitch perceived at the end of vowels corresponds to an averaging of F0 toward the end of the glissando. This is expressed in Rossi’s so-called ‘‘2/3 rule,’’ which can be stated as follows: For dynamic tones in a vowel, the pitch perceived corresponds to a point between the second and the third third of the vowel. For example, if a linear glissando between 100 and 200 Hz lasts 150 ms, this rule predicts that the pitch perceived is somewhere in the F0 values of the glissando in the time interval 100–150 ms, i.e., between 166.6 and 200 Hz. In Rossi’s terminology, a dynamic tone is a tone with perceived pitch change. Of course, this rule applies only to the experimental situation envisaged in Rossi’s experiments, namely pitch perception measured at the end of tones in vowels. This model was also applied to the pitch of glidelike F0 curves in folk songs. Ross ~1987! reported good agreement between his experimental results and Rossi’s rule. In mathematical terms, Rossi’s rule is an example of a time-average ~TA! model of pitch perception. Pitch is computed by averaging F0 on the time interval of the tone: p5 * t2 t1 F0 ~ t ! d t * t2 t1 d t , ~1! where p is the pitch perceived, F0( t ) is the time-varying fundamental frequency, t1 and t2 are the averaging time limits. In systems for automatic intonation analysis, several authors used ‘‘pragmatic’’ TA models. For instance, Mertens ~1987! and House ~1990! computed pitch as the average of F0 on 20–30 ms at the end of vowels in syllables. In a study on pitch perception for short-duration vibrato tones in singing, d’Alessandro and Castellengo ~1994! showed that TA models were not suitable in this situation. The main problem was that TA models were not able to give more perceptual weight to the end of tones. They proposed instead a weighted time-average ~WTA! model of pitch perception. Pitch is computed as the time average of F0 viewed through a time window: 2340 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 FIG. 1. The four parts ABCD of the experiments. p5 a ~ t 2t2 ! * t2 1 b ! F0 ~ t ! d t t1 ~ e a ~ t 2t2 ! * t2 1b !dt t1 ~ e , ~2! where the constants b and a define the raised exponential window. The constant a is the factor of the exponent in the window. It gives more or less weight to the final part of the tone. The constant b is the height of the raised exponential window, i.e., the height of a rectangular window, whose width is the tone duration. Contrary to a, the constant b gives equal weight to all parts of the tone. Thus b controls the weight given to the whole tone, while a controls the weight given to the end of the tone. The optimal parameters were a 522 and b 50.20. This WTA model was also successfully applied to intonation stylization in speech in d’Alessandro and Mertens ~1995!. In summary, the aim of the present study is to address the following matters: ~1! to give new measurements of the pitch perceived at the beginning and end of short-duration F0 glissandos; ~2! to study the asymmetries in perception of rises and falls; ~3! to evaluate a WTA model of F0 integration for pitch perception. This model should be able to predict the pitch perceived at the beginning and end of an F0 glissando. In Sec. I, the experimental method is described. Then the results are presented and analyzed in Sec. II. A numerical model accounting for our data is proposed and discussed in Sec. III. Section IV concludes. I. METHOD A. Conditions The aim of these experiments is to measure the pitch perceived for short-duration F0 glissandos. Linear F0 glissandos are of the form F0( t )5a3 t 1b, where b is the frequency ~in Hz! at the beginning of the tone, a is the slope of the tone ~if d is the duration of the tone, a3d1b is the frequency at the end of the tone!, and t represents time. The stimulus are pairs of synthetic vowels, with a 300-ms silent interval between each member of a stimulus pair. One member of a pair is a glissando ~either a rise or a fall!. The other member of a pair is flat ~constant F0! with the same duration. The test is divided into four parts ABCD, summarized in Fig. 1. d’Alessandro et al.: Pitch of glissandos 2340 TABLE I. Experimental conditions. R: rise. F: fall. B: adjustment at the beginning. E: adjustment at the end. b: below the glissando threshold. g: glissando threshold. a: above the glissando threshold. ¯ : condition not tested. 50 ms B 0 ST 0.8 ST 1.5 ST 3 ST 6 ST 12 ST 100 ms E B E B 300 ms E B E R F R F R F R F R F R F R F R R b b b g a a ¯ b b g a a b b b g a a ¯ b b g a a b b g a a a ¯ b g a a a b b g a a a ¯ b g a a a b g a a a a ¯ g a a a a b g a a a a ¯ g a a a a b a a a a a ¯ a a a a a b a a a a a ¯ a a a a a In ~A! the first member of a stimulus pair is a rise, and the second member is flat. The subjects are asked to adjust a flat tone to the end of a rise. In ~B! the first member of a stimulus pair is flat, and the second member is a rise. The subjects are asked to adjust a flat tone to the beginning of a rise. In ~C! the first member of a stimulus pair is a fall, and the second member is flat. The subjects are asked to adjust a flat tone to the end of a fall. In ~D! the first member of a stimulus pair is flat, and the second member is a fall. The subjects are asked to adjust a flat tone to the beginning of a fall. All the stimuli are centered at 220 Hz, a typical F0 for a female voice. Four parameters are studied: the glissando direction ~rise/fall!, the extremity matched ~beginning/end!, duration ~50, 100, 200, and 300 ms! and extent. The glissando extents are defined according to the glissando threshold, using data reported in ’t Hart et al. ~1990!. For the 50 ms duration condition, G t 50.16/(0.05) 2 564 semitones/s. The extent of the glissando at threshold (G tr) is 64 30.0553.2 semitones. For this condition, the extents studied in the experiments are 0, 0.8, 1.5, 3, 6 and 12 semitones ~or 0, 10.17, 18.74, 38.17, 76.63, and 155.56 Hz!. This corresponds approximately to 0.253G tr , 0.53G tr , G tr , 2 3G tr , 43G tr , in addition to the 0-Hz extent condition. The same values are used for the 100, 200, and 300 ms duration conditions. The extents are symmetric around the center value 220 Hz on a semitone ~logarithmic! scale: F0 is within 220 Hz6N/2 semitones, where N is the extent of the stimuli in semitones. As the glissandos are linear functions of time in Hz, the value of 220 Hz is not reached at the center of the stimuli in time because of the wider interval in Hz for higher frequencies. In summary, there are 24 conditions for rises ~including four ‘‘0 semitone’’ conditions, which are considered as rises! and 20 conditions for falls. The resulting 88 conditions are summarized in Table I, together with the glissando threshold. Glissandos above and below the glissando threshold are considered in the experiments. B. Stimuli The stimuli were synthetic posterior vowels /a/, computed by a digital parallel formant synthesizer. The center frequencies, bandwidth and relative amplitudes of the first four formants (F1, F2, F3, F4) were typical values for a female speaker: F15(650 Hz, 80 Hz, 0 dB), F2 2341 200 ms J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 5(1100 Hz, 90 Hz, 28 dB), F35(2900 Hz, 140 Hz, 211 dB), F45(3300 Hz, 130 Hz, 220 dB). Synthetic voice stimuli were chosen because they were fully controlled, and because they were closer to natural sounds ~like speech and singing! than pure tones or square waves. The signals corresponding to all the experimental conditions were synthesized and stored on a computer disk. Signals were sampled at 16 kHz, using 16 bits per sample. Prestored examples were played at test time, following the procedure described below. The signal peak amplitude was the same for all the stimuli. Sound intensity variation within all the stimuli was less than 2.4 dB. This was small enough to avoid consideration of the effect of intensity changes on pitch perception @see Rossi ~1978b! for a discussion of interaction of amplitude and fundamental frequency glissandos in pitch perception#. C. Procedure Previous data published on the pitch perceived for shortduration F0 glissandos were generally obtained using a forced-choice procedure ~Rossi, 1971; ’t Hart et al., 1990!. However, we preferred a method of adjustment similar to the method used in a study of pitch perception for short-duration vibrato tones in d’Alessandro and Castellengo ~1994!. The method of adjustment has a number of advantages: ~1! each response of a subject gives an estimation of the pitch perceived; ~2! the method helps concentration, as the subject is actively engaged in the adjustment process; ~3! the data obtained are less variable than forced-choice data in frequency discrimination tasks, as shown by Wier et al. ~1976!. Subjects were asked to adjust the frequency of the flat tone so that it was equal to the frequency at the beginning or at the end of the glissando. To make an adjustment, the subject pointed to a place on a one-octave scale on the computer screen, pressed a button, and heard a stimulus pair. If she/he was satisfied with this pair, the subject reported a match. If she/he was not satisfied with this pair, the subject moved the mouse pointer on the scale and listened to the corresponding stimulus pair. Low frequencies were on the left of the scale, and high frequencies on the right. For each match, the subject had no time limit for giving an answer, and could play any stimulus pair for the same condition. Subjects typically listened to about ten stimulus pairs while manipulating the frequency of the flat tone before reporting a match. The subd’Alessandro et al.: Pitch of glissandos 2341 TABLE II. Experimental results. Means and standard deviations for the matches reported by the subjects, as a function of extent, duration, rise/fall, and begin/end ~see parts ABCD in the text!. Means and standard deviations are in Hz. Extent ST ~Hz! 50 ms 100 ms 200 ms 300 ms 221.261.0 226.268.7 229.869.1 241.1611.6 255.266.8 291.0613.7 221.161.0 227.069.5 231.4610.8 242.0612.7 256.565.3 292.0612.7 219.162.5 218.162.9 216.1610.7 213.3611.6 209.5620.0 208.2635.7 219.561.4 218.164.0 215.2610.5 212.4611.3 203.8619.4 201.8629.1 217.367.2 215.364.9 210.966.3 201.0612.2 193.7620.5 216.964.0 214.665.4 210.466.6 198.0610.0 193.5620.5 223.164.4 225.367.2 234.0611.4 251.6611.6 275.9622.0 222.563.0 226.765.2 235.066.7 254.369.5 282.0616.6 Part A 0.0~220.9! 0.8~215.0–225.1! 1.5~210.6–229.4! 3.0~201.7–239.9! 6.0~185.0–261.6! 12.0~155.6–311.1! 221.561.8 224.067.1 222.865.4 234.7615.5 255.1613.7 281.9612.4 221.161.0 225.566.9 227.7610.6 238.6612.4 257.069.7 283.7612.8 Part 0.0~220.9! 0.8~215.0–225.1! 1.5~210.6–229.4! 3.0~201.7–239.9! 6.0~185.0–261.6! 12.0~155.6–311.1! 218.163.6 219.062.0 215.569.9 213.9612.4 215.7624.3 215.9639.7 219.061.5 218.363.4 216.3610.3 216.0613.1 213.3625.4 213.8641.1 Part 0.8~215.0–225.1! 1.5~210.6–229.4! 3.0~201.7–239.9! 6.0~185.0–261.6! 12.0~155.6–311.1! 217.667.2 217.962.0 215.264.1 208.1615.9 214.1631.2 220.262.0 223.767.7 228.367.8 246.6616.5 267.4618.8 jects were asked to always give an answer. For each test, a fixed scale spanning one octave was proposed to the subjects for adjustment of the flat tone. This scale was divided into steps of 2.25 Hz. This step size corresponded to about 1.4 times the difference limens reported by Moore ~1973! for short-duration tones ~about 1.6 Hz for 50 ms at 250 Hz!. The signals were presented to both ears at a level of 80 dB SPL through Beyer DT48 headphones. The subjects participated in five test sessions divided into 20 min periods, with a comfortable pause between periods. A typical session lasted between 2 and 4 h. The first test session was considered as a training session, and was not used in the results. D. Subjects A group of eight musically trained subjects participated in the experiments. One of the subjects was one of the authors ~CdA!, and the others were music students who were paid for the experiments. Results are likely to be accurate with trained and selected subjects. However, interpretation of the results as representative of the general population may be questionable, because of the superior skills of our subjects in pitch discrimination. With 8 subjects and 4 sessions, 32 matches were made for each of the 88 experimental conditions. 2342 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 C 217.168.3 217.562.6 213.565.1 205.7613.8 202.0622.0 Part 0.8~215.0–225.1! 1.5~210.6–229.4! 3.0~201.7–239.9! 6.0~185.0–261.6! 12.0~155.6–311.1! B D 221.162.8 224.165.3 231.169.5 250.3614.3 272.1618.9 II. RESULTS AND DISCUSSION A. Results Means and standard deviations averaged for all subjects are reported in Table II and plotted in Fig. 2. For each condition in Fig. 2, horizontal sides of the box represent duration and vertical sides of the box represent extent. The four empty circles near the corners of the boxes represent the means obtained for the four parts ABCD of the test. The coordinates of a circle represent the mean frequency ~along the y axis! reported by the subjects, and the time ~along the x axis! that corresponds to the instant when the glissando frequency was equal to the mean frequency. Thus the F0 rises can be represented in the boxes by a line between the bottom left corner and the top right corner. The falls can be represented in the boxes by a line between the top left corner and the bottom right corner. If the subjects were able to perfectly perceive the time-varying fundamental frequency, matches would place the four points in the four corners. Part A ~rise/ end! is generally in the right/top corner, part B ~rise/ beginning! is generally in the left/bottom corner, part C ~fall/ end! is generally in the right/bottom corner, and part D ~fall/ beginning! is generally in the left/top corner. Responses are always close to the extremity matched. This indicates that subjects did not take all the glissando into account, but rather made an average of the frequencies present near the extremd’Alessandro et al.: Pitch of glissandos 2342 ~1! the high register has more weight in perception ~because matches are closer to the higher extremity of the glissando!. ~2! the integration interval is shorter at the end and longer at the beginning ~because matches are closer to the extremity at the end than at the beginning!. FIG. 2. Experimental results. Means reported in Table II. Each circle represents the mean obtained for each condition. Each box corresponds to the four parts ABCD of the experiments. For part A, rises adjusted at the end, results are generally in the right and top corner. For part B, rises adjusted at the beginning, results are generally in the left and bottom corner. For part C, falls adjusted at the end, results are generally in the right and bottom corner. For part D, falls adjusted at the beginning, results are generally in the left and top corner. The underlined conditions represent infraliminar glissandos, glissandos that are below the glissando threshold ~they are marked ‘‘b’’ in Table I!. The other conditions represents supraliminar glissandos. ity. The underlined conditions in Fig. 2 represent infraliminar glissandos, glissandos that are below the glissando threshold ~they are marked ‘‘b’’ in Table I!. The other conditions represent supraliminar glissandos. For supraliminar glissandos, a ‘‘butterfly’’ pattern seems consistent across duration/extent conditions. Most butterflies share a similar form: part A is the nearest to the corner, and in several cases is matched even out of the box ~average differences between means and extremities for part A are: 6.35 Hz!; part B is the farthest from the corner ~B: 20.12 Hz!; point C is farther from the corner than point D ~C: 16.43 Hz, D: 12.74 Hz!. The parts can be classified in the order A,D,C,B as a function of their distances to the corresponding extremity. The butterfly pattern shows a possible asymmetry between rises and falls. Pitch for rises ~compared to falls! is closer to the end when matched at the end ~compare parts A and C!, and farther off the beginning when matched at the beginning ~compare parts B and D!. This might indicate different intervals for perceptual time averaging that may depend on the factor ‘‘rise/fall.’’ It seems that subjects are more influenced by the higher extremity than by the lower extremity. Typically, falls are adjusted closer to the beginning, i.e., the higher extremity ~compare C and D!, whereas rises are adjusted closer to the end, i.e., the higher extremity ~compare A and B!. Heuristic rules for drawing the butterfly patterns in Fig. 2 can be stated as follows: 2343 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 Rule ~1! states that A,B and C.D. Rule ~2! states that A,D and B.C. Thus, one can deduce that A,D,C,B. The same tendency can be seen for standard deviations. The standard deviations reported in Table II are an indication of the consistency of the adjustments. Generally, standard deviations are smaller for adjustments at the end ~parts A and C!, and higher for adjustments at the beginning ~parts B and D!. This could indicate that it was easier to make consistent adjustments at the end. Also, standard deviations are smaller for adjustment of the higher extremity ~compare A and D, and B and C!. For the large extent conditions ~6 semitones and 12 semitones!, the standard deviations can be ordered as: A,D,C,B. Thus it seems that when adjustments are farther from the extremity matched, they are also less consistent. For the smaller extent conditions, it is difficult to interpret the differences in standard deviation. For the 0 semitone condition, the standard deviations are small when adjusted at the end. In this case, they are close to the difference limens for short tones ~Moore, 1973!. An analysis of variance was carried out with factors ‘‘duration’’ ~four values!, ‘‘rise/fall’’ ~two values!, ‘‘beginning/end’’ ~two values!, ‘‘extent’’ ~five values!, ‘‘session’’ ~four values!. The factor ‘‘duration’’ was not significant @F(3,560)520.055; p50.9830#. As a matter of fact, only a slight dependence on duration is noticeable in Fig. 2. Means followed a slightly rising or falling line as duration increased for a same extent, but this tendency was not significant. This showed that duration played only a minor role compared to other factors. No significant effect was found for the factor ‘‘beginning/end’’ @F(1,560)51.936; p 50.1646#; subject responses averaged for parts A and C were not different from the subject responses averaged for parts B and D. The effect of the factor ‘‘rise/fall’’ was significant @F(1,560)526.94; p,0.0001#. This indicates that the asymmetry observed in perception of rises and falls is significant. The factor ‘‘extent’’ was significant @F(4,560) 572.67; p,0.0001#. This was likely to occur, as for most conditions the extents were close to or above the glissando threshold. Interactions between ‘‘rise/fall’’ and ‘‘beginning/ end’’ were very strong @F(1,560)51092.80; p,0.0001# This indicates different behaviors for the different parts A, B, C and D. The interaction between ‘‘rise/fall’’ and ‘‘extent’’ was significant @F(4,560)53.07; p50.0162#. Again, this confirms an asymmetry in perception of rises and falls, which depends on extent. All other interactions were not significant. B. Asymmetry between higher and lower extremities Rule ~1! states that the higher extremity is ‘‘better’’ perceived than the lower extremity. ‘‘Better’’ perceived means that adjustments are closer to the higher extremity, when it is the extremity matched, compared to the lower extremity, d’Alessandro et al.: Pitch of glissandos 2343 again when it is the extremity matched. This experimental result suggests that the high register might be perceptually more important than the low register. Similar findings have already been reported in other works. A difference between rises and falls was noticed by Rump and Hermes ~1996! in an experiment on prominence in intonation. The stimuli were /mamama/ utterances, with fundamental frequency around 100 Hz. They found that high pitch levels contributed to prominence more than low pitch levels. Demany and McAnally ~1994! and Demany and Clément ~1995, 1997! conducted a number of studies on perception of wide frequency modulations. In these experiments, pure tones or complex tones were used, between 250 and 4000 Hz. They consistently found that frequency peaks are better perceived than frequency troughs. This effect seems robust against factors like the carrier signal, modulation waveform, frequency register and intensity. In a study on perception of contour tones, Hombert ~1975! asked subjects to adjust a static tone at the beginning of a rise or a fall. The results are very similar to ours: the beginning of a gliding tone is better perceived for falls than for rises. In other words, for a task rather similar to the present experiments, Hombert ~1975! found the same type of effect: The higher extremity is better perceived than the lower extremity. This indicates that rule ~1! might be a rather general effect for perception of supraliminar ~or wide! frequency modulations. C. Asymmetry between beginning and end The second type of asymmetry found in our data, rule ~2!, is an asymmetry between adjustments at the beginning and at the end of tones. This type of asymmetry has been reported in several studies of wide frequency modulations, as in formant transitions between consonants and vowels in speech. Brady et al. ~1961! suggest that a higher perceptual weight could be assigned to the most recent information. More recently, Porter et al. ~1991!, Wieringen and Pols ~1995!, Wieringen ~1995!, and Cullen et al. ~1992! reported differences in sensitivity between initial and final transitions of short formant glides. All these studies conclude that the end is better perceived than the beginning. This indicates that rule ~2! might be a rather general effect for perception of time-varying frequency modulation. D. Asymmetries between rises and falls The possible asymmetry in perception of rises and falls has been the subject of controversial discussion in many papers on tone or intonation perception. Dooley and Moore ~1988! discussed asymmetry in difference limens for glide detection. In a two-alternative, forced-choice procedure, they measured frequency difference limens ~DLF! for rises and falls. Rises and falls were presented randomly before and after a constant tone signal. Signals were merged, corresponding to parts A and B on the one hand and C and D on the other, and no effects between beginning and end were likely to occur. Pure tones were used, with center frequencies ranging from 500 to 8000 Hz, and durations ranging from 50 to 800 ms. Dooley and Moore ~1988! found that UP–DL ~DLF for rises! were consistently higher than DOWN–DL 2344 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 ~DLF for falls!. Thus it seems that rises are more difficult to perceive than falls. The opposite effect is reported by Gardner and Wilson ~1979!. Dooley and Moore conclude that these differences are difficult to interpret. The differences between rises and falls went unnoticed in many previous works on the glissando threshold, as ’t Hart et al. ~1990! pointed out. They also noted the large dispersion of the data on the glissando threshold found in the literature. All these studies were for difference limens. In this case, it is difficult to conclude about possible asymmetries between rises and falls. On the contrary, in the case of wide and supraliminar frequency modulations, it seems that the difference between rises and falls is consistent in many experiments ~Hombert, 1975; Rump and Hermes, 1996; Demany and McAnally, 1994; Demany and Clément, 1995, 1997!. In this case, the combination of rules ~1! and ~2! can shed some light on the possible asymmetry in perception of rises and falls, as it is discussed above. E. Summary In summary: ~1! the mean frequency perceived for shortduration F0 glissandos is in the direction of the extremity matched; ~2! distances of adjustments to the extremity differ for rises and falls; ~3! adjustments are closer to the extremity for adjustments at the end, compared to adjustments at the beginning; ~4! adjustments are closer to the higher extremity compared to the lower extremity; ~5! ‘‘duration’’ plays a minor role; ~6! ‘‘extent’’ is highly significant. As a consequence, the four parts ABCD for a given condition of extent and duration have distinct behaviors, which can be seen in the asymmetric butterfly patterns of Fig. 2. Time averaging of F0 could explain pitch perception for F0 glissandos. It also seems that not all the glissando is taken into account for time averaging. These results are compatible with Rossi’s 2/3 rule. One can easily check that most of our experimental data are comprised between the second and the third third of the glissandos.2 However, this rule is not really a model, as it gives only an indication of where pitch is. It is of little use in the case of large extent. For example, if one considers the 12 semitone conditions, Rossi’s rule predicts that the pitch perceived is between the second and the third third of the glissando, and one third of the glissando represents more than 50 Hz. Of course the result is correct, but not very accurate. In the next section, a more accurate model is discussed. III. NUMERICAL MODEL A. Equation of the model A numerical model able to predict experimental data is searched for. The aim of this model is to predict the pitch perceived at the beginning and end of a given F0 glissando. The model should be able to give a response which is close to the subjects’ mean response. Both TA and WTA models are studied. The TA model integrates F0 over a time interval of the glissando. The integration interval is near the extremity matched. Assuming a linear glissando F05a t 1b, a TA model is d’Alessandro et al.: Pitch of glissandos 2344 p5 * t2 t1 ~ a t 1b ! d t * t2 t1 d t 5a t11t2 1b. 2 ~3! The model has only one parameter, d, defined as follows. Let d be the glissando duration. Then d is the length of the averaging interval for integration, expressed as a proportion of d. The time limits of integration are (t150, t25 d 3d! for an adjustment at the beginning, and (t15d2 d 3d, t25d! for an adjustment at the end. The TA model is computed as p~ d !5 H ad ~ 22 d ! 1b 2 ad d 1b 2 adjustment at the end ~4! adjustment at the beginning. Like the TA model, the WTA model integrates F0 on a time interval of the glissando, but F0 is modulated by a weighing function. The aim of this weighing function is to concentrate the integration of frequency on a part of the glissando. It is reasonable to assume that time plays a role in the averaging process, so more attention must be paid to the end in case of adjustment at the end, and to the beginning in case of adjustment at the beginning. This is the reason that the weighing function chosen is an exponential function. The model has two parameters. d represents the duration of the averaging interval expressed as a proportion of the tone duration, and a is the exponential factor: p~ a,d !5 5 * t2 t1 exp„a ~ t 2t2 ! …~ a t 1b ! d t ~5! Let d be the glissando duration. Then the time limits of integration are (t150, t25 d 3d! for an adjustment at the beginning, and (t15d2 d 3d, t25d! for an adjustment at the end. The WTA model is computed as: p~ a,d !5 5 a2a ~ d2 d d ! exp~ 2 a d d ! a 1b2 12exp~ 2 a d d ! a adjustment at the end, a add 1b2 12exp~ 2 a d d ! a ~6! adjustment at the beginning. B. Estimation of optimal parameters The TA model depends on one free parameter, and the WTA model depends on two free parameters. Optimal parameters of the models must be estimated. A classical means for estimation of optimal parameters is to minimize the rootmean-square ~rms! distance between the model response and the experimental data. Let m(n) be the experimental mean ~reported in Table II! for a particular condition n of factors ‘‘duration,’’ ‘‘extent,’’ ‘‘rise/fall,’’ and ‘‘beginning/end.’’ Let p d , a (n) be the response of a model with parameters d,a for condition n. The normalized rms distance is computed by 2345 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 A( N n51 „p d , a ~ n ! 2m ~ n ! …2 , ~7! where n is between 1 and N, and where N is the number of conditions considered. The minimum rms distance between the model and data, using an iterative minimization procedure, gives the optimal parameters. However, the raw distance between the model and the data is not very informative by itself. A good way to gain further insight into the model quality is to evaluate the fit to the data, by computing the number Q of conditions for which the model response is not statistically different from the subject’s response: The larger the Q, the better the fit. Q is computed as follows. Let e(n) be the experimental standard deviation ~reported in Table II! for a particular condition n of factors ‘‘duration,’’ ‘‘extent,’’ ‘‘rise/fall,’’ and ‘‘beginning/end.’’ The statistical equivalence between this predicted value and the experimental data is computed by the Student t-test ~recall that 32 responses are available for each test condition!: t(n)5 @ „m(n)2 p d , a (n)…A31# /e(n). Using a level of significance of 0.05, the critical value is 2.03 for a two-tailed test with df531. The statistical equivalence between the predicted value and the experimental data for a condition n is computed using a function q d , a (n) defined by q d,a~ n ! 5 H 1, if 22.03,t ~ n ! ,2.03, 0, otherwise. ~8! The quality Q( d , a ) of the model with parameters d,a is the number of conditions for which the model and the data are statistically equivalent: * t2 t1 exp„a ~ t 2t2 ! …d t a at22at1 exp„a ~ t12t2 ! … 1b2 . 12exp„a ~ t12t2 ! … a 1 rms~ d , a ! 5 N N Q~ d,a !5 ( n51 q d,a~ n ! , ~9! n is between 1 and N, where N is the number of conditions considered. For instance, if the model is searched for all the conditions, then N580; but if it is searched only for rises matched at the end, then N520. The zero-semitone conditions are not considered, because every model will give the same answer in this case. Discussion of the results reported above suggests that asymmetries between beginning versus end, higher versus lower extremities, and rises versus falls have to be considered. This indicates that each part ABCD should be considered separately for modeling. Thus four sets of parameters are considered for modeling, one for each part ABCD, with N520 for each set. The optimization procedure described above is applied to each part. In addition, the optimization procedure is applied to the entire set of data ~coined ‘‘all data’’ condition!. Optimal parameters, and the corresponding number of conditions for which the predicted data are not significantly different from the experimental data, are reported in Table III. Results for each condition ~using four sets of parameters! are displayed in Fig. 3. An empty circle indicates a condition for which the model response is statistically different from the data. A filled circle indicates a condition for which the model response is not statistically different from experimental data. d’Alessandro et al.: Pitch of glissandos 2345 TABLE III. Optimal parameters for the TA and the WTAM models. ‘‘Q’’ represents the ‘‘quality’’ of the model: the number of conditions for which the predicted data are not significantly different from the experimental data ~see text!, and ‘‘#’’ represents the number of conditions for each part. ‘‘%’’ is the corresponding percentage (1003Q/#). ‘‘All data’’ means that a same parameter setting is used for all data. ‘‘ABCD’’ means that a different parameter setting is used for each part ABCD. TA d All data Part A Part B Part C Part D ABCD 0.48 0.18 0.69 0.59 0.45 ¯ Q ~#! 49 17 20 16 15 68 ~80! ~20! ~20! ~20! ~20! ~80! % 61 85 100 80 75 85 C. Discussion The TA and WTA models are able to predict 85% and 94% of the data, provided that the four parts ABCD are considered separately. The percentages are reduced to only 61% and 65% with the same parameter setting for all parts. This result is not surprising, given the asymmetries observed in the experimental data. Thus for modeling the four parts ABCD must be considered separately. In this case, good results are obtained for both models, though the WTA model performs better than the TA model. The optimal a and d parameters for the WTA model are different for the four parts of the test. The interpretation of d is straightforward: It represents the proportion of the glissan- FIG. 3. Weighted time-average models. Each circle represents the value obtained by four weighted time-average models corresponding to the four parts of the test. A filled circle represents a conditions which is not statistically different from the experimental data. The underlined conditions represent infraliminar glissanndos, glissandos that are below the glissando threshold ~they are marked ‘‘b’’ in Table I!. The other conditions represent supraliminar glissandos. 2346 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 WTA d WTA a Q ~#! 0.54 0.26 0.82 0.75 0.56 ¯ 7.2 24.7 27.5 13.5 219.4 ¯ 52 17 20 20 18 75 ~80! ~20! ~20! ~20! ~20! ~80! % 65 85 100 100 90 94 dos over which integration occurs. Interpretation of a is twofold. On the one hand, the sign of a indicates if more attention was paid to the beginning or end of the glissando. A positive a indicates that a higher perceptual weight is assigned to the most recent information ~end!. A negative a indicates that a higher perceptual weight is assigned to the most distant information ~beginning!. On the other hand, the magnitude of a represents the amount of damping introduced in averaging. A large magnitude means that a higher perceptual weight is assigned to the information close to the extremity matched. A small magnitude means that all the parts of the time interval used for averaging have almost the same weight. The averaging interval ( d 50.26) is very short for part A: about 26% of the end of the glissando. The parameter a 524.7 is positive, which means that the end of the glissando has more weight. The averaging window length in part B ( d 50.82) is long: about 82% of the glide. The parameter a 527.5 is negative. This indicates that more attention was paid to the beginning of the glissando. The same remark holds for part D, where a 5219.4 is negative, though for part D the averaging window ( d 50.56) is shorter than for part B. For part C, the parameter a 513.5 is positive, which indicates that more attention was paid to the end of the glissando. The averaging window ( d 50.75) is long, about 75% of the tone duration. The same remarks hold for the TA model. The averaging window is shorter for part A ~18%!, longer for part B ~69%!. It is also long for parts C and D. The combination of the magnitude of a and d can be used for computing the distance between the model responses and the extremity matched. Recall that in the results section, it was found that this distance is ordered as A,D,C,B for the experimental means. The constant a magnitude represents the damping factor. When it is high, damping is high and then the model response is close to the extremity. The distance of the model response and the extremity can be ordered as follows: A,D,C,B because the corresponding a are 24.7~A!.19.4~D!.13.5~C!.7.5~B!. The parameter d represents the proportion of the glissando used for integration. When d is small, the model response is close to the extremity. We obtain the same order: A,D,C,B, because 0.26~A!,0.56~D!,0.75~C!,0.82~B!. Therefore, the optimal parameters obtained for the WTA model are in good agreement with the heuristic rules derived d’Alessandro et al.: Pitch of glissandos 2346 from the experimental results: The same order is found for the four parts ABCD of the test. In summary, it seems that both the TA and the WTA models are able to predict a significant amount of the experimental data. However, the WTA model performs better than the TA model. This is not surprising, since the WTA model has two parameters compared to only one parameter for the TA model. The model’s parameters give some insight into the experimental data, and are compatible with the butterfly patterns found in the experimental data. It must be pointed out that it is necessary to know the glissando direction and whether the match is at the beginning or end in order to select a set of parameters. This means that the WTA model in itself is not able to explain the asymmetries, but must be adapted to each situation. In this case it is able to predict the pitch perceived. The WTA model is a functional model, and not a model of the physiological or psychological processes that take place in pitch perception. It may help to compute the pitch perceived for an F0 glissando, but is not able to explain how this pitch is perceived. In particular, the explanation of experimental asymmetries in pitch perception is still unclear. IV. CONCLUSION In this study, we studied perception of short-duration linear F0 glissandos. Perception of rising and falling glissandos, matched at the beginning and at the end, was measured. Significant differences between the four main parts ABCD of the experiments ~rises adjusted at the end, rises adjusted at the beginning, falls adjusted at the end, falls adjusted at the beginning! were found. This indicates that there is a difference between perception of rises and falls and perception of the beginning and the end of tones. This might be explained by a greater perceptual weight for the high register, compared to the low register. The experimental results showed that some sort of averaging of the F0 contours was performed by the subjects. Frequencies near the extremity matched ~beginning or end! seemed more important. Simple numerical models for prediction of the experimental data were tried: a time-average model and a weighted timeaverage model of pitch perception. Optimal parameters for these models were estimated using a statistical distance criterion. The models performed well in predicting experimental data: 94% of the data computed with the best WTA models were not significantly different from experimental data. The optimal model’s parameter settings varied significantly among parts ABCD ~rises versus falls, adjustment at the end versus adjustment at the beginning!. This was consistent with the finding of significant differences in the perception of the four parts. It can be concluded from our experiments that pitch perception for short-duration F0 glissandos corresponds to a weighted time average of F0. The WTA model proposed in the present study is close to the WTA model that was proposed for perception of short-duration vibrato tones in d’Alessandro and Castellengo ~1994!. However, the parameters obtained in both studies are different, as in this study they are also different among different parts. 2347 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 ACKNOWLEDGMENTS We are grateful to John P. Madden and Adrian Houtsma for their detailed constructive comments on an earlier version of this article, and to the editorial assistance and criticisms provided by D. Wesley Grantham. Thanks also to Olivier Piot, who helped in the experimental data collection. 1 Mario Rossi, not to be confused with Jean-Pierre Rossi, one of the authors. Rossi’s rule was stated for adjustment at the end of stimuli. The same type of rule can be formulated in the case of adjustment at the beginning, as follows: For dynamic tones in a vowel, the pitch perceived at the beginning corresponds to a point between the first and the second third of the vowel. 2 Brady, P. T., House, A. S., and Stevens, K. N. ~1961!. ‘‘Perception of sounds characterized by a rapidly changing resonant frequency,’’ J. Acoust. Soc. Am. 33, 1357–1362. Cullen, J. K., Houtsma, A. M. J., and Collier, R. ~1992!. ‘‘Discrimination of brief tone-glides with high rate of frequency changes,’’ Abstract of the Fifteenth Midwinter Meeting of the Association for Research in Otolaryngology, 67. d’Alessandro, C., and Castellengo, M. ~1994!. ‘‘The pitch of short duration vibrato tones,’’ J. Acoust. Soc. Am. 95, 1617–1630. d’Alessandro, C., and Mertens, P. ~1995!. ‘‘Automatic pitch contour stylization using a model of tonal perception,’’ Comput. Speech Lang. 9, 257– 288. Dooley, G. J., and Moore, B. C. J. ~1988!. ‘‘Detection of linear frequency glides as a function of frequency and duration,’’ J. Acoust. Soc. Am. 84, 2045–2057. Demany, L., and Clément, S. ~1995!. ‘‘The perception of frequency peaks and troughs in wide frequency modulation. III. Complex carriers,’’ J. Acoust. Soc. Am. 98, 2515–2523. Demany, L., and Clément, S. ~1997!. ‘‘The perception of frequency peaks and troughs in wide frequency modulation. IV. Effects of modulation waveform,’’ J. Acoust. Soc. Am. 102, 2935–2944. Demany, L., and McAnally, K. I. ~1994!. ‘‘The perception of frequency peaks and troughs in wide frequency modulation,’’ J. Acoust. Soc. Am. 96, 706–715. Gardner, R. B., and Wilson, J. P. ~1979!. ‘‘Evidence for direction-specific channels in the processing of frequency modulations,’’ J. Acoust. Soc. Am. 66, 704–709. Hermes, D. J. ~1993!. ‘‘Pitch analysis,’’ in Visual Representation of Speech Signals, edited by M. Cooke, S. Beet, and M. Crawford ~Wiley, Chichester, UK!. Hess, W. ~1983!. Pitch Determination of Speech Signals. Algorithms and Devices ~Springer-Verlag, Berlin!. Hombert, J. M. ~1975!. ‘‘Perception of contour tones: an experimental study,’’ Proc. 1st Annu. Meet. Berkeley Ling. Soc., pp. 221–232. House, D. ~1990!. Tonal Perception in Speech ~Lund U.P., Lund, Sweden!. Klatt, D. H. ~1973!. ‘‘Discrimination of fundamental frequency contours in synthetic speech: Implications for models of pitch perception,’’ J. Acoust. Soc. Am. 53, 8–16. Mertens, P. ~1987!. ‘‘L’intonation du français. De la description linguistique á la reconnaissance automatique,’’ Ph.D. thesis, Catholic University of Leuven, Belgium. Moore, B. C. J. ~1973!. ‘‘Frequency difference limens for short-duration tones,’’ J. Acoust. Soc. Am. 54, 610–619. Nabelek, I. V., Nabelek, A. K., and Hirsh, I. J. ~1970!. ‘‘Pitch of tone bursts of changing frequency,’’ J. Acoust. Soc. Am. 48, 536–553. Porter, R. J., Cullen, J. K., Collins, M. J., and Jackson, D. F. ~1991!. ‘‘Discrimination of formant transition onset frequency: Psychoacoustic cues at short, moderate, and long durations,’’ J. Acoust. Soc. Am. 90, 1298–1308. Ross, J. ~1987!. ‘‘The pitch of glide-like F0 curves in Votic folk songs,’’ Proceedings of XIth International Congress of Phonetic Sciences, XIth ICPhS, Tallin, USSR, pp. 174–177. Rossi, M. ~1971!. ‘‘Le seuil de glissando ou seuil de perception des variations tonales pour les sons de la parole,’’ Phonetica 23, 1–33. Rossi, M. ~1978!. ‘‘La perception des glissando descendants dans les contours prosodiques,’’ Phonetica 35, 11–40. d’Alessandro et al.: Pitch of glissandos 2347 Rossi, M. ~1978!. ‘‘Interaction of intensity glides and frequency glissandos,’’ Language and Speech 21, 384–396. Rump, H. H., and Hermes, D. J. ~1996!. ‘‘Prominence lent by rising and falling pitch movements: Testing two models,’’ J. Acoust. Soc. Am. 100, 1122–1131. Schouten, H. E. M. ~1985!. ‘‘Identification and discrimination of sweep tones,’’ Percept. Psychophys. 37, 369–376. Sergeant, R. L., and Harris, J. D. ~1962!. ‘‘Sensitivity to unidirectional frequency modulation,’’ J. Acoust. Soc. Am. 34, 1625–1628. ’t Hart, J. ~1976!. ‘‘Psychoacoustic backgrounds of pitch contour stylization,’’ IPO Ann. Prog. Rep. 11, Eindhoven, The Netherlands, 11–19. 2348 J. Acoust. Soc. Am., Vol. 104, No. 4, October 1998 ’t Hart, J., Collier, R., and Cohen, A. ~1990!. A Perceptual Study of Intonation ~Cambridge U.P., London!. van Wieringen, A. ~1995!. ‘‘Perceiving dynamic speechlike sounds,’’ Ph.D. thesis, Amsterdam University. van Wieringen, A., and Pols, L. C. V. ~1995!. ‘‘Discrimination of single and complex consonant-vowel- and vowel-consonant-like formant transitions,’’ J. Acoust. Soc. Am. 98, 1304–1312. Wier, C. C., Jesteadt, W., and Green, D. M. ~1976!. ‘‘A comparison of method-of-adjustment and forced-choice procedures in frequency discrimination,’’ Percept. Psychophys. 19, 75–79. d’Alessandro et al.: Pitch of glissandos 2348
© Copyright 2026 Paperzz