Full Paper

ITRW on
Speech and Emotion
Newcastle, Northern Ireland, UK
September 5-7, 2000
ISCA Archive
http://www.iscaĆspeech.org/archive
ATTITUDES AND YES-NO QUESTIONS IN STANDARD FRENCH:
TESTING TWO HYPOTHESES
Olivier Piot
Institut de Phonétique de l'Université Paris III-La Sorbonne Nouvelle,
CNRS UPRESA 7018, FRANCE
e-mail: [email protected]
ABSTRACT
Two hypotheses are tested here, one about the use of F0 to
convey an attitude of ignorance, the other about the expression
of motivation by means of F0, vocal intensity and speech rate.
Sixteen Klatt-synthesised prosodic variations of a yes-no
question « Natacha? » (In French) are used for this test. Results
show that greater ignorance is expressed by higher final F0 (and
not by speech rate, final /a/ initial F0 or intensity rise), greater
desire to know by both higher speech rate and higher final
vowel pitch register. Both hypotheses are satisfyingly verified,
and this experiment allows to state them in a more precise way,
so it may contribute to the theoretical study of the expression of
emotions and attitudes by prosody.
INTRODUCTION
There are a lot of studies on the expression of emotions by
prosody, and also many attempts at understanding the linguistic
role of intonation: the former ones generally take into account
only average prosodic values (average pitch, loudness, speech
rate, etc.), while the latter ones very often merely allege the
paralinguistic aspect of prosody. But there are few of them that
consider the interplay between pitch contours and attitudes, or
emotions. The aim of this study is to test two hypotheses,
regarding the expression of attitudes and emotions by prosody,
applied to a yes-no question rising pitch contour in Standard
French: the first one is derived from the issue of size-sound
symbolism, the second one comes from the classical concept of
arousal.
1. FIRST HYPOTHESIS
In the field of the expression of attitudes in speech, there are
few theoretical proposals trying to give a biologically based
explanation for some prosodic phenomena which are, however,
broadly believed to be of a universal kind. One of them comes
from Ohala (Ohala, 1984), and is named by him the "frequency
code" hypothesis: based on ethological studies by Morton
(1977), it claims that man and animals use a lower pitch as a
way to convey « the primary meaning of ‘large vocalizer’ and
such secondary meanings as ‘dominant, aggressive, threatening,
etc.’ », and conversely a higher pitch to convey « the primary
meaning of ‘small vocalizer’ and such secondary meanings as
‘subordinate, submissive, non threatening, desirous of the
receiver’s goodwill, etc. ».
Thus the often-reported cross-language similarities in the
1
intonation contours of yes-no questions (cf. Bolinger, 1978,
who reported that about 70% of a sample of nearly 250
languages were said to use a rising terminal to signal questions,
and that the remaining 30% use a higher over-all pitch for
questions than for non-questions), is explained as a
consequence of its universal use to express one’s « desire of the
receiver’s goodwill ». More precisely, questioning means asking
someone for information, and thus raising pitch is a way of
doing so with some well-adapted politeness: expressing
inferiority (and therefore a need for assistance) on this one
particular point, allows the speaker to make it (more) acceptable
for the receiver to provide information without any comparable
« reward » in return.
Recently we proposed (Piot, 1999) another psychologically
grounded explanation for the same linguistic data, based on an
intra-individual axis instead of the inter-species one proposed
by Ohala. It is a well-agreed and well-documented fact that the
average pitch of each individual decreases from birth to
adulthood, mainly because of the growth in length and weigh of
the vocal folds, and that it stabilises at the age of about thirty
(cf. for instance Titze, 1993). This age axis thus constitutes an
intra-individual scale that we believe is « symbolically » at stake
in yes-no questions final rise, to express a feeling of low
« maturity » on one particular point.
This view takes subjective experience as a basic morphogenetic
principle, rather than inferred knowledge of the acousticophysical properties of vibrating folds, as is implied by Ohala’s
proposal. This makes Ohala’s view more difficult to defend, as
body size alone doesn’t allow to predict with any accuracy the
pitch of animals cross-specifically, and even intra-specifically.
On the contrary, the relationship between age and pitch doesn’t
suffer from any such disparity, and the curves of men and
women show great similarities (Titze, 1993). This is consistent
with the observation that pitch excursions, in general (and those
of yes-no questions in particular), are described as independent
from the average pitch of the speaker, while the Ohala’s
hypothesis would predict similar pitch targets, hence greater
excursions (toward the high pitch values) for men than for
women.
As a consequence, what we propose is that high pitch basically
conveys a feeling of ignorance or helplessness. Hence in the
1
i.e. questions whose answer is basically - and may be limited
to - yes or no.
case of a yes-no question, a greater final rise in pitch is expected
to convey a greater ignorance. To be more precise, great
ignorance means having no clue for guessing what the answer
is, while low ignorance means having an opinion. In his paper,
Ohala describes two experiments showing that both a lower
pitch register or a sharper pitch terminal fall are judged « more
dominant » (Ohala, 1984). These results are in agreement with
our proposal, but they just provide an evaluation of a gross
psychological parameter, not of the pragmatic use of pitch for
the purpose of communication. We deal with this aspect in the
experiment presented here.
2. SECOND HYPOTHESIS
The concept of arousal, coming from research on emotions, is
now widely considered as the primary parameter in the field of
their expression in speech. It can be glossed as the automatic
preparation of the body for action in response to « pregnant »
stimuli (i.e. stimuli of importance for the survival of the
individual and his cast). Among other autonomic responses
(increase in blood pressure, respiratory activity, etc.), one of its
main skeletal motor consequence is to increase tonus. It is
involved in the vocal expression of anger or joy for instance,
increasing the mean values of pitch, intensity, speech rate, pitch
range, etc. (cf. Scherer, 1986).
This idea was already suggested by Aristotle, who claimed that
fear and attraction were the two basic emotions at the source of
movement. But Darwin is undoubtedly the first author to put it
into a clear and deterministic form, that of his third principle
(Darwin, 1872): the excitation of the sensory system, due to
painful sensation or the perception of a pregnant stimulus, is
redirected to the motor system, in a diffuse and undifferentiated
way.
The motor aspect of arousal is described, on its rapidly reacting
part, as a tonic response of the somatic nervous system (Scherer
1986). Arousal thus facilitates movement, and therefore tends to
enhance speech rate (greater articulatory force means greater
accelerations of articulators, and so a greater overall speed of
articulation), vocal loudness (through an increase in the
subglottal pressure Ps, thanks to the increase in respiratory and
laryngeal activities, cf. Titze, 1993), and pitch (increase of the
global laryngeal activity - or even of Ps alone - raises pitch, cf.
Farley, 1994). This doesn’t mean, of course, that greater arousal
necessarily increases all three prosodic parameters, because
there is a process of control behind the realisation of any motor
program (cf. Fonagy 1983, who provides examples of high
arousal emotions such as anger, expressed with lower pitch and
loudness through active laryngeal constriction), even though
tonus itself may escape from voluntary control. But the increase
of one or several of these parameters constitute possible indices
of greater arousal, and thus may appear (either spontaneously,
or through mimicry) in the expression of communicative
attitudes such as a desire to get information on an important
topic. This also is what we tested in our experiment.
3. YES-NO QUESTIONS IN FRENCH
All stimuli used in this study are yes-no question rising
contours. This is the most common and unambiguous one (see
Grundstrom, 1973) for communicating that a « syntactic
assertion » is in fact a question (ex. in English: he is gone?). It
has been showed that the only important feature for this contour
to be perceived as a question is a high rise in the last syllable
(see for instance Faure, 1973), and that before it pitch may
either be rising, falling or staying at a constant value. We chose
the latter one (see figure 1), which may be considered as
unmarked, for the following reasons: 1) it is one of the most
studied, simple and unambiguous contours in Standard French,
2) both our hypotheses can be tested in a simple manner on this
contour, by modifying either speech rate or final vowel loudness
and pitch contours, 3) a great many intonation languages use
final syllable pitch rise for yes-no questions (see part 1, this
paper), 4) this latter observation is certainly the most immediate
issue to be accounted for by theories of a universal meaning of
intonation, 5) ignorance and desire to know are two important
psychological features of the speaker’s situation in regard to the
subject of his questioning. Applied to the one word phrase
« Natacha » (see below for a justification of this choice), it
could have the meaning « Is it Natacha that you’re talking
about? », or « who has won this time, Natacha? » when the
context and centre of interest is clearly the points' counting at
the end of a game.
4. EXPERIMENTAL PROCEDURE
We thus limited ourselves here to synthesising one particular
type of yes-no question contour: the pitch stays at a flat level
until the last syllable, where it sharply rises. We saw in part 3
above that this is the most representative contour for yes-no
questions in French, and that it is particularly well suited to
testing both our hypotheses. We used the « Compost » software,
an ergonomic and high quality Klatt synthesiser, which is the
only known way to modify one speech parameter while leaving
all the others unchanged. We used the one word phrase
« Natacha », which was chosen for the following reasons: 1) 3
syllables are enough for all three prosodic parameters variations
to be clearly perceived (which is confirmed by the results), 2)
the use of 3 /a/ vowels neutralises intrinsic pitch and intensity
effects, 3) the use of a small number of syllables diminishes the
non informative input to the subjects’ ears, while allowing them
to listen to the stimuli several times in a smaller space of time.
Figure 1 shows the method and values we chose for the
2
synthesis of our stimuli : they are based on the cross-variation
of two different speech rates (the lower of the two corresponds
to the initial synthesis, which was made after a recording at a
rather low speech rate), two different loudness contours on « cha » (the lower of the two corresponds to an unmarked yes-no
question, see Fonagy & Bérard, 1973 p. 78, and the higher one
to a version perceived as emphatic), and four final pitch
contours: the four values 154, 192, 240 and 300 Hz constitute a
geometric progression, which means that they are perceptively
in a constant progression. This implies here only two different
2
This choice is based on recordings: we recorded numerous
realisations of the yes-no question « Natacha? » as described
in the text, varying all three prosodic values while remaining
in the scope of a perceved yes-no question
final /a/ pitch rises on a perceptual point of view: one going
from 154 to 192 and from 192 to 240, and a steeper one going
from 154 to 240 and 192 to 300. This was designed in order to
allow the comparison of as many influences as that of final
syllable mean pitch (the perceptual steepness of the rise being
held constant), initial value of the rise (with a constant final
value), final value of the rise (with a constant initial value), and
hence steepness of the rise (with a constant initial or final
value). In a preliminary experiment, all 16 stimuli sounded like
possible yes-no questions to our subjects, which was later
confirmed during the main experiment (see also for example
Faure 1973, p. 14, for a precedent experimental justification).
300 Hz
240 Hz
Pitch
192 Hz
192 Hz
154 Hz
123 Hz
Vocal
Intensity
+ 7 dB
+ 4 dB
+ 2 dB
Speech
original phonemes durations
Rate
or
multiplied by 0,8
figure 1. Prosodic cross-variation of pitch, loudness and speech
rate, used to synthesise 16 yes-no question stimuli on the one
word phrase « Natacha? ».
The output of the synthesiser is at a 5 kHz - 16 bits format, so
the stimuli were bandpass filtered between 75 and 5700 Hz. 200
and 0 milliseconds spaces of time were put at the beginning and
end (resp.) of each file, the 200 ms insuring that the mouse click
didn’t interfere with the hearing of the stimuli, and the final 0
ms allowing Ss to listen to the stimuli again as quickly as they
3
liked. The experiment was designed as a HyperCard stack ,
containing general instructions on the first card, followed by 8
test cards (4 for each one of both tested hypotheses) for a
preliminary practise (insuring that the instructions were clearly
understood, and displaying the entire range of prosodic
variations appearing in the stimuli); then appeared two series (a)
and (b) (one for each tested hypothesis) of 32 test cards, each
series being made of two successive randomised series of all 16
stimuli, and being preceded by one additional instruction card.
3
a stack is made of a series of cards following each other
Each test card contained one stimulus (which could be listened
to one or several times, by pressing a button), one attitude, and a
popup box for the ratings. In part (a) of the experiment the
attitude was « ignorance EXPRIMEE » (EXPRESSED
ignorance), and in part (b) it was « envie de savoir
EXPRIMEE » (EXPRESSED desire to know). The word
« exprimée » was highlighted because we wanted the subjects to
make their judgement only on what was directly expressed, and
so to prevent them from making any inference on the speaker’s
thoughts. For instance we had to avoid such underlying
interpretations as « the speaker is restraining his anger »: the
prosody was the target of the judgement. In both parts (a) and
(b) the rating choice was between « très grande » (very high),
« grande » (high), « moyenne » (medium), and « faible » (low).
This « literal » judgement was found, in a preliminary
experiment, to be easier than the arithmetical one (notation from
1 to 4) for the subjects to use. The lowest level of notation was
chosen not to be « absent » (or 0), because both attitudes are
implicitly present in this particular context of a yes-no question.
Each additional instruction cards had two roles: the first was to
indicate that part (a) or part (b) was about to start, and the
second was to warn the subjects against the usual positive
correlation between ignorance and desire to know, by giving
them examples where the former is high while the latter is low,
and vice versa. The division of the experiment into two
following parts was made to facilitate both the experimental task
and the consistency in the judgements, by favouring the
subjects’ elaboration of their own notation scale. They could go
back to the additional instruction card whenever they wanted to,
by just pressing a button. On each card of each part, a number
(decrementing from 32 to 1) showed them their progression.
The experiment took place in an anechoic room, using
professional quality earphones. The loudness (vocal intensity) at
the subjects’ ears was comprised between 65 and 72 dBA,
which provided a comfortable listening without causing
noticeable auditive fatigue, all the more as the experiment was
of a rather short duration (between 10 and 20 minutes). The
subjects were 20 students in linguistics, all were native speakers
of standard French. They were between 18 and 25 years old, 14
were female and 6 male.
5. RESULTS AND DISCUSSION
The results, averaged over all 20 Ss, are shown on table 1 for
part (a), and table 2 for part (b), where « très grande » (very
high), « grande » (high), « moyenne » (medium), and « faible »
(low) ratings were converted into 4, 3, 2 and 1 (resp.). The
stimuli are coded in the following way: the first letter is for the
speech rate (R: rapid, L: slow), and the following numbers are,
from left to right, for final /a/ initial pitch (1: 154 Hz, 2: 192
Hz), final pitch (2: 192 Hz, 3: 240 Hz, 4: 300 Hz), and final
loudness (0: rise from 0 to 2 dB, 1: rise from 4 to 7 dB).
Part (a): « ignorance exprimée »
Part (b): « envie de savoir exprimée »
(expressed ignorance)
(expressed desire to know)
stimulus
mean value
st. dev.
stimulus
mean value
st. dev.
L120
1,88
1,02
L120
1,40
0,55
L121
1,88
0,99
L121
1,48
0,64
L130
2,20
0,72
L130
2,05
0,68
L131
2,25
0,71
L131
2,08
0,69
L230
2,35
0,74
L230
1,95
0,64
L231
2,15
0,77
L231
2,20
0,72
L240
3,10
0,78
L240
3,23
0,77
L241
3,20
0,65
L241
3,38
0,67
R120
1,88
0,82
R120
1,83
0,75
R121
1,85
0,74
R121
1,98
0,77
R130
2,23
0,73
R130
2,33
0,73
R131
2,38
0,81
R131
2,58
0,71
R230
2,15
0,74
R230
2,63
0,74
R231
2,60
0,87
R231
2,83
0,71
R240
3,10
0,84
R240
3,50
0,55
R241
3,18
0,93
R241
3,75
0,44
table 1: Mean ratings and standard deviation (expressed
ignorance) for each yes-no question stimulus, averaged on
twice-repeated ratings from 20 Ss (for a description of the
stimuli’s naming, see part 5 above). Rating choice was between
1 (lowest notation), 2, 3 and 4 (highest notation).
table 2: Mean ratings and standard deviation (expressed desire
to know) for each yes-no question stimulus, averaged on twicerepeated ratings from 20 Ss (for a description of the stimuli’s
naming, see part 5). Rating choice was between 1 (lowest
notation), 2, 3 and 4 (highest notation).
part (a): « ignorance EXPRIMEE »
part (b): « envie de savoir EXPRIMEE »
For all our statistical comparative measures we used the
bilateral paired t-test. We didn’t find any influence of the
speech rate, either globally (p=0,44), or considering each of the
8 R/L pairs of stimuli (p>0,23, except for L/R231: p=0,012). As
we have p=0,23 for L/R230, which sounds very much like
L/R231, we decided not to give too much importance to the
L/R231 exception by using p=0,01 as the threshold value, i.e.
by considering only the « very significance » of our results.
Therefore we grouped together the results of R/L stimuli pairs.
In the same way, no influence of the loudness contour could be
found, either globally (p=0,14), or considering each of the 4
pitch contours (p>0,28). We therefore constituted 4 groups of 4
stimuli, each group differing from the others by its pitch
contour. We then couldn’t find any influence of the final /a/
initial pitch, by comparing « 13 » and « 23 » pitch contours
(who have the same final pitch value): p=0,50. But the influence
of final pitch is highly significant, stimuli being judged to
express more ignorance when final pitch is higher (p=1,16E-06
for « 12 » vs. « 13 », and p=7,83E-20 for « 23 » vs. « 24 »).
Loudness has a positive influence on all mean values (see table
2), but none of them is very significant (p>0,02), while globally
it is (p=3,13E-04). This suggests that loudness could have a
small influence, which needs more Ss to appear as significant.
However we grouped together the results of all pairs of stimuli
differing by their intensity contour. The global influence of
speech rate was high (p=6,87E-16), and was attested in all 4
cases (p<2E-03). Final /a/ initial pitch had no influence at low
speech rate (p=0,90), but had a slight one at high speech rate
(p=9,66E-03): a stimulus whose final rise is beginning higher
was judged to express a little more desire to know at high
speech rate (2,33 and 2,58 vs. 2,63 and 2,83 resp.). Maybe this
small effect could be explained by the need of a fast increase in
laryngeal tension, at high speech rate, in order to go from 123
Hz before the « ch » sound, to 192 Hz immediately after it. This
fast increase could therefore be perceived as an indication of a
higher arousal. The influence of final pitch was highly
significant, either globally (p=4,43E-46) and for each of all 4
pairs (p<1E-06).
As predicted the final pitch value has a great influence on
ratings for « ignorance EXPRIMEE ». Final /a/ average pitch
can not represent this tendency as well as final pitch value,
because final /a/ initial pitch alone has no influence on the
ratings. The purpose of the additional instructions, that is the
dissociation between ignorance and desire to know, have been
successfully achieved: parts (a) and (b) appear to have very
different results. Thus vocal intensity has a small positive effect
on « envie de savoir EXPRIMEE », although it is significant
only in a global comparison. Speech rate also has an influence,
but it is higher (with the prosodic variations we used in this
experiment) and always highly significant. The same is true of
the influence of final pitch, but with the influence of final /a/
initial pitch at high speech rate, the question is to know which
one of final pitch or final /a/ pitch register gives the best account
for this tendency. Thus if we further compare « 12 » vs. « 23 »,
and also « 13 » vs. « 24 » pitch contours, for both high and low
speech rate, we find that the positive influence of final /a/ pitch
register is even much more significant than that of final pitch,
either globally (p=2,90E-57, to be compared to p=4,43E-46),
and for each of all 4 pairs (p<1E-10, to be compared to p<1E06), while mean ratings show a comparable effect for « L »
stimuli, but a higher effect of final /a/ pitch register for « R »
4
ones . It could be that a high speech rate reveals, maybe for the
reason we suggested above, the critical influence of final /a/
initial pitch, and hence that final /a/ pitch register and speech
rate are two clue parameters explaining the results of part (b).
CONCLUSION
Both hypotheses that are tested here are quite satisfyingly
verified: in yes-no questions, greater ignorance may be
expressed by a steeper final pitch rise, and greater desire to
know by higher final vowel pitch register and faster speech rate.
In this latter case final vocal loudness has a systematic positive
influence on the ratings, but it doesn’t reach significance except
on a global comparison. This study, being at the interplay
between attitudes and emotions, may contribute to theorical
work on of the expression of emotions in speech, as well as to a
better understanding of the interplay between linguistic and
paralinguistic aspects of intonation. It may be followed by a
comparable study, based on assertive contour(s).
REFERENCES
Bolinger (1978): « Intonation across languages », in J. H.
Greenberg et al. (Eds.) Universals of human language ,
Phonology, 2: 471-523.
Darwin, C. (1872): The expression of the emotions in man
and animals , London: Murray.
Farley, G., R. (1994): "A quantitative model of voice F0
control", JASA 95 (2): 1017-1029.
Faure, G. (1973): « La description phonologique des
systèmes prosodiques », in A. Grundstrom & P. Léon
4
a simple calculus shows that it straightly comes from the
difference we found in the influence of final /a/ initial pitch
(Eds.) Interrogation et intonation en Français standard et
Français Canadien (Studia Phonetica, 8), Montreal:
Didier.
Fonagy, I. , Bérard, E. (1973): « Questions totales simples et
implicatives en français parisien », in A. Grundstrom &
P. Léon (Eds.) Interrogation et intonation en Français
standard et Français Canadien (Studia Phonetica, 8),
Montreal: Didier.
Fonagy, I. (1983): La vive voix , bibliothèque scientifique
Payot.
Morton, E., W. (1977): "On the occurrence and significance
of motivation-structural rules in some birds and
mammal sounds", Am. Nat., 111: 855-869.
Ohala, J., J. (1984): "An ethological perspective on common
cross-language utilization of F0 of voice", Phonetica,
41: 1-16.
Piot, O. (1999): « Une approche morphogénétique des
‘clichés mélodiques’ du français standard », Faits de
Langue, 13: 26-34.
Scherer, K., R. (1986): "Vocal affect expression: a review
and a model for future research", Psychol. Bulletin, 99:
141-165.
Titze, I., R. (1993): Principles of voice production, Prentice
Hall.