Expression of Speaker`s Intentions through Sentence

8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain
Expression of Speaker’s Intentions
through Sentence-Final Particle/Intonation Combinations
in Japanese Conversational Speech Synthesis
Kazuhiko Iwata, Tetsunori Kobayashi
Perceptual Computing Laboratory, Waseda University, Japan
Abstract
Aiming to provide the synthetic speech with the ability to express speaker’s intentions and subtle nuances, we investigated
the relationship between the speaker’s intentions that the listener perceived and sentence-final particle/intonation combinations in Japanese conversational speech. First, we classified F0
contours of sentence-final syllables in actual speech and found
various distinctive contours, namely, not only simple rising and
falling ones but also rise-and-fall and fall-and-rise ones. Next,
we conducted subjective evaluations to clarify what kind of intentions the listeners perceived depending on the sentence-final
particle/intonation combinations. Results showed that adequate
sentence-final particle/intonation combinations should be used
to convey the intention to the listeners precisely. Whether the
sentence was positive or negative also affected the listeners’
perception. For example, a sentence-final particle ‘yo’ with a
falling intonation conveyed the intention of an “order” in a positive sentence but “blame” in a negative sentence. Furthermore,
it was found that some specific nuances could be added to some
major intentions by subtle differences in intonation. The different intentions and nuances could be conveyed just by controlling the sentence-final intonation in synthetic speech.
Index Terms: speech synthesis, speaker’s intention, sentencefinal particle, sentence-final intonation, conversational speech
1. Introduction
Speech synthesis technology has made remarkable progress in
its synthetic voice quality, and the natural-sounding synthetic
speech has recently become available. Moreover, various approaches to synthesizing expressive and conversational speech
have been also reported in the past decade [1, 2, 3, 4]. However,
due to the diversity of conversational speech, numerous problems still need to be solved to build a useful speech synthesis
system for robots and speech-enabled agents that communicate
with humans through synthetic speech. We communicate with
each other through linguistic and paralinguistic information [5].
An utterance is affected by the speaker’s intentions, attitudes,
feelings, personal relationship with the listeners, and so forth.
The paralinguistic features of the utterance vary a great deal.
In spoken Japanese, a speaker’s intention is usually conveyed at the end of a sentence by sentence-final particles or
auxiliary verbs [6]. The functions of sentence-final particles
have been extensively studied in the field of linguistics [7, 8, 9,
10, 11, 12, 13]. For example, a sentence-final particle ‘yo’ indicates a strong assertion, and ‘ne’ indicates a request for listeners’ agreement. In addition, the intonation of the sentence-final
particle, namely, the sentence-final intonation plays a significant
role in expressing the intention and has also been studied over
the years [14, 15, 16, 17, 18, 19]. For instance, a sentence with a
sentence-final particle ‘ka’ becomes a declarative sentence with
a falling intonation, whereas it becomes an interrogative sentence with a rising intonation. Moreover, it can express various additional nuances such as surprise, admiration, concern,
doubt, or anger through different intonations. The functions
of the sentence-final and phrase-final intonation have been discussed also in terms of turn-taking [20, 21]. However, not only
the sentence-final intonation but also the intonation of the whole
sentence varies depending on the speaker’s intention or attitude.
In some languages other than Japanese, it has been reported that
the listeners could identify the speaker’s attitudes before the end
of the sentences [22, 23]. In contrast, the experiment that used
two utterances of the same sentence uttered with different intentions demonstrated the importance of the sentence-final intonation in Japanese. When the speech segments of the sentencefinal particle were cut off and swapped, the listeners perceived
the intention that was expressed by the segment of sentencefinal particle though the overall F0 contours of the utterances
differed from each other [24].
On the other hand, there are not as yet many approaches to
expressing the speaker’s intention by controlling the sentencefinal intonation in the field of speech synthesis technology.
Boundary pitch movements in Tokyo Japanese have been analyzed and modeled [25]. Five boundary pitch movements were
chosen and their meanings were examined through association
with eight semantic scales. A prosody control method has been
proposed [26], which was based on the analysis of intonations
of the one-word utterance ‘n’. This study revealed that speaking attitudes were expressed by the F0 average height and dynamic pattern of the word ‘n’, since it has no particular lexical meaning. However, none of these models above considered the association of the expressions of the speaker’s intention with the sentence-final particles despite the fact that the
intention is verbally expressed by the sentence-final particle in
Japanese spoken dialogue. Although we investigated the relationship between the speaker’s intentions and sentence-final intonation contours in a previous study [27], we did not consider
the relevance of the sentence-final particles.
In this study, we focus on the listeners’ perception of
speaker’s intention, associating with sentence-final particles and
their intonation contours in order to enable synthetic speech to
express various intentions. In Section 2, we classify sentencefinal intonation contours to find what kinds of sentence-final
intonation were used in actual speech. In Section 3, we select several distinctive intonation contours from among the classification results and conduct a subjective evaluation to find
suitable sentence-final particle/intonation combinations for conveying some specific intentions (hereafter “major intentions”).
235
K. Iwata, T. Kobayashi
Furthermore, in Section 4, we investigate sentence-final particle/intonation combinations that can add subtle nuances to the
major intentions aiming to provide the synthetic speech with
wide expressiveness. In Section 5, we conclude the paper with
some remarks.
2. Classification of sentence-final intonation
2.1. Previous research in linguistics
As mentioned above, the functions of the sentence-final intonation have been discussed and various views have been proposed
by the linguists. Table 1 shows some examples for the categorization of the sentence-final intonation in terms of the contours.
In the majority of them, the sentence-final intonation contours
were classified into two categories or up to five categories at
most. However, there seems to be no accepted notion.
Table 1: Examples for categorization of sentence-final intonation in linguistics.
Number of
categories
2
4
5
Categories
Rise, Fall [15]
Rise, Fall-and-rise, Fall, Rise-and-fall [16]
Interrogative rise, Prominent rise, Fall,
Rise-and-fall, Flat [17]
2.2. Speech data
We used speech data that were created with the aim of developing an HMM-based speech synthesis system that had multiple HMMs depending on situations of conversation [4]. To
build the HMMs, we designed several situations and more than
2000 sentences derived from dialogues that our communication
robot [28] performed. These sentences were uttered by a voice
actress, to whom we did not indicated any specific intentions for
each sentence but explained the situations in which the robot
was supposed to utter each sentence. The F0 contours were
extracted using STRAIGHT [29], and the phonemes were manually segmented. The intonation contours at the end of these
utterances varied a great deal and expressed subtle nuances and
connotations. Of these data, 2092 utterances whose sentencefinal vowel was not devoiced were used for the analysis.
The sentence-final intonation contours, that is, the F0 contour in the sentence-final syllable was extracted by referring to
the phoneme boundaries. Because the actual F0 values of the
utterances differed from each other and were difficult to classify, the time and frequency axes were normalized. To remove
F0 perturbation caused by jitter and microprosody, the logarithmic F0 contour was approximated by a third-order least squares
curve. The approximated curve was sampled at 11 points that
equally divided the duration into 10. Finally, the starting point
of the sampled curve was parallel-translated to the origin [27].
2.3. Classification of sentence-final intonation contours
The normalized F0 contours obtained by the above process were
classified by Ward’s clustering algorithm [30], which is one of
the hierarchical clustering algorithms and merges clusters so as
to minimize the within-cluster variance.
Figure 1 shows an example of the clustering results when
the number of clusters was set to 32. The F0 contours denoted
by thick circles and a thick line are the centroids of each cluster. The numbers in square brackets (e.g., [C2]) are expedient
236
cluster IDs corresponding to the clustering sequence. Note that
the lengths of the vertical lines do not represent the distance between the clusters due to the limitation of the page layout. Various sentence-final intonation contours were found, including
not only simple rising and falling intonations but also rise-andfall and fall-and-rise intonations.
2.4. Perceptual discrimination of intonation contours by
centroids
We found the sentence-final intonation contours were classified
into distinctive clusters. However, we predicted not all the pairs
of cluster centroids would have a notable perceptual difference
from each other because the clustering was based only on the
shapes of the F0 contours. Therefore, a preliminary evaluation
was conducted.
A back-channel utterance ‘haa’, which had no specific
linguistic meaning, was resynthesized by the STRAIGHT
vocoder [29], and its F0 contour was replaced with 127 centroid
pairs obtained in the process of classifying the F0 contours into
128 clusters. Fifteen listeners were randomly presented with
254 ‘haa’ pairs including a reverse order for each pair of centroids and then asked whether they perceived the two intonations to be the same or different. The results of the evaluation
are shown in Figure 2. The numbers in parentheses indicate the
number of responses when the intonations by the centroids on
both sides were perceived as different. This is how we obtained
the criteria for sifting through and selecting the F0 contours that
would be used in the next experiments.
3. Major intentions conveyed by
sentence-final particle/intonation
combinations
3.1. Selection of representative intonation contours
We consulted previous studies prior to conducting the subjective evaluation to investigate what kind of speaker’s intentions
could be conveyed by sentence-final particles and their intonation contours. Referring to the results of the preliminary evaluation (Figure 2), we stopped dividing clusters whose child clusters received 28 or fewer perceptions that their intonations were
different. The selected clusters were C5, C20, C6, C15, C16,
and C19.
Compared with the categorization in the previous research
shown in Table 1, the centroid of the cluster C5 seems to correspond to the interrogative rise intonation, C20 to the fall-andrise, C6 to the fall, C15 to the rise-and-fall, C16 to the prominent rise, and C19 to the flat. These results indicate that specific
intonation contours corresponding to the categories in the previous research could be obtained from the speech database.
3.2. Experimental setup
A subjective evaluation was conducted to clarify what kind of
intentions the listener could perceive through the sentence-final
intonations produced by the selected six centroids.
We prepared 31 short sentences consisting of a verb
‘taberu’ (“eat” ) followed by a sentence-final particle (‘yo’, ‘na’,
‘ne’, ‘datte’, etc.), an auxiliary verb (‘daroo’), or one of their
concatenations (‘yone’, ‘yona’, ‘datteyo’, ‘daroone’, etc.). Synthetic voices of these sentences were generated by our HMMbased speech synthesis system [4]. The duration of the last
vowel of each sentence was fixed to 313 ms, which was the
mean duration of the last vowels of the sentences in the speech
8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain
[C4]
[C2]
1
1
0
0
[C3]
−1
[C5]
1
0
0
1
0
−1
[C9]
−1
1
0
0
1
−1
[C14]
0
[C33]
−1
1
[C36]
−1
1
1
0
0
[C39]
1
1
0
0
−1
−1
[C25]
[C13]
1
1
−1
−1
[C61]
1
[C48]
[C21]
1
1
−1
[C22]
−1
1
1
0
0
[C24]
1
0
−1
−1
−1
−1
[C30]
1
1
0
0
1
1
0
0
−1
−1
−1
[C34] [C47]
1
1
[C84] [C42]
1
1
[C70] [C58]
1
0
1
0
−1
[C46] [C41]
[C40] [C44]
1
0
1
0
−1
[C26]
[C28]
1
1
[C89] [C64]
1
1
0
0
−1
−1
0
0
−1
−1
1
0
0
−1
[C27]
1
−1
[C32]
[C35] [C38]
0
0
0
−1
1
1
0
0
−1
0
[C31]
−1
[C18]
1
0
0
0
−1
[C19]
1
−1
[C29]
[C16]
−1
1
0
[C17]
−1
−1
[C11]
0
[C12]
−1
[C15]
−1
0
0
[C23]
−1
1
1
0
1
1
[C10]
1
0
1
0
[C7]
[C6]
[C20]
1
[C8]
−1
1
0
−1
0
0
−1
−1
−1
0
0
−1
−1
−1
−1
1
−1
[C55] [C62]
[C59][C105]
1
0
0
−1
−1
0
−1
−1
[C60] [C52] [C68][C101][C56] [C49]
1
1
0
0
1
1
0
0
1
[C67] [C65]
1
0
−1
−1
1
1
1
−1
0
−1
0
−1
1
0
0
−1
−1
−1
−1
−1
Figure 1: Result of clustering sentence-final intonation contours when the number of clusters was set to 32.
[C4]
1
(30)
0
0
[C3]
−1
[C8]
−1
1
1
(29)
0
[C5]
[C6]
[C20]
1
0
[C15]
−1
1
−1
1
1
(29)
0
(29)
0
0
0
−1
−1
−1
−1
[C7] [C23]
1
1
0
0
[C9] [C11]
1
1
0
0
[C16]
[C19]
1
1
(30)
0
−1
• Sentence-final particle ‘ne’
Generally, the use of ‘ne’ signals a polite “request”. This was
endorsed with the rising intonations C5 and C20 (Figures 3(a)
and 3(f)). In the positive sentence ‘Tabete ne’, a “request”
was also conveyed with the rising C16 intonation. On the
other hand, in the negative sentence ‘Tabenaide ne’ with the
rising C16 and flat C19 intonations, an “order” was perceived
more clearly than in the positive sentence.
[C2]
1
0
−1
−1
(27)
−1
−1
(23)
[C46] [C41]
1
1
0
0
1
1
1
1
0
0
1
0
−1
−1
( 5)
1
0
0
0
−1
−1
−1
−1
−1
−1
[C24] [C35] [C38] [C27]
[C26] [C28]
(28)
−1
(26)
(18)
Figure 2: Preliminary evaluation results of intonations generated by cluster centroids. The numbers in parentheses indicate
the number of responses when the intonations by the centroids
on both sides were perceived as different.
data. Then, the sentence-final F0 contour was replaced with the
six centroids. We also designed 11 speaker’s intentions (“request”, “order”, “blame”, “hearsay”, “guess”, “question”, etc.)
and situations where these intentions could be indicated. We
informed 20 listeners of the situations and speaker’s intentions
and asked them to evaluate whether or not both the lexical and
intonational expressions of the stimulus were suitable for conveying the intention on a five-level scale: –2 (unsuitable; suitable for a different intention), –1 (rather unsuitable), +1 (rather
suitable), +2 (suitable), and 0 (none of the above).
3.3. Results and discussion
Figure 3 shows the key results of the subjective evaluation, with
a particular focus on a “request”, an “order”, and “blame”.
• Sentence-final particle ‘yo’
In the positive sentences ‘Tabete yo’ (Figure 3(b)) with the
falling intonations C6 and C15 and ‘Tabero yo’ (Figure 3(d))
with C6, an “order” (“Don’t eat.” ) was conveyed. In contrast, in the negative sentences ‘Tabenaide yo’ (Figure 3(g))
and ‘Taberuna yo’ (Figure 3(i)), which meant prohibition,
“blame” (“Why did you eat even though I told you not to?” )
was strongly conveyed with the falling intonations C6 and
C15. When with the rising and flat intonations C5, C16, and
C19, they caused an “order” impression.
• Sentence-final particles ‘yone’ and ‘yona’
‘Yone’ is known to have different lexical functions from ‘ne’
and ‘yo’. “Blame”, which was not much perceived in the
sentences with ‘ne’, was conveyed with the rising C16 and
flat C19 intonations (Figures 3(c) and 3(h)). This tendency
differs from the case with ‘yo’, where “blame” was conveyed
with the falling intonations C6 and C15. “Blame” could be
conveyed also in ‘yona’ with C16 and C19 (Figures 3(e) and
3(j)).
To summarize the results, we can obtain Table 2, which
shows the highest scored intentions for each combination. Then
we can consult this table when we generate synthetic speech.
For example, we can express “request” in a positive sentence
with a sentence-final particle ‘ne’ by using C5 or C20 as the
sentence-final intonation. When we need to express “blame”
in a negative sentence with a sentence-final particle ‘yo’, we
should use C6 or C15.
237
K. Iwata, T. Kobayashi
Request
Request
Request
Request
Request
Order
Order
Order
Order
Order
C5
C20
C6
C15
C16
C19
95%
Confidence
interval
Blame
Blame
Blame
Blame
Blame
1
0.5
−2
−1
0
1
2
(a) ‘Tabete ne’
−2
−1
0
1
2
−2
(b) ‘Tabete yo’
−1
0
1
2
−2
(c) ‘Tabete yone’
−1
0
1
2
(d) ‘Tabero yo’
−2
−1
0
1
2
C5
0
−0.5
−1
(e) ‘Tabero yona’
2 4 6 8 10
1
0.5
C20
0
−0.5
−1
2 4 6 8 10
1
Request
Request
Request
Request
Request
0.5
C6
0
−0.5
−1
2 4 6 8 10
1
0.5
Order
Order
Order
Order
C15
Order
0
−0.5
−1
2 4 6 8 10
1
0.5
C16
0
−0.5
Blame
Blame
Blame
Blame
Blame
−1
2 4 6 8 10
1
0.5
−2
−1
0
1
(f) ‘Tabenaide ne’
2
−2
−1
0
1
2
(g) ‘Tabenaide yo’
−2
−1
0
1
2
−2
(h) ‘Tabenaide yone’
−1
0
1
2
(i) ‘Taberuna yo’
−2
−1
0
1
(j) ‘Taberuna yona’
2
C19
0
−0.5
−1
2 4 6 8 10
Figure 3: Subjective evaluation results of major intentions depending on sentence-final particle/intonation combinations. The sentences
in (a), (b), (c), (d), and (e) are positive (roughly, “Please eat.” ), and the others are negative (“Please don’t eat.” ).
Table 2: Speaker’s intention conveyed by sentence-final particle/intonation combination. The intentions (“request”, “order”, and
“blame”) that received the highest positive score are shown by the initial letters and underlined when their scores are higher than or
equal to 1.0 (rather suitable). *** p<0.001, ** p<0.01, * p<0.05, + p<0.1, where p is the maximum p-value among two comparisons.
Sentence-final particle
Sentence-final intonation
C20
C6
C15
C16
C5
(Phrase)
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
0
0
0
0
0
−0.5
−0.5
−0.5
−0.5
−0.5
−0.5
−1
−1
−1
−1
−1
−1
2 4 6 8 10
‘Tabete ne’
‘Tabenaide ne’
‘Tabete yo’
‘Tabenaide yo’
‘Tabete yone’
‘Tabenaide yone’
‘Tabero yo’
‘Taberuna yo’
‘Tabero yona’
‘Taberuna yona’
R***
R
R**
O+
R
O
R
O
O
B
2 4 6 8 10
R**
R
R**
R*
O
R
R**
O
B
O*
4. Additional nuances conveyed by
sentence-final particle/intonation
combinations
As the next step of this study, we investigated whether the listeners could perceive additional intentions, attitudes, or feelings
(hereinafter collectively called “additional nuances”) through
the sentence-final intonations.
4.1. Selection of representative intonation contours
We increased the types of representative intonation contours to
be used for the subjective evaluation of additional nuances. The
selected six clusters in 3.1 were divided into several subclusters.
This time, we merged two clusters from the 128 leaf nodes to the
six clusters. When two clusters received 27 or more out of 30
(more than or equal to 90%) perceptions that their intonations
were different, they were not merged. Additionally, their parent
cluster was not merged with its paired cluster either. Thus, the
cluster C5 was ultimately divided into 8 subclusters, C20 into
238
C19
1
0.5
2 4 6 8 10
–
–
O+
B**
B
–
O
B***
O
B**
2 4 6 8 10
–
O/B
O
B***
B
B**
B
B***
B
B
2 4 6 8 10
R*
O*
R
O*
O
B
–
O+
O/B
B
2 4 6 8 10
O
O*
B
O*
O
B
R/O
O***
B
B*
3, C6 into 11, C15 into 5, C16 into 7, and C19 into 7, as listed
in Table 3.
4.2. Experimental setup
We defined the intentions of a “request”, an “order”, and
“blame” as the major intentions for this evaluation. Considering
the results of the previous section (Table 2), we chose sentencefinal particle/intonation combinations from among those which
received the significantly different (p < 0.05) score: ‘Tabete ne’
with the sentence-final intonations by the subclusters of C5,
C20, and C16 for a “request”; ‘Tabenaide yo’ with the subclusters of C16 and C19 for an “order”; and ‘Tabenaide yo’ with the
subclusters of C6 and C15 for “blame”. We generated 18 synthetic utterances with different intonations for the major intention “request”, 14 utterances for the “order”, and 16 utterances
for the “blame” as the stimuli.
Three additional nuances for each major intention (Table 4)
were designed, which one of the authors perceived from these
stimuli in advance. The stimuli were presented to 22 listeners,
8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain
Table 3: Selected subclusters as representative intonation
contours to be used for the evaluation of additional nuances. Their centroids are shown in Figure 4.
Parent
cluster
Table 4: Evaluated additional nuances for each major intention.
Major
intention
Request
Subclusters
# IDs
8 C12, C36, C61, C30, C43, C130, C184, C23
3 C100, C135, C41
11 C9, C70, C334, C282, C232, C139, C151,
C233, C125, C65, C22
5 C24, C329, C117, C74, C155
7 C123, C346, C248, C71, C131, C104, C52
7 C214, C269, C229, C101, C56, C112, C202
C5
C20
C6
C15
C16
C19
Order
Blame
Additional nuance
The speaker seems
(r1) to sincerely request the listener to eat.
(r2) not to really want the listener to eat.
(r3) to be slightly in a bad mood.
(o1) to be afraid the listener will definitely eat.
(o2) not to be afraid the listener will definitely eat.
(o3) to be in a slightly bad mood.
(b1) to be in a hurry to stop the listener from eating.
(b2) to be in a rage after the listener has eaten.
(b3) to be disheartened after the listener has eaten.
3
(r1)
(r2)
(r3)
2
95%
Confidence
interval
1
0
C12
C36
C61
C30
C43
C130
C184
C23
C100
C135
C41
C123
C346
C248
C71
C131
C104
C52
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
−0.5
0
−0.5
−1
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
−0.5
−1
2 4 6 8 10
−1
2 4 6 8 10
2 4 6 8 10
(a) ‘Tabete ne’ with major intention “request”.
3
(o1)
(o2)
(o3)
2
95%
Confidence
interval
1
0
C123
C346
C248
C71
C131
C104
C52
C214
C269
C229
C101
C56
C112
C202
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
−0.5
0
−0.5
−1
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
−0.5
−1
2 4 6 8 10
−1
2 4 6 8 10
2 4 6 8 10
(b) ‘Tabenaide yo’ with major intention “order”.
3
(b1)
(b2)
(b3)
2
95%
Confidence
interval
1
0
C9
C70
C334
C282
C232
C139
C151
C233
C125
C65
C22
C24
C329
C117
C74
C155
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
−0.5
0
−0.5
−1
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
0
−0.5
−1
2 4 6 8 10
−0.5
−1
2 4 6 8 10
−1
2 4 6 8 10
2 4 6 8 10
(c) ‘Tabenaide yo’ with major intention “blame”.
Figure 4: Subjective evaluation results of additional nuances to major intentions. The centroids of the subclusters of C5, C20, and C16
were used for expressing the major intention “request”, those of C16 and C19 for “order”, those of C6 and C15 for “blame”.
along with the situation of the dialogue and the additional nuances for each of the three major intentions. The listeners were
asked to evaluate whether they felt the additional nuances on a
four-level scale: 3 (strongly felt), 2 (felt), 1 (slightly felt), and
0 (not felt). In addition, they were asked to freely describe any
other nuances they felt and evaluate the stimuli in the same way.
4.3. Results and discussion
The results are shown in Figure 4. Several intonation contours
were found to be able to convey some additional nuances.
• “Request” (Figure 4(a))
The additional nuances (r1) and (r2) have mutually exclusive
meanings. C61, C23, C43, and C131 conveyed the nuance
(r1) well. They all rise toward a considerably high frequency
at the end of a sentence, which is a characteristic that can be
considered to convey “sincerity” or “cordiality”. In contrast,
C248 and C123, which were slightly rising and rather flat
overall contours, implied (r2). C135 and C41, which were
fall-and-rise contours with the extreme lowering, insinuated
(r3). As for the free description, some listeners noted that
C248 and C346 created a sense of familiarity and caused an
impression that the speaker was just like a senior (such as a
parent of a friend). However, whether others similarly perceive it or not needs to be investigated further.
• “Order” (Figure 4(b))
The additional nuances (o1) and (o2) are mutually exclusive.
C269, C101, and C229, which were undulated, conveyed
(o1). C71, which were slightly rising and undulated, also
conveyed (o1). On the other hand, (o2) was not very clearly
conveyed, but by C248. C229, C214, and C101 conveyed
(o3) in addition to (o1).
• “Blame” (Figure 4(c))
The additional nuance (b1) seemed to be expressed by a large
rise-and-fall movement such as C117 and C155. C24 and
239
K. Iwata, T. Kobayashi
C329, which had a steep fall after slight rise, expressed (b2)
clearly. C151 and C233, which had a slight fall, expressed
(b3).
5. Conclusions
We investigated the expression of speaker’s intentions through
sentence-final particle/intonation combinations in Japanese
conversational speech. Results showed that the sentencefinal intonation contours varied a great deal and that adequate
sentence-final particle/intonation combinations should be used
to convey the intention to the listeners precisely. Furthermore,
it was found that some specific nuances could be added to some
major intentions by subtle differences in intonation.
We indicated that not only major intentions but also
subtle nuances could be expressed by sentence-final particle/intonation combinations. However, other prosodic features
such as the duration, power, and voice quality of the sentencefinal syllable must be changed in actual speech to contribute
to conveying the intentions. They need to be adequately controlled in order to express the intentions more precisely and
more expressively. We intend to elucidate the relationship between these features and the intentions in our future work.
6. Acknowledgements
This work was supported in part by Global COE Program
“Global Robot Academia”.
7. References
[1] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and
J. Pitrelli, “A corpus-based approach to expressive speech synthesis,” in Proc. 5th ISCA ITRW on Speech Synthesis, 2004, pp.
79–84.
[2] Y. Sagisaka, T. Yamashita, and Y. Kokenawa, “Generation and
perception of F0 markedness for communicative speech synthesis,” Speech Communication, vol. 46, no. 3–4, pp. 376–384, July
2005.
[3] M. Schröder, “Expressive speech synthesis: Past, present, and
possible futures,” in Affective Information Processing, J. Tao and
T. Tan, Eds. London: Springer-Verlag, 2009, pp. 111–126.
[4] K. Iwata and T. Kobayashi, “Conversational speech synthesis system with communication situation dependent HMMs,” in
Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, R.-C. Delgado and
T. Kobayashi, Eds. New York: Springer, 2011, pp. 113–123.
[5] K. Maekawa, “Production and perception of ‘paralinguistic’ information,” in Proc. Speech Prosody, 2004, pp. 367–374.
[6] S. Makino and M. Tsutsui, A Dictionary of Basic Japanese Grammar. Tokyo: The Japan Times, Ltd., 1986.
[7] The National Language Research Institute, Bound Forms (‘Zyosi’
and ‘Zyodôsi’) in Modern Japanese: Uses and Examples. Tokyo:
Shuei Shuppan, 1951 [in Japanese].
[8] M. Tsuchihashi, “The speech act continuum: An investigation of
Japanese sentence final particles,” J. Pragmatics, vol. 7, no. 4, pp.
361–387, August 1983.
[9] H. M. Cook, “The sentence-final particle ne as a tool for cooperation in Japanese conversation,” in Japanese/Korean Linguistics,
H. Hoji, Ed. Stanford: The Stanford Linguistics Association,
1990, vol. 1, pp. 29–44.
[10] A. Kamio, “The theory of territory of information: The case of
Japanese,” J. Pragmatics, vol. 21, no. 1, pp. 67–100, January
1994.
[11] S. K. Maynard, Japanese Communication: Language and
Thought in Context. Honolulu: University of Hawai‘i Press,
1997.
240
[12] H. Saigo, The Japanese Sentence-Final Particles in Talk-inInteraction. Amsterdam: John Benjamins Publishing Co., 2011.
[13] Y. Asano-Cavanagh, “An analysis of three Japanese tags: Ne,
yone, and daroo,” Pragmatics & Cognition, vol. 19, no. 3, pp.
448–475, 2011.
[14] N. Yoshizawa, “Intoneeshon (Intonation),” in A Research for Making Sentence Patterns in Colloquial Japanese 1: On Materials in
Conversation. Tokyo: Shuei Shuppan, 1960 [in Japanese], pp.
249–288.
[15] T. Moriyama, “Bun no imi to intoneeshon (Sentence meaning and
intonation),” in Kooza Nihongo To Nihongo Kyooiku 1: Nihongogaku Yoosetsu, Y. Miyaji, Ed. Tokyo: Meiji Shoin, 1989 [in
Japanese], pp. 172–196.
[16] T. Koyama, “Bunmatsushi to bunmatsu intoneeshon (Sentencefinal particles and sentence-final intonation),” in Speech and
Grammar, Spoken Language Working Group, Ed. Tokyo: Kurosio Publishers, 1997 [in Japanese], pp. 97–119.
[17] S. Kori, “Intoneeshon (Intonation),” in Asakura Nihongo Kooza 3:
Onsei On’in (Asakura Japanese Series 3: Phonetics, Phonology),
Z. Uwano, Ed. Tokyo: Asakura Publishing Co., Ltd., 2003 [in
Japanese], pp. 109–131.
[18] Y. Katagiri, “Dialogue functions of Japanese sentence-final particles ‘yo’ and ‘ne’,” J. Pragmatics, vol. 39, no. 7, pp. 1313–1323,
July 2007.
[19] E. Ofuka, J. D. McKeown, M. G. Waterman, and P. J. Roach,
“Prosodic cues for rated politeness in Japanese speech,” Speech
Communication, vol. 32, no. 3, pp. 199–217, October 2000.
[20] H. Tanaka, Turn-Taking in Japanese Conversation: A study in
grammar and interaction. Amsterdam: John Benjamins Publishing Co., 1999.
[21] C. T. Ishi, “The functions of phrase final tones in Japanese: Focus
on turn-taking,” J. Phonetic Soc. Japan, vol. 10, no. 3, pp. 18–28,
December 2006.
[22] V. Aubergé, T. Grépillat, and A. Rilliard, “Can we perceive attitudes before the end of sentences? The gating paradigm for
prosodic contours,” in Proc. EUROSPEECH, 1997, pp. 871–874.
[23] V. J. van Heuven, J. Haan, E. Janse, and E. J. van der Torre, “Perceptual identification of sentence type and the time-distibution of
prosodic interrogativity markers in Dutch,” in Proc. ESCA Workshop on Intonation, 1997, pp. 317–320.
[24] M. Sugito, “Shuujoshi ‘ne’ no imi, kinoo to intoneeshon (Meanings, functions and intonation of sentence-final particle ‘ne’),” in
Speech and Grammar III, Spoken Language Working Group, Ed.
Tokyo: Kurosio Publishers, 2001 [in Japanese], pp. 3–16.
[25] J. J. Venditti, K. Maeda, and J. P. H. van Santen, “Modeling
Japanese boundary pitch movements for speech synthesis,” in
Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis, 1998,
pp. 317–322.
[26] Y. Greenberg, N. Shibuya, M. Tsuzaki, H. Kato, and Y. Sagisaka,
“A trial of communicative prosody generation based on control
characteristic of one word utterance observed in real conversational speech,” in Proc. Speech Prosody. PS8–8–37, 2006.
[27] K. Iwata and T. Kobayashi, “Expressing speaker’s intentions
through sentence-final intonations for Japanese conversational
speech synthesis,” in Proc. Interspeech. Mon.P2b.03, 2012.
[28] S. Fujie, Y. Matsuyama, H. Taniyama, and T. Kobayashi, “Conversation robot participating in and activating a group communication,” in Proc. Interspeech, 2009, pp. 264–267.
[29] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0
extraction: Possible role of a repetitive structure in sounds,”
Speech Communication, vol. 27, no. 3–4, pp. 187–207, April
1999.
[30] J. H. Ward, Jr., “Hierarchical grouping to optimize an objective
function,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 236–244,
March 1963.