8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain Expression of Speaker’s Intentions through Sentence-Final Particle/Intonation Combinations in Japanese Conversational Speech Synthesis Kazuhiko Iwata, Tetsunori Kobayashi Perceptual Computing Laboratory, Waseda University, Japan Abstract Aiming to provide the synthetic speech with the ability to express speaker’s intentions and subtle nuances, we investigated the relationship between the speaker’s intentions that the listener perceived and sentence-final particle/intonation combinations in Japanese conversational speech. First, we classified F0 contours of sentence-final syllables in actual speech and found various distinctive contours, namely, not only simple rising and falling ones but also rise-and-fall and fall-and-rise ones. Next, we conducted subjective evaluations to clarify what kind of intentions the listeners perceived depending on the sentence-final particle/intonation combinations. Results showed that adequate sentence-final particle/intonation combinations should be used to convey the intention to the listeners precisely. Whether the sentence was positive or negative also affected the listeners’ perception. For example, a sentence-final particle ‘yo’ with a falling intonation conveyed the intention of an “order” in a positive sentence but “blame” in a negative sentence. Furthermore, it was found that some specific nuances could be added to some major intentions by subtle differences in intonation. The different intentions and nuances could be conveyed just by controlling the sentence-final intonation in synthetic speech. Index Terms: speech synthesis, speaker’s intention, sentencefinal particle, sentence-final intonation, conversational speech 1. Introduction Speech synthesis technology has made remarkable progress in its synthetic voice quality, and the natural-sounding synthetic speech has recently become available. Moreover, various approaches to synthesizing expressive and conversational speech have been also reported in the past decade [1, 2, 3, 4]. However, due to the diversity of conversational speech, numerous problems still need to be solved to build a useful speech synthesis system for robots and speech-enabled agents that communicate with humans through synthetic speech. We communicate with each other through linguistic and paralinguistic information [5]. An utterance is affected by the speaker’s intentions, attitudes, feelings, personal relationship with the listeners, and so forth. The paralinguistic features of the utterance vary a great deal. In spoken Japanese, a speaker’s intention is usually conveyed at the end of a sentence by sentence-final particles or auxiliary verbs [6]. The functions of sentence-final particles have been extensively studied in the field of linguistics [7, 8, 9, 10, 11, 12, 13]. For example, a sentence-final particle ‘yo’ indicates a strong assertion, and ‘ne’ indicates a request for listeners’ agreement. In addition, the intonation of the sentence-final particle, namely, the sentence-final intonation plays a significant role in expressing the intention and has also been studied over the years [14, 15, 16, 17, 18, 19]. For instance, a sentence with a sentence-final particle ‘ka’ becomes a declarative sentence with a falling intonation, whereas it becomes an interrogative sentence with a rising intonation. Moreover, it can express various additional nuances such as surprise, admiration, concern, doubt, or anger through different intonations. The functions of the sentence-final and phrase-final intonation have been discussed also in terms of turn-taking [20, 21]. However, not only the sentence-final intonation but also the intonation of the whole sentence varies depending on the speaker’s intention or attitude. In some languages other than Japanese, it has been reported that the listeners could identify the speaker’s attitudes before the end of the sentences [22, 23]. In contrast, the experiment that used two utterances of the same sentence uttered with different intentions demonstrated the importance of the sentence-final intonation in Japanese. When the speech segments of the sentencefinal particle were cut off and swapped, the listeners perceived the intention that was expressed by the segment of sentencefinal particle though the overall F0 contours of the utterances differed from each other [24]. On the other hand, there are not as yet many approaches to expressing the speaker’s intention by controlling the sentencefinal intonation in the field of speech synthesis technology. Boundary pitch movements in Tokyo Japanese have been analyzed and modeled [25]. Five boundary pitch movements were chosen and their meanings were examined through association with eight semantic scales. A prosody control method has been proposed [26], which was based on the analysis of intonations of the one-word utterance ‘n’. This study revealed that speaking attitudes were expressed by the F0 average height and dynamic pattern of the word ‘n’, since it has no particular lexical meaning. However, none of these models above considered the association of the expressions of the speaker’s intention with the sentence-final particles despite the fact that the intention is verbally expressed by the sentence-final particle in Japanese spoken dialogue. Although we investigated the relationship between the speaker’s intentions and sentence-final intonation contours in a previous study [27], we did not consider the relevance of the sentence-final particles. In this study, we focus on the listeners’ perception of speaker’s intention, associating with sentence-final particles and their intonation contours in order to enable synthetic speech to express various intentions. In Section 2, we classify sentencefinal intonation contours to find what kinds of sentence-final intonation were used in actual speech. In Section 3, we select several distinctive intonation contours from among the classification results and conduct a subjective evaluation to find suitable sentence-final particle/intonation combinations for conveying some specific intentions (hereafter “major intentions”). 235 K. Iwata, T. Kobayashi Furthermore, in Section 4, we investigate sentence-final particle/intonation combinations that can add subtle nuances to the major intentions aiming to provide the synthetic speech with wide expressiveness. In Section 5, we conclude the paper with some remarks. 2. Classification of sentence-final intonation 2.1. Previous research in linguistics As mentioned above, the functions of the sentence-final intonation have been discussed and various views have been proposed by the linguists. Table 1 shows some examples for the categorization of the sentence-final intonation in terms of the contours. In the majority of them, the sentence-final intonation contours were classified into two categories or up to five categories at most. However, there seems to be no accepted notion. Table 1: Examples for categorization of sentence-final intonation in linguistics. Number of categories 2 4 5 Categories Rise, Fall [15] Rise, Fall-and-rise, Fall, Rise-and-fall [16] Interrogative rise, Prominent rise, Fall, Rise-and-fall, Flat [17] 2.2. Speech data We used speech data that were created with the aim of developing an HMM-based speech synthesis system that had multiple HMMs depending on situations of conversation [4]. To build the HMMs, we designed several situations and more than 2000 sentences derived from dialogues that our communication robot [28] performed. These sentences were uttered by a voice actress, to whom we did not indicated any specific intentions for each sentence but explained the situations in which the robot was supposed to utter each sentence. The F0 contours were extracted using STRAIGHT [29], and the phonemes were manually segmented. The intonation contours at the end of these utterances varied a great deal and expressed subtle nuances and connotations. Of these data, 2092 utterances whose sentencefinal vowel was not devoiced were used for the analysis. The sentence-final intonation contours, that is, the F0 contour in the sentence-final syllable was extracted by referring to the phoneme boundaries. Because the actual F0 values of the utterances differed from each other and were difficult to classify, the time and frequency axes were normalized. To remove F0 perturbation caused by jitter and microprosody, the logarithmic F0 contour was approximated by a third-order least squares curve. The approximated curve was sampled at 11 points that equally divided the duration into 10. Finally, the starting point of the sampled curve was parallel-translated to the origin [27]. 2.3. Classification of sentence-final intonation contours The normalized F0 contours obtained by the above process were classified by Ward’s clustering algorithm [30], which is one of the hierarchical clustering algorithms and merges clusters so as to minimize the within-cluster variance. Figure 1 shows an example of the clustering results when the number of clusters was set to 32. The F0 contours denoted by thick circles and a thick line are the centroids of each cluster. The numbers in square brackets (e.g., [C2]) are expedient 236 cluster IDs corresponding to the clustering sequence. Note that the lengths of the vertical lines do not represent the distance between the clusters due to the limitation of the page layout. Various sentence-final intonation contours were found, including not only simple rising and falling intonations but also rise-andfall and fall-and-rise intonations. 2.4. Perceptual discrimination of intonation contours by centroids We found the sentence-final intonation contours were classified into distinctive clusters. However, we predicted not all the pairs of cluster centroids would have a notable perceptual difference from each other because the clustering was based only on the shapes of the F0 contours. Therefore, a preliminary evaluation was conducted. A back-channel utterance ‘haa’, which had no specific linguistic meaning, was resynthesized by the STRAIGHT vocoder [29], and its F0 contour was replaced with 127 centroid pairs obtained in the process of classifying the F0 contours into 128 clusters. Fifteen listeners were randomly presented with 254 ‘haa’ pairs including a reverse order for each pair of centroids and then asked whether they perceived the two intonations to be the same or different. The results of the evaluation are shown in Figure 2. The numbers in parentheses indicate the number of responses when the intonations by the centroids on both sides were perceived as different. This is how we obtained the criteria for sifting through and selecting the F0 contours that would be used in the next experiments. 3. Major intentions conveyed by sentence-final particle/intonation combinations 3.1. Selection of representative intonation contours We consulted previous studies prior to conducting the subjective evaluation to investigate what kind of speaker’s intentions could be conveyed by sentence-final particles and their intonation contours. Referring to the results of the preliminary evaluation (Figure 2), we stopped dividing clusters whose child clusters received 28 or fewer perceptions that their intonations were different. The selected clusters were C5, C20, C6, C15, C16, and C19. Compared with the categorization in the previous research shown in Table 1, the centroid of the cluster C5 seems to correspond to the interrogative rise intonation, C20 to the fall-andrise, C6 to the fall, C15 to the rise-and-fall, C16 to the prominent rise, and C19 to the flat. These results indicate that specific intonation contours corresponding to the categories in the previous research could be obtained from the speech database. 3.2. Experimental setup A subjective evaluation was conducted to clarify what kind of intentions the listener could perceive through the sentence-final intonations produced by the selected six centroids. We prepared 31 short sentences consisting of a verb ‘taberu’ (“eat” ) followed by a sentence-final particle (‘yo’, ‘na’, ‘ne’, ‘datte’, etc.), an auxiliary verb (‘daroo’), or one of their concatenations (‘yone’, ‘yona’, ‘datteyo’, ‘daroone’, etc.). Synthetic voices of these sentences were generated by our HMMbased speech synthesis system [4]. The duration of the last vowel of each sentence was fixed to 313 ms, which was the mean duration of the last vowels of the sentences in the speech 8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain [C4] [C2] 1 1 0 0 [C3] −1 [C5] 1 0 0 1 0 −1 [C9] −1 1 0 0 1 −1 [C14] 0 [C33] −1 1 [C36] −1 1 1 0 0 [C39] 1 1 0 0 −1 −1 [C25] [C13] 1 1 −1 −1 [C61] 1 [C48] [C21] 1 1 −1 [C22] −1 1 1 0 0 [C24] 1 0 −1 −1 −1 −1 [C30] 1 1 0 0 1 1 0 0 −1 −1 −1 [C34] [C47] 1 1 [C84] [C42] 1 1 [C70] [C58] 1 0 1 0 −1 [C46] [C41] [C40] [C44] 1 0 1 0 −1 [C26] [C28] 1 1 [C89] [C64] 1 1 0 0 −1 −1 0 0 −1 −1 1 0 0 −1 [C27] 1 −1 [C32] [C35] [C38] 0 0 0 −1 1 1 0 0 −1 0 [C31] −1 [C18] 1 0 0 0 −1 [C19] 1 −1 [C29] [C16] −1 1 0 [C17] −1 −1 [C11] 0 [C12] −1 [C15] −1 0 0 [C23] −1 1 1 0 1 1 [C10] 1 0 1 0 [C7] [C6] [C20] 1 [C8] −1 1 0 −1 0 0 −1 −1 −1 0 0 −1 −1 −1 −1 1 −1 [C55] [C62] [C59][C105] 1 0 0 −1 −1 0 −1 −1 [C60] [C52] [C68][C101][C56] [C49] 1 1 0 0 1 1 0 0 1 [C67] [C65] 1 0 −1 −1 1 1 1 −1 0 −1 0 −1 1 0 0 −1 −1 −1 −1 −1 Figure 1: Result of clustering sentence-final intonation contours when the number of clusters was set to 32. [C4] 1 (30) 0 0 [C3] −1 [C8] −1 1 1 (29) 0 [C5] [C6] [C20] 1 0 [C15] −1 1 −1 1 1 (29) 0 (29) 0 0 0 −1 −1 −1 −1 [C7] [C23] 1 1 0 0 [C9] [C11] 1 1 0 0 [C16] [C19] 1 1 (30) 0 −1 • Sentence-final particle ‘ne’ Generally, the use of ‘ne’ signals a polite “request”. This was endorsed with the rising intonations C5 and C20 (Figures 3(a) and 3(f)). In the positive sentence ‘Tabete ne’, a “request” was also conveyed with the rising C16 intonation. On the other hand, in the negative sentence ‘Tabenaide ne’ with the rising C16 and flat C19 intonations, an “order” was perceived more clearly than in the positive sentence. [C2] 1 0 −1 −1 (27) −1 −1 (23) [C46] [C41] 1 1 0 0 1 1 1 1 0 0 1 0 −1 −1 ( 5) 1 0 0 0 −1 −1 −1 −1 −1 −1 [C24] [C35] [C38] [C27] [C26] [C28] (28) −1 (26) (18) Figure 2: Preliminary evaluation results of intonations generated by cluster centroids. The numbers in parentheses indicate the number of responses when the intonations by the centroids on both sides were perceived as different. data. Then, the sentence-final F0 contour was replaced with the six centroids. We also designed 11 speaker’s intentions (“request”, “order”, “blame”, “hearsay”, “guess”, “question”, etc.) and situations where these intentions could be indicated. We informed 20 listeners of the situations and speaker’s intentions and asked them to evaluate whether or not both the lexical and intonational expressions of the stimulus were suitable for conveying the intention on a five-level scale: –2 (unsuitable; suitable for a different intention), –1 (rather unsuitable), +1 (rather suitable), +2 (suitable), and 0 (none of the above). 3.3. Results and discussion Figure 3 shows the key results of the subjective evaluation, with a particular focus on a “request”, an “order”, and “blame”. • Sentence-final particle ‘yo’ In the positive sentences ‘Tabete yo’ (Figure 3(b)) with the falling intonations C6 and C15 and ‘Tabero yo’ (Figure 3(d)) with C6, an “order” (“Don’t eat.” ) was conveyed. In contrast, in the negative sentences ‘Tabenaide yo’ (Figure 3(g)) and ‘Taberuna yo’ (Figure 3(i)), which meant prohibition, “blame” (“Why did you eat even though I told you not to?” ) was strongly conveyed with the falling intonations C6 and C15. When with the rising and flat intonations C5, C16, and C19, they caused an “order” impression. • Sentence-final particles ‘yone’ and ‘yona’ ‘Yone’ is known to have different lexical functions from ‘ne’ and ‘yo’. “Blame”, which was not much perceived in the sentences with ‘ne’, was conveyed with the rising C16 and flat C19 intonations (Figures 3(c) and 3(h)). This tendency differs from the case with ‘yo’, where “blame” was conveyed with the falling intonations C6 and C15. “Blame” could be conveyed also in ‘yona’ with C16 and C19 (Figures 3(e) and 3(j)). To summarize the results, we can obtain Table 2, which shows the highest scored intentions for each combination. Then we can consult this table when we generate synthetic speech. For example, we can express “request” in a positive sentence with a sentence-final particle ‘ne’ by using C5 or C20 as the sentence-final intonation. When we need to express “blame” in a negative sentence with a sentence-final particle ‘yo’, we should use C6 or C15. 237 K. Iwata, T. Kobayashi Request Request Request Request Request Order Order Order Order Order C5 C20 C6 C15 C16 C19 95% Confidence interval Blame Blame Blame Blame Blame 1 0.5 −2 −1 0 1 2 (a) ‘Tabete ne’ −2 −1 0 1 2 −2 (b) ‘Tabete yo’ −1 0 1 2 −2 (c) ‘Tabete yone’ −1 0 1 2 (d) ‘Tabero yo’ −2 −1 0 1 2 C5 0 −0.5 −1 (e) ‘Tabero yona’ 2 4 6 8 10 1 0.5 C20 0 −0.5 −1 2 4 6 8 10 1 Request Request Request Request Request 0.5 C6 0 −0.5 −1 2 4 6 8 10 1 0.5 Order Order Order Order C15 Order 0 −0.5 −1 2 4 6 8 10 1 0.5 C16 0 −0.5 Blame Blame Blame Blame Blame −1 2 4 6 8 10 1 0.5 −2 −1 0 1 (f) ‘Tabenaide ne’ 2 −2 −1 0 1 2 (g) ‘Tabenaide yo’ −2 −1 0 1 2 −2 (h) ‘Tabenaide yone’ −1 0 1 2 (i) ‘Taberuna yo’ −2 −1 0 1 (j) ‘Taberuna yona’ 2 C19 0 −0.5 −1 2 4 6 8 10 Figure 3: Subjective evaluation results of major intentions depending on sentence-final particle/intonation combinations. The sentences in (a), (b), (c), (d), and (e) are positive (roughly, “Please eat.” ), and the others are negative (“Please don’t eat.” ). Table 2: Speaker’s intention conveyed by sentence-final particle/intonation combination. The intentions (“request”, “order”, and “blame”) that received the highest positive score are shown by the initial letters and underlined when their scores are higher than or equal to 1.0 (rather suitable). *** p<0.001, ** p<0.01, * p<0.05, + p<0.1, where p is the maximum p-value among two comparisons. Sentence-final particle Sentence-final intonation C20 C6 C15 C16 C5 (Phrase) 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 −0.5 −0.5 −0.5 −0.5 −0.5 −0.5 −1 −1 −1 −1 −1 −1 2 4 6 8 10 ‘Tabete ne’ ‘Tabenaide ne’ ‘Tabete yo’ ‘Tabenaide yo’ ‘Tabete yone’ ‘Tabenaide yone’ ‘Tabero yo’ ‘Taberuna yo’ ‘Tabero yona’ ‘Taberuna yona’ R*** R R** O+ R O R O O B 2 4 6 8 10 R** R R** R* O R R** O B O* 4. Additional nuances conveyed by sentence-final particle/intonation combinations As the next step of this study, we investigated whether the listeners could perceive additional intentions, attitudes, or feelings (hereinafter collectively called “additional nuances”) through the sentence-final intonations. 4.1. Selection of representative intonation contours We increased the types of representative intonation contours to be used for the subjective evaluation of additional nuances. The selected six clusters in 3.1 were divided into several subclusters. This time, we merged two clusters from the 128 leaf nodes to the six clusters. When two clusters received 27 or more out of 30 (more than or equal to 90%) perceptions that their intonations were different, they were not merged. Additionally, their parent cluster was not merged with its paired cluster either. Thus, the cluster C5 was ultimately divided into 8 subclusters, C20 into 238 C19 1 0.5 2 4 6 8 10 – – O+ B** B – O B*** O B** 2 4 6 8 10 – O/B O B*** B B** B B*** B B 2 4 6 8 10 R* O* R O* O B – O+ O/B B 2 4 6 8 10 O O* B O* O B R/O O*** B B* 3, C6 into 11, C15 into 5, C16 into 7, and C19 into 7, as listed in Table 3. 4.2. Experimental setup We defined the intentions of a “request”, an “order”, and “blame” as the major intentions for this evaluation. Considering the results of the previous section (Table 2), we chose sentencefinal particle/intonation combinations from among those which received the significantly different (p < 0.05) score: ‘Tabete ne’ with the sentence-final intonations by the subclusters of C5, C20, and C16 for a “request”; ‘Tabenaide yo’ with the subclusters of C16 and C19 for an “order”; and ‘Tabenaide yo’ with the subclusters of C6 and C15 for “blame”. We generated 18 synthetic utterances with different intonations for the major intention “request”, 14 utterances for the “order”, and 16 utterances for the “blame” as the stimuli. Three additional nuances for each major intention (Table 4) were designed, which one of the authors perceived from these stimuli in advance. The stimuli were presented to 22 listeners, 8th ISCA Speech Synthesis Workshop • August 31 – September 2, 2013 • Barcelona, Spain Table 3: Selected subclusters as representative intonation contours to be used for the evaluation of additional nuances. Their centroids are shown in Figure 4. Parent cluster Table 4: Evaluated additional nuances for each major intention. Major intention Request Subclusters # IDs 8 C12, C36, C61, C30, C43, C130, C184, C23 3 C100, C135, C41 11 C9, C70, C334, C282, C232, C139, C151, C233, C125, C65, C22 5 C24, C329, C117, C74, C155 7 C123, C346, C248, C71, C131, C104, C52 7 C214, C269, C229, C101, C56, C112, C202 C5 C20 C6 C15 C16 C19 Order Blame Additional nuance The speaker seems (r1) to sincerely request the listener to eat. (r2) not to really want the listener to eat. (r3) to be slightly in a bad mood. (o1) to be afraid the listener will definitely eat. (o2) not to be afraid the listener will definitely eat. (o3) to be in a slightly bad mood. (b1) to be in a hurry to stop the listener from eating. (b2) to be in a rage after the listener has eaten. (b3) to be disheartened after the listener has eaten. 3 (r1) (r2) (r3) 2 95% Confidence interval 1 0 C12 C36 C61 C30 C43 C130 C184 C23 C100 C135 C41 C123 C346 C248 C71 C131 C104 C52 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 −0.5 0 −0.5 −1 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 −0.5 −1 2 4 6 8 10 −1 2 4 6 8 10 2 4 6 8 10 (a) ‘Tabete ne’ with major intention “request”. 3 (o1) (o2) (o3) 2 95% Confidence interval 1 0 C123 C346 C248 C71 C131 C104 C52 C214 C269 C229 C101 C56 C112 C202 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 −0.5 0 −0.5 −1 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 −0.5 −1 2 4 6 8 10 −1 2 4 6 8 10 2 4 6 8 10 (b) ‘Tabenaide yo’ with major intention “order”. 3 (b1) (b2) (b3) 2 95% Confidence interval 1 0 C9 C70 C334 C282 C232 C139 C151 C233 C125 C65 C22 C24 C329 C117 C74 C155 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 −0.5 0 −0.5 −1 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 0 −0.5 −1 2 4 6 8 10 −0.5 −1 2 4 6 8 10 −1 2 4 6 8 10 2 4 6 8 10 (c) ‘Tabenaide yo’ with major intention “blame”. Figure 4: Subjective evaluation results of additional nuances to major intentions. The centroids of the subclusters of C5, C20, and C16 were used for expressing the major intention “request”, those of C16 and C19 for “order”, those of C6 and C15 for “blame”. along with the situation of the dialogue and the additional nuances for each of the three major intentions. The listeners were asked to evaluate whether they felt the additional nuances on a four-level scale: 3 (strongly felt), 2 (felt), 1 (slightly felt), and 0 (not felt). In addition, they were asked to freely describe any other nuances they felt and evaluate the stimuli in the same way. 4.3. Results and discussion The results are shown in Figure 4. Several intonation contours were found to be able to convey some additional nuances. • “Request” (Figure 4(a)) The additional nuances (r1) and (r2) have mutually exclusive meanings. C61, C23, C43, and C131 conveyed the nuance (r1) well. They all rise toward a considerably high frequency at the end of a sentence, which is a characteristic that can be considered to convey “sincerity” or “cordiality”. In contrast, C248 and C123, which were slightly rising and rather flat overall contours, implied (r2). C135 and C41, which were fall-and-rise contours with the extreme lowering, insinuated (r3). As for the free description, some listeners noted that C248 and C346 created a sense of familiarity and caused an impression that the speaker was just like a senior (such as a parent of a friend). However, whether others similarly perceive it or not needs to be investigated further. • “Order” (Figure 4(b)) The additional nuances (o1) and (o2) are mutually exclusive. C269, C101, and C229, which were undulated, conveyed (o1). C71, which were slightly rising and undulated, also conveyed (o1). On the other hand, (o2) was not very clearly conveyed, but by C248. C229, C214, and C101 conveyed (o3) in addition to (o1). • “Blame” (Figure 4(c)) The additional nuance (b1) seemed to be expressed by a large rise-and-fall movement such as C117 and C155. C24 and 239 K. Iwata, T. Kobayashi C329, which had a steep fall after slight rise, expressed (b2) clearly. C151 and C233, which had a slight fall, expressed (b3). 5. Conclusions We investigated the expression of speaker’s intentions through sentence-final particle/intonation combinations in Japanese conversational speech. Results showed that the sentencefinal intonation contours varied a great deal and that adequate sentence-final particle/intonation combinations should be used to convey the intention to the listeners precisely. Furthermore, it was found that some specific nuances could be added to some major intentions by subtle differences in intonation. We indicated that not only major intentions but also subtle nuances could be expressed by sentence-final particle/intonation combinations. However, other prosodic features such as the duration, power, and voice quality of the sentencefinal syllable must be changed in actual speech to contribute to conveying the intentions. They need to be adequately controlled in order to express the intentions more precisely and more expressively. We intend to elucidate the relationship between these features and the intentions in our future work. 6. Acknowledgements This work was supported in part by Global COE Program “Global Robot Academia”. 7. References [1] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and J. Pitrelli, “A corpus-based approach to expressive speech synthesis,” in Proc. 5th ISCA ITRW on Speech Synthesis, 2004, pp. 79–84. [2] Y. Sagisaka, T. Yamashita, and Y. Kokenawa, “Generation and perception of F0 markedness for communicative speech synthesis,” Speech Communication, vol. 46, no. 3–4, pp. 376–384, July 2005. [3] M. Schröder, “Expressive speech synthesis: Past, present, and possible futures,” in Affective Information Processing, J. Tao and T. Tan, Eds. London: Springer-Verlag, 2009, pp. 111–126. [4] K. Iwata and T. Kobayashi, “Conversational speech synthesis system with communication situation dependent HMMs,” in Proceedings of the Paralinguistic Information and its Integration in Spoken Dialogue Systems Workshop, R.-C. Delgado and T. Kobayashi, Eds. New York: Springer, 2011, pp. 113–123. [5] K. Maekawa, “Production and perception of ‘paralinguistic’ information,” in Proc. Speech Prosody, 2004, pp. 367–374. [6] S. Makino and M. Tsutsui, A Dictionary of Basic Japanese Grammar. Tokyo: The Japan Times, Ltd., 1986. [7] The National Language Research Institute, Bound Forms (‘Zyosi’ and ‘Zyodôsi’) in Modern Japanese: Uses and Examples. Tokyo: Shuei Shuppan, 1951 [in Japanese]. [8] M. Tsuchihashi, “The speech act continuum: An investigation of Japanese sentence final particles,” J. Pragmatics, vol. 7, no. 4, pp. 361–387, August 1983. [9] H. M. Cook, “The sentence-final particle ne as a tool for cooperation in Japanese conversation,” in Japanese/Korean Linguistics, H. Hoji, Ed. Stanford: The Stanford Linguistics Association, 1990, vol. 1, pp. 29–44. [10] A. Kamio, “The theory of territory of information: The case of Japanese,” J. Pragmatics, vol. 21, no. 1, pp. 67–100, January 1994. [11] S. K. Maynard, Japanese Communication: Language and Thought in Context. Honolulu: University of Hawai‘i Press, 1997. 240 [12] H. Saigo, The Japanese Sentence-Final Particles in Talk-inInteraction. Amsterdam: John Benjamins Publishing Co., 2011. [13] Y. Asano-Cavanagh, “An analysis of three Japanese tags: Ne, yone, and daroo,” Pragmatics & Cognition, vol. 19, no. 3, pp. 448–475, 2011. [14] N. Yoshizawa, “Intoneeshon (Intonation),” in A Research for Making Sentence Patterns in Colloquial Japanese 1: On Materials in Conversation. Tokyo: Shuei Shuppan, 1960 [in Japanese], pp. 249–288. [15] T. Moriyama, “Bun no imi to intoneeshon (Sentence meaning and intonation),” in Kooza Nihongo To Nihongo Kyooiku 1: Nihongogaku Yoosetsu, Y. Miyaji, Ed. Tokyo: Meiji Shoin, 1989 [in Japanese], pp. 172–196. [16] T. Koyama, “Bunmatsushi to bunmatsu intoneeshon (Sentencefinal particles and sentence-final intonation),” in Speech and Grammar, Spoken Language Working Group, Ed. Tokyo: Kurosio Publishers, 1997 [in Japanese], pp. 97–119. [17] S. Kori, “Intoneeshon (Intonation),” in Asakura Nihongo Kooza 3: Onsei On’in (Asakura Japanese Series 3: Phonetics, Phonology), Z. Uwano, Ed. Tokyo: Asakura Publishing Co., Ltd., 2003 [in Japanese], pp. 109–131. [18] Y. Katagiri, “Dialogue functions of Japanese sentence-final particles ‘yo’ and ‘ne’,” J. Pragmatics, vol. 39, no. 7, pp. 1313–1323, July 2007. [19] E. Ofuka, J. D. McKeown, M. G. Waterman, and P. J. Roach, “Prosodic cues for rated politeness in Japanese speech,” Speech Communication, vol. 32, no. 3, pp. 199–217, October 2000. [20] H. Tanaka, Turn-Taking in Japanese Conversation: A study in grammar and interaction. Amsterdam: John Benjamins Publishing Co., 1999. [21] C. T. Ishi, “The functions of phrase final tones in Japanese: Focus on turn-taking,” J. Phonetic Soc. Japan, vol. 10, no. 3, pp. 18–28, December 2006. [22] V. Aubergé, T. Grépillat, and A. Rilliard, “Can we perceive attitudes before the end of sentences? The gating paradigm for prosodic contours,” in Proc. EUROSPEECH, 1997, pp. 871–874. [23] V. J. van Heuven, J. Haan, E. Janse, and E. J. van der Torre, “Perceptual identification of sentence type and the time-distibution of prosodic interrogativity markers in Dutch,” in Proc. ESCA Workshop on Intonation, 1997, pp. 317–320. [24] M. Sugito, “Shuujoshi ‘ne’ no imi, kinoo to intoneeshon (Meanings, functions and intonation of sentence-final particle ‘ne’),” in Speech and Grammar III, Spoken Language Working Group, Ed. Tokyo: Kurosio Publishers, 2001 [in Japanese], pp. 3–16. [25] J. J. Venditti, K. Maeda, and J. P. H. van Santen, “Modeling Japanese boundary pitch movements for speech synthesis,” in Proc. 3rd ESCA/COCOSDA Workshop on Speech Synthesis, 1998, pp. 317–322. [26] Y. Greenberg, N. Shibuya, M. Tsuzaki, H. Kato, and Y. Sagisaka, “A trial of communicative prosody generation based on control characteristic of one word utterance observed in real conversational speech,” in Proc. Speech Prosody. PS8–8–37, 2006. [27] K. Iwata and T. Kobayashi, “Expressing speaker’s intentions through sentence-final intonations for Japanese conversational speech synthesis,” in Proc. Interspeech. Mon.P2b.03, 2012. [28] S. Fujie, Y. Matsuyama, H. Taniyama, and T. Kobayashi, “Conversation robot participating in and activating a group communication,” in Proc. Interspeech, 2009, pp. 264–267. [29] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive timefrequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 3–4, pp. 187–207, April 1999. [30] J. H. Ward, Jr., “Hierarchical grouping to optimize an objective function,” J. Am. Statistical Assoc., vol. 58, no. 301, pp. 236–244, March 1963.
© Copyright 2026 Paperzz