Title Other Contributor(s) Author(s) The effect of anchors and training on the reliability of perceptual voice evaluation University of Hong Kong Chan, Man-kei, Karen Citation Issued Date URL Rights 2000 http://hdl.handle.net/10722/56318 The author retains all proprietary rights, such as patent rights and the right to use in future works.; This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. The Effect of Anchors and Training on the Reliability of Perceptual Voice Evaluation CHAN, Man Kei Karen A dissertation submitted in partial fulfillment of the requirements for the Bachelor oi Science (Speech and Hearing Sciences), The University of Hong Kong, May 10, 2000. 1 ABSTRACT The reliability of perceptual voice evaluation is important in voice research and clinical practice. The literature shows that inter-rater and intra-rater reliability in perceptual voice evaluation can be very variable. Kreiman, Gerratt, Kempster, Erman and Berke (1993) proposed a framework that suggested that listeners recall their own internal standard when asked to judge voice qualities. These internal standards are unreliable as they could be affected by different factors. Therefore, external referenced standards (anchors) were proposed to replace these internal ones in order to improve the reliability of perceptual voice judgments. The aim of this study was to investigate whether training and provision of anchors would improve the reliability of perceptual voice evaluation. In addition, whether natural or synthesized anchors would be more effective. Twenty-eight naive listeners with no training in voice quality judgment were included in this study. All of them attempted two rating sessions (pre-training and post-training) and a training session. They were required to rate a set of 56 stimuli under three different conditions (with no external standards, with natural stimuli as anchors and with synthesized stimuli as anchors). The samples covered the roughness and breathiness qualities at three different levels of severity (mild, moderate and severe). Results showed that training and the use of anchors improved the intra-rater agreement and inter-rater variability of perceptual voice judgments. This supported previous findings that using anchors in perceptual voice judgments is more reliable than just relying on individuals' variable internal standards. Furthermore, synthesized anchors were more effective than natural anchors in improving the intra-rater agreement and inter-rater variability. INTRODUCTION Perceptual voice evaluation is an important assessment tool in both voice research and clinical practice. A number of studies on the validity of instrumental measurements of voice, such as acoustic and aerodynamics measurements, used perceptual voice analysis as the golden standard (e.g., de Krom, 1995; Moran, & Gilbert, 1984; Yumoto, Sasaki, & Okamura, 1984). Clinicians also use perceptual evaluation in voice assessment and voice therapy as outcome measures. Hence, the reliability of perceptual voice evaluation lays the foundation in the study of assessment of voice disorders. As perceptual voice evaluation is a subjective process, reliability needs to be established. Studies showed that inter-rater and intra-rater reliability vary widely in perceptual voice evaluation. The inter-rater reliability is often a concern (see Kreiman, Gerratt, Kempster, Erman, Berke, 1993). Kreiman and her associates (Kreiman, Gerratt, Precoda & Berke, 1992 & Kreiman et al., 1993) reviewed factors that could have contributed to this low reliability and proposed a framework for the perception of voice quality. They proposed that some form of internal standards is recalled when listeners are required to judge voice qualities. These internal standards are believed to form from the listener's experience with voice and are stored in memory. Studies showed that these internal standards were unreliable and could be affected by internal and external factors, such as memory and acoustic context (Gerratt, Kreiman, Antonanzas-Barroso & Berke, 1993; Kreiman, et al, 1992; Kreiman et al., 1993). This could lead to a high variability in making perceptual voice judgments. As these internal standards were unreliable, external standards have been suggested to replace the internal ones in making perceptual voice judgment (Kreiman, et al. 1993 and Gerratt, 1993). External standards are selected voice samples that act as anchors representing different voice qualities of varying degree. Studies using anchors of natural or synthesized voice samples showed listeners could improve their reliability in perceptual voice evaluation (e.g. Gerratt et al., 1993). However, no comparison has been done to determine which of these two types of anchors is more effective in providing better referencing points. Although natural voice samples are closer in nature to the voice samples to be rated, it is difficult to vary their parameters systematically. On the other hand, the parameters of the synthesized signals can be systematically manipulated. The disadvantage of using synthesized signals is their unnatural quality that sometimes makes them difficult to be compared with natural voice samples. One of the objectives of this study was to compare the effect of these two types of anchors on the reliability of perceptual voice evaluation. Another possible way of improving the reliability of perceptual voice evaluation is through training. Many studies provided training to improve the reliability of perceptual voice evaluation (e.g. Bassich & Ludlow, 1987; Martin and Wolfe, 1996; Shewell, 1998), However, it is difficult to conclude if training improves the reliability in perceptual voice evaluation (Kreiman, et al, 1993) because few of the previous studies gave description of what was involved in their training (e.g., Anders, Hollien, Hurme, Sonninen, & Wendler, 1988). It is important to understand how training would affect the reliability of perceptual voice evaluation. This would provide information on how the skill of perceptual voice evaluation is acquired and how listeners form their internal standards. With this information, more effective training and more reliable perceptual voice evaluation protocol can be designed. Another objective of this study was to investigate the effect of training on the reliability of perceptual voice evaluation. Participants with no experience in assessing and treating voice disorders were included in this study. It is reasonable to 4 assume that these participants had internal standards of voice qualities formed from their daily experience. When these listeners showed any improvement in the reliability of the perceptual voice ratings after training, it would suggest that the improvement could be attributed to the training effect. The reliability of perceptual voice evaluation is also related to the types of voice qualities to be rated in the evaluation. There are different voice quality and also a wide range of terminology used in describing voice qualities making it difficult to compare between studies in research settings and difficult to compare between clinicians in clinical settings. In the current study, roughness and breathiness were used because recent studies showed that many researchers commonly investigated roughness and breathiness (e.g. Hammarberg, Fritzell, Gauffm, Sundberg, & Wedin, 1980; Kreiman, Gerratt, & Berke, 1994 and Martin and Wolfe, 1996). The definitions of breathiness and roughness adopted in this study are listed in Table 1. Table 1. Definitions of roughness and breathiness adopted in this study. Quality Roughness (harshness or hoarseness) Perceptual correlates Irregular quality; random fluctuations of glottal pulse; lack of clarity; uneven quality. Breathiness (whispery voice or whisperiness) Audible sound of expiration; audible air escape; audible friction noise. Acoustic correlates Aperiodic mode of vibration; perturbation of the spectrum. Physiological correlates (Believed to be) due to irregular vibration of the vocal folds, Related to a significant component of noise due to turbulence. Incomplete closure of vocal folds/glottis during phonation. The current study used the visual analogue (VA) scale because it has been found to be more reliable than another other commonly used scales, the equal-appearing interval (EAI) scale (e.g. Kreiman, 1993). EAI scale uses an equidistant scale with a fixed 5 number of points (usually five or seven points) and listeners have to rate the voice sample on these points. VA scale uses an unmarked line, usually 10 cm long. The rating of the voice sample is done by marking a point along the line. Objectives The first objective of this study was to test the effect of training on the reliability and variability of perceptual voice evaluation. Training should shape the listeners' internal standards in perceptual voice qualities. Hence, it was hypothesized that with training, the listeners would show improvement in the reliability of perceptual voice evaluation. The second objective was to test the effect of natural and synthesized anchors on the reliability and variability of perceptual voice evaluation. It was hypothesized that if the listeners used the natural or synthesized samples as the external standards in making perceptual voice judgments, the reliability would be higher than if they used their internal standards. It was further hypothesized that the natural anchors would be more effective than the synthesized anchors as listeners could better match the natural anchors with the natural voice samples they were required to rate. METHOD Stimuli Three sets of stimuli, namely testing stimuli, anchor stimuli and training stimuli, were used in this study. Preparation of the Testing Stimuli. The testing stimuli were used in the rating tests to assess the participants' reliability of perceptual voice evaluation. The testing stimuli were made up of two o gender sets: male and female natural voices. Cantonese sentences /pa pa ta po/ were recorded in a sound treated room from 14 female speakers (12 dysphoric speakers and two speakers with normal voice) and 13 male speakers (10 dysphoric speakers and three with normal voice, one speaker with normal voice was required to simulate two types of dysphoric voices). The mean age of the female speakers was 33 (SD = 7.79, range = 20 40), while that of the male speakers was 36.38 (SD = 13.91, range = 21 - 73). Each gender set of stimuli (male and female) covered three levels of severity (mild, moderate and severe) and two types of quality (roughness and breathiness), resulting in six categories in each set of voice samples. There were two voice samples in each of these six categories, resulting in a total of 12 dysphoric samples. The severity and quality type were determined by a speech therapist with over two years of experience in perceptual voice evaluation. Together with two normal voice samples, there were a total of 14 voice samples. Each voice sample was duplicated, resulting in a total of 28 voice stimuli in each gender set of testing stimuli. As it was difficult to identify dysphonic speakers with the desirable roughness or breathiness severity, two male dysphoric samples were recorded from one normal speaker simulating mild breathiness and severe roughness. Preparation of anchor stimuli. Two types of anchor stimuli were used as the external reference stimuli during the testing and training sessions. The natural anchor stimuli included a normal voice and four dysphonic voice and the synthesized anchor stimuli included a normal voice and six dysphonic voice. The dysphonic stimuli covered two dysphoric qualities (breathiness and roughness) at two severity levels (mild and moderate). The stimuli were recordings or synthesis of Cantonese sentences /pa pa ta po/. 7 The female natural anchor stimuli set consisted of two rough voice samples from two female speakers with hyperfunctional disorder; two breathy samples from two female speakers with normal voice but deliberately producing breathiness; and one normal voice produced by a female speaker with normal voice. The mean age of these five female speakers was 30.4 (SD = 12.01, range = 21 - 45). The breathy voice anchors were produced by speakers with normal voice because it was difficult to identify participants whose voice was purely breathy. With the male natural anchor stimuli set, it was difficult to identify dysphonic speakers with the desirable roughness and breathiness severity. Therefore, a male speaker with normal voice was asked to produce roughness and breathiness at mild and moderate level of severity. Another male speaker was asked to record a normal voice anchor stimulus. The mean age of the male speakers was 32 (SD = 14.14, range =22 42). None of the anchor stimuli were used as the testing stimuli. The synthesized anchor stimuli consisted of 14 anchors (seven female and seven male) developed and validated by Yiu (1999). For each gender set, there was an anchor representing normal voice; two for breathiness (mild and moderate); two for Type I roughness (mild and moderate) and two for Type II roughness (mild and moderate). Two types of roughness were included as anchors because the listeners perceived two sets of perceptually distinct synthesized stimuli as roughness (Yiu, 1999). Preparation of training stimuli. The training stimuli were used as practice items by the participants. The training stimuli were made up of two gender sets: male and female natural voices. The voice stimuli were recordings of Cantonese sentences /pa pa ta pa/ made in a sound treated room from ten female speakers (eight speakers with dysphonia and two speakers with 8 normal voice) and three male speakers (one speaker with dysphoria and two speakers with normal voice. The mean age of these ten female speakers was 32 (SD = 15.26, range = 20 - 66) and that of the male speakers was 33 (SD =10.15, range = 22 - 42). The voice samples covered three levels of severity (mild, moderate and severe) and two types of quality (roughness and breathiness). For each gender set of training stimuli, there were one mild, two moderate and one severe voice samples for each type of quality, resulting in eight dysphoric samples. As it was difficult to identify dysphoric speakers with the desirable roughness or breathiness severity, seven male dysphoric samples were recorded from one normal speaker deliberately producing voices with the two dysphoric qualities at mild, moderate and severe level. Half of the training stimuli were used as the anchor stimuli. The severity and quality type were determined by five speech therapists, each with over two years of experience in perceptual voice evaluation. Together with the two normal voice samples, there were ten voice samples in each gender set of training stimuli. Each voice sample was duplicated, resulting in 20 voice stimuli in each gender set of training stimuli. Participants Twenty-six females and two males (mean age = 20.71, SD = 1.08, range = 19 - 22), participated in this study. All were native Cantonese speakers undergoing training to be speech therapists. None of them have received training in voice disorders or perceptual voice evaluation at the time of testing. Procedures Each participant participated in a pre-training rating session, a training session and a post-training rating session. The three sessions were one week apart from each other 9 with the training session carried out in the second week. In each session, the participants were required to rate the severity of breathiness and roughness of the testing or training stimuli using a 10-cm long VA scale. The rating tests and training program were presented through a computerized program designed for this study. The voice stimuli were presented in free field through two loudspeakers (Philips, MMS100) placed approximately 50 cm away from the listeners in a sound-treated room. The loudness level, at which the voice stimuli were played, was kept at around 80 dB SPL. Pre- and post-training rating sessions. Each rating session was made up of six rating tests with different types of anchor stimuli provided (Table 2). The presentation order of the tests was randomized across the participants and across the two rating sessions to counterbalance any learning effect from the order of the presentations. Table 2: Details of the six rating tests. Gender of testing stimuli Female Anchor Stimuli Female Natural Voice Female Synthesized Voice No Anchor Male Male Natural Voice Male Synthesized Voice No Anchor Definitions and descriptions of breathiness and roughness were provided for each participant in all rating sessions (Table 1). For those tests with anchor stimuli provided, the participants could choose to listen to the anchor stimuli as frequently as they liked. However, the computer program was designed so that the participants were required to 10 review all anchor stimuli at least once every four testing stimuli. Participants could also choose to listen to the rating stimuli as many times as they wanted but were not allowed to return to items already rated. They were asked to rate the severity of each quality (breathiness and roughness) using a 10-cm long scroll bar. Each rating session lasted for approximately one hour. Training session. The training program was made up of two parts, one part with male voice stimuli and the other part with female voice stimuli. Each part was made up of 20 practice stimuli. Natural anchor stimuli were included as external reference points. At the beginning of the training program, four anchor stimuli and two training stimuli, covering three severity levels (mild, moderate and severe) and two quality types (roughness and breathiness), were presented together with the definitions of the two qualities (Table 1) through the computer program. The participants were required to listen to all the anchor stimuli at least once after every four practice items. They could, however, choose to listen to the anchors more frequently. They were also allowed to listen to the practice items as frequently as they wanted. The participants were asked to rate each practice stimulus using a 10-cm long scroll bar with the anchor stimuli as references. Suggested ratings (breathiness and roughness) for each practice stimulus were provided after each rating. The suggested ratings for each practice item were the average of the ratings determined by five expert speech therapists, each with more than two years of experience in perceptual voice evaluation. Before the participants could proceed with the training program, they were required to listen to each practice item again after the suggested ratings were presented. Each training session lasted for approximately half an hour. 11 Data Analysis The breathiness and roughness ratings of the two sets of testing stimuli (male and female) were analyzed to determine the reliability and agreement of the participants under different conditions (provision of training and/or anchors). Although the participants rated each stimulus for both breathiness and roughness, only the relevant quality was analyzed for the designated stimuli, i.e breathiness ratings were analyzed for the designated breathy stimuli and roughness ratings were analyzed for the designated rough stimuli. As each testing stimulus was repeated twice in each set of testing stimuli, intrarater agreement was calculated to determine the agreement in rating two identical stimuli. As the probability of two ratings falling within 1-cm of one another by chance on a 10cm long scale is 0.01 (0.1 x 0.1), the present study considered two ratings that were within 1 -cm of one another as reaching agreement. This intra-rater agreement reflected the stability of each participant's performance in perceptual voice qualities. The variance of the ratings was used to determine the variability of the participants in judging the testing stimuli. A low variance suggested the judges were cohesive in judging the stimuli. The inter-rater variability reflected the relationship among the participants' standards in perceptual voice qualities. The variability analyses were carried out separately for the stimuli at the three levels of severity (mild, moderate and severe) of each quality type (roughness and breathiness) and for the normal stimuli. RESULTS The breathiness and roughness ratings of the two sets of testing stimuli (male and female) were analyzed to determine the agreement and variability of the participants' 12 perceptual voice evaluation under different conditions (provision of training and/or anchors). Intra-rater Agreement The mean percentages of agreement for the different conditions are listed in Table 3. The mean percentages of agreement varied between 49 - 76%, depending on the condition of the tests. Table 3: The mean percentage of intra-rater agreement in each rating test. Anchor providj ed Natural anchor Synthesized anchor No anchor Rating session Roughness .. „, orN Mean SD Range to Breathiness .. _ OTA Mean SD Range a Pretraining Gender of voice ,. stimuli Male Female 55.80 17.17 50.45 19.69 25-88 25 -100 69.20 18.78 49.11 16.64 25-100 13-75 Posttraining Male Female 63.39 20.95 61.16 16.44 13-100 25-88 70.09 19.35 62.95 20.83 13-100 25-88 Pretraining Male Female 62.95 18.78 58.48 15.23 25-88 38-88 64.29 14.32 56.25 18.48 38-100 13-88 Posttraining Male Female 75.89 16.64 70.54 21.30 38-100 13-100 70.98 18.96 63.39 17.32 25 -100 13-88 Pretraining Male Female 64.29 19.16 63.39 17.65 25 - 100 13-88 60.27 20.43 62.95 16.83 13-88 38-100 Posttraining Male Female 58.93 26.32 69.64 19.96 0-100 13-100 55.80 25.57 71.88 16.54 13-100 25-100 SD = Standard deviation Effects of training (training versus no training). Separate ANOVAs were performed on the intra-rater agreement for each set oi stimuli (female and male) and ratings (roughness and breathiness), with pre- and post- training as the within subject factor to test whether there were significant training effects. Results showed that there were some significant training effects. When the natural anchors were provided, intra-rater agreement increased after training for both the female rough and breathy stimuli. When the synthesized anchors were provided, intra-rater agreement increased after training for both the male and female rough stimuli. When no anchor was used, training did not seem to show any significant effect. Table 4 lists the pair-wise contrasts for each testing condition. Table 4: Statistical results of within subject contrasts tests of the effect of training on intra-rater agreement. Quality of voice stimuli Gender of voice stimuli Natural Anchor Rough Male Breathy Training Effect (Post-training > Pre-training) F (1,27) = 3.45 Synthesize Anchor F (1, 27) = 11.43* F (1.27) =1.68 Female F (1,27) = 5.70* F (1, 27) = 6.70* F (1.27) = 2.08 Male F (1,27) = 0.06 F (1,27) = 2.97 F (1,27) = 0.76 F (1, 27) = 10.00* F (1,27) = 2.40 F (1,27) = 2.93 Female No x\nchor * indicates significantly higher intra-rater agreement after training (p < 0.05) Although the intra-rater agreements were in general higher in the post-training session than in the pre-training rating session for most stimuli (except with the male breathy and rough stimuli when no anchor was provided), not all of them were statistically significant (Table 4). Effect of Anchors. A series of ANOVAs were performed on the intra-rater agreement to determine if there was significant difference among the three types of anchors (natural anchors, 14 synthesized anchors and no anchor). Table 5 lists the Mauchly test statistics and the associated F values indicating the effect of anchors on rating different type of stimuli before and after training. Table 5: ANOVA results testing for effect of anchor (natural, synthesized, no anchor) on intra-rater agreement with the associated results of Mauchlv tests. Quality of voice stimuli Gender of voice stimuli Rough Male Female Breathy Male Female Anchor Effect Pre-training Post-training F (2, 48) = 3.45* M (2) = 0.82 F (2, 26) = 4.61* M (2) = 0.76* F (2. 49) = 6.83* M (2) = 0.82 F(2, 51)=3.45* M (2) = 0.87 F (2, 54) = 3.11 M (2) = 0.97 F (2, 54) = 7.22* M (2) = 1.00 F (2, 54) = 7.21* M (2) = 0.92 F (2, 54) = 3.00 M (2) = 0.95 Note: The Mauchly test (M) was used to test if the data violated the sphericity assumption. If the p value for M is larger than 0.05, univariate F statistics with Huynh-Feldt epsilon adjusted degrees of freedom was reported. If the p value for M is less than 0.05, F values from Wilks' Lambda test were reported. * indicates significantly higher intra-rater agreement after training (p < 0.05) The anchor effect was significant for most stimuli in both rating sessions, except for the male breathy stimuli in the pre-training session and the female breathy stimuli in the post-training session (Table 5). Two types of planned contrast were carried out for those stimuli that demonstrated significant effects of the anchors. The first series of planned contrast procedure were carried out to determine whether the provision of anchors (natural or synthesized) would result in better intra-rater agreement than no provision of anchor. Table 6 lists the statistical results of the planned contrasts. 15 Table 6: Statistical results of Bonferroni t-tests between the use of different anchors (natural, synthesized and no anchors) Quality of voice stimuli Rough Gender Pre-Training Post-training of voice Natural vs. Synthesize vs. Natural vs. Synthesize vs. stimuli No anchor No anchor No anchor No anchor Male t ( l , 27)=2.14'i t (1,27) = 0.35 t (1,27) =1.00 t (1, 27)=3.00* Female t(l,27)=289 A Breathy Male N/A Female t (1, 27)=3.91A t(1,27) =158 t(1,27) =1.99 t(l,27)=0.28 N/A t(l,27)=3.00* t(l,27)=3 14* t (1,27) = 1.83 N/A N/A Key: * Significantly lower intra-rater agreement when no anchor was used (p<0.05). A Significantly higher intra-rater agreement when no anchor was used (p<0.05). N/A - planned contrast was not carried out as no effect of anchors was found for that set of stimuli. Before training, the intra-rater agreements in rating male and female rough stimuli, and the female breathy stimuli were significantly higher when no anchor was given than when natural anchors were provided. No significant differences were found when the synthesized anchor condition was compared with the no anchor condition. After training, the intra-rater agreements in rating male rough and breathy stimuli were significantly higher when synthesized anchors were provided than when no anchor was provided. The intra-rater agreements in rating male breathy stimuli were significantly higher when natural anchors were provided than when no anchor was provided. A second series of planned contrast procedures were carried out to determine whether natural or synthesized anchors facilitate a higher intra-rater agreement. Table 7 lists the statistical results of the second series of planned contrasts. 16 - Quality of voice stimuli Rough Breathy — • • — - * Gender of voice stimuli Male P re-training Post-training t (1,27) = 2.66* t (1, 27) = 3.11* Female t (1, 27) = 1.61 t (1,27) = 2.18* N/A t (1,27) = 0.23 t (1,27) = 3.69 N/A Male Female * indicates significantly higher intra-rater reliability when synthesized anchors were provided (p < 0.05) N/A - planned contrast was not carried out as no effect of anchors was found for that set of stimuli. Before training, the intra-rater agreement for rating male rough stimuli was significantly higher when synthesized anchors were provided. After training, the intrarater agreement for rating rough stimuli (male and female) was significantly higher when synthesized anchors were provided than when natural anchors were provided. No significant differences were found in rating the breathy stimuli. Inter-rater Variability The variance of the ratings was used to determine the variability of perceptual judgment. The variability in judging normal, mild, moderate and severe (rough and breathy) stimuli were determined separately. These variances are listed in Table 8. The variances varied between 0.19 - 11.49, depending on the condition of the tests. 17 Anchors provided Natural anchor Rating session Pre-training No anchor Roughness Male Female 1.75 1.27 3.62 11.49 4.86 8.17 6.36 8.99 Breathiness Male Female 0.83 1.38 4.76 3.06 11.45 6.72 3.06 8.29 Normal Mild Moderate Severe 2.93 2.83 4.28 5.56 1.05 7.78 7.15 6.02 2.17 4.20 6.97 4.03 2.51 5.91 5.44 Pre-training Normal Mild Moderate Severe 1.21 2.22 3.13 7.01 1.00 8.60 7.90 9.04 0.71 5.43 11.03 4.18 0.65 2.10 6.46 7.87 Post-training Normal Mild Moderate Severe 1.63 2.70 3.80 5.11 0.89 7.58 7.00 6.64 1.02 3.28 7.81 2.37 1.41 2.13 5.56 5.32 Pre-training Normal Mild Moderate Severe 0.92 3.00 4.66 5.38 0.19 8.75 8.17 8.08 0.38 3.92 11.23 5.43 0.67 1.17 5.15 7.70 Post-training Normal Mild Moderate Severe /.J3 5.41 4.18 8.38 0.58 8.17 6.93 6.60 8.11 6.72 9.50 11.11 1.75 2.26 5.74 6.03 Post-training Synthesized anchor Severity of voice stimuli Normal Mild Moderate Severe l.jj Effect of training (training versus no training). Separate ANOVAs were performed on the variances of each set of stimuli (female and male) and ratings (roughness and breathiness) with the pre-training and post-training as the within subject factor to test if there was significant training effect. The statistical results are listed in Table 9. 18 Table 9: ANOVA results on the variance of inter-rater agreement to test for training effects. Anchors provided Natural Anchor Severity Level Normal Mild Moderate Severe Roughness Male Female F(l,lll)=1.39 F(l,lll)=0.12 F(l,lll)=1.15 F(l,lll}=11.16* F(l,lll)=0.49 F(l,lll)=1.17 F(l,lll)=0.49 F(l,lll)=7.30* Breathiness Male Female F(l,lll)=2.47 F(l,lll)=0.06 F(l,lll)=0.29 F(l,lll)=0.79 F(l,lll)=21.75* F(l,lll)=0.78 F(l,lll)=1.86 F(l,lll)=9.46* Synthesized Anchor Normal Mild Moderate Severe F(l,lll)=0.51 F(l,lll)=0.28 F(1,1H)=0.60 F(l,lll)=5.14* F(l,lll)=0.07 F(l,lll)=0.94 F(l,lll)=1.16 F(l,lll)=6.09* F(l,lll)=0.38 F(l,lll)=6.87* F(l,lll)=11.69* F(l,lll)=7.83* No Anchor Normal Mild Moderate Severe F(1,111)=23.91A F(1,111)=4.13A F(l,lll)=0.38 F(1,111)=8.92A F(l,lll)=3.58 F(l,lll)=0.36 F(l,lll)=1.80 F(l,lll)=2.34 F(1,111)=35.04A F(l,lll)=1.71 F(1,111)=6.59A F(1,111)=3.99A F(l,lll)=2.61 F(l,lll)=0.41 F(1,111)=30.19A F(l,lll)=3.07 Key: F(1,1H)=2.57 F(1,111)=0.00 F(l,lll)=31.95* F(l,lll)=6.17* * Significantly lower variance after training (p < 0.05) Significantly higher variance after training (p < 0.05) A The results showed that there were some significant training effects. There were more significant improvements in the variability (i.e. lower variance) after training when synthesized anchors were used than with natural anchors or no anchor. It should be noted that when no anchor was used, the variances became higher after training. Effect of Anchors. Separate ANOVAs were performed on the variances to determine if there was significant difference among the three types of anchors (natural anchors, synthesized anchors and no anchor). Table 10 lists the F values and the corresponding values of the Mauchly tests indicating the effect of anchors in rating different type of stimuli before and after training. Anchor effects were noticed in both rating sessions. No effect of anchors was found in rating the female stimuli after training. 19 T a b l e 10: ANOVA results testing for effect of anchor (natural, synthesized, no anchor) on inter-rater agreement with the associated results of Mauchly tests Quality of voice stimuli Rough Gender of voice stimuli Male Severity of voice stimuli Normal Mild Moderate Severe Female Normal Mild Moderate Severe Breathy Male Normal Mild Moderate Severe Female Normal Mild Moderate Severe Pre-training Post-training F(2, 110) =1.76 M(2) = 0 88* F(25 221) = 2.04 M(2) = 0.98 F(2,110) = 4.67* M(2) = 0.95* F(2, 110) =2.52 M(2) = 0.88* F(2,219) = 5.38* M(2) = 0.97 F(25 218) = 7.59* M(2) = 0.97 F(2,110) = 0.09 M(2) = 0.94* F(2,222) =0.62 M(2) = 0.56 F(2,110) = 15.35* M(2) = 0.40* F(2,110) = 4 37* M(2) = 0.91* F(2, 110) = 0.27 M(2) = 0 88* F(2, 110) = 7.18* M(2) = 0.78* F(l,135) = 0.39 M(2) = 0.35* F(2, 222) = 0.39 M(2) = 0.99 F(2,222) = 0.07 M(2) = 0.99 F(2, 222) = 0.63 M(2) = 0.99 F(2,110) = 4.19* M(2) = 0.80* F(2,222) = 1.90 M(2) = 0.99 F(2,222) = 0.10 M(2) = 0.99 F(2, 215) = 6.31* M(2) = 0.95 F(2,110) = 1.74 M(2) = 0.76* F(2, 110) = 6.08* M(2) = 0.72 F(2,110) = 20.67* M(2) = 0.69* F(2,222) = 0.20 M(2) = 0.99 F(2,110) = 16.13* M(2) = 0.88* F(2, 110) = 5.67* M(2) = 0.50* F(2, 110) = 4.84* M(2) = 0.71* F(2,110) = 42.42* M(2) = 0.73* F(2,110) = 0.76 M(2) = 0.34* F(2,110) = 0.26 M(2) = 0.59* F(2,219) = 0.14 M(2) = 0.97 F(2, 110) = 0.79 M(2) = 0.86* Note: The Mauchly test (M) was used to test if the data violated the sphericity assumption. If the p value for M is larger than 0.05, univariate F statistics with HuynhFeldt epsilon adjusted degrees of freedom was reported. If the p value for M is less than 0.05, F values from Walks' Lambda test were reported. significance level p < 0.05 20 Two types of planned contrast were carried out on the data that demonstrated significant anchor effects. The first series of planned contrast procedures were carried out to determine which stimuli set demonstrated lower variance when comparing between the with anchors condition and no anchor condition. Table 11 lists the statistical results of the planned contrasts. Before training, only the male severe breathy stimuli demonstrated significant lower variance with natural anchors than with no anchor. Some of the remaining stimuli demonstrated significant lower variance with no anchor than with anchors. After training, most of the male stimuli had significantly lower variance with anchors than with no anchors. However, none of the female stimuli were rated differently whether anchors were provided or not (Table 11). 21 Table 11: Statistical results of planned contrast between the uses of different anchors (natural, synthesized and no anchors). Rating session Anchors provided Severity of voice stimuli Male Female Male Female Pretraining Natural Normal N/A F(1,111)=10.29A F(1,111)=7.58A N/A Mild N/A F(1,111)=10.84A N/A F(l, 111)=10.23A F(l,lll)=0.08 N/A N/A F(l, 111)= 2.69 Severe N/A N/A F(l,111)=14.63* N/A Normal N/A F(1,111)=6.43A F(l, 111)= 2.12 N/A Mild N/A F(l, 111)= 0.04 N/A F(1,111)=6.92A F(l, 111)= 1.03 N/A N/A F(1,111)=41.71A Severe N/A N/A F(l, 111)= 2.84 N/A Normal F(l, 111)=7.20* N/A F(1J11)=27.06* N/A Mild F(l3 111)=6.27* N/A F(l,lll)=4.43* N/A N/A N/A F(l,lll)=9.14* N/A Severe F(l, 111)=9.04* N/A F(l,lll)=49.72* N/A Normal F(l,l 11)=20.16* N/A F(l,lll)=27.76* N/A Mild F(l, 111)=8.30* N/A F(l,lll)=9.45* N/A N/A N/A F(l, 111)=4.09* N/A F(l,l 11)=14.50* N/A F(l,l 11)=85.12* N/A Moderate Synthesized Moderate Posttrainins Natural Moderate Synthesized Moderate Severe Roughr Breathiness Key: * Significantly higher variance when no anchor was used (p < 0.05). A Significantly lower variance when no anchor was used (p < 0.05). N/A - planned contrast was not carried out as no effect of anchors was found for that set of stimuli. A second series of planned contrast were carried out to determine which anchors, natural or synthesized, facilitated a lower variance. Table 12 lists the statistical results of the planned contrast. Only four instances of significant difference were found when comparing between natural and synthesized anchors. Three of them showed variances 22 that were significantly lower when synthesized anchors were provided. Three instances of significant difference were found before training (Table 12). Table 12: Statistical results of planned contrast between natural and synthesized anchors. Quality of voice stimuli Rough Gender of voice stimuli Male Female Severity of voice stimuli Mild Pre-training Post-training N/A F(l,lll)=0 03 Moderate F(l,lll)=6 24* N/A Severe N/A F(l,lll)=0 56 Normal N/A F(l,lll)=3.16 F ( U H ) = 1 0 29* N/A N/A N/A Severe F(l,111)= 0.00 N/A Normal N/A N/A Mild N/A F(1,1H)=2.86 Moderate N/A F(l,lll)=2.75 Severe F(l,lll)=3.10 F(U11)=7.50* Normal F(l,lll)=0.26 F(l,lll)=1.31 Mild F(U11)=3.06 N/A F(U11)=28.78 A N/A Severe N/A N/A Normal N/A N/A Mild Moderate Breathy Male Female Moderate Key: * Significantly lower variance when synthesized anchors were provided A Significantly higher variance when synthesized anchors were provided N/A - planned contrast was not carried out as no effect of anchors was found foi that set of stimuli. 23 Summary of results The results showed that both intra-rater agreement and inter-rater variability improved with training and when anchors were provided. When no anchor was provided, training did not have an effect on intra-rater agreement, but caused an increase in interrater variability. When anchors were provided, the intra-rater agreement for rough stimuli improved most than those for breathy stimuli, while the anchors were equally effective in reducing the inter-rater variability for breathy and rough stimuli. The use of anchors affected both the intra-rater agreement and inter-rater variability before and after training. Generally, before training, the intra-rater agreement and inter-rater variability were better when no anchor was provided than when anchors were provided. Although the natural anchors only improved the intra-rater agreement for the male breathy stimuli, they helped to reduce the inter-rater variability for the male breathy and rough stimuli. The synthesized anchors were effective in improving the intra-rater agreement and inter-rater variability for male breathy and rough stimuli after training. No anchor effect was noticed for rating the female stimuli. DISCUSSION The first objective was to determine whether training improved the reliability and variability of perceptual voice evaluation. As predicted, training generally improved the reliability and variability of perceptual voice evaluation. The training program used in this study involved a stimulus-response-feedback-stimulus paradigm. The listeners were presented with a stimulus, they were then asked to give a rating (response), followed by a modal answer (feedback) and lastly the stimulus was presented again. This paradigm required the listeners to compare and eventually replace their internal standards with the external standards. 24 Although training was generally effective in improving the reliability and variability in perceptual voice evaluation, this improvement was only limited to the evaluation provided with anchors. This showed that training alone was not sufficient to improve the reliability and reduce the variability of perceptual voice evaluation. Anchors were also needed to ensure a better reliability and lowered the variability. After training, the listeners rated the voice qualities with more reliability and less variability with anchors than with no anchor. This supported that the internal standards for pathological voice qualities were unstable. Contrary to the hypothesis of this study, instead of improving the reliability and variability, the use of anchors actually led to lower reliability and higher variability in the pre-training session. However, reliability was improved when anchors were provided in the post-training session as predicted. This could be attributed to the adjustment that the listeners had to make to their internal standards when the external standards (anchors) were provided. It seemed that training provided an opportunity for the listeners to adjust to the external standards. Although previous studies emphasized the importance of using external standards in perceptual voice evaluation, this study showed that establishing the internal standards were equally important. If the listeners' internal standards are not stable or not well formed, the listeners are less likely to make use of the external standards because they may not be able to pick up the relevant and distinctive features from the external standards. Based on the results of this study that training generally improved the intrarater agreement and the inter-rater variability in perceptual voice evaluation. Given the training program only took around 30 minutes to complete and only minimal guidance was needed, it should be recommended for teaching perceptual voice evaluation to naive listeners. 25 The second objective was to test the effect of natural and synthesized anchors on the reliability and variability of perceptual voice evaluation Generally, the results suggested that the synthesized anchors and the natural anchors were both effective in improving the intra-rater agreement and inter-rater variability. However, the synthesized anchors were better in achieving intra-rater agreement than the natural anchors. This result did not support the hypothesis that natural anchors were better than the synthesized anchors in achieving better intra-rater agreement and reduce inter-rater variability. This could be because for the natural voice anchors, it was difficult to find samples with isolated abnormal voice qualities (roughness or breathiness only). The abnormal voice qualities often co-existed and this occurred in the natural voice anchors used in this study. However, with the synthesized voice anchors, the two qualities were created independently and were therefore distinguishable in isolation. The listeners might have found it more difficult to focus on just one quality with the use of the natural anchors, whereas with the synthesized anchors, the listeners might have found it easier to distinguish between different voice qualities. However, natural anchors were still effective in improving inter-rater variability because even though the listeners may not focus on the same voice quality of the natural anchors each time they listened to them, the information drawn from the natural anchors would be more stable than those drawn from the listeners' internal standards. In addition, all listeners used the same set of external standards (the anchors), thus, the mental representations of the voice qualities among the listeners (roughness and breathiness) should be similar. The anchors did not seem to improve the intra-rater agreement for rating breathy voice stimuli, although they were useful in reducing the inter-rater variability. However, the anchors were effective in improving the intra-rater agreement and inter-rater variability for rating rough voice stimuli. This may be interpreted as breathiness is 26 perceptually more distinctive than roughness, the listeners could rate the breathiness severity consistently with or without anchors or training. However, when no anchors or training were provided, the listener's judgment on breathiness severity varied from listener to listener. It was with training and anchors that reduced the variability between the listeners' judgment on the breathiness ratings. This suggested that even with welldeveloped internal standards, the external standards are effective for calibrating the internal standards between the listeners. Although training and external standards improved the listeners' agreement and variability in perceptual voice evaluation, the intra-rater agreements were still between 61-76% while the variances were between 0.89 and 7.78. This suggests that the amount of training included in this study may not be enough in shaping the listeners' internal standards in pathological voice qualities. In addition, the amount of voice stimuli included in the anchors may not be enough. Further studies are recommended to test if additional training or anchors can improve the intra-rater agreement. If both fail to improve the intra-rater agreement, it would reflect the limitation of perceptual voice evaluation. In summary, training was found to improve the intra-rater agreement and interrater variability when anchors were provided. This suggested that improving the stability of the internal standards were equally important as providing external standards in order to improve the intra-rater agreement and inter-rater variability. It was found that listeners improved their intra-rater agreement and inter-rater variability when anchors were provided than when no anchor was provided. This supported the idea that the listeners' internal standards for pathological voice qualities were unstable. Synthesized anchors were more effective in improving the intra-rater agreement and inter-rater variability than natural anchors in this study. This suggested that listeners found it easier to 27 distinguish between the voice qualities in synthesized signals than in natural voice samples. LIMITATION OF THE PRESENT STUDY It was difficult to compare between the responses to male and female stimuli, as they may involve different perceptual framework for male and female voices. In addition, male and female listeners may have different perceptual framework, however, as the gender of the listeners was not balanced in this study (26 females and 2 males), no comparison could be made. No conclusion could be drawn on the nature of the internal standards. However, it is hypothesized that the listeners form a prototype of each quality before they build up their representations of different severity levels. Further analysis of the present data may reflect the nature of the internal standards by examining whether the listeners are better in the detection of the presence of the qualities than in determining the severity level under different conditions. CONCLUSION With the results from this study, it is recommended to provide synthesized voice stimuli as external standards in perceptual voice evaluation protocol. An evaluation protocol similar to the one used in this study is recommended as an assessment tool for future voice research and clinical practice. Further study is recommended to test whether the intra-rater agreement and interrater variability can be further improved when additional severity levels included in the anchors and when the training is more extensive. This is important for ensuring that perceptual voice evaluation is a valid assessment tool for research and clinical practice. 28 ACKNOWLEDGEMENTS Special thanks to Dr. Edwin Yiu, Ms. Emily Chan and Mr. Miroslav Kiimmel for their guidance and support throughout the development of this study. In addition, I would like to express my sincere gratitude to the following people for participating in my study: Ms. Sabina Chan, Ms. Pance Kung, Ms. Polly Lau, Ms. Elsa Wong, Ms Cynthia Woo and all Year I and III students at the Department of Speech and Hearing Sciences. REFERENCES ANDERS, L., HOLLIEN, H., HURME, P., SONNINEN, A., & WENDLER, J. (1988). Perception of hoarseness by several classes of listeners Folia Phoniatrica, 40, 91-100. BASSICH, C.J & LUDLOW, C.L. (1986). The use of perceptual methods by new clinicians for assessing voice quality. Journal of Speech and Hearing Disorders, 51, 125- DE KROM, G. (1995). Some spectral correlates of pathological breathy and rough voice quality for different types of vowel fragments. Journal of Speech and Hearing Research 35,794-811. GELFER, M.P. (1988). Perceptual attributes of voice: Development and use of rating scales. Journal of Voice, 2,320-326. GERRATT, B R, KREIMAN, J., ANTONANZAS-BARROSO, K , BERKE, G. (1993). Comparing internal and external standards in voice quality judgements. Journal of Speech and Hearing Research, 36,14-20. HAMMARBERG, B., FRITZELL, B., GAUFFIN, L, SUNDBERG, J., & WEDIN, L. (1980). Perceptual and acoustic correlates of abnormal voice qualities. Acta Otolaryngologica, 90, 441-451. 2 2 MAS 2006 29 KREIMAN, J., GERRATT, B.R., & BERKE, G.S. (1994). The multidimensional nature of pathological voice quality. Journal of the Acoustical Society of America, 96, 1291-1302. KREIMAN, J., GERRATT, B.R., KEMPSTER, G.B., ERMAN, A., BERKE, G.S. (1993). Perceptual evaluation of voice quality: review, tutorial, and a framework for future research. Journal of Speech and Hearing Research, 36, 21-40. KREIMAN, J., GERRATT, B.R., PRECODA, K., BERKE, G.S. (1992). Individual differences in voice quality perception. Journal of Speech and Hearing Research, 35, 512520. MARTIN, D.P., WOLFE, V.I. (1996). Effects of perceptual training based upon synthesized voice signals. Perceptual and Motor Skills, 83, 1291-1298. MORAN, M J . & GILBERT, H.R. (1984) relation between voice profile ratings and aerodynamic and acoustic parameters. Journal of Communication Disorders, 17,245-260. SHEWELL, C. (1998). The effect of perceptual training on ability to use the vocal profile analysis scheme. Internal Journal of Language and Communication Disorders, 33, S322-326. YIU, E. (1999). Perceptual voice properties of synthesized voice signals. Project in progress. YUMOTO, E., SASAKI, Y. & OKAMURA, H. (1984). Harmonics-to-noise ratio and psychophysical measurement of the degree of hoarseness. Journal of Speech and Hearing Research 27, 2-6.
© Copyright 2026 Paperzz