Title The effect of anchors and training on the reliability of perceptual

Title
Other
Contributor(s)
Author(s)
The effect of anchors and training on the reliability of perceptual
voice evaluation
University of Hong Kong
Chan, Man-kei, Karen
Citation
Issued Date
URL
Rights
2000
http://hdl.handle.net/10722/56318
The author retains all proprietary rights, such as patent rights
and the right to use in future works.; This work is licensed under
a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License.
The Effect of Anchors and Training
on the Reliability of Perceptual Voice Evaluation
CHAN, Man Kei Karen
A dissertation submitted in partial fulfillment of the requirements for the Bachelor oi
Science (Speech and Hearing Sciences), The University of Hong Kong, May 10, 2000.
1
ABSTRACT
The reliability of perceptual voice evaluation is important in voice research and
clinical practice. The literature shows that inter-rater and intra-rater reliability in
perceptual voice evaluation can be very variable. Kreiman, Gerratt, Kempster, Erman
and Berke (1993) proposed a framework that suggested that listeners recall their own
internal standard when asked to judge voice qualities. These internal standards are
unreliable as they could be affected by different factors. Therefore, external referenced
standards (anchors) were proposed to replace these internal ones in order to improve the
reliability of perceptual voice judgments.
The aim of this study was to investigate whether training and provision of anchors
would improve the reliability of perceptual voice evaluation. In addition, whether natural
or synthesized anchors would be more effective. Twenty-eight naive listeners with no
training in voice quality judgment were included in this study. All of them attempted
two rating sessions (pre-training and post-training) and a training session. They were
required to rate a set of 56 stimuli under three different conditions (with no external
standards, with natural stimuli as anchors and with synthesized stimuli as anchors). The
samples covered the roughness and breathiness qualities at three different levels of
severity (mild, moderate and severe).
Results showed that training and the use of anchors improved the intra-rater
agreement and inter-rater variability of perceptual voice judgments. This supported
previous findings that using anchors in perceptual voice judgments is more reliable than
just relying on individuals' variable internal standards. Furthermore, synthesized anchors
were more effective than natural anchors in improving the intra-rater agreement and
inter-rater variability.
INTRODUCTION
Perceptual voice evaluation is an important assessment tool in both voice research
and clinical practice. A number of studies on the validity of instrumental measurements
of voice, such as acoustic and aerodynamics measurements, used perceptual voice
analysis as the golden standard (e.g., de Krom, 1995; Moran, & Gilbert, 1984; Yumoto,
Sasaki, & Okamura, 1984). Clinicians also use perceptual evaluation in voice
assessment and voice therapy as outcome measures. Hence, the reliability of perceptual
voice evaluation lays the foundation in the study of assessment of voice disorders. As
perceptual voice evaluation is a subjective process, reliability needs to be established.
Studies showed that inter-rater and intra-rater reliability vary widely in perceptual voice
evaluation. The inter-rater reliability is often a concern (see Kreiman, Gerratt, Kempster,
Erman, Berke, 1993).
Kreiman and her associates (Kreiman, Gerratt, Precoda & Berke, 1992 & Kreiman
et al., 1993) reviewed factors that could have contributed to this low reliability and
proposed a framework for the perception of voice quality. They proposed that some
form of internal standards is recalled when listeners are required to judge voice qualities.
These internal standards are believed to form from the listener's experience with voice
and are stored in memory. Studies showed that these internal standards were unreliable
and could be affected by internal and external factors, such as memory and acoustic
context (Gerratt, Kreiman, Antonanzas-Barroso & Berke, 1993; Kreiman, et al, 1992;
Kreiman et al., 1993). This could lead to a high variability in making perceptual voice
judgments.
As these internal standards were unreliable, external standards have been suggested
to replace the internal ones in making perceptual voice judgment (Kreiman, et al. 1993
and Gerratt, 1993). External standards are selected voice samples that act as anchors
representing different voice qualities of varying degree. Studies using anchors of natural
or synthesized voice samples showed listeners could improve their reliability in
perceptual voice evaluation (e.g. Gerratt et al., 1993). However, no comparison has been
done to determine which of these two types of anchors is more effective in providing
better referencing points. Although natural voice samples are closer in nature to the
voice samples to be rated, it is difficult to vary their parameters systematically. On the
other hand, the parameters of the synthesized signals can be systematically manipulated.
The disadvantage of using synthesized signals is their unnatural quality that sometimes
makes them difficult to be compared with natural voice samples. One of the objectives
of this study was to compare the effect of these two types of anchors on the reliability of
perceptual voice evaluation.
Another possible way of improving the reliability of perceptual voice evaluation is
through training. Many studies provided training to improve the reliability of perceptual
voice evaluation (e.g. Bassich & Ludlow, 1987; Martin and Wolfe, 1996; Shewell, 1998),
However, it is difficult to conclude if training improves the reliability in perceptual
voice evaluation (Kreiman, et al, 1993) because few of the previous studies gave
description of what was involved in their training (e.g., Anders, Hollien, Hurme,
Sonninen, & Wendler, 1988).
It is important to understand how training would affect the reliability of perceptual
voice evaluation. This would provide information on how the skill of perceptual voice
evaluation is acquired and how listeners form their internal standards. With this
information, more effective training and more reliable perceptual voice evaluation
protocol can be designed. Another objective of this study was to investigate the effect of
training on the reliability of perceptual voice evaluation. Participants with no experience
in assessing and treating voice disorders were included in this study. It is reasonable to
4
assume that these participants had internal standards of voice qualities formed from their
daily experience. When these listeners showed any improvement in the reliability of the
perceptual voice ratings after training, it would suggest that the improvement could be
attributed to the training effect.
The reliability of perceptual voice evaluation is also related to the types of voice
qualities to be rated in the evaluation. There are different voice quality and also a wide
range of terminology used in describing voice qualities making it difficult to compare
between studies in research settings and difficult to compare between clinicians in
clinical settings. In the current study, roughness and breathiness were used because
recent studies showed that many researchers commonly investigated roughness and
breathiness (e.g. Hammarberg, Fritzell, Gauffm, Sundberg, & Wedin, 1980; Kreiman,
Gerratt, & Berke, 1994 and Martin and Wolfe, 1996). The definitions of breathiness and
roughness adopted in this study are listed in Table 1.
Table 1. Definitions of roughness and breathiness adopted in this study.
Quality
Roughness
(harshness or
hoarseness)
Perceptual
correlates
Irregular quality;
random fluctuations
of glottal pulse; lack
of clarity; uneven
quality.
Breathiness
(whispery
voice or
whisperiness)
Audible sound of
expiration; audible
air escape; audible
friction noise.
Acoustic correlates
Aperiodic mode of
vibration;
perturbation of the
spectrum.
Physiological
correlates
(Believed to be)
due to irregular
vibration of the
vocal folds,
Related to a
significant
component of noise
due to turbulence.
Incomplete
closure of vocal
folds/glottis
during phonation.
The current study used the visual analogue (VA) scale because it has been found to
be more reliable than another other commonly used scales, the equal-appearing interval
(EAI) scale (e.g. Kreiman, 1993). EAI scale uses an equidistant scale with a fixed
5
number of points (usually five or seven points) and listeners have to rate the voice
sample on these points. VA scale uses an unmarked line, usually 10 cm long. The rating
of the voice sample is done by marking a point along the line.
Objectives
The first objective of this study was to test the effect of training on the reliability
and variability of perceptual voice evaluation. Training should shape the listeners'
internal standards in perceptual voice qualities. Hence, it was hypothesized that with
training, the listeners would show improvement in the reliability of perceptual voice
evaluation.
The second objective was to test the effect of natural and synthesized anchors on
the reliability and variability of perceptual voice evaluation. It was hypothesized that if
the listeners used the natural or synthesized samples as the external standards in making
perceptual voice judgments, the reliability would be higher than if they used their
internal standards. It was further hypothesized that the natural anchors would be more
effective than the synthesized anchors as listeners could better match the natural anchors
with the natural voice samples they were required to rate.
METHOD
Stimuli
Three sets of stimuli, namely testing stimuli, anchor stimuli and training stimuli,
were used in this study.
Preparation of the Testing Stimuli.
The testing stimuli were used in the rating tests to assess the participants'
reliability of perceptual voice evaluation. The testing stimuli were made up of two
o
gender sets: male and female natural voices. Cantonese sentences /pa pa ta po/ were
recorded in a sound treated room from 14 female speakers (12 dysphoric speakers and
two speakers with normal voice) and 13 male speakers (10 dysphoric speakers and three
with normal voice, one speaker with normal voice was required to simulate two types of
dysphoric voices). The mean age of the female speakers was 33 (SD = 7.79, range = 20 40), while that of the male speakers was 36.38 (SD = 13.91, range = 21 - 73).
Each gender set of stimuli (male and female) covered three levels of severity (mild,
moderate and severe) and two types of quality (roughness and breathiness), resulting in
six categories in each set of voice samples. There were two voice samples in each of
these six categories, resulting in a total of 12 dysphoric samples. The severity and
quality type were determined by a speech therapist with over two years of experience in
perceptual voice evaluation. Together with two normal voice samples, there were a total
of 14 voice samples. Each voice sample was duplicated, resulting in a total of 28 voice
stimuli in each gender set of testing stimuli.
As it was difficult to identify dysphonic speakers with the desirable roughness or
breathiness severity, two male dysphoric samples were recorded from one normal
speaker simulating mild breathiness and severe roughness.
Preparation of anchor stimuli.
Two types of anchor stimuli were used as the external reference stimuli during the
testing and training sessions. The natural anchor stimuli included a normal voice and
four dysphonic voice and the synthesized anchor stimuli included a normal voice and six
dysphonic voice. The dysphonic stimuli covered two dysphoric qualities (breathiness
and roughness) at two severity levels (mild and moderate). The stimuli were recordings
or synthesis of Cantonese sentences /pa pa ta po/.
7
The female natural anchor stimuli set consisted of two rough voice samples from
two female speakers with hyperfunctional disorder; two breathy samples from two
female speakers with normal voice but deliberately producing breathiness; and one
normal voice produced by a female speaker with normal voice. The mean age of these
five female speakers was 30.4 (SD = 12.01, range = 21 - 45). The breathy voice anchors
were produced by speakers with normal voice because it was difficult to identify
participants whose voice was purely breathy.
With the male natural anchor stimuli set, it was difficult to identify dysphonic
speakers with the desirable roughness and breathiness severity. Therefore, a male
speaker with normal voice was asked to produce roughness and breathiness at mild and
moderate level of severity. Another male speaker was asked to record a normal voice
anchor stimulus. The mean age of the male speakers was 32 (SD = 14.14, range =22 42). None of the anchor stimuli were used as the testing stimuli.
The synthesized anchor stimuli consisted of 14 anchors (seven female and seven
male) developed and validated by Yiu (1999). For each gender set, there was an anchor
representing normal voice; two for breathiness (mild and moderate); two for Type I
roughness (mild and moderate) and two for Type II roughness (mild and moderate). Two
types of roughness were included as anchors because the listeners perceived two sets of
perceptually distinct synthesized stimuli as roughness (Yiu, 1999).
Preparation of training stimuli.
The training stimuli were used as practice items by the participants. The training
stimuli were made up of two gender sets: male and female natural voices. The voice
stimuli were recordings of Cantonese sentences /pa pa ta pa/ made in a sound treated
room from ten female speakers (eight speakers with dysphonia and two speakers with
8
normal voice) and three male speakers (one speaker with dysphoria and two speakers
with normal voice. The mean age of these ten female speakers was 32 (SD = 15.26,
range = 20 - 66) and that of the male speakers was 33 (SD =10.15, range = 22 - 42).
The voice samples covered three levels of severity (mild, moderate and severe) and two
types of quality (roughness and breathiness). For each gender set of training stimuli,
there were one mild, two moderate and one severe voice samples for each type of quality,
resulting in eight dysphoric samples. As it was difficult to identify dysphoric speakers
with the desirable roughness or breathiness severity, seven male dysphoric samples were
recorded from one normal speaker deliberately producing voices with the two dysphoric
qualities at mild, moderate and severe level. Half of the training stimuli were used as the
anchor stimuli. The severity and quality type were determined by five speech therapists,
each with over two years of experience in perceptual voice evaluation. Together with the
two normal voice samples, there were ten voice samples in each gender set of training
stimuli. Each voice sample was duplicated, resulting in 20 voice stimuli in each gender
set of training stimuli.
Participants
Twenty-six females and two males (mean age = 20.71, SD = 1.08, range = 19 - 22),
participated in this study. All were native Cantonese speakers undergoing training to be
speech therapists. None of them have received training in voice disorders or perceptual
voice evaluation at the time of testing.
Procedures
Each participant participated in a pre-training rating session, a training session and
a post-training rating session. The three sessions were one week apart from each other
9
with the training session carried out in the second week. In each session, the participants
were required to rate the severity of breathiness and roughness of the testing or training
stimuli using a 10-cm long VA scale.
The rating tests and training program were presented through a computerized
program designed for this study. The voice stimuli were presented in free field through
two loudspeakers (Philips, MMS100) placed approximately 50 cm away from the
listeners in a sound-treated room. The loudness level, at which the voice stimuli were
played, was kept at around 80 dB SPL.
Pre- and post-training rating sessions.
Each rating session was made up of six rating tests with different types of anchor
stimuli provided (Table 2). The presentation order of the tests was randomized across the
participants and across the two rating sessions to counterbalance any learning effect
from the order of the presentations.
Table 2: Details of the six rating tests.
Gender of testing stimuli
Female
Anchor Stimuli
Female Natural Voice
Female Synthesized Voice
No Anchor
Male
Male Natural Voice
Male Synthesized Voice
No Anchor
Definitions and descriptions of breathiness and roughness were provided for each
participant in all rating sessions (Table 1). For those tests with anchor stimuli provided,
the participants could choose to listen to the anchor stimuli as frequently as they liked.
However, the computer program was designed so that the participants were required to
10
review all anchor stimuli at least once every four testing stimuli. Participants could also
choose to listen to the rating stimuli as many times as they wanted but were not allowed
to return to items already rated. They were asked to rate the severity of each quality
(breathiness and roughness) using a 10-cm long scroll bar. Each rating session lasted for
approximately one hour.
Training session.
The training program was made up of two parts, one part with male voice stimuli
and the other part with female voice stimuli. Each part was made up of 20 practice
stimuli. Natural anchor stimuli were included as external reference points. At the
beginning of the training program, four anchor stimuli and two training stimuli, covering
three severity levels (mild, moderate and severe) and two quality types (roughness and
breathiness), were presented together with the definitions of the two qualities (Table 1)
through the computer program. The participants were required to listen to all the anchor
stimuli at least once after every four practice items. They could, however, choose to
listen to the anchors more frequently. They were also allowed to listen to the practice
items as frequently as they wanted.
The participants were asked to rate each practice stimulus using a 10-cm long
scroll bar with the anchor stimuli as references. Suggested ratings (breathiness and
roughness) for each practice stimulus were provided after each rating. The suggested
ratings for each practice item were the average of the ratings determined by five expert
speech therapists, each with more than two years of experience in perceptual voice
evaluation. Before the participants could proceed with the training program, they were
required to listen to each practice item again after the suggested ratings were presented.
Each training session lasted for approximately half an hour.
11
Data Analysis
The breathiness and roughness ratings of the two sets of testing stimuli (male and
female) were analyzed to determine the reliability and agreement of the participants
under different conditions (provision of training and/or anchors). Although the
participants rated each stimulus for both breathiness and roughness, only the relevant
quality was analyzed for the designated stimuli, i.e breathiness ratings were analyzed
for the designated breathy stimuli and roughness ratings were analyzed for the
designated rough stimuli.
As each testing stimulus was repeated twice in each set of testing stimuli, intrarater agreement was calculated to determine the agreement in rating two identical stimuli.
As the probability of two ratings falling within 1-cm of one another by chance on a 10cm long scale is 0.01 (0.1 x 0.1), the present study considered two ratings that were
within 1 -cm of one another as reaching agreement. This intra-rater agreement reflected
the stability of each participant's performance in perceptual voice qualities.
The variance of the ratings was used to determine the variability of the participants
in judging the testing stimuli. A low variance suggested the judges were cohesive in
judging the stimuli. The inter-rater variability reflected the relationship among the
participants' standards in perceptual voice qualities. The variability analyses were
carried out separately for the stimuli at the three levels of severity (mild, moderate and
severe) of each quality type (roughness and breathiness) and for the normal stimuli.
RESULTS
The breathiness and roughness ratings of the two sets of testing stimuli (male and
female) were analyzed to determine the agreement and variability of the participants'
12
perceptual voice evaluation under different conditions (provision of training and/or
anchors).
Intra-rater Agreement
The mean percentages of agreement for the different conditions are listed in Table
3. The mean percentages of agreement varied between 49 - 76%, depending on the
condition of the tests.
Table 3: The mean percentage of intra-rater agreement in each rating test.
Anchor
providj
ed
Natural
anchor
Synthesized
anchor
No
anchor
Rating
session
Roughness
..
„,
orN
Mean SD
Range
to
Breathiness
..
_
OTA
Mean SD
Range
a
Pretraining
Gender
of voice
,.
stimuli
Male
Female
55.80 17.17
50.45 19.69
25-88
25 -100
69.20 18.78
49.11 16.64
25-100
13-75
Posttraining
Male
Female
63.39 20.95
61.16 16.44
13-100
25-88
70.09 19.35
62.95 20.83
13-100
25-88
Pretraining
Male
Female
62.95 18.78
58.48 15.23
25-88
38-88
64.29 14.32
56.25 18.48
38-100
13-88
Posttraining
Male
Female
75.89 16.64
70.54 21.30
38-100
13-100
70.98 18.96
63.39 17.32
25 -100
13-88
Pretraining
Male
Female
64.29 19.16
63.39 17.65
25 - 100
13-88
60.27 20.43
62.95 16.83
13-88
38-100
Posttraining
Male
Female
58.93 26.32
69.64 19.96
0-100
13-100
55.80 25.57
71.88 16.54
13-100
25-100
SD = Standard deviation
Effects of training (training versus no training).
Separate ANOVAs were performed on the intra-rater agreement for each set oi
stimuli (female and male) and ratings (roughness and breathiness), with pre- and post-
training as the within subject factor to test whether there were significant training effects.
Results showed that there were some significant training effects. When the natural
anchors were provided, intra-rater agreement increased after training for both the female
rough and breathy stimuli. When the synthesized anchors were provided, intra-rater
agreement increased after training for both the male and female rough stimuli. When no
anchor was used, training did not seem to show any significant effect. Table 4 lists the
pair-wise contrasts for each testing condition.
Table 4: Statistical results of within subject contrasts tests of the effect of training on
intra-rater agreement.
Quality of
voice
stimuli
Gender
of voice
stimuli
Natural Anchor
Rough
Male
Breathy
Training Effect (Post-training > Pre-training)
F (1,27) = 3.45
Synthesize
Anchor
F (1, 27) = 11.43*
F (1.27) =1.68
Female
F (1,27) = 5.70*
F (1, 27) = 6.70*
F (1.27) = 2.08
Male
F (1,27) = 0.06
F (1,27) = 2.97
F (1,27) = 0.76
F (1, 27) = 10.00*
F (1,27) = 2.40
F (1,27) = 2.93
Female
No x\nchor
* indicates significantly higher intra-rater agreement after training (p < 0.05)
Although the intra-rater agreements were in general higher in the post-training
session than in the pre-training rating session for most stimuli (except with the male
breathy and rough stimuli when no anchor was provided), not all of them were
statistically significant (Table 4).
Effect of Anchors.
A series of ANOVAs were performed on the intra-rater agreement to determine if
there was significant difference among the three types of anchors (natural anchors,
14
synthesized anchors and no anchor). Table 5 lists the Mauchly test statistics and the
associated F values indicating the effect of anchors on rating different type of stimuli
before and after training.
Table 5: ANOVA results testing for effect of anchor (natural, synthesized, no anchor) on
intra-rater agreement with the associated results of Mauchlv tests.
Quality of
voice
stimuli
Gender
of voice
stimuli
Rough
Male
Female
Breathy
Male
Female
Anchor Effect
Pre-training
Post-training
F (2, 48) = 3.45*
M (2) = 0.82
F (2, 26) = 4.61*
M (2) = 0.76*
F (2. 49) = 6.83*
M (2) = 0.82
F(2, 51)=3.45*
M (2) = 0.87
F (2, 54) = 3.11
M (2) = 0.97
F (2, 54) = 7.22*
M (2) = 1.00
F (2, 54) = 7.21*
M (2) = 0.92
F (2, 54) = 3.00
M (2) = 0.95
Note: The Mauchly test (M) was used to test if the data violated the sphericity
assumption. If the p value for M is larger than 0.05, univariate F statistics with
Huynh-Feldt epsilon adjusted degrees of freedom was reported. If the p value for
M is less than 0.05, F values from Wilks' Lambda test were reported.
* indicates significantly higher intra-rater agreement after training (p < 0.05)
The anchor effect was significant for most stimuli in both rating sessions, except
for the male breathy stimuli in the pre-training session and the female breathy stimuli in
the post-training session (Table 5).
Two types of planned contrast were carried out for those stimuli that demonstrated
significant effects of the anchors. The first series of planned contrast procedure were
carried out to determine whether the provision of anchors (natural or synthesized) would
result in better intra-rater agreement than no provision of anchor. Table 6 lists the
statistical results of the planned contrasts.
15
Table 6: Statistical results of Bonferroni t-tests between the use of different anchors
(natural, synthesized and no anchors)
Quality
of voice
stimuli
Rough
Gender
Pre-Training
Post-training
of voice Natural vs.
Synthesize vs. Natural vs.
Synthesize vs.
stimuli No anchor
No anchor
No anchor
No anchor
Male
t ( l , 27)=2.14'i t (1,27) = 0.35 t (1,27) =1.00 t (1, 27)=3.00*
Female t(l,27)=289 A
Breathy
Male
N/A
Female t (1, 27)=3.91A
t(1,27) =158
t(1,27) =1.99
t(l,27)=0.28
N/A
t(l,27)=3.00*
t(l,27)=3 14*
t (1,27) = 1.83
N/A
N/A
Key: * Significantly lower intra-rater agreement when no anchor was used (p<0.05).
A
Significantly higher intra-rater agreement when no anchor was used (p<0.05).
N/A - planned contrast was not carried out as no effect of anchors was found for
that set of stimuli.
Before training, the intra-rater agreements in rating male and female rough stimuli,
and the female breathy stimuli were significantly higher when no anchor was given than
when natural anchors were provided. No significant differences were found when the
synthesized anchor condition was compared with the no anchor condition.
After training, the intra-rater agreements in rating male rough and breathy stimuli
were significantly higher when synthesized anchors were provided than when no anchor
was provided. The intra-rater agreements in rating male breathy stimuli were
significantly higher when natural anchors were provided than when no anchor was
provided.
A second series of planned contrast procedures were carried out to determine
whether natural or synthesized anchors facilitate a higher intra-rater agreement. Table 7
lists the statistical results of the second series of planned contrasts.
16
-
Quality of
voice stimuli
Rough
Breathy
—
• • —
- *
Gender of
voice stimuli
Male
P re-training
Post-training
t (1,27) = 2.66*
t (1, 27) = 3.11*
Female
t (1, 27) = 1.61
t (1,27) = 2.18*
N/A
t (1,27) = 0.23
t (1,27) = 3.69
N/A
Male
Female
* indicates significantly higher intra-rater reliability when synthesized anchors were
provided (p < 0.05)
N/A - planned contrast was not carried out as no effect of anchors was found for that
set of stimuli.
Before training, the intra-rater agreement for rating male rough stimuli was
significantly higher when synthesized anchors were provided. After training, the intrarater agreement for rating rough stimuli (male and female) was significantly higher
when synthesized anchors were provided than when natural anchors were provided. No
significant differences were found in rating the breathy stimuli.
Inter-rater Variability
The variance of the ratings was used to determine the variability of perceptual
judgment. The variability in judging normal, mild, moderate and severe (rough and
breathy) stimuli were determined separately. These variances are listed in Table 8. The
variances varied between 0.19 - 11.49, depending on the condition of the tests.
17
Anchors
provided
Natural
anchor
Rating
session
Pre-training
No anchor
Roughness
Male Female
1.75
1.27
3.62
11.49
4.86
8.17
6.36
8.99
Breathiness
Male Female
0.83
1.38
4.76
3.06
11.45
6.72
3.06
8.29
Normal
Mild
Moderate
Severe
2.93
2.83
4.28
5.56
1.05
7.78
7.15
6.02
2.17
4.20
6.97
4.03
2.51
5.91
5.44
Pre-training
Normal
Mild
Moderate
Severe
1.21
2.22
3.13
7.01
1.00
8.60
7.90
9.04
0.71
5.43
11.03
4.18
0.65
2.10
6.46
7.87
Post-training
Normal
Mild
Moderate
Severe
1.63
2.70
3.80
5.11
0.89
7.58
7.00
6.64
1.02
3.28
7.81
2.37
1.41
2.13
5.56
5.32
Pre-training
Normal
Mild
Moderate
Severe
0.92
3.00
4.66
5.38
0.19
8.75
8.17
8.08
0.38
3.92
11.23
5.43
0.67
1.17
5.15
7.70
Post-training
Normal
Mild
Moderate
Severe
/.J3
5.41
4.18
8.38
0.58
8.17
6.93
6.60
8.11
6.72
9.50
11.11
1.75
2.26
5.74
6.03
Post-training
Synthesized
anchor
Severity of
voice stimuli
Normal
Mild
Moderate
Severe
l.jj
Effect of training (training versus no training).
Separate ANOVAs were performed on the variances of each set of stimuli (female
and male) and ratings (roughness and breathiness) with the pre-training and post-training
as the within subject factor to test if there was significant training effect. The statistical
results are listed in Table 9.
18
Table 9: ANOVA results on the variance of inter-rater agreement to test for training effects.
Anchors
provided
Natural
Anchor
Severity
Level
Normal
Mild
Moderate
Severe
Roughness
Male
Female
F(l,lll)=1.39 F(l,lll)=0.12
F(l,lll)=1.15 F(l,lll}=11.16*
F(l,lll)=0.49 F(l,lll)=1.17
F(l,lll)=0.49 F(l,lll)=7.30*
Breathiness
Male
Female
F(l,lll)=2.47 F(l,lll)=0.06
F(l,lll)=0.29 F(l,lll)=0.79
F(l,lll)=21.75* F(l,lll)=0.78
F(l,lll)=1.86 F(l,lll)=9.46*
Synthesized
Anchor
Normal
Mild
Moderate
Severe
F(l,lll)=0.51
F(l,lll)=0.28
F(1,1H)=0.60
F(l,lll)=5.14*
F(l,lll)=0.07
F(l,lll)=0.94
F(l,lll)=1.16
F(l,lll)=6.09*
F(l,lll)=0.38
F(l,lll)=6.87*
F(l,lll)=11.69*
F(l,lll)=7.83*
No
Anchor
Normal
Mild
Moderate
Severe
F(1,111)=23.91A
F(1,111)=4.13A
F(l,lll)=0.38
F(1,111)=8.92A
F(l,lll)=3.58
F(l,lll)=0.36
F(l,lll)=1.80
F(l,lll)=2.34
F(1,111)=35.04A F(l,lll)=1.71
F(1,111)=6.59A F(1,111)=3.99A
F(l,lll)=2.61 F(l,lll)=0.41
F(1,111)=30.19A F(l,lll)=3.07
Key:
F(1,1H)=2.57
F(1,111)=0.00
F(l,lll)=31.95*
F(l,lll)=6.17*
* Significantly lower variance after training (p < 0.05)
Significantly higher variance after training (p < 0.05)
A
The results showed that there were some significant training effects. There were
more significant improvements in the variability (i.e. lower variance) after training when
synthesized anchors were used than with natural anchors or no anchor. It should be
noted that when no anchor was used, the variances became higher after training.
Effect of Anchors.
Separate ANOVAs were performed on the variances to determine if there was
significant difference among the three types of anchors (natural anchors, synthesized
anchors and no anchor). Table 10 lists the F values and the corresponding values of the
Mauchly tests indicating the effect of anchors in rating different type of stimuli before
and after training. Anchor effects were noticed in both rating sessions. No effect of
anchors was found in rating the female stimuli after training.
19
T a b l e 10:
ANOVA results testing for effect of anchor (natural, synthesized, no anchor)
on inter-rater agreement with the associated results of Mauchly tests
Quality
of voice
stimuli
Rough
Gender
of voice
stimuli
Male
Severity of
voice
stimuli
Normal
Mild
Moderate
Severe
Female
Normal
Mild
Moderate
Severe
Breathy
Male
Normal
Mild
Moderate
Severe
Female
Normal
Mild
Moderate
Severe
Pre-training
Post-training
F(2, 110) =1.76
M(2) = 0 88*
F(25 221) = 2.04
M(2) = 0.98
F(2,110) = 4.67*
M(2) = 0.95*
F(2, 110) =2.52
M(2) = 0.88*
F(2,219) = 5.38*
M(2) = 0.97
F(25 218) = 7.59*
M(2) = 0.97
F(2,110) = 0.09
M(2) = 0.94*
F(2,222) =0.62
M(2) = 0.56
F(2,110) = 15.35*
M(2) = 0.40*
F(2,110) = 4 37*
M(2) = 0.91*
F(2, 110) = 0.27
M(2) = 0 88*
F(2, 110) = 7.18*
M(2) = 0.78*
F(l,135) = 0.39
M(2) = 0.35*
F(2, 222) = 0.39
M(2) = 0.99
F(2,222) = 0.07
M(2) = 0.99
F(2, 222) = 0.63
M(2) = 0.99
F(2,110) = 4.19*
M(2) = 0.80*
F(2,222) = 1.90
M(2) = 0.99
F(2,222) = 0.10
M(2) = 0.99
F(2, 215) = 6.31*
M(2) = 0.95
F(2,110) = 1.74
M(2) = 0.76*
F(2, 110) = 6.08*
M(2) = 0.72
F(2,110) = 20.67*
M(2) = 0.69*
F(2,222) = 0.20
M(2) = 0.99
F(2,110) = 16.13*
M(2) = 0.88*
F(2, 110) = 5.67*
M(2) = 0.50*
F(2, 110) = 4.84*
M(2) = 0.71*
F(2,110) = 42.42*
M(2) = 0.73*
F(2,110) = 0.76
M(2) = 0.34*
F(2,110) = 0.26
M(2) = 0.59*
F(2,219) = 0.14
M(2) = 0.97
F(2, 110) = 0.79
M(2) = 0.86*
Note: The Mauchly test (M) was used to test if the data violated the sphericity
assumption. If the p value for M is larger than 0.05, univariate F statistics with HuynhFeldt epsilon adjusted degrees of freedom was reported. If the p value for M is less than
0.05, F values from Walks' Lambda test were reported.
significance level p < 0.05
20
Two types of planned contrast were carried out on the data that demonstrated
significant anchor effects. The first series of planned contrast procedures were carried out
to determine which stimuli set demonstrated lower variance when comparing between the
with anchors condition and no anchor condition. Table 11 lists the statistical results of the
planned contrasts.
Before training, only the male severe breathy stimuli demonstrated significant
lower variance with natural anchors than with no anchor. Some of the remaining stimuli
demonstrated significant lower variance with no anchor than with anchors. After training,
most of the male stimuli had significantly lower variance with anchors than with no
anchors. However, none of the female stimuli were rated differently whether anchors
were provided or not (Table 11).
21
Table
11: Statistical results of planned contrast between the uses of different anchors
(natural, synthesized and no anchors).
Rating
session
Anchors
provided
Severity
of voice
stimuli
Male
Female
Male
Female
Pretraining
Natural
Normal
N/A
F(1,111)=10.29A
F(1,111)=7.58A
N/A
Mild
N/A
F(1,111)=10.84A
N/A
F(l, 111)=10.23A
F(l,lll)=0.08
N/A
N/A
F(l, 111)= 2.69
Severe
N/A
N/A
F(l,111)=14.63*
N/A
Normal
N/A
F(1,111)=6.43A
F(l, 111)= 2.12
N/A
Mild
N/A
F(l, 111)= 0.04
N/A
F(1,111)=6.92A
F(l, 111)= 1.03
N/A
N/A
F(1,111)=41.71A
Severe
N/A
N/A
F(l, 111)= 2.84
N/A
Normal
F(l, 111)=7.20*
N/A
F(1J11)=27.06*
N/A
Mild
F(l3 111)=6.27*
N/A
F(l,lll)=4.43*
N/A
N/A
N/A
F(l,lll)=9.14*
N/A
Severe
F(l, 111)=9.04*
N/A
F(l,lll)=49.72*
N/A
Normal
F(l,l 11)=20.16*
N/A
F(l,lll)=27.76*
N/A
Mild
F(l, 111)=8.30*
N/A
F(l,lll)=9.45*
N/A
N/A
N/A
F(l, 111)=4.09*
N/A
F(l,l 11)=14.50*
N/A
F(l,l 11)=85.12*
N/A
Moderate
Synthesized
Moderate
Posttrainins
Natural
Moderate
Synthesized
Moderate
Severe
Roughr
Breathiness
Key: * Significantly higher variance when no anchor was used (p < 0.05).
A
Significantly lower variance when no anchor was used (p < 0.05).
N/A - planned contrast was not carried out as no effect of anchors was found for
that set of stimuli.
A second series of planned contrast were carried out to determine which anchors,
natural or synthesized, facilitated a lower variance. Table 12 lists the statistical results
of the planned contrast. Only four instances of significant difference were found when
comparing between natural and synthesized anchors. Three of them showed variances
22
that were significantly lower when synthesized anchors were provided. Three instances
of significant difference were found before training (Table 12).
Table 12: Statistical results of planned contrast between natural and synthesized anchors.
Quality of
voice
stimuli
Rough
Gender
of voice
stimuli
Male
Female
Severity
of voice
stimuli
Mild
Pre-training
Post-training
N/A
F(l,lll)=0 03
Moderate
F(l,lll)=6 24*
N/A
Severe
N/A
F(l,lll)=0 56
Normal
N/A
F(l,lll)=3.16
F ( U H ) = 1 0 29*
N/A
N/A
N/A
Severe
F(l,111)= 0.00
N/A
Normal
N/A
N/A
Mild
N/A
F(1,1H)=2.86
Moderate
N/A
F(l,lll)=2.75
Severe
F(l,lll)=3.10
F(U11)=7.50*
Normal
F(l,lll)=0.26
F(l,lll)=1.31
Mild
F(U11)=3.06
N/A
F(U11)=28.78 A
N/A
Severe
N/A
N/A
Normal
N/A
N/A
Mild
Moderate
Breathy
Male
Female
Moderate
Key: * Significantly lower variance when synthesized anchors were provided
A
Significantly higher variance when synthesized anchors were provided
N/A - planned contrast was not carried out as no effect of anchors was found foi
that set of stimuli.
23
Summary of results
The results showed that both intra-rater agreement and inter-rater variability
improved with training and when anchors were provided. When no anchor was provided,
training did not have an effect on intra-rater agreement, but caused an increase in interrater variability. When anchors were provided, the intra-rater agreement for rough
stimuli improved most than those for breathy stimuli, while the anchors were equally
effective in reducing the inter-rater variability for breathy and rough stimuli.
The use of anchors affected both the intra-rater agreement and inter-rater
variability before and after training. Generally, before training, the intra-rater agreement
and inter-rater variability were better when no anchor was provided than when anchors
were provided. Although the natural anchors only improved the intra-rater agreement for
the male breathy stimuli, they helped to reduce the inter-rater variability for the male
breathy and rough stimuli. The synthesized anchors were effective in improving the
intra-rater agreement and inter-rater variability for male breathy and rough stimuli after
training. No anchor effect was noticed for rating the female stimuli.
DISCUSSION
The first objective was to determine whether training improved the reliability and
variability of perceptual voice evaluation. As predicted, training generally improved the
reliability and variability of perceptual voice evaluation. The training program used in
this study involved a stimulus-response-feedback-stimulus paradigm. The listeners were
presented with a stimulus, they were then asked to give a rating (response), followed by
a modal answer (feedback) and lastly the stimulus was presented again. This paradigm
required the listeners to compare and eventually replace their internal standards with the
external standards.
24
Although training was generally effective in improving the reliability and
variability in perceptual voice evaluation, this improvement was only limited to the
evaluation provided with anchors. This showed that training alone was not sufficient to
improve the reliability and reduce the variability of perceptual voice evaluation. Anchors
were also needed to ensure a better reliability and lowered the variability. After training,
the listeners rated the voice qualities with more reliability and less variability with
anchors than with no anchor. This supported that the internal standards for pathological
voice qualities were unstable.
Contrary to the hypothesis of this study, instead of improving the reliability and
variability, the use of anchors actually led to lower reliability and higher variability in
the pre-training session. However, reliability was improved when anchors were provided
in the post-training session as predicted. This could be attributed to the adjustment that
the listeners had to make to their internal standards when the external standards (anchors)
were provided. It seemed that training provided an opportunity for the listeners to adjust
to the external standards.
Although previous studies emphasized the importance of using external standards
in perceptual voice evaluation, this study showed that establishing the internal standards
were equally important. If the listeners' internal standards are not stable or not well
formed, the listeners are less likely to make use of the external standards because they
may not be able to pick up the relevant and distinctive features from the external
standards. Based on the results of this study that training generally improved the intrarater agreement and the inter-rater variability in perceptual voice evaluation. Given the
training program only took around 30 minutes to complete and only minimal guidance
was needed, it should be recommended for teaching perceptual voice evaluation to naive
listeners.
25
The second objective was to test the effect of natural and synthesized anchors on
the reliability and variability of perceptual voice evaluation Generally, the results
suggested that the synthesized anchors and the natural anchors were both effective in
improving the intra-rater agreement and inter-rater variability. However, the synthesized
anchors were better in achieving intra-rater agreement than the natural anchors. This
result did not support the hypothesis that natural anchors were better than the
synthesized anchors in achieving better intra-rater agreement and reduce inter-rater
variability. This could be because for the natural voice anchors, it was difficult to find
samples with isolated abnormal voice qualities (roughness or breathiness only). The
abnormal voice qualities often co-existed and this occurred in the natural voice anchors
used in this study. However, with the synthesized voice anchors, the two qualities were
created independently and were therefore distinguishable in isolation. The listeners
might have found it more difficult to focus on just one quality with the use of the natural
anchors, whereas with the synthesized anchors, the listeners might have found it easier
to distinguish between different voice qualities. However, natural anchors were still
effective in improving inter-rater variability because even though the listeners may not
focus on the same voice quality of the natural anchors each time they listened to them,
the information drawn from the natural anchors would be more stable than those drawn
from the listeners' internal standards. In addition, all listeners used the same set of
external standards (the anchors), thus, the mental representations of the voice qualities
among the listeners (roughness and breathiness) should be similar.
The anchors did not seem to improve the intra-rater agreement for rating breathy
voice stimuli, although they were useful in reducing the inter-rater variability. However,
the anchors were effective in improving the intra-rater agreement and inter-rater
variability for rating rough voice stimuli. This may be interpreted as breathiness is
26
perceptually more distinctive than roughness, the listeners could rate the breathiness
severity consistently with or without anchors or training. However, when no anchors or
training were provided, the listener's judgment on breathiness severity varied from
listener to listener. It was with training and anchors that reduced the variability between
the listeners' judgment on the breathiness ratings. This suggested that even with welldeveloped internal standards, the external standards are effective for calibrating the
internal standards between the listeners.
Although training and external standards improved the listeners' agreement and
variability in perceptual voice evaluation, the intra-rater agreements were still between
61-76% while the variances were between 0.89 and 7.78. This suggests that the amount
of training included in this study may not be enough in shaping the listeners' internal
standards in pathological voice qualities. In addition, the amount of voice stimuli
included in the anchors may not be enough. Further studies are recommended to test if
additional training or anchors can improve the intra-rater agreement. If both fail to
improve the intra-rater agreement, it would reflect the limitation of perceptual voice
evaluation.
In summary, training was found to improve the intra-rater agreement and interrater variability when anchors were provided. This suggested that improving the stability
of the internal standards were equally important as providing external standards in order
to improve the intra-rater agreement and inter-rater variability. It was found that listeners
improved their intra-rater agreement and inter-rater variability when anchors were
provided than when no anchor was provided. This supported the idea that the listeners'
internal standards for pathological voice qualities were unstable. Synthesized anchors
were more effective in improving the intra-rater agreement and inter-rater variability
than natural anchors in this study. This suggested that listeners found it easier to
27
distinguish between the voice qualities in synthesized signals than in natural voice
samples.
LIMITATION OF THE PRESENT STUDY
It was difficult to compare between the responses to male and female stimuli, as
they may involve different perceptual framework for male and female voices. In addition,
male and female listeners may have different perceptual framework, however, as the
gender of the listeners was not balanced in this study (26 females and 2 males), no
comparison could be made.
No conclusion could be drawn on the nature of the internal standards. However, it
is hypothesized that the listeners form a prototype of each quality before they build up
their representations of different severity levels. Further analysis of the present data may
reflect the nature of the internal standards by examining whether the listeners are better
in the detection of the presence of the qualities than in determining the severity level
under different conditions.
CONCLUSION
With the results from this study, it is recommended to provide synthesized voice
stimuli as external standards in perceptual voice evaluation protocol. An evaluation
protocol similar to the one used in this study is recommended as an assessment tool for
future voice research and clinical practice.
Further study is recommended to test whether the intra-rater agreement and interrater variability can be further improved when additional severity levels included in the
anchors and when the training is more extensive. This is important for ensuring that
perceptual voice evaluation is a valid assessment tool for research and clinical practice.
28
ACKNOWLEDGEMENTS
Special thanks to Dr. Edwin Yiu, Ms. Emily Chan and Mr. Miroslav Kiimmel for their
guidance and support throughout the development of this study. In addition, I would like
to express my sincere gratitude to the following people for participating in my study: Ms.
Sabina Chan, Ms. Pance Kung, Ms. Polly Lau, Ms. Elsa Wong, Ms Cynthia Woo and all
Year I and III students at the Department of Speech and Hearing Sciences.
REFERENCES
ANDERS, L., HOLLIEN, H., HURME, P., SONNINEN, A., & WENDLER, J. (1988).
Perception of hoarseness by several classes of listeners Folia Phoniatrica, 40, 91-100.
BASSICH, C.J & LUDLOW, C.L. (1986). The use of perceptual methods by new
clinicians for assessing voice quality. Journal of Speech and Hearing Disorders, 51, 125-
DE KROM, G. (1995). Some spectral correlates of pathological breathy and rough voice
quality for different types of vowel fragments. Journal of Speech and Hearing Research
35,794-811.
GELFER, M.P. (1988). Perceptual attributes of voice: Development and use of rating
scales. Journal of Voice, 2,320-326.
GERRATT, B R, KREIMAN, J., ANTONANZAS-BARROSO, K , BERKE, G. (1993).
Comparing internal and external standards in voice quality judgements. Journal of Speech
and Hearing Research, 36,14-20.
HAMMARBERG, B., FRITZELL, B., GAUFFIN, L, SUNDBERG, J., & WEDIN, L.
(1980). Perceptual and acoustic correlates of abnormal voice qualities. Acta
Otolaryngologica, 90, 441-451.
2 2 MAS 2006
29
KREIMAN, J., GERRATT, B.R., & BERKE, G.S. (1994). The multidimensional nature
of pathological voice quality. Journal of the Acoustical Society of America, 96, 1291-1302.
KREIMAN, J., GERRATT, B.R., KEMPSTER, G.B., ERMAN, A., BERKE, G.S.
(1993). Perceptual evaluation of voice quality: review, tutorial, and a framework for future
research. Journal of Speech and Hearing Research, 36, 21-40.
KREIMAN, J., GERRATT, B.R., PRECODA, K., BERKE, G.S. (1992). Individual
differences in voice quality perception. Journal of Speech and Hearing Research, 35, 512520.
MARTIN, D.P., WOLFE, V.I. (1996). Effects of perceptual training based upon
synthesized voice signals. Perceptual and Motor Skills, 83, 1291-1298.
MORAN, M J . & GILBERT, H.R. (1984) relation between voice profile ratings and
aerodynamic and acoustic parameters. Journal of Communication Disorders,
17,245-260.
SHEWELL, C. (1998). The effect of perceptual training on ability to use the vocal
profile analysis scheme. Internal Journal of Language and Communication Disorders, 33,
S322-326.
YIU, E. (1999). Perceptual voice properties of synthesized voice signals. Project in
progress.
YUMOTO, E., SASAKI, Y. & OKAMURA, H. (1984). Harmonics-to-noise ratio and
psychophysical measurement of the degree of hoarseness. Journal of Speech and Hearing
Research 27, 2-6.