Regression Tree Approach to Studying Factors Influencing Acoustic

Original Paper
Folia Phoniatr Logop 2006;58:274–288
DOI: 10.1159/000093184
Regression Tree Approach to Studying
Factors Influencing Acoustic Voice Analysis
Dimitar D. Deliyski a Heather S. Shaw a Maegan K. Evans a
Roumen Vesselinov b
a
Department of Communication Sciences and Disorders, and b Department of Statistics,
University of South Carolina, Columbia, S.C., USA
Key Words
Acoustic voice analysis Voice assessment Fundamental frequency Perturbation Data acquisition Environmental noise
Abstract
Multiple factors influence voice quality measurements (VQM) obtained during an
acoustic voice assessment including: gender, intrasubject variability, microphone, environmental noise (type and level), data acquisition (DA) system, and analysis software.
This study used regression trees to investigate the order and relative importance of
these factors on VQM including interaction effects of the factors and how the outcome
differs when the acoustic environment is controlled for noise. Twenty normophonic
participants provided 20 voice samples each, which were recorded synchronously on
five DA systems combined with six different microphones. The samples were mixed
with five noise types at eight signal-to-noise ratio (SNR) levels. The resulting 80,000 audio samples were analyzed for fundamental frequency (F0), jitter and shimmer using
three software analysis systems: MDVP, PRAAT, and TF32 (CSpeech). Fifteen regression
trees and their Variable Importance Measures were utilized to analyze the data. The
analyses confirmed that all of the factors listed above were influential. The results suggest that gender, intrasubject variability, and microphone were significant influences
on F0. Software systems and gender were highly influential on measurements of jitter
and shimmer. Environmental noise was shown to be the prominent factor that affects
VQM when SNR levels are below 30 dB.
Copyright © 2006 S. Karger AG, Basel
© 2006 S. Karger AG, Basel
1021–7762/06/0584–0274$23.50/0
Fax +41 61 306 12 34
E-Mail [email protected] Accessible online at:
www.karger.com/fpl
www.karger.com
Dimitar Deliyski, PhD
Communication Sciences and Disorders
University of South Carolina, 1621 Greene Street
Columbia, SC 29208 (USA), Tel. +1 803 777 2245
Fax +1 803 777 3081, E-Mail [email protected]
Introduction
Accuracy of acoustic voice analysis is essential when used to assist clinicians in
the appropriate diagnosis and treatment of vocal pathologies. Multiple factors influence voice quality measurements (VQM) obtained during an acoustic voice assessment. These factors include: gender [1, 2], intrasubject variability (token) [3, 4],
microphone [5], environmental noise [6–9], data acquisition (DA) system [8, 10],
and analysis software [11]. Knowledge related to the order and extent of influence
of these factors has an immediate significance for speech-language pathologists who
utilize computer-based acoustic VQM. It is important for clinicians to be aware of
these factors and the degree of their influence in order to reduce their impact on
VQM.
Gender
It is widely accepted that male and female voices are fundamentally different
as a result of basic physiological characteristics, which cause significant differences
in fundamental frequency (F0) [2]. Fluctuations in F0 and increased noise in female
phonations are also related to anatomical characteristics [12]. According to Baken
and Orlikoff [2], the relationship between gender and shimmer measurements is not
clear.
Intrasubject Variability
A technique utilized to control for intrasubject variability is to obtain the mean
VQM from multiple tokens. A study by Scherer et al. [4] investigated the number of
tokens needed to acquire a representative perturbation value. For stable voices, at
least 6 tokens were recommended and for voices with normal to high levels of instability, at least 15 tokens were recommended. The National Center for Voice and
Speech (NCVS) recommends that 10 tokens are needed to obtain reliable perturbation measures [3].
Microphone
A study by Titze and Winholtz [5] demonstrated that the type of microphone
used has a significant impact on acoustic voice analysis. The results demonstrated
that: condenser microphones and microphones with a balanced output performed
the best. Microphone sensitivity and distance were found to have the largest effect
on perturbation measures. NCVS [3] recommends that a professional grade condenser microphone with a minimum sensitivity of –60 dB should be used; specifically, a miniature head-mounted microphone with balanced output and a mouth-tomicrophone distance less than 10 cm. A recent study by Deliyski et al. [10], which
investigated the effects of DA hardware on VQM, was in agreement with Titze and
Winholtz [5] and Titze [3].
Environmental Noise
Noise added by the acoustic environment and data acquisition hardware becomes inseparable from noise originating from the larynx in perturbation measurements. According to NCVS [3], recordings should be made in a sound-treated room
with ‘ambient noise less than 50 dB’. Ingrisano et al. [6] investigated the effect of
noise on computer-based analyses of voice samples and found that jitter and shim-
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
275
mer measurements increased as noise floors increased. Carson et al. [8] examined
the effect of noise on computer-based analyses of voice samples and suggested that
appropriate recording standards are needed to obtain valid and reliable results relative to voice production samples, recording processes, and analysis systems. Deliyski et al. [9] suggested that an instrumental measurement of the noise present in the
acoustic environment is imperative when acoustic analysis is used for clinical voice
assessment. A recommended signal-to-noise ratio (SNR) level of 42 dB was established, which allowed for errors due to noise in up to 1% of voice measurements. A
level to produce reliable results was set at 25 dB SNR maintaining all noise-contributing factors in an acoustic environment within an error rate of 5%.
Data Acquisition Systems
Deliyski et al. [10] investigated the relative performance of different DA environments (microphone-to-DA-system combinations) and the relationship between
their technical characteristics and VQM performance. The effective dynamic range,
discretization error, and differences in VQM parameters as well as the relationship
between these factors for 18 DA environments were compared. The effective dynamic range of the DA environments was the main factor influencing shimmer and
had a significant influence on jitter. Discretization error was found to be the main
factor influencing the accuracy of F0 across DA environments. The importance of
selecting high-quality DA systems combined with microphones that provide high
SNR was emphasized.
Software
The analysis software is another factor that significantly affects VQM, but research on this topic is very limited and inconclusive. The average F0 is a time-calibrated measurement. However, jitter and shimmer differences due to software are
difficult to assess in the absence of a calibrated reference given that jitter and shimmer are acoustic estimates of the frequency and amplitude perturbations of the glottal velocity waveform. Depending on the pitch extraction algorithm these estimates
can differ substantially [11].
Purpose and Research Questions
Regression trees derived from the Classification and Regression Trees statistical method (CART) [13] allowed for the representation of the interaction of the
aforementioned factors and the order in which the factors influence F0, frequency
perturbation (jitter), and amplitude perturbation (shimmer), as measured by three
software systems. As a logical extension of previous findings [9], the investigators
were interested in whether the factors would differentially influence VQM over
three noise level ranges. Range A included recommended noise levels only (from 66
to 42 dB SNR). Range B consisted of the recommended plus reliable levels (from
66 to 26 dB SNR). Range C included all noise levels studied (from 66 to 10 dB
SNR).
The specific research questions were as follows:
(1) What are the interaction effects of gender, intrasubject variability, microphone, noise type, noise level, DA system, and analysis software on VQM? This
276
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov
question was answered by building an optimal and validated model for each measurement: F0, jitter and shimmer, using regression trees.
(2) What is the order and relative importance of gender, token, microphone,
noise type, noise level, DA system and analysis software as factors influencing VQM?
This question was answered by analyzing the results of the Variable Importance
Measures (VIM) for each regression tree.
(3) How do these influences differ depending on whether the acoustic environment was controlled for noise or not? This question was answered by comparing the
VIM for the three SNR ranges.
Method
The data utilized to answer the abovementioned research questions were obtained using the
instrumentation and procedures previously detailed in Deliyski et al. [9].
Instrumentation and Procedures
The instrumentation included: computers, soundcards, microphones, and VQM software.
Five DA systems, comprised of computers and soundcards, were utilized in this study. The DA
systems included: three desktop computers, one with Computerized Speech Lab Model 4400 by
Kay Elemetrics Corp. (CSL), one with an enhanced music soundcard (DT1), and one with a builtin soundcard (DT1), and two laptop computers, one of which (LT1) was coupled to a preamplifier, and the second one (LT2) utilized a built-in soundcard. The DA systems were coupled to six
microphones and rotated to balance for microphone-to-DA-system interactions. The microphones included: three desktop condenser microphones (C0, C1, C2), of which C0 had balanced
output, two desktop dynamic microphones (D1, D2), and a head-mount condenser microphone
with balanced output (H0). Three voice analysis software systems were used in this study: MDVP
[14, 15], PRAAT (version 4.1.5) [16, 17], and TF32 (previously known as CSpeech) [18, 19]. The
DA hardware, microphones, and voice analysis software were selected based on their frequency
of use worldwide for clinical and research voice applications.
Twenty normophonic individuals (10 men, 10 women) participated in the study. They represented a large age range (25–61 years for females and 23–64 years for males) with equivalent
mean ages of 39 years. The participants seated in a sound-attenuated booth provided 20 sustained
phonations of the vowel /a/ in a habitual pitch for up to 10 s. All participants were positioned at
a distance of 30 cm from the desktop microphones. The head-mount microphone, when used,
was positioned at a distance of 4 cm and an angle of 45° from the participant’s mouth. The intensity was measured at 30 cm distance using a digital sound level meter. The investigators maintained the intensity within 88 dB (A) 8 3 dB by cueing the participants to raise or lower their
volume. This intensity was utilized to secure an acoustic environment with an invariant SNR of
66 dB at 22 dB (A) ambient noise level within the sound booth. These phonations were recorded
in four sessions, 30 min apart, between which the microphones were rotated amongst the DA
systems. The two microphones with balanced output, C0 and H0, were alternated for each rotation on CSL, the only DA system with balance input. The other four microphones, C1, C2, D1,
and D2, were rotated amongst the remaining four DA systems, DT1, DT2, LT1, and LT2. Rotations allowed for producing all 18 possible microphone-to-DA-system combinations. Five tokens
(1, 2, 3, 4, 5) were obtained from each participant during each recording session. In order to reduce the learning effect, a preliminary session was recorded and discarded. All recordings were
made simultaneously on all five DA systems at a sampling rate of 44.1 kHz and a 16-bit quantization. Pulses were presented at the beginning and at the end of each recording session in order
to synchronize the five DA systems. All tokens were viewed and a 4-second portion of each type
1 token, according to NCVS recommendations [3], was selected.
Five noise types, chosen to represent the most common noise interference possibilities in a
realistic assessment environment, were recorded with a high-quality condenser desktop micro-
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
277
phone. The noise types were inclusive of: fan noise from a computer (Fan), 60-Hz penetration
noise from the A/C power (60 Hz), band-limited white noise (White), background speech (Talk),
and noise from the city (City). The five types of noise were digitally mixed, using Matlab version
6.5, at precise SNR with the 66 dB (Clean) audio samples, resulting in a total of eight noise levels:
66, 58, 50, 42, 34, 26, 18, and 10 dB SNR.
As a result of recording 5 tokens in four separate rotation sessions for 20 participants simultaneously on five DA systems, a total of 2,000 Clean 4-second audio samples were obtained.
These 2,000 audio samples were then mixed with five types of noise at eight SNR levels producing a total of 80,000 samples. Three types of VQM, F0, jitter (RAP, %jit and jitter (rap)), and
shimmer (Shim, %shm, and shimmer (local)) were measured using the three software analysis
systems, MDVP, TF32, and PRAAT, respectively. The specific jitter and shimmer measures
across the three programs were chosen based on the similarity of order of the perturbation function. The nine parameters resulted in a total of 720,000 output values for statistical analysis. The
dataset was divided into three noise level ranges to answer the research questions. For range A,
40,000 acoustic samples were analyzed at four noise levels (66, 58, 50, 42 dB SNR). For range B,
60,000 acoustic samples were analyzed resulting from six noise levels (66, 58, 50, 42, 34, 26 dB
SNR) and range C included 80,000 acoustic samples that resulted from all eight noise levels (66,
58, 50, 42, 34, 26, 18, 10 dB SNR).
Analysis
The CART approach was selected since it is a binary recursive partitioning method, which
is a powerful discovery tool for data with a complex structure [13, 20, 21]. It has been shown that
recursive partitioning improves the accuracy of conventional models by an average of 10–15%
[20, 22]. The goal in a regression tree approach is ‘to partition the data into relatively homogeneous (low standard deviation) terminal nodes, and obtain the mean value observed in each node
as the node’s predicted value’ [20]. Tree building is initiated with the first binary split of the most
important variable termed the ‘parent’ node. Subsequently, this ‘parent’ node is split repetitively
until a terminal node is reached. For the purposes of this study, the terminal node was forced
when the next split included less than 200 tokens. For each node, the variable that provides the
best split, as determined by the split with the minimum error, is included in the tree. The splitting criterion utilized for this study was the Gini impurity coefficient. For each split, a list of
‘competitors’ and ‘surrogates’ are generated. Competitors might replace the ‘best’ variable at a
particular split with a change in the structure of the split. A surrogate is a variable that might replace or ‘mimic’ the ‘best’ variable without much change in the split. At any split, the model attempts to predict the dependent variable, and the mean for each node (within each node in
fig. 1–3) is the predicted value.
Since trees may be sensitive to random noise/error in the data, a validation process is commonly employed when building regression trees. The validation technique used by CART, crossvalidation, allows for building very robust models that are superior to other tree and standard
regression models [20]. For this study, ‘10-fold’ cross-validation was employed. The data was divided randomly into 10 equal subsamples (each containing 10% of the data) and the tree building
process was repeated 10 times. First, a tree was built using the first 9/10 of the data, while the
remaining 1/10 (subsample No. 1) was used to estimate the error rate. Then, the process was repeated on another 9/10 of the data, until each of the 10 subsamples had been utilized to provide
an error rate. The resulting subsample error rates for the 10-fold sample iterations were combined
to form the error rates for each tree. This procedure ensured the estimation of the independent
predictive accuracy of the tree and the confidence that the resulting tree could be generalized to
a completely different set of data.
Once the tree is built and validated, the importance of each variable can be ascertained. At
each splitting point, both the most influential variable and its surrogates are determined by
CART. With conventional models, the importance of one variable is often ‘masked’ by another
variable. For example, in a model using a stepwise procedure, the surrogates would be dropped
out of the equation and their actual importance would be obscured. CART solves this ‘masking’
problem by taking into account the improvement measure not only for the primary splitter, but
also for the surrogates. The variable importance score is calculated by ‘looking at the improve-
278
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov
Fig. 1. Schematic of the regression tree for F0 based on range B with the reported mean F0 values
and number of samples at each node.
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
279
131.1
1,800
CSL
132.7
16,200
C0, C1, C2,
D1, D2
132.9
14,400
DT1, DT2,
LT1, LT2
134.7
1,800
133.6
1,800
CSL
136.1
1,800
H0
134.8
18,000
132.9
18,000
H0
2
1
Microphone
Gender
239.3
1,200
Token
237.2
10,800
C0
Top of tree
134.8
14,400
237.4
12,000
MDVP,
PRAAT
H0, C1, C2,
D1, D2
137.6
18,000
5
DT1, DT2,
LT1, LT2
C0, C1, C2,
D1, D2
136.3
36,000
3, 4
134.7
16,200
135.6
90,000
Male
187.8
180,000
238.8
6,000
Noise level
238.1
1,800
H0
240.8
1,800
C0
239.4
18,000
2
Software
DA system
C0, C1, C2,
D1, D2
239.1
5,400
TF32
Noise type
236.5
600
H0
237.9
18,000
1
Female
240.0
90,000
239.4
14,400
C1, C2,
D1, D2
240.0
18,000
3
242.8
3,600
C0
241.4
36,000
4, 5
241.3
32,400
H0, C1, C2,
D1, D2
Fig. 2. Schematic of the regression tree for jitter based on range B with the reported mean jitter
values and number of samples at each node.
280
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov
0.145
600
C0
0.235
3,000
Token
Microphone
Gender
0.134
10,800
0.147
600
H0
C0, C1,
C2, D1, D2
Top of tree
0.130
1,200
2, 5
0.133
12,000
H0
1, 3, 4
Noise level
Noise type
PRAAT
0.144
60,000
0.253
2,000
0.124
600
Software
DA system
0.140
600
C0
3
0.291
4,800
0.146
5,400
0.153
30,000
CSL, DT2,
LT1, LT2
0.166
3,000
0.150
8,400
0.355
6,000
0.162
3,600
0.149
1,800
LT1
0.397
6,000
CSL, DT1,
DT2, LT2
0.158
6,600
C0, C2,
D1, D2
0.361
9,000
0.156
8,400
1, 5
0.158
12,000
C1, H0
0.144
600
DT1
C0, C2,
D1, D2
0.319
9,000
CSL, DT1, LT1
Female
0.328
16,500
2, 4
0.151
12,000
C1, H0
0.154
3,600
0.304
7,500
C0, C1,
C2, D1
0.375
15,000
2–4
1, 5
C0, C2, D2
Female
0.354
30,000
0.333
15,000
DT2, LT2
H0, D1, C1
MDVP
0.335
60,000
CSL, DT2,
LT1, LT2
0.321
24,000
H0, D2
H0, C1,
C2, D1, D2
1–3, 5
0.316
30,000
Male
0.146
6,000
0.299
6,000
DT1
0.328
1,200
4
C0, C1,
C2, D1, D2
0.128
5,400
4
0.128
6,000
0.331
2,000
0.292
4,000
Female
Male
2, 5
C0, C1,
H0
C2, D1, D2
0.141
5,400
3
0.142
6,000
0.257
3,000
66
0.264
10,000
0.246
6,000
Female
Male
0.135
30,000
Male
0.251
15,000
2, 4, 5
Male
0.256
25,000
0.262
10,000
1, 3
H0, C1,
C2, D1, D2
0.139
5,400
0.140
6,000
0.231
20,000
0.257
5,000
1
2–5
1
0.236
25,000
Female
0.246
50,000
58–26
TF32
0.249
60,000
0.243
180,000
Fig. 3. Schematic of the regression tree for shimmer based on range B with the reported mean
shimmer values and number of samples at each node.
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
281
2.136
200
1
1.950
1,000
26
2.709
5,400
C0, C1, C2,
D1, D2
2.688
6,000
Fan,
60 Hz,
City
1.904
800
2, 3,
4, 5
1.603
6,000
CSL
2.498
600
H0
3.144
2,000
White
2.824
10,000
26
Microphone
Gender
1.645
2,500
C0
1.422
2,500
H0
2.117
4000
66–34
1
2.168
4,800
2.612
9,000
1
2.425
800
26
2.200
3,200
66
1.911
19,200
1.853
16,000
58–26
2, 3, 4, 5
1.736
1,500
1, 2, 3
2.003
500
26
1.745
3,000
H0
3.539
3,600
26
66–34
2.516
2,700
2.582
4,500
26
Male
2.371
30,000
2.681
1,800
3.505
27,000
66–34
2.484
10,000
C2,
D2
2.413
22,500
TF 32
2.356
12,500
C0, C1,
D1
2.131
60,000
66–34
CO, C1, C2,
D1, D2
2.441
27,000
White,
Talk
3.744
5,400
26
3, 4, 5
3.545
32,400
C0, C1, C2,
D1, D2
3.462
54,000
3.298
18,000
3.338
21,600
1, 2
Fan,
4, 5 60 Hz,
City
1.630
1,000
1.693
2,500
66–34
2, 3, 4
3.397
60,000
Male
2.680
180,000
2.808
3,000
66–34
2.750
5,000
2.665
2,000
White,
1, 5
Talk
2.807
6,000
H0
3.256
400
3.092
1,000
26
DT1, DT2,
LT1, LT2
2.983
600
Fan,
60 Hz,
City
1.962
24,000
2.442
36,000
2, 3,
4, 5
C0, C1, C2,
D1, D2
2.476
45,000
66–34
DA system
Software
2.954
120,000
MDVP, PRAAT
Noise level
Noise type
2.449
50,000
1.890
30,000
Female
2.125
3,000
2, 3, 4
2.206
5,000
H0
66–34
1.533
5,000
2.329
2,000
1, 5
2.910
2,000
Talk
2.511
60,000
Female
Token
Top of tree
ment measure attributable to each variable in its role as a surrogate to the primary split. The
values of these improvements are summed over each node, totaled for the tree, and scaled relative to the best performing variable’ [20] forming the VIM. That is, the variable with the highest
score is set to 100% and the remaining variables are scaled relative to this variable. The ranking
is relative to the tree structure, thus, it is very important to validate each tree to obtain results
that can be generalized.
The 15 regression trees discussed below were generated by the CART program version 5
[21], and provide a hierarchy of the influential factors. Three noise level ranges for the three types
of VQM parameters resulted in nine trees. Six additional trees were utilized to compare measurements of F0 across gender for each range. The factors included in the first nine trees were: gender,
token, microphone, noise type, noise level, DA system, and software system. The factors included in the six trees for F0 were the same with the exception of gender. After the tree was completed, VIM were obtained for all factors included in the tree. The tree structure and VIM allowed
for the assessment of the relative influence of each factor on VQM in the presence of the other
variables in the model.
Results and Discussion
Regression Trees
The interpretations of the CART statistical models provide valuable information related to the interaction effects of the factors on VQM. The regression trees for
each VQM parameter are shown in figures 1–3. These trees are based on results utilizing range B noise levels. This range was shown to provide information on the impact of noise level and type, while restricting the possible dominance of these factors
due to invalid VQM [9].
Fundamental Frequency. The optimal regression tree for F0 range B resulted in
13 levels and 130 end nodes. This tree was reduced to five levels to provide a schematic and to limit the complexity of interactions investigated (fig. 1). For males,
there were no visible variations in the interaction terms amongst the factors with a
standard order: gender, token, microphone, and DA system. For females, a greater
variation in the interaction terms was observed. The first two factors were gender
and token. Then, the difference between tokens influenced the next factor. For tokens 2 through 5, the following and final factor was microphone. However for token
1, the next factor was software followed by microphone. Apparent differences between token 1 and the rest of the tokens for female samples caused TF32 to measure
F0 differently from MDVP and PRAAT, thus software became an important variable. Software might be an important factor for measuring F0 of female voices because of the higher probability of pitch extraction gross errors compared to males.
Some pitch extraction algorithms may be more vulnerable to gender than others. The
cause of the different interactions following token 1 for females remains unexplained.
Jitter. The optimal regression tree for jitter range B resulted in 13 levels and
175 end nodes. As a dependent variable, jitter appears 35% more complex than F0
when measured by end nodes. This tree was reduced to five levels to provide a schematic and to increase the relevance of interactions investigated (fig. 2). Software
system was the first variable. For PRAAT, no variation in the interaction terms was
observed. The factors sequentially were gender, token, microphone, and DA system.
For TF32, the next factor was noise level. For the clean samples (66 dB SNR), the
282
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov
next factor was token followed by gender. For the other levels, the order was reversed, gender was the next factor followed by token. This difference between the
order of factors for 66 dB SNR versus the other SNRs highlights the impact that even
small amounts of noise may have on VQM. For MDVP, the second factor was gender. Microphone was the next factor for females, while DA system was next in importance for males. Thus, gender has an impact on the importance of microphone
and DA system. For females, half of the microphones (C0, C2, D2) influenced the
occurrence of token as the next factor, while the other half (H0, D1, C1) was followed
by DA system. Thus, for females the type of microphone is a factor in determining
the contribution of intrasubject variability and DA system. For males, DT1 was followed by token. The remaining DA systems were followed by microphone. For
males, when the DT1 hardware with the MDVP software was utilized, microphone
did not appear to be a significant factor. However, when other DA systems were
used, intrasubject variability was not significant. The interpretation of this regression tree suggests that the data structure of jitter is highly dependent on the software
used. This finding is due to the significant differences in the algorithms of the three
programs, which likely do not measure the same voice feature. Therefore, frequency
perturbation values obtained with different analysis software cannot be compared.
Shimmer. The optimal regression tree for shimmer range B resulted in 14 levels
and 392 end nodes. The number of levels and end nodes illustrates the complexity
of shimmer when compared to F0 (202% more complex) and jitter (124% more complex), concurring with previous findings that shimmer is the VQM most vulnerable
to influences. This tree was reduced to five levels to provide a schematic and to limit the complexity of interactions investigated (fig. 3). The first factor for shimmer,
like jitter, was software system. However, unlike jitter, MDVP and PRAAT followed
the same course throughout the tree, while TF32 diverged. All software systems had
the same second factor, gender. For TF32, females split on DA system. The CSL differed from the rest of the systems and was followed by noise level. For SNR levels
between 34 and 66 dB, microphone was the following and final factor. However, for
26 dB SNR, token was the final factor. Thus, gender influences the importance of
DA systems, noise level, and microphone. For TF32, males split on microphone. The
H0 microphone was followed by noise level and token as factors. The remaining microphones split on noise level. For SNRs 34 dB and above, microphone was the final
factor with C2 and D2 differing from the other three. For 26 dB SNR, noise type
emerged as a final factor with White and Talk noises splitting from Fan, 60 Hz and
City. For MDVP and PRAAT, females split on noise level. For SNR 34 dB and
above, microphone and then token were the following factors. For 26 dB SNR, noise
type and microphone were the following factors. For MDVP and PRAAT, males
split on microphone. For H0 noise level was the next factor, while for the desktop
microphones, token and noise level were the following factors. These results suggest
that amplitude perturbation data obtained with MDVP and PRAAT is fully compatible, while shimmer computed by TF32 has significantly different values and data
structure. The regression tree model also provided an independent confirmation of
earlier findings [9] that: (1) shimmer is most sensitive to environmental noise influences in comparison to F0 and jitter; (2) noise can invalidate shimmer values at SNR
below 30 dB, and that (3) noise type influences VQM differentially at noise levels
below 30 dB SNR.
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
283
Variable Importance Measures
The results of the CART statistical models provide valuable information related to the overall importance of the factors that affect VQM. The following results
indicate the order of influence of the factors for each VQM. The VIM of each factor
is shown in figures 4 and 5. The factors are arranged from most influential to least
influential.
Fundamental Frequency. Figure 4a demonstrates how the VIM change for F0
across the three different noise level ranges. Gender remains the most influential
factor across all three noise level ranges. When environmental noise is controlled, at
least within recommended or acceptable levels, token (intrasubject variability), microphone, and software are the most important factors. The factors listed after gender have VIM values close to zero. The most prominent changes for all three models
occur in the transition from range B to range C. As lower SNRs are included in the
data to be analyzed, noise type, software, and noise level become the most influential
factors, with the exception of males (fig. 4c) where an increase in importance for
noise type was not shown. In agreement with earlier findings [9], the results shown
for range C reconfirm the importance of controlling for the type and level of environmental noise. That is, SNR levels below 25 dB are not adequate to perform reliable and valid acoustic voice analyses. These results stress the importance of carefully selecting software systems and microphones as well as using multiple tokens in
order to obtain accurate results. To further explore the importance of these factors
when gender was not included in the model, F0 was separately analyzed for males
and females. For females (fig. 4b) and males (fig. 4c), there is no change in the order
of relative importance of factors between range A and range B. However, for females
the VIM values change significantly between range A and range B, demonstrating
that microphone, software, and DA system become increasingly important as SNRs
decrease from 42 to 26 dB. Results for males suggest that token, microphone, and
DA system are the main factors to consider when measuring F0 in an environment
that has an SNR above 26 dB. As shown in figure 4c, token remains the most influential factor for males across all ranges. When the lower SNRs are included (range
C), the results show that software and noise level become the most important factors
to consider. For females, when SNR decreases below 26 dB, the model illustrates
that noise type, software, and noise level become the most important factors. These
results further demonstrate the importance of controlling the noise in the environment and selecting appropriate software and hardware when measuring F0.
Jitter. As seen in figure 5a, for measurements of jitter, the model reveals that
software is the most influential factor and remains so across range B. Similarly, gender remains the second most influential factor. For range B, environmental noise
(noise level and noise type) does not yet have a significant effect. In fact, DA system
and microphone have increased relative importance for range B, suggesting that
these factors have more influence on VQM within this SNR range. When data from
SNR levels deemed as unacceptable are included in the analysis (range C), noise
level, noise type, and software system become the most prominent factors. These
results further suggest that when performing an acoustic voice analysis, noise level,
noise type, software system, and microphone significantly affect the reliability and
validity of the jitter values obtained.
Shimmer. As shown in figure 5b, software, gender, and microphone are the most
influential factors when obtaining shimmer measurements within acceptable SNR
284
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov
a
b
c
Gender
100.000
Gender
100.000
Gender
100.000
Token
0.081
Token
0.081
Noise type
2.448
Microphone
0.008
Microphone
0.008
Software
1.595
Software
0.002
Software
0.002
Noise level
1.583
DA system
0.001
DA system
0.002
Token
0.079
Noise level
0.001
Noise type
0.000
Microphone
0.003
Noise type
0.000
Noise level
0.000
DA system
0.001
Range A:
66–42 dB SNR
Range B:
66–26 dB SNR
Range C:
66–10 dB SNR
Token
100.000
Token
100.000
Noise type
100.000
Microphone
0.000
Microphone
13.916
Software
64.202
Software
0.000
Software
7.595
Noise level
61.923
DA system
0.000
DA system
1.359
Token
1.170
Noise type
0.000
Noise type
0.000
Microphone
0.000
Noise level
0.000
Noise level
0.000
DA system
0.000
Range A:
66–42 dB SNR
Range B:
66–26 dB SNR
Range C:
66–10 dB SNR
Token
100.000
Token
100.000
Token
100.000
Microphone
9.582
Microphone
9.321
Software
45.101
DA system
3.313
DA system
3.078
Noise level
24.666
Software
0.000
Software
0.000
Microphone
3.768
Noise type
0.000
Noise type
0.000
Noise type
2.493
Noise level
0.000
Noise level
0.000
DA system
0.949
Range A:
66–42 dB SNR
Range B:
66–26 dB SNR
Range C:
66–10 dB SNR
Fig. 4. Block diagram of the factor order and specific VIM for F0 across the three SNR ranges for
all (a), female (b), and male (c) participants.
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
285
a
b
Software
100.000
Software
100.000
Noise level
100.000
Gender
3.651
Gender
3.435
Software
77.801
Token
2.066
Microphone
1.502
Noise type
11.380
Microphone
1.569
Token
1.426
Microphone
3.091
Noise level
0.389
DA system
0.659
Gender
2.187
DA system
0.242
Noise level
0.250
DA system
0.063
Noise type
0.000
Noise type
0.000
Token
0.056
Range A:
66–42 dB SNR
Range B:
66–26 dB SNR
Range C:
66–10 dB SNR
Software
100.000
Software
100.000
Noise level
100.000
Gender
97.389
Gender
99.514
Software
9.161
Microphone
20.753
Microphone
17.509
Noise type
6.177
Token
7.679
Noise level
8.763
Gender
5.253
DA system
3.425
Token
4.079
Microphone
0.610
Noise level
1.977
DA system
2.283
DA system
0.088
Noise type
0.000
Noise type
1.392
Token
0.004
Range A:
66–42 dB SNR
Range B:
66–26 dB SNR
Range C:
66–10 dB SNR
Fig. 5. Block diagram of the factor order and specific VIM for jitter (a) and shimmer (b), across
the three SNR ranges.
levels. Shimmer is affected by noise level as demonstrated in the transition from
range A to range B, which displays an increase in the importance of noise level. The
most prominent changes overall occur between range B and range C, where the inclusion of lower SNR causes the factors of noise level and noise type to significantly
increase in importance. All other factors, with the exception of DA system, decreased
in relative importance, demonstrating the degree of influence that noise has on measurements of shimmer.
286
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov
Conclusion
The results of the study confirm that all of the factors included were significant
at varying degrees. Gender strongly influences VQM, which demonstrates the importance of considering gender differences when comparing male and female voices
to normative data. Intra-subject variability was shown to be a significant influence
for all measurements, especially for F0 measurements, concurring with the results of
Scherer et al. [4] and with the NCVS requirement of implementing multiple tokens
[3]. In agreement with Titze and Winholtz [5], it was shown that for range B, the effect of microphone never fell below the third place of importance for all three measurements. The effect of environmental noise, including noise level and noise type,
was shown to be the most prominent factor that affects VQM when SNR levels drop
below 30 dB. This result concurs with earlier findings [9] that: (1) environmental
noise below 30 dB SNR distorts voice samples to the point where normal voices are
recognized as abnormal, and (2) the impact of environmental noise below 25 dB
SNR exceeds that due to intrasubject variability also causing pitch extraction algorithms to incur gross errors. Although DA system was not shown to be one of the
most important factors, it was certainly shown to become highly significant when
environmental noise was controlled within acceptable SNR levels. The results demonstrated the effect of differing computational algorithms of the software systems on
VQM and further support the observation that more research is needed on this topic. The results reinforce the role of the software system as a highly influential factor
making it a critical variable when performing acoustic voice analysis and comparing
results across software systems, particularly for measurements of jitter and shimmer.
Other factors, such as sampling rate, may have a significant effect on comparing
VQM and should be further studied. The results from this experiment demonstrate
the importance of controlling for factors that affect VQM in order to obtain accurate
and reliable VQM necessary to appropriately diagnose and treat individuals with
vocal pathologies.
Acknowledgments
The authors express their appreciation to the University of South Carolina faculty, staff,
students, and family members who volunteered to participate in this study; to the Department
of Communication Sciences and Disorders for supporting the project; to Dr. Paul Milenkovic of
the University of Wisconsin-Madison and Dr. Paul Boersma of the University of Amsterdam for
contributing the TF32 and PRAAT software systems, respectively.
References
1
2
3
4
5
Kent RD, Ball MJ: Voice Quality Measurement. San Diego, Singular Publishing Group, 2000.
Baken RJ, Orlikoff RF: Clinical Measurement of Speech and Voice, ed 2. San Diego, Singular Publishing Group, 2000.
Titze IR: Workshop on Acoustic Voice Analysis: Summary Statement. Iowa City, National Center for
Voice and Speech, 1995.
Scherer RC, Vail VJ, Guo CG: Required number of tokens to determine representative voice perturbation values. J Speech Hear Res 1995; 38: 1260–1269.
Titze IR, Winholtz WS: Effects of microphone type and placement on voice perturbation measurements.
J Speech Hear Res 1993; 36: 1177–1190.
Regression Tree Approach in Acoustic Voice
Analysis
Folia Phoniatr Logop
2006;58:274–288
287
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
288
Ingrisano DR, Perry CK, Jepson KR: Environmental noise: a threat to automatic voice analysis. Am J
Speech Lang Pathol 1998; 7: 91–96.
Perry CK, Ingrisano DR, Palmer MA, McDonald EJ: Effects of environmental noise on computer-derived voice estimates from female speakers. J Voice 2000; 14: 146–153.
Carson CP, Ingrisano DR, Eggleston KD: The effect of noise on computer aided measures of voice: a
comparison of CSpeechSP and the Multi-Dimensional Voice Program software using the CSL 4300B
module and Multi-Speech for Windows. J Voice 2003; 17: 12–20.
Deliyski DD, Shaw HS, Evans MK: Adverse effects of environmental noise on acoustic voice quality
measurements. J Voice 2005; 19: 15–28.
Deliyski DD, Evans MK, Shaw HS: Influence of data acquisition environment on accuracy of acoustic
voice quality measurements. J Voice 2005; 19: 176–186.
Titze IR, Liang H: Comparison of F0 extraction methods for high-precision voice perturbation measurements. J Speech Hear Res 1993; 36: 1120–1133.
Kent RD: The Speech Sciences. San Diego, Singular Publishing Group, 1997.
Breiman L, Friedman J, Olshen R, Stone C: Classification and Regression Trees. Pacific Grove, Wadsworth, 1984.
Deliyski DD: Acoustic model and evaluation of pathological voice production. Proc 3rd Eur Conf on
Speech Communication and Technol, EuroSpeech, Berlin, 1993, vol 3, pp 1969–1972.
Kay Elemetrics Corp: Multi-Dimensional Voice Program (MDVP) model 5105, version 2.0: Software
Instruction Manual. Lincoln Park, Kay Elemetrics Corp, 1999.
Boersma P: Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise
ratio of a sampled sound. Proc Inst Phonet Sci, Amsterdam, 1993, vol 17, pp 97–110.
Boersma P, van Heuven V: PRAAT, a system for doing phonetics by computer. Glot Int 2001; 5: 341–
345.
Milenkovic P: Least mean square measures of voice perturbation. J Speech Hear Res 1987; 30: 529–
538.
Milenkovic P: TF32 User’s Manual. Madison, Milenkovic, 2001.
Steinberg D, Colla P: CART: Tree-Structured Non-Parametric Data Analysis. San Diego, Salford Systems, 1995.
Steinberg D, Colla P: CART: Classification and Regression Trees. San Diego, Salford Systems, 1997.
Steadman H, Silver E, Monahan J, Appelbaum P, Robbins P, Mulvey E, Grisso T, Roth L, Banks S: A
classification tree approach to the development of actuarial violence risk assessment tools. Law Hum
Behav 2000; 24: 83–100.
Folia Phoniatr Logop
2006;58:274–288
Deliyski /Shaw /Evans /Vesselinov