A Comparison of Two Approaches to Assessment of Speech

J Am Acad Audiol 16:54–62 (2005)
A Comparison of Two Approaches to
Assessment of Speech Recognition Ability
in Single Cases
Rochelle Cherry*
Adrienne Rubinstein*
Abstract
Single-case design with the randomization test (RT) has been proposed as
an alternative to the binomial distribution (BD) tables of Thornton and Raffin
(1978) to assess changes in speech recognition performance in individual subjects. The present study investigated whether data analyzed using both
approaches would result in similar outcomes. Sixty-two adults with normal
hearing were evaluated using phoneme scoring and a restricted alternating
treatments design under two signal-to-noise conditions. Results revealed a
significant correlation between the RT and a BD analysis using at least 50word lists, although the BD analysis was a more sensitive measure. The
absence of correlation between the RT with a BD analysis using 25-word lists
challenges the common clinical practice of using reduced list size, and supports the use of phoneme scoring and other attempts to find clinically acceptable
yet evidence-based solutions to overcome the conflict between time and accuracy.
Key Words: Binomial model, phoneme scoring, randomization test single case
design, speech recognition tests
Abbreviations: AT = alternating treatment; BD = binomial distribution; RT =
randomization test
Sumario
Se ha propuesto el diseño de caso único con prueba de aleatoriedad (RT) como
una alternativa a las tablas de distribución binomial (BD) de Thornton y Raffin
(1978), para evaluar los cambios en el desempeño del reconocimiento del
lenguaje en sujetos individuales. El presente estudio investigó si los datos
analizados utilizando ambos enfoques rinde resultados similares. Se evaluaron
sesenta y dos adultos con audición normal usando puntuación de fonemas y
un diseño de tratamientos restringidos alternantes bajo dos condiciones de
tasa de señal-ruido. Los resultados revelaron una correlación significativa entre
el RT con un análisis BD utilizando al menos listas de 50 palabras, aunque el
análisis BD constituyó una medida más sensible. La ausencia de correlación
entre el RT con una análisis BD usando listas de 25 palabras desafía la
práctica clínica común de utilizar listas de tamaño reducido, y apoya el uso
de puntuación de fonemas y otros intentos de encontrar soluciones clínicamente
*Department of Speech Communication Arts and Sciences, Brooklyn College
Adrienne Rubinstein, Department of Speech Communication Arts and Sciences, 2900 Bedford Avenue, Brooklyn, New York
11210; Phone: 718-951-5186; Fax: 718-951-4363; E-mail: [email protected]
Earlier portions of this paper were presented at the Annual Convention of the American Academy of Audiology, San Diego, CA,
April 19–20, 2001.
This study was supported by PSC-CUNY Research Grant # 61653-00-30.
54
A s s e s s i n g S p e e c h R e c o g n i t i o n/Cherry and Rubinstein
aceptables y basadas en evidencia, para superar el conflicto entre el tiempo
y la exactitud.
Palabras Clave: Modelo binomial, puntuación de fonemas, diseño de caso
único con prueba de aleatoriedad, pruebas de reconocimiento del lenguaje
Abreviaturas: AT = Tratamiento alternante; BD = distribución binomial; RT =
prueba de aleatoriedad
W
ord recognition testing is used in
audiology to evaluate receptive
communication abilities of
individuals, to verify hearing aid and cochlear
implant fittings, and to differentially diagnose
auditory pathologies (Wiley et al, 1995;
Martin et al, 2000). According to Martin and
colleagues (Martin et al, 1998), 91% of
audiologists routinely use word recognition
tests, with the most common materials being
the CID Auditory Test W-22 (Hirsh et al,
1952) and the Northwestern University
Auditory Test No. 6, or NU-6 word lists
(Tillman and Carhart,1966).
One problem commonly associated with
word recognition testing is the time
consuming nature of its administration.
Numerous attempts have been made to
address this issue (Stach et al 1995; Olsen et
al 1997; Gelfand 1998, 2003; Hurley and
Sells, 2003). One popular technique used
clinically is to administer only 25-word lists
instead of 50 as it was originally designed
(Martin et al, 1998). Reducing the number of
items, however, increases test variability,
thus making the test less reliable (Thornton
and Raffin, 1978; Dillon, 1982). One solution
proposed that reduces variability without
increasing administration time is to
incorporate phoneme instead of whole-word
scoring (Boothroyd, 1968; Olsen et al, 1997;
Gelfand, 1998, 2003). The AB Short
Isophonemic word lists, for example, were
specifically developed by Boothroyd (1968) to
incorporate phoneme scoring using 10
monosyllabic words per list in order to reduce
the number of words needed.
Issues of variability and reliability are a
concern when evaluating group data, but can
be even more problematic when attempting
to evaluate the results of individual clients.
The typical method for establishing
significant differences in speech recognition
scores in a clinical setting involves the use of
the binomial distribution tables (BD) of
Thornton and Raffin (1978). Limitations in
the use of the binomial distribution to predict
test-retest variability for speech recognition
assessment has been discussed by Dillon
(1982, 1983). He concluded that the
predictions of the binomial distribution should
fall close to empirical values for the average
subject but may not lead to valid results for
a particular individual.
An alternative to the BD approach is the
use of single-subject, or single-case, design.
In this design, intrasubject variability is
highlighted through repeated measurement
of the dependent variable under both
conditions of the independent variable. The
sequence in which conditions are presented
can vary and produce different types of singlecase designs. The alternating treatment (AT)
design, for example, may be distinguished
from the more familiar ABA design (Tawney
and Gast, 1984). The ABA design typically
compares a no-treatment (A) baseline and a
treatment (B) condition, followed by another
no-treatment (A) phase. Several data points
are collected for each condition before moving
to the next condition. In the AT design, single
data points may be collected as one alternates
between the two conditions. Thus, results
may be obtained with fewer data points than
in the ABA design because one need not
employ a formal baseline phase or multiple
measures within each phase.
Proposed methods for establishing
significant differences in single-subject
designs include visual inspection and
statistical analysis. Whereas visual inspection
has the advantage of being less complex and
time-consuming, critics point to the element
of subjectivity in the judgments, which has
55
J o u r n a l o f t h e A m e r i c a n A c a d e m y o f A u d i o l o g y /Volume 16, Number 1, 2005
been shown to affect agreement among raters
(Ottenbacher, 1993). Chmiel and Jerger
(1995) employed the randomization test (RT),
a nonparametric statistical measure, to
evaluate a variety of treatments using speech
measures in single-subject designs. They
recommended the use of the RT over the BD.
Since the RT uses information on the
performance stability for the specific
individual being tested, it can overcome two
potential pitfalls of the BD: (1) it relies on
averaged data and (2) the data on which the
tables were generated could be based on a
population different from the subject. Another
shortcoming of the BD analysis (although
not an issue in traditional speech recognition
testing) is that it requires a dependent
variable that is specified as percentage
measure, which prevents its application to
other types of data. One disadvantage of the
RT with single-case design, however, is that
the analysis of the data is more complex and
time-consuming (Rubinstein, 1996). This is
especially problematic in the clinical
environment and what makes the BD so
appealing.
The RT uses the variability of scores for
each subject to determine the estimate of
error variance which, in turn, serves to assess
the treatment effect for that individual
(Chmiel and Jerger, 1995). The mean
difference in scores between the two
conditions is compared to all other possible
mean differences based on all possible
assignments of the individual scores to the
two conditions. When all mean differences are
placed in rank order, those great enough to
occur less often then than five times out of one
hundred can be established. If the mean
difference obtained falls within that range,
conditions are judged to be significantly
different. Chmiel and Jerger (1995) provide
a detailed description of this test.
In addition to the problem of variability,
word-recognition testing has also been
criticized for its lack of sensitivity (Walden et
al, 1983; Mendel and Danhauer 1997). The
use of a person’s own measure of performance
stability could conceivably improve the
sensitivity of the test. Rubinstein et al (1998)
analyzed data using phoneme scoring and a
BD versus a single-subject visual inspection
approach to determine if the findings
comparing two aided conditions would result
in similar outcomes. The BD analysis used a
25-word list for each condition, according to
56
common clinical practice (Martin, 1998); the
single subject analysis compared four 25word lists (100 words) for each condition in
order to arrive at individual performance
variability estimates.
Rubinstein et al (1998) found that
different subjects were identified as being
affected by the experimental variable using
the two types of analyses. However, when
the data were reanalyzed and the item
number was held constant at 100 words for
both approaches instead of using a 25-word
list for the BD analysis, both analyses arrived
at the same outcome. This suggests that the
performance stability information from the
single subject analysis did not provide any
significant advantage over the information
provided by the BD as long as a sufficient
number of items was used. It also suggests
that use of 25-word lists reduces the validity
of the speech recognition test. These
conclusions must be treated with caution,
however, because so few subjects were shown
to be affected by the treatment variable. To
address these questions, it is necessary to
incorporate a more powerful independent
variable, one that would result in significant
differences for more than a few subjects. One
variable known to affect word recognition
performance is signal-to-noise ratio (Gengel
et al, 1981; Beattie, 1989; Sperry et al, 1997).
The purpose of the present study was to
determine if speech recognition test data,
analyzed using both the BD tables of
Thornton and Raffin (1978), and the
randomization test with a single-case design,
would result in similar outcomes in the
presence of two different signal-to-noise
ratios. If the two analyses result in similar
outcomes, it would then be valuable to
determine whether one method is more
sensitive in detecting differences.
METHOD
Subjects
Sixty-two adults (mean age of 25.1 with
standard deviation of 6.35 years) with hearing
within normal limits served as subjects and
met the following selection criteria: (a)
speaking English as the first language, (b)
passing a pure-tone screening at 20 dBHL
at 250–8000 Hz bilaterally (using a GSI
16 audiometer), (c) having normal
A s s e s s i n g S p e e c h R e c o g n i t i o n/Cherry and Rubinstein
tympanograms bilaterally (peak pressure
more positive than -50 daPa and static
immittance between .3 and 1.6 ml) and
present ipsilateral acoustic reflexes at 1000
Hz at 100 dB HL (using a GSI 33 Middle-Ear
Analyzer).
Procedure
All speech and noise materials were
commercially produced on compact disc by
Auditec of St. Louis, played on a Sony CDP211 CD player and delivered via a GSI 16
audiometer. Materials included (a) the NU6 word lists (due to their common clinical
usage and availability of multiple lists) and
(b) the AB word lists (due to their shorter
length and because they were specifically
developed to be used with phoneme scoring).
All subjects received both NU-6 and AB lists,
and the test protocol was carried out in a
single session.
NU-6 Lists
For the NU-6 word lists, each subject
listened under earphones in the preferred
ear (as determined by the subject) and
responded verbally to eight 25-word lists
presented at 50 dB HL (normal
conversational level) under two conditions
(0 versus -2 dB signal-to-noise ratios, four lists
per condition). Cafeteria noise was used
because it is dominated by low-frequency
information but also contains some highfrequency components and simulates a reallife listening environment (Humes et al,
1997). These noise levels were chosen (a) to
avoid ceiling and floor effects and (b) to assure
a sufficiently powerful independent variable
so that differences in speech recognition
ability could be expected (Rubinstein et al,
1998). Data from Sperry et al (1997) had
suggested that using signal-to-noise ratios
around 0 dB and monosyllabic words with this
subject pool would prevent ceiling and floor
effects. Prior to the onset of the study, five
subjects were run at various signal-to-noise
ratios around 0dB, and it was determined
that 0 dB and -2 dB signal-to-noise ratios
provided appropriate levels for the
independent variable.
In order to accommodate a single-case
analysis of the data, the order of the two
noise conditions over the eight trials was
randomized for each subject according to a
restricted alternating treatments design
(Onghena and Edgington, 1994; Rubinstein,
1996). This may be differentiated from a
completely randomized design where any
order of presentation of the two conditions
may be randomly chosen. In such a case, it
might have been possible to randomly choose
an order in which all administrations of
one treatment (e.g., four presentations of a
0 dB signal-to-noise ratio) precede all
administrations of the second (e.g., -2 dB
signal-to-noise ratio), thus introducing the
risk of an order effect. In a restricted
alternating treatments design, certain orders
of administration are eliminated as
possibilities, thus preventing certain
undesirable outcomes. In the present study,
no subject received more than three
successive presentations for the NU-6 lists
under the same noise condition. The order of
the lists was counterbalanced.
Phoneme recognition scores were
calculated in percent correct for each of the
eight presentations (four lists times two
conditions) with 75 phonemes (25 words times
three phonemes) per list. Phoneme scoring
was adopted to increase the sensitivity of
the measure by increasing sample size
without having to increase test time (Olsen
et al, 1997). The BD (using a 95% critical
difference level) was used to establish who is
differentially affected by the two noise
conditions, using only the first list of each
condition. The binomial tables were adjusted
because it is incorrect to assume 75
independent pieces of information when using
meaningful words, and thus, j values
(Boothroyd and Nittrouer, 1988) were
calculated and applied to the analysis of the
data. The effect of phonemic constraint was
accounted for by using the equations derived
by Boothroyd and Nittrouer (1988), who
compared recognition probability scores of
the whole word to the probability scores of the
individual phonemes, as follows:
pw = ppj
where pw is the probability of recognition of
the whole word and pp is the probability of
the recognition of an individual phoneme
when j is known. The j value is calculated by
j = log(pw)/log (pp)
Using the above calculation, the j value
57
J o u r n a l o f t h e A m e r i c a n A c a d e m y o f A u d i o l o g y /Volume 16, Number 1, 2005
will vary depending on such factors as test
material and subject pool but will always be
a number less than three. In the present
study, j values were calculated from the data
obtained from each subject and multiplied by
25 to obtain the correct number of
independent sources of information for
subsequent analysis by the BD method. In
addition, analysis of the single-case data was
performed using the RT (Onghena and
Edgington, 1994) to determine which subjects
were differentially affected by the treatment
variable.
AB Lists
The Isophonemic (AB) Word Lists
(Boothroyd, 1968) consist of 15 lists, with 10
consonant-nucleus-consonant monosyllabic
words per list and consisting of the same 10
vowels and 20 consonants. A more thorough
description of this test can be found in
Boothroyd (1968). In the case of the AB word
lists, the procedure was the same except for
the following: (1) ten 10-word lists were used,
five for each condition (allowing for a BD
analysis using 10 words, as well as 50 words
for comparison with the NU-6), (2) no subject
received more than four successive lists for
the same condition (to avoid all trials for one
condition from occurring before all trials of
the other), and (3) phoneme recognition scores
were calculated for each of ten presentations
(five lists times two conditions) with 30
phonemes (10 words times three phonemes)
per list, and (4) binomial tables were adjusted
accordingly.
R E S U LT S A N D D I S C U S S I O N
T
he purpose of the study was to determine
if the BD and RT analyses would result
in similar outcomes, that is, the results of a
subject who demonstrated a significant
difference between noise conditions on the RT
would also show a significant difference when
the BD analysis was used. It may be recalled
that the BD analysis was based on 25 words
for the NU-6 material (as performed
clinically) and 10 words for the AB lists. Table
1 lists the number of subjects demonstrating
similar outcomes when comparing the results
of the BD and RT analyses. Using NU-6
materials, the BD and RT analyses for 38 of
the 62 subjects resulted in similar outcomes:
for 13 subjects both the BD and RT analyses
resulted in statistical differences between
noise conditions, and for 25, neither analysis
showed a statistical difference. In the case of
the AB lists, there were similar outcomes for
the two analyses for 44 subjects, the numbers
being 6 and 38, respectively.
To determine if the BD and RT analyses
were statistically correlated, a Spearman
Rank order correlation coefficient (Zar, 1996)
was calculated on the results of each list,
and it revealed that the results from the BD
and RT were not significantly correlated
(NU-6: r = .2, p > .05; AB: r = .23, p> .05). In
other words, a subject who ranked high on the
BD did not necessarily show a high ranking
on the RT. These results are consistent with
those of Rubinstein et al (1998), who also
found that subjects were not judged to be
similarly affected by the treatment variable
when comparing the two analyses.
As in the earlier study (Rubinstein et al,
1998), the question arises whether these
different outcomes were due to the different
types of analyses performed or to the
differences in number of items used (using
NU-6: BD-25 words, RT-100 words; using A B:
BD-10 words, RT-50 words). Thus, further
analyses were carried out, with the BD
Table 1. Number (and percentage) of Subjects (n = 62) Demonstrating Similar
Outcomes from BD and RT Analyses When Determining If the Two Noise Conditions
(treatment effect) Were Significantly Different
Test
Material
BD List
Size
Number (and %) of subjects with similar outcomes
Treatment effect
No treatment effect
Total
with BD and RT
with BD or RT
NU–6
25
13 (21%)
AB
* p < .01
** p < .001
58
25 (40.3%)
38 (61.3%)
50
21 (80.6%)
26 (41.9%)
47 (75.8%)**
100
23 (37.1%)
19 (30.6%)
42 (67.7%)**
10
6 (9.7%)
38 (61.3%)
44 (71%)
50
18 (29%)
32 (51.6%)
50 (80.6%)**
A s s e s s i n g S p e e c h R e c o g n i t i o n/Cherry and Rubinstein
analysis including the entire data set and
based on adjusted BD tables (NU-6: 100 x 3;
AB: 50 x 3; j values calculated based on
Boothroyd and Nittrouer, 1988). Table 1 also
lists the number of subjects achieving
statistical significance when item number
was held constant. Results both from the
NU-6 and the AB lists revealed a statistically
significant positive correlation between the
BD and RT analyses (NU-6: r = .48, p < .001;
AB: r = .64, p < .001). In other words, when
item number was constant, the BD and RT
analyses tended to identify the same subjects
as affected by the noise conditions. These
results are in agreement with those of
Rubinstein et al (1998) and support the
conclusion that lists of 25 words or less reduce
the validity of the speech recognition test
results, even with the addition of phoneme
scoring.
Since the NU-6 word lists originally were
designed to be administered using 50 words,
another correlation test was carried out
comparing the RT (100 words) and the first
two 25-word lists (50 words) for the BD
analysis with adjusted BD tables (see Table
1). Results revealed a statistically significant
correlation in this case as well (r = .55, p <
.001), also supporting the use of 50-word lists
with phoneme scoring.
Figure 1 presents the speech recognition
scores for a sample subject in each noise
condition and each trial, for the NU-6 lists.
Analysis using the BD on the first two data
points (25 words each) revealed no
statistically significant difference between
conditions. When the analysis was based on
the first 50 words per condition, or all the data
(100 words each), results were statistically
significant. The same pattern of findings was
noted for the results of this subject with the
AB lists, shown in Figure 2 (10 words, not
statistically significant; 50 words, statistically
significant). The results of the RT analysis for
this subject were also statistically significant
for both types of material. These figures
provide pictorial representation of the
variability of the data with 10- and 25-word
lists, and evidence of the problem in using
shortened word lists.
Although the two analyses resulted in
statistically significant agreement when 50and 100-word lists were used for the BD
analysis, they did not always arrive at the
F i g u r e 1 . Speech recognition scores of Subject 12 for both signal-to-noise ratio conditions using the NU-6 lists
59
J o u r n a l o f t h e A m e r i c a n A c a d e m y o f A u d i o l o g y /Volume 16, Number 1, 2005
F i g u r e 2 . Speech recognition scores of Subject 12 for both signal-to-noise ratio conditions using the AB lists
same conclusion for every subject. Table 2
shows the number of subjects for whom the
treatment effects of noise were significant
for one analysis, but not the other (BD only
vs. RT only).
For example, in Table 1, in the case of the
50-word AB lists, 50 out of 62 subjects
demonstrated agreement between the BD
and RT. In Table 2, we see that of the
remaining 12 subjects where the analyses
did not agree, it was only the BD analysis that
was sensitive enough to demonstrate
significant differences between noise
conditions. Thus, although the results from
the BD and RT analyses were significantly
correlated (as demonstrated by the Spearman
Test), in cases where their results did not
agree, one analysis (in this case the BD) was
more sensitive and able to identify differences
between conditions. To address this issue,
the data were analyzed using the McNemar
P Test of Correspondence (Zar, 1996). Results
are shown in Table 2. In all cases where a
significant correlation had been found, the BD
analysis was statistically more sensitive. In
other words, for subjects where the BD and
Table 2. Number (and percentage) of Subjects (n = 62) Demonstrating Different
Outcomes from the BD and RT Analyses When Determining If the Two Noise Conditions
(treatment effect) Were Significantly Different
Test
Material
BD List
Size
Number of subjects with different outcomes
Treatment effect
No treatment effect
with BD and RT
with BD or RT
Total
NU–6
25
13 (21%)
11 (17.7%)
24 (38.7%)
50
12 (19.4%)
3 (4.8%)
15 (24.2%)**
100
19 (30.6%)
1 (1.6%)
20 (32.2%)**
10
6 (9.7%)
12 (19.4%)
18 (29%)
50
12 (19.4%)
0 (0%)
12 (19.4%)**
AB
* p < .01
** p < .001
60
A s s e s s i n g S p e e c h R e c o g n i t i o n/Cherry and Rubinstein
RT analyses resulted in different findings, the
BD test was the analysis more likely to detect
the difference between the two noise
conditions.
CONCLUSION
T
he purpose of the present study was to
determine if speech recognition test data,
analyzed using both the BD tables of
Thornton and Raffin (1978), and the RT with
single-case design would result in similar
outcomes. Results revealed that there was a
significant correlation between the two types
of analyses when phoneme scoring was used
and the BD analysis was based on at least 50word lists. These findings were consistent
across both types of materials (NU-6 and AB
word lists). Furthermore, the RT did not add
to the sensitivity of the measure, and actually
was found to be significantly less sensitive.
Therefore, using the individual’s own
information on performance stability from
the single-case design combined with the RT
did not provide any advantage over the use
of the BD in the clinical setting. This does not
rule out the potential value of the RT in other
circumstances. As noted earlier, it can
demonstrate treatment effects based on any
measures on an interval scale including
measures not suitable to the BD tables (e.g.,
reaction time measures, amplitudes or
latencies of evoked potentials) (Chmiel and
Jerger,1995).
The fact that the 10- and 25-word lists
arrived at different findings as compared to
the longer lists adds further evidence to the
literature that questions the legitimacy of
their use. The concern over the use of half lists
continues to be acknowledged in the literature
(Runge and Hosford-Dunn, 1985; Stach et
al, 1995; Mendel and Danhauer, 1997; Olsen
et al, 1997; Gelfand, 1998, Studebaker et al,
1999; Hurley and Sells, 2003), yet it continues
to be used clinically (Wiley et al, 1995;
Studebaker et al, 1999; Hurley and Sells,
2003). Further research is needed to find a
solution acceptable to clinicians that can
enable accuracy and reliability while limiting
expenditure of time. In summary, the present
study revealed that results using a BD and
RT analysis were statistically correlated
when using phoneme scoring and at least
50-word lists. Although the two analyses
were correlated, the BD analysis was found
to be a more powerful measure. The absence
of correlation with 10- or 25-word lists even
when phoneme scoring is used provides
further support challenging the common
clinical practice of using reduced list size.
Further, it supports the use of phoneme
scoring and other attempts to find clinically
acceptable, yet evidence-based (i.e., researchdriven) (Wiley et al, 1995), solutions to
overcome the conflict between time and
accuracy.
A c k n o w l e d g m e n t s . The authors would like to
acknowledge the assistance of K. Gerow and M. Rosing
in the statistical treatment of the data, and M.
Bendavid for her comments.
REFERENCES
Beattie RC. (1989) Word recognition functions for
the CID W-22 Test in multitalker noise for normally
hearing and hearing-impaired subjects. J Speech
Hear Disord 54:20–32.
Boothroyd A. (1968) Developments in speech
audiometry. Sound 2:3–10.
Boothroyd A, Nittrouer S. (1988) Mathematical
treatment of context effects in phoneme and word
recognition. J Acoust Soc Am 84:101–114.
Chmiel R, Jerger J. (1995) Quantifying improvement
with amplification. Ear Hear 16:166–175.
Dillon H. (1982) A quantitative examination of the
sources of speech discrimination test score variability.
Ear Hear 3:51–58.
Dillon H. (1983) The effect of test difficulty on the
sensitivity of speech recognition tests. J Acoust Soc
Am 73:336–344.
Gelfand SA. (1998) Optimizing the reliability of speech
recognition scores. J Speech Lang Hear Res
41:1088–1102.
Gelfand SA. (2003) Tri-word presentations with
phonemic scoring for practical high reliability speech
recognition assessment. J Speech Lang Hear Res
41:1088–1102.
Gengel RW, Miller L, Rosenthal E. (1981) Between and
within listener variability in response to CID W-22
presented in noise. Ear Hear 2:78–81
Hirsh IJ, Davis H, Silverman SR, Reynolds EG,
Eldert E, Benson RW. (1952) Development of materials
for speech audiometry. J Speech Hear Disord
17:321–327.
Humes L, Christensen L, Bess F, Hedley-Williams A.
(1997) A comparison of the benefit provided by wellfit linear hearing aids and instruments with automatic
reductions of low-frequency gain. J Speech Lang Hear
Res 40:666–685.
Hurley RM, Sells JP. (2003) An abbreviated word
recognition protocol based on item difficulty. Ear
Hear 24:111–118.
61
J o u r n a l o f t h e A m e r i c a n A c a d e m y o f A u d i o l o g y /Volume 16, Number 1, 2005
Martin F, Champlin C, Chambers J. (1998) Seventh
survey of audiometric practices in the United States.
J Am Acad Audiol 9:95–104.
Martin F, Champlin C, Perez D. (2000) The question
of phonetic balance in word recognition testing. J Am
Acad Audiol 11:489–493.
Mendel LL, Danhauer JL. (1997). Test administration and interpretation. In: Mendel LL, Danhauer
JL, eds. Audiologic Evaluation and Management and
Speech Perception Assessment . San Diego, CA:
Singular Publishing Group, 15–43.
Olsen W, Van Tasell D, Speaks C. (1997) Phoneme
and word recognition for words in isolation and in
sentences. Ear Hear 18:175–188.
Onghena P, Edgington E. (1994) Randomization tests
for restricted alternating treatments designs. Behav
Res Ther 32:783–786.
Ottenbacher K. (1993) Interrater agreement of visual
analysis in single-subject decisions: quantitative
review and analysis. Am Assoc Ment Retard
98:135–142.
Rubinstein A. (1996) Letter to the editor: Complete
and restricted alternating treatment designs. Ear
Hear 17:78.
Rubinstein A, Cherry R, Schur K. (1998) Comparing
two methods of evaluating aided speech recognition
performance for single cases. J Acad Rehabilitative
Audiol 31:31–44.
Runge CA, Hosford-Dunn H. (1985) Word recognition
performance with modified CID W-22 word lists. J
Speech Hear Res 28:355–362.
Sperry J, Wiley T, Chial M. (1997) Word recognition
performance in various background competitions. J
Am Acad Audiol 8:71–80.
Stach B, Davis-Thaxton ML, Jerger J. (1995)
Improving the efficiency of speech audiometry: computer-based approach. J Am Acad Audiol 6:330–333.
Studebaker G, Gray G, Branch W. (1999) Prediction
and statistical evaluation of speech recognition test
scores. J Am Acad Audiol 10:355–370.
Tawney J, Gast D. (1984) Single Subject Research in
Special Education. Columbus, OH: Charles E. Merrill
Publishing.
Thornton A, Raffin M. (1978) Speech discrimination
scores modeled as a binomial variable. J Speech Hear
Res 21:507–518.
Tillman TW, Carhart R. (1966) An Expanded Test for
Speech Discrimination Utilizing CNC Monosyllabic
Words. Northwestern University Auditory Test 6. USAF
School of Aerospace Medicine Technical Report. Brooks
Air Force Base, TX: USAF School of Aerospace
Medicine.
Walden B, Schwartz D, Williams D, Holum-Hardegen
L, Crowley J. (1983) Test of the assumptions underlying comparative hearing aid evaluations. J Speech
Hear Disord 48:264–273.
Wiley T, Stoppenbach D, Feldhake L, Moss K,
Thordarottir E. (1995) Audiologic practices: what is
62
popular versus what is supported by the evidence.
Am J Audiol 4:26–34.
Zar JH. (1996) Biostatistical Analysis. Upper Saddle
River, NJ: Prentice Hall.