Muhlbaier, L.H.; (1981) "Enhancement of Precision of Estimates of Prevalence by Multiple Observation of Individuals."

ENHANCEMENT OF PRECISION OF ESTIMATES OF PREVALENCE
BY MULTIPLE OBSERVATION OF INDIVIDUALS
by
Lawrence H. Muhlbaier
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1371
December 1981
ENHANCEMENT OF PRECISION OF ESTIMATES OF
PREVALENCE BY MULTIPLE OBSERVATION
OF INDIVIDUALS
by
Lawrence H. Muhlbaier
A Dissertation submitted to the faculty of The
University of North Carolina at Chapel Hill in partial
fulfillment of the requirements for the degree of
Doctor of Philosophy in the Department of Biostatistics
in the School of Public Health.
Chapel Hill
1981
Approved by:
Va.-t« ~~4..
ABSTRACT
LAWRENCE HENRY MUHLBAIER.
Enhancement of Precision of
Prevalence by Multiple Observation of Individuals
E~timates
of
(Under the direction
of DANA E. A. QUADE.)
In screening for a rare event, even with a highly specific test,
one usually declares many more individuals to be positive than truly
are positive, causing an inflated estimate of the prevalence of the
event.
We show how a more accurate estimate of the prevalence can be
obtained by using a maximum likelihood estimation technique based on
multiple applications of the test.
This maximum likelihood technique
also provides estimates of the sensitivity and specificity of the test,
but does not classify each individual as positive or not for the event.
We compare the mean squared error of estimators based on different
designs for multiple observation with that of the estimate obtained
from single observation on an equally costly larger sample.
These are
found to be lower in most cases, indicating improved efficiency due to
the multiple observation estimation procedure.
We also indicate
certain extensions to more than two classifications of an individual,
and to non-independent tests.
We illustrate the analysis using data on repeated measurements of
blood pressure.
CONTENTS
ILLUSTRATIONS
TABLES
ii
. . .
· . . iii
ACKNOWLEDGEMENTS
iv
Chapter
1
2
e
3
INTRODUCTION AND LITERATURE REVIEW
1.1 Elementary Terminology of Testing
1.2 Literature Review · · · · · · · .
A PROBLEM IN ESTIMATION
2.1 A General Model · · · · · · ·
2.2 Maximum Likelihood Estimation
2.3 Questions to be Answered · · ·
......···
EXAMPLE .
5
SUMMARY AND RECOMMENDATIONS .
APPENDIX A
.
...·
.·..
24
47
66
68
76
····
1
16
18
22
74
...·
REFERENCES
9
·
RESULTS
3.1 Analysis of Design 2 . · · · ·
3.2 Model and Analysis -- Design 3 .
3.3 A Problem of Estimability
4
1
.
81
ILLUSTRATIONS
e
I.
Sensitivity and Specificity of Design 2 as a
Function of Design 1 . . . . . . · ·
·
2.
.
········
7
Empirical and Cumulative Distribution
Function for Prevalence at n=500 · ·
···········
29
3.
Empirical and Cumulative Distribution
Function for Prevalence at n=l,OOO ·
···········
30
4.
Empirical and Cumulative Distribution
Function for Prevalence at n=5,000
···········
31
5.
Empirical and Cumulative Distribution
Function for Prevalence at n=10,00O ·
· · · · · · · · · · · 32
6.
Empirical and Cumulative Distribution
Function for Sensitivity at n=500 · ·
· · · · · · · · · · · 33
7.
Empirical and Cumulative Distribution
Function for Sensitivity at n=l,OOO ·
···········
8.
Empirical and Cumulative Distribution
Function for Sensitivity at n=5,OOO ·
· · · · · · · · · · · 35
9.
Empirical and Cumulative Distribution
Function for Sensitivity at n=10,OOO
10.
Empirical and Cumulative Distribution
Function for Specificity at n=500 · ·
· · · · · · · · · · · 37
II.
Empirical and Cumulative Distribution
Function for Specificity at n=l,OOO ·
·· ·········
12.
Empirical and Cumulative Distribution
Function for Specificity at n=5,000 ·
········
13.
Empirical and Cumulative Distribution
Function for Specificity at n=10,000
ii
···········
··
·
34
36
38
39
· · · · · · · · · · · 40
TABLES
l.
2.
Expected Classifications by a Test with Error
Expected Classifications by a Test with Error
-- Design 2
e·
2
.
5
3.
Lilliefors Test of Normality for Estimates of
Prevalence, Sensitivity, and Specificity
Among Different Sample Sizes for Design 2
(Prevalence = .05, Sensitivity = .85,
Specificity = .95)
. • . • • • . . 27
4.
Wilcoxon Signed-Rank Test for Unbiasedness of
Estimates (Prevalence = .05,
Sensitivity = .85, Specificity = .95) . . . . • . . '. • . . . 28
5.
Lilliefors Test of Normality of Estimates of
Prevalence (p) for N=100 Monte Carlo Runs
for Design 2 . . . . . . . . . . . . . . . . . • . • . . . . 42
6.
Lilliefors Test of Normality of Estimates of
Sensitivity (u) and Specificity (v) for
N=100 Monte Carlo Runs for Design 2 . . . . . . . • • . . . 43
7.
Maximum Likelihood and Multinomial Estimates
of Prevalence (p) for Design 2 (n=l,OOO)
Based on 100 Monte Carlo Runs
8.
Maximum Likelihood and Multinomial Estimates
of Prevalence (p) for Design 2 (n=5,000)
Based on 100 Monte Carlo Runs
.
45
• • . 46
9.
Maximum Likelihood Estimates of Sensitivity
(u) and Specificity (v) for Design 2
(n=l,OOO) Based on 100 Monte Carlo Runs . . . . . . • . . 48
10.
Maximum Likelihood Estimates of Sensitivity
(u) and Specificity (v) for Design 2
(n=5,000) Based on 100 Monte Carlo Runs
49
Lilliefors Test of Normality for Design 3
Estimates of Prevalence (p) Based on 100
Monte Carlo Runs
. . . . . . ..
.
56
11.
iii
TABLES (Continued)
12.
Lilliefors Test of Normality for Design 3
Estimates of Sensitivity (u), Specificity
(v), and Pr(Obvious) (t) Based on 100
Monte Carlo Runs . . . . . . . . . . . . . . . . . . • . . . 57
13.
Design 3 Estimates of Prevalence (p) and
their RMSE's for n=l,OOO Based on 100
Monte Carlo Runs
14.
15.
16.
Design 3 Estimates of Prevalence (p) and
their RMSE's for n=5,000 Based on 100
Monte Carlo Runs . . . . . . . .
58
.
Design 3 Estimates of Prevalence (p) and
their RMSE's for n=10,000 Based on 100
Monte Carlo Runs
.
59
. . . . . . . . 60
Design 3 Estimatps of Sensitivity,
Specificity, and Pr(Obvious) and their RMSE's
for n=l,OOO per Monte Carlo Run
63
17.
Design 3 Estimates of Sensitivity,
Specificity, and Pr(Obvious) and their RMSE's
for n=5,000 per Monte Carlo Run . . . • . . . . . . . . . . 64
18.
Design 3 Estimates of Sensitivity,
Specificity, and Pr(Obvious) and their RMSE's
for n=lO,OOO per Monte Carlo Run. . . . .
.
65
Blood Pressure Categorizations for 7074
White Subjects Aged 20-54 from All Lipids
Research Clinics . . . . . • .
.
70
19.
20.
Design 3 Hypertension Parameter Estimates for
Lipids Research Clinics Data • . . . . . . . . . . • . . . . 71
21.
Design 3 Hypertension Parameter ML Estimates
for Lipids Research Clinics Data with
Pr(Obvious) Forced to be Zero
iv
73
ACKNOWLEDGEMENTS
Dana Quade's guidance and assistance as my academic advisor and as
my dissertation advisor has been invaluable.
He focused my studies,
made valuable suggestions for my dissertation and devoted many hours to
reading (and re-reading) this paper.
Many thanks.
My thanks also to William E. Wilkinson, my division chief at Duke
University, for allowing the flexibility in my work schedule to pursue
these studies.
Basil Rifkind of the National Heart, Lung, and Blood
Institute and the Lipids Research Clinics provided the data for my
example.
Jo Ann Lutz, my wife, was instrumental in the completion of this
dissertation.
Without her support and counsel, it would never have
been done.
v
1
Chapter 1.
1.1
INTRODUCTION AND LITERATURE REVIEW
Elementary Terminology of Testing
In a simple classification procedure or test, the object is to
classify individuals as positive ("+") or negative (II_") for the
characteristic under study.
Let
p
=
probabil ity that an individual has the characteristic
n
=
number of individuals tested,
and let
n1
=
number that are classified
(i.e., with the characteristic),
n2
=
number that are classified
"_II
(i.e., without the characteristic).
Then a naive estimate of
p is
PI = n/n.
Unfortunately, few tests are error free;
individuals.
u
they misclassify some
Let
=
sensitivity of the test
(i.e., the probability of correctly classifying a truly
positive individual),
and
v
=
specificity of the test
(i.e., the probability of correctly classifying a truly
negative individual).
2
Then the expected results of the test can be summarized by Table 1.
TABLE 1
Expected Classifications by a Test
with Error
True +
True -
npu
nO-p){l-v)
n1
np(l-u)
n(l-p)v
n2
np
nO-p)
n
Total
C
L
A
S
S
"+"
I
F
II
II
I
E
0
The probability of classifying an individual as positive is
and
PI
Q1 = n1/n = pu + (l-p)(I-v)
has this expectation. Thus PI
bias equal to
is generally biased, with the
The variance of PI
thus the mean squared error (MSE) is
MSE(Pl)
(MSE
is Q1(l - Q1 )/n and
(Ql - p).
=
Ql(1 - Ql)/n + (Ql - p)2
is variance plus squared bias).
sensitivity and specificity, the MSE
Even with fairly good
is usually dominated by a large
bias term and thus greatly exceeds the variance
unbiased binomial estimate of the parameter.
v = .9,
p = .15,
Q1
and
=
and
n = 1,000,
then
(.15)(.8) + (.85)(.1)
=
.205
p(l - p)/n of an
For example, if u
=
.8,
3
MSE(P1)
= .205(1
- .205)/1000 + (.205 - .15)2
= .0031879
this may be compared to
...
MSE(PO)
= .15(1 - .15)/1000
.0001275
=
t
which we would obtain using a perfect test which (correctly) classifies
...
...
n*P1 individuals as positive.
MSE(P1)
is over 25
times as large as
...
MSE(PO) .
With characteristics of low prevalence (small
p)
among the
population (and presumably among the individuals sampled), the naive
estimator makes many errors (in absolute numbers) of classifying as
having the characteristic those who truly do not have it.
The
conditional probability that an individual is truly positive
given that he is classified as
p(+III_II)
=
("_II)
(+)
is
)(l-u)
p{I-u + (I-p)v
and the conditional probability that the individual is truly positive
(+) given that he is classified as positive (11+") is
P(+III+")
=
pu
pu + (I-p)(l-v)
For the example above,
P(+III_")
=
=
~.15~(.2t
{.15 (.2 +.85){.9)
.03
.795
=
.037736
and
P(+I"+II)
=
=
~.15~(.8t
{.15 (.8 +.85){.1)
.12
.205
=
.585366
4
substantiating the claim made.
More than one-third of the individuals
classified as positive are truly negative.
One way to reduce this problem is to apply another test
(independent of the first) to some or all of the individuals.
test mayor may not be the same as the original.
however, that it has the same sensitivity,
u,
This
Let us assume,
and specificity,
v,
as the original test.
Suppose both classifications are positive. This
happens with probability u2 for individuals who are truly positive
and with probability
(l_v)2
for individuals who are truly negative.
Thus the conditional probability that an individual is truly positive,
given two independent positive classifications is
pu 2
p(+II++") =
pu 2 + (l-p)(I-v)2
for the example above, this is
p(+III++II)
=
=
(.15)(.64~
(.15)(.64) +.85)(.01)
.096
.1045
=
.9187,
increasing our confidence considerably about the classification of the
individual as positive for the characteristic.
With a rare characteristic, it is generally not economically
possible to put all members of the sample through all of the tests.
As
an alternative, we could select a subset of the sample to be
categorized more than once.
One method that would attempt to minimize
the total number of tests and simultaneously increase the specificity
is to re-examine only those who were classed as positive by the first
test.
On these individuals, we could perform an additional one or two
tests to reach a majority of two positive or two negative
5
classifications.
Call this Design 2, with Design 1 being the original
plan of one test per individual.
T~e
expected results of applying this
design to a sample of size n are shown in Table 2.
TABLE 2
Expected Classifications by a Test
with Error, Design 2
True +
C
L
A
"+_+"
"+11
"++"
S
e
S
I
F
I
E
D
-
-
n(l-p)(l-v)2
np(l-u)
n(l-p)v
"+_- II
npu(l-u)2
n(l-p)l(l-v)
np
n(l-p)
II
Total
n(l-p)v(l-v)2
II
II
II
npu 2 (I-u)
npu 2
True -
n
For the previous example, these expectations can be calculated as
C
L
A
S
True +
True -
Total
"+_+"
19.20
7.65
26.85
"++"
96.00
8.50
104.50
-
30.00
765.00
795.00
4.80
68.85
73.65
"+"
S
I
F
I
E
D
II
II
-
II
II
"+_- II
Total
For this design, the estimate of
150
p is now
850
1000
6
for the example. its expectation is
Q2
=
(26.85 + 104.5)/1000
.13135
=
so that the bias has been reduced from +.055
to
-.01865.
Then
.00046192
=
which is also much smaller than for Design 1.
It is still. however.
A
3.6 times as large as
MSE{PO)'
The sensitivity and specifity for Design 2 can also be obtained
from Table 2.
We have
sensitivity
=
specificity
=
u2 (1-u) + u2
v + v2 (1-v).
=
2
u - u(l-u) •
Sensitivity has decreased and specificity has increased. since some of
the individuals that were at first classified as
and are now classified as
from
.8
to
"+"
are re-examined
" " In the example. the sensitivity drops
.768 and the specificity rises from
.9
to
.981.
Note
also that the sensitivity and specificity are functions of the design
chosen and do not involve the underlying prevalence of the
characteristic.
Figure 1 shows the relationship of Design 2's
sensitivity and specificity to the sensitivity and specificity of a
single classification (Design 1).
This comparison of designs is not complete. since Design 2 is more
costly in the sense that the total number of examinations for a given
sample size is greater than for Design 1.
tests.
In total. Design 2 requires
T = 3n 1 + 2n 2 + n3 + 3n 4
From Table 2, the total expected number of tests to be
performed is
7
FIGURE 1
Sensilivily and Specificlly of Design 2
Q& a Funcllcn of Design t
t.wt-<I--------------------~-+
8.
D
•s
•
I
8.
g
n
2
8.8
8.1
8.2
8.3
8.4
8.5
Design t
8.6
8.7
8.8
8.9
t.8
8
E(T)
n(3(pu 2 (I-u) + (l-p)v(l-v)2)
2
+ 2(pu + (l-p)(I-v)2)
=
(p(l-u)
+
+
(l-p)v)
2
+ 3{pu(l-u)2 + (l-p)v (I-v))).
For the example this expected number of tests is
3(26.85)
+
2(104.5)
+
795
+
3(73.65)
=
1305.5.
The question then arises whether the increase in cost due to the
extra tests is balanced by the decrease in MSE.
requires a suitable cost/utility function.
To answer this
One such function in the
literature (Chernoff and Moses 1959) for use in estimation problems is
a multiplicative function of the MSE and the effort expended in
obtaining the estimates, viz.,
CRIT
where
C(n)
=
MSE * C(n)
is a IIcost of sampling
ll
function.
This cost function
assumes that the value of the estimate itself is inversely proportional
to the MSE.
functions.
Quade and McClish (1977) consider other cost/utility
If we ignore costs of general program administration ana
costs of computation, then
example, with
E(T),
.6030.
u = .8,
C(n)
v = .9,
is proportional to
p = .15,
the criterion for Design 1 is
T.
n = 1,000,
3.1880
For our
C(n) =
and
and for Design 2 is
Alternately, if cost of sampling is held constant, then the
cost/utility is directly proportional to the
MSE.
Design 1 based on an expected sample size of
E(T)
which compares unfavorably with the
.00046192
MSE
of
The
=
1305
MSE
is
for
.003149
for Design 2.
•
9
1.2 Literature Review
Although there has been much written about screening and the
notions of sensitivity and specificity, there has been little written
about sequential screening procedures (i.e., ones in which each
screening test after the first is conditional on the results of the
previous test(s)), and even less dealing with the problem of having no
(definitive) standard procedure at all.
Furthermore, most of the
literature is concerned with finding cases rather than with estimating
prevalence.
Yerushalmy (1947) introduced sensitivity and specificity as
standard measures in the evaluation of a diagnostic test.
Berkson
(1947) approached the same problem as "cost utility", where utility is
the same as sensitivity and cost is the complement of specificity.
Neyman (1947) proposed a simple probability model as a basis for
evaluating the efficiency of a test.
Several other models having the
same assumptions as those of Neyman were examined and discussed by
Chiang (1951).
The distinction between the reliability of a test and
its validity was pointed out by Chiang, Hodges, and Yerushalmy (1955).
In this paper they point out specifically that Neyman's model is a
measure of reliability, a fact to which Neyman alludes.
In a now classic study of the use of a screening technique,
Yerushalmy (1953) found large variations among the results of different
readers and among the results of the same reader at different times in
the evaluation of chest x-rays for tuberculosis.
found by Berkson et
Similar results were
21 (1960).
Chiang (1951) deals very comprehensively with the enhancement of
specificity or sensitivity by the use of multiple readings in which all
10
subjects screened are given independent tests.
Chiang postulates that
for each individual being tested there is a fixed number
p which is
the probability that the individual will be termed "positive" by any
one test.
In this context, Chiang discusses different probability
models for the screening test.
Nissen-Meyer (1964) expands Chiang's
work by allowing different probability distributions for
p among the
mutually exhaustive subpopulations of true positives and true
negatives.
Both approaches, however, assume that all individuals
presented for testing are multiply screened and that a true
categorization can be established for all individuals.
Chi ang, Hodges, and Yerusha 1my (1955) poi nt out that Berkson
(1947) introduced the notion that two types of error are involved in
the diagnosis, and that consideration of their probabilities is an
essential step in the statistical analysis of screening tests.
"However, this approach does not take into account what we feel to be
essential elements of the mass survey problem:
the prevalence of the
disease in the population, and the vital necessity of keeping within
reasonable limits the ratio of false positives to true positives"
(Chiang, Hodges, and Yerushalmy 1955, p. 124).
To deal with this
problem, they discuss what they term "mu ltistage procedures,., in which
subsequent stages of testing are entered following "positive" results
at earlier stages.
This is the first mention of a procedure
approaching the one that we will be considering here.
These multistage
procedures assume, however, that the final stage is definitive.
You den (1950) introduced an index of efficiency of a screening
test
(j
=
sensitivity + specificity - 1),
but the index suffers from
weighting false negative and false positive results both equally and
11
from not taking into account the prevalence of the attribute being
screened.
Blumberg (1957) compares various indices for screening
evaluation, but none of them account for the prevalence of disease in
the evaluation.
Thorner and Remein (1961) produced a handbook for
administrators and applied health researchers that discusses screening
sensitivity and specificity in elementary terms.
They include
discussion of the enhancement of sensitivity or specificity by multiple
testing of all subjects, but do not include any evaluation of screening
procedures with respect to the prevalence of disease or the cost of
different types to errors.
Bross (1954) discusses the effect of misclassification on the
estimates of the proportion with a disease and its effect on tests
comparing these estimates between two populations.
He showed that,
assuming population membership is known without error, the
misclassification generally biased the estimate and causes the odds
ratio for comparing two populations to be reduced.
He also found that,
with constant misclassification rates in both populations, the size of
the chi-squared test was not affected, but its power was reduced.
Subsequently, Diamond and Lilienfeld (1962a, 1962b) found that the
observed association could be increased or even reversed in direction.
Newell (1962) also looked at the problem, but the three papers
generally clouded the issue with proof-by-example type arguments.
The
assumption ignored both by Diamond and Lilienfeld and by Newell was
that the two populations are known without error.
The three papers
referenced above did generate considerable interest in the problem.
Mote and Anderson (1965) discuss the effect of misclassification
on the general chi-squared test and found that to allow unrestricted
12
misclassification to any other category produces a model with more
unknowns to be estimated than degrees of freedom for the test.
They
consider two simplified models in detail, one of constant error
probability and another of errors allowed only between adjacent cells.
Neither of these is appropriate to screening problems, where the event
of interest is rare and the categories are generally nominal.
Assakul and Proctor (1967) examine tests of independence with
classification errors in general two-way tables and allow correlated
errors.
However, they assume that the error probabilities are known
constants.
Bryson (1965), Gart and Buck (1965), and Press (1968)
examine classification error models in which all individuals are
classified more than once.
Koch (1969) looks at the effect of non-sampling errors (including
classification errors) on measures of association in
using a random response error model.
2 x 2 tables
Tenenbein (1970, 1971) and
Hochberg (1977) examine double sampling schemes to deal with estimating
the error probabilities.
Their procedures have a
Iitrue~
but expensive
classification applied to a subsample of those individuals screened.
They thus develop unbiased error estimates with which to adjust the
estimates of the prevalence of the characteristic of interest.
Rogan and Gladen (1978) present a model for estimating prevalence
using three independent samples, one to estimate each of sensitivity,
specificity, and prevalence.
Their procedure is straightforward, but
may be somewhat wasteful of subjects.
Vecchio (1966) defines "predictive value " of a test as the
proportion of true positives among positive test results.
His model by
a simple application of Bayes Theorem, is useful in mass screening; it
13
accounts for the proportion of lIfalse positives ll that a test will
declare, conditional on the prevalence of the disease.
Unfortunately,
one must know the prevalence of the disease to use the model.
Sunderman and Van Soestbergen (1971) apply a compartment model to the
problem that Vecchio presents.
The presentation seems to have some
face value, but the conceptual validity of their approach has a problem
in that the term "incidence ll is consistently substituted where
"prevalence" is called for.
Sunderman (1972) furthers his compartment model approach to
"multiphasic" screening and describes the use of multiple low-specific
tests to enhance specificity.
He continues with the
"incidence"l"prevalence" problem noted above.
There are several other
papers discussing the use of "multiphasic" or automated clinical
chemistry techniques in screening (Schoen and Brooks (1970), Wilson
(1973), Werner, Brooks, and Wette (1973), Galen and Gambino (1975),
Galen (1975)).
Galen and Gambino is the most clearly written of these.
All of these clinical chemistry oriented articles assume a complete set
of all r,loasures for all screened individuals.
Galen (1975) and Galen
and Gambino (1975) also consider the implications of non-independent
tests.
Several other articles relate health screening to various
statistical and administrative problems.
Cochran and Holland (1971)
discuss the sensitivity and specificity of a test as it affects the
decision to use it in screening.
Hartz (1973) describes the use of a
multiple logistic model to determine an index from which a
classification of positive and negative can be made.
His approach
requires that all individuals be given all screening tests and does not
14
assume independence of tests.
Grant (1974) discusses the practical
problems of dollar-cost associated with specificity in mass screening.
Lusted (1971) discusses the use of signal detection theory and receiver
operating curves to combine sensitivity and specificity information in
the reading of x-rays.
Cochran (1968) has a survey article discussing the general effects
of errors of
I'!l('()surcr.~(>pt
it'
stnti~.:,':·;.
3~sides
discussing errors of
classification (binomial, hypcrgeometric, and multinonial), he
discusses errors in the linear models context and the relationships
):;l~'J"~':hl~ ''.1-/(, 1 YIOS
of errors.
Fleiss (1973) devotes two chapters to
an elementary treatment of the effect of classification errors and
their measurement and control.
Federer (1963) provides a large bibliography of screening
procedures for selection and allocation in a variety of disciplines.
Of the over 500 references in the bibliography, few are clearly related
to screening in medical diagnosis, some may be useful, and many do not
appear to be useful to this study.
Section 1.1 has summarized the problem of estimating p as viewed
by Quade (1976), and by Quade and McClish (1977).
Quade (1976)
examines other designs for up to three independent classification
procedures per individual screened.
He also examines randomized
designs under ML estimation and concludes that though the best possible
design has slightly lower criterion values than does the best
non-randomized design, they are not sufficiently superior to warrant
the administrative problems that they would produce.
His paper is
restricted to two results per classification procedure, two categories
of interest, and constant misclassification probabilities across all
15
classification procedures (i.e., the same sensitivity and specificity
for each test).
Quade and McClish (1977) developed a notational
approach and a general model which avoid all three of the restrictions.
Besides the multiplicative cost/utility function of Section 1.1, they
also examine the behavior of a linear cost/utility function,
C(n) + G * MSE.
CRIT'
=
16
Chapter 2:
2.1
A PROBLEM IN ESTIMATION
A General Model
Motivated by the previous discussion, we now present the most
general model possible subject to the following constraints:
i)
Nominal scale of measurement for each test.
ii)
Deterministic model -- the decision of whether to apply
the next test in the sequence is determined in a fixed,
algorithmic manner.
Let us assume that an individual can fall into anyone of
I
>=
2 categories.
,
8.
for
=
=
We define
i th
Pr(individual is in the
1, 2, ... , I,
where
,
category)
Write the set
8.
of 6'S as
, 8 , ... ,6 )'.
1 2
1
The goal of the analysis is to estimate
,.6
=
(6
e or some function of e.
IV
AI
T1 , T2 ,
(some or all of which may
be i dent i ca 1) • Test Ta may produce rna poss i b1e results tam' for
m = 1,2, ... , rna' These results mayor may not have a one-to-one
There are available tests
correspondence with the categories of interest.
For instance, the
categories may be "positive" and "negative," but the possible results
for a test may be "positive," "negative," and "indeterminate."
The
classification procedures are applied to each individual, with the
results of the previous tests being used to determine whether or not to
17
continue testing.
The final outcome of testing is a sequence of
results such as {t12,t33,t23}'
A design for multiple testing is the set of all possible outcomes
of the testing procedure, or alternately, the set of sequences of
results that cause cessation of testing.
of outcomes in the design
0,
Let
and index them by
Suppose, for example, that each test has
("+") and
t a2 = 2 ("_").
K=K{O) denote the number
k=l, 2, ... , K.
2 results, viz.,
tal
= 1
Then Design 1 (classify each individual
once only) has
0
1
=
with
C'+","_"}
K{D 1 )
=
2
and Design 2 has
Suppose that
n individuals are available for testing. Let
be the number of these for whom testing yields the kth outcome;
The estimate of ,.,e is then a function ,.,
t{n)
,..,
n
k
where
~ =
(n 1 , n2 , ... , nk)'·
To examine the multinomial estimators, we use the testing
procedure to classify the
K design outcomes into the
I categories
of interest, and then take the proportions that they define to be the
estimate of e.
As already shown for Design 1 in Chapter 1, with the
K = 2 design outcomes
categories of interest
{"+","_"}
mapped onto the the
I =2
{+,-}, the multinomial estimate may be biased
and no unbiased multinomial estimator may be available.
As also
discussed in Chapter 1, taking a larger sample size will not remove the
bias, so the multinomial estimate is not consistent as
infinity.
n approaches
18
2.2 Maximum Likelihood Estimation
To define maximum likelihood estimators (MLE's), some more
notation is necessary.
Define
Probability that application of the a th test to an
individual in the i th category will yield the mth
result.
and
~
""
~
""
= (three dimensional) array of
~IS.
is the generalization of sensitivity and specificity.
Then denote
the probability that applying the entire testing procedure (the design)
to a randomly chosen individual will yield the kth outcome by
Pk(2,!),
so that the likelihood function derived from this model is
L(n;a,~) =
... .-
K
n!
IT
k=1
n
[Pk(a,~)J
AI
This is maximized with respect to
t(n)
of
..,""
a,
,oJ
kInk!
N
a and
""
~
""
and, incidentally, an estimate
to produce the estimate
~
,.,
~.
of
..,
The maximum likelihood estimation method is asymptotically
unbiased and efficient for large
number of outcomes
and
~
,oJ
n provided, minimally, that the
K is larger than the number of parameters in ,..,
e
to be estimated.
The elements of
a and
-
~
~
must also be
linearly independent to prevent the estimate from being indeterminate.
The maximum likelihood method is complicated to apply, however, and
does not classify individuals into categories.
19
Using the notation in the general model described above for
Design 2,
u
IjJlll
=
1jJ1l2
= 1
(sens it i vity)
- u
1jJ121 = 1 - v
1jJ122 = v
(speci fi city);
1jJ2im and 1jJ3im are the same as
IjJl im'
for all
i
and m,
since
the same testing procedure is applied each time.
In general, the probability of obtaining the
k
th outcome for an
. d'1Vl.dua 1 wh0 1S
. ln
. th e
1n
1.th
from the elements of ..,~,
since we do not require that the sequence of
tests be independent.
ca t egory canno t be 0 bt'
a1ne d d'1rect 1y
If, in fact, the tests in the sequence are
independent according to the model (as in Designs 1 and 2), then the
probability of obtaining the kth outcome for an individual who is in
the i th category is the product of the probabilities of the sequence
of results in that outcome, conditional upon the
.th
1
ca t egory.
20
The probability of obtaining the kth outcome for individuals who
are in the i th category is given below for Design 2:
Outcome
k
Sequence
of
Results
Pr(k th outcomeli th category)
i = 2
i = 1
"+_+"
1
121
"++"
2
1/11111/12121/1311
1/11211/12221/1321
(l-v)(l-v)
1/1 111 1/1 211
II_II
2
"+ __ "
4
(l-v)v(l-v)
uu
11
3
u(l-u)u
(l-u)
1/11211/1221
v
1/1 112
1/1 122
u(l-u)(l-u)
{l-v)vv
122
1/11111/12121/1312 1/11211/1222¢322
These are the same probabil it i es that are shown in Table 2.
I
= 2, we can set
1 = P and 8 2 = I-p. Then
P1(p,u,v) =
p(l-u)i + (l-p)v(l-v)2
2
P2(p,u,v) =
pu 2
+ (l- P)( 1- v)
P3(p,u,v) =
+ (l-p)v
p{l-u)
P4 (p,u,v)
8
pu{l-u)2
=
+ {l-p)(l-v)i
and
L( (n 1 ,n ,n ,n ) I ; p, u,v)
2 3 4
4
=
nl
IT
k=1
n
[(Pk(p,u,v)) kInkl]
With
21
=
n
(pu 2 + (l-p)(I-v)2) 2
n2'I
n
(p(l-u) + (l-p)v) 3
*
*
n3'I
n
(pu(l-u)2 + (1-p)(I-v)v 2 ) 4 ).
n4'I
*
The log-likelihood is
10g(L((n 1 ,n 2 ,n 3 ,n 4 )';p,u,v))
=
4
n
L
i=1
10g(i) +.
L
k=l
n
(n k 10g(Pk(P,u,v))
k 10g(i)).
i=1
-
L
Using the method of Ne1der and Mead (1965) and the expected values
for
nl , n2 , n3 , and n4 (see Table 2 in Chapter 1), the log-likelihood
was maximized to obtain the MLE of p. This MLE is somewhat artificial
in that
n uses expected values based on the parameters that are being
~
~
estimated.
~
The MSE of PI
its squared bias.
was computed as the variance of PI
For our example in Chapter 2, with
p
=
plus
.15,
~
u
=
.80,
and
v
=
.90,
cost/utility criterion
the MSE of PI is
.0007078
(MSE * #-of-tests) is
.9238,
and the
which compares
unfavorably with the multinomial estimate's criterion for Design 2 of
.6030.
(Since the number of tests ;s the same for both the MLE and the
multinomial estimate on Design 2, the comparison of the MSE's is the
same as comparison of cost/utility.)
This may not always be so since
the MSE of the MLE may do better than that of the multinomial estimate
for different values of the parameters or other sample sizes.
22
2.3 Questions to be Answered
To this point we have summarized and slightly expanded the work by
Quade (1976) and Quade and McClish (1977) to allow for dependence.
McClish and Quade are currently working on a more general dependence
structure of the model.
There are several possible directions in which
to continue.
Within the general framework of the model stated above, we wish to
examine two specific models addressing these two questions in more
deta i 1:
1)
For a given design, compare various sample sizes to
determine the range of sample sizes for which the MLE
provides a better estimate (i.e .• smaller MSE) of 6
AI
than does the multinomial estimate.
2)
For a given sample size, compare the cost/utility of the
more complicated design to that of a simpler design to
determine their relative performance characteristics
over a range of parameter values
Problem 1)
of the estimates.
(6 and
,.,
~).
N
involves a question of the precision and/or accuracy
Since the MLE's use asymptotic theory to derive
their estimates, Monte Carlo procedures are necessary to properly
answer this question of IIhow big is big enough. 1I
Problem 2)
depends
on problem 1) to determine the sample sizes at which to make comparisons of cost/utility.
In Chapter 3, Design 2 is examined at a fairly large number of
different sample sizes for one set of parameters to determine the
initial range for further comparisons.
problems
1)
and
2)
Design 2 is then examined for
over a range of parameters that might be
23
expected in a biomedical prevalence measuring situation.
Design 3,
which involves more categories of interest and more outcomes, is
introduced and examined with respect to the MSE and cost/utility
questions described above.
Chapter 4 provides several related examples that fit the
assumptions of Design 3.
In Chapter 5, the significance of the results that have been
obtained and suggestions for further research are presented.
24
Chapter 3.
RESULTS
This chapter is divided into three parts:
1) the analysis by
Monte Carlo methods of Design 2, with two results and four outcomes; 2)
the analysis by Monte Carlo methods of a more complicated design,
having four results and five outcomes (Design 3); and 3) a discussion
of the problem of estimability.
The Monte Carlo methods are based on
descriptions in Hammersley and Handscomb (1964) and Kleijnen (1974).
In performing general Monte Carlo studies, data are generated according
to a known distribution of parameters and then the estimation procedure
is applied to obtain estimates of these parameters.
Finally, these
estimates are examined to see if they have the desirable properties
(such as unbiasedness or asymptotic normality) that would make the
estimates and estimation procedure useful in applications.
otherwise specified,
Unless
100 replicates were performed for each
combination of parameter values and sample size, since Kleijnen (1974)
recommends a relatively small number of replicates (such as 100) to
examine the behavior of a design under different conditions (as opposed
to a relatively large number of replicates when a precise estimate for
a particular condition is desired).
Maximum likelihood estimation was
performed using the EM method (Dempster, Laird, and Rubin (1979)) for
Design 2 and a modified Newton-Raphson method of function minimization
(Subroutine ZXMIN from the International Mathematical and Statistical
Library (IMSL) - 1979) for Design 3.
In both Design 2 and Design 3 the
function to be maximized was the log-likelihood as described in
25
Chapter 2.
Data for each replicate were generated using a subroutine
(GGMLT) from the IMSL.
3.1 Analysis of Design 2
Recall from Chapter 2 that in Design 2 there are four outcomes of
testing {"+_+", "++", "_", "+ __ "} and two (2) results of interest
{+,-}.
The characteristics of the estimators are examined for
population parameter values that might be encountered in a health
screening study.
In addition, since maximum likelihood estimators are
asymptotically normally distributed, various sample sizes are examined
to determine reasonable sample sizes for using this method.
The population parameters p (prevalence),
u (sensitivity), and
v (specificity) for which samples were generated are all possible
combinations of
P
u
v
.05
.70
.90
.10
.85
.95
. 15
1.00
1.00 .
The sample sizes initially examined were n=500,
1,000, 5,000, and
10,000.
For p=.05,
u=.85, and v=.95,
the four sample sizes ranging
from 500 to 10,000 were examined using the Monte Carlo method with
100 replicates at each sample size.
This was to determine which
sample sizes would yield adequate fit of the estimated frequencies to
the underlying frequency distribution.
This fit for each sample was measured in two ways, one to test the
normality of the estimates of prevalence, sensitivity, and specificity,
26
and one to test the unbiasedness of the estimates.
The normality of
the estimates (Table 3) was tested using Lilliefors's modification of
the Kolmogorov-Smirnov goodness of fit test (Conover 1971).
The
Kolmogorov-Smirnov test statistic D-MAX is the maximum distance between
the empirical (EDF) and cumulative (CDF) distribution functions.
The
•
associated probability level shown with it is for the Lilliefors test
of a normal distribution.
For n=500, only specificity is anywhere
near fitting the normal distribution
(P
<
.10).
Only with n=10,000
do all three parameters show a very good fit to the normal distribution
(all P > .20).
The mean and standard deviation are also shown for each
set of trials.
As the sample size increases, the D-MAX for each
estimate decreases, indicating that the estimate's distribution is
approaching normality.
The unbiasedness of the estimates is tested using the Wilcoxon
Signed-Rank test (Conover 1971) (Table 4).
Strictly speaking, the
Wilcoxon Signed-Rank test is for a given median rather than mean, and
also assumes a symmetric distribution.
The symmetry of the
distribution is partially tested by the Lilliefors test (the test will
reject for asymmetry as well as for other reasons), and if the distribution is symmetric, the median and mean are the same.
The probability
levels (P >= .33 for all estimates) for the t-test approximation to the
Wilcoxon Signed-Rank test show that any sample size of the four
considered is adequate for unbiased estimates.
To make visual comparisons of the hypothesized (CDF) and empirical
(EDF) distributions examined in Table 3, graphs of the EDF's and their
associated CDF's for each estimate of prevalence,
and specificity,
v,
p,
sensitivity,
u,
and each sample size are shown in Figures 2 - 13
27
TABLE 3
Lilliefors Test of Normality for Estimates of Prevalence,
Sensitivity, and Specificity
Among Different Sample Sizes for Design 2
•
(Prevalence=.05, Sensitivity=.85, Specificity=.95)
N*
D-MAX
Prob
Mean of
Estimates
S. D.
100
100
100
.17
.13
.08
.01
.01
.10
.0536
.8348
.9508
.0226
.1342
.0126
100
100
100
.12
.10
.06
.01
.01
>.20
.0511
.8456
.9498
.0132
.0888
.0090
100
100
100
.10
.05
.05
.05
>.20
>.20
.0503
.8493
.9501
.0057
.0423
.0036
100
100
100
.05
.07
.05
>.20
>.20
>.20
.0502
.8489
.9500
.0036
.0279
.0028
Sample Size=500
Prevalence
Sensitivity
Specificity
Sample Size=l,OOO
e
Prevalence
Sensitivity
Specificity
Sample Size=5,000
Prevalence
Sensitivity
Speci fi ci ty
Sample Size=10,000
Prevalence
Sens iti vi ty
Specificity
NOTE
*
N = Number of Monte Carlo runs.
28
TABLE 4
Wilcoxon Signed-Rank Test for Unbiasedness of Estimates
(Prevalence=.05, Sensitivity=.85, Specificity=.95)
N*
Sample Size=500
Prevalence
Sensitivity
Specificity
Median of Sum of*
Estimates Ranks
t
Pr
'>
It I
100
100
100
.051
.858
.953
2640
2496
2809
.39
-.10
.98
.69
.92
.33
100
100
100
.049
.852
.950
2495
2601
2510
- .10
.26
-.05
.92
.80
.96
100
100
100
.049
.852
.950
2591
2540
2580
.23
.05
.19
.82
.96
.85
100
100
100
.050
.852
.950
2623
2551
2568
.34
.09
.15
.74
.93
.88
•
Sample Size=l,OOO
Prevalence
Sensitivity
Spec; fi city
Sample Size=5,000
Prevalence
Sensitivity
Specificity
e
Sample Size=10,000
Prevalence
Sensitivity
Specificity
NOTES
*
N = Number of Monte Carlo runs.
** The
expected value of the sum is 2525.
•
29
(S)
~
lJ)
u
c
..,J
0
lJ)
lJ)
~
(S)
U
Q.
L.
0
e.-
N
C
0
~
..,J
Z
U
N
e
~
~
~
CJ
~
w
C
:J
lL.
0)
(S)
·
C
(S)
0
<0
(S)
..,J
·
II)
~ a-
(S)
I-
ct
~
H
~
(J)
(')
(S)
..,J
0
·
(S)
:J
E
:J
U
"0
C
(S)
~
0
(S)
W
~
w
•>
0E
W
W
L.
L.
ct
>
a
:J
..Q
0
...J
Q..
..,J
u
W
U
.
.
.
U>
<0
~
N
(S)
(S)
~
(S)
(S)
~
au..
·
(S)
FIGURE 3
EmpIrIcal and CumulatIve
D's~r'but'on
Functron
~or
p-.0S al
n-t~000
t .0
0.8
0.6
D
F
0.4
0.2
0.0
-,....",..,,~."""'?,-r._,-I,..,',......'.....'"""'?,-r,-r._."""""'' ' ' ',"""'?i-ri_.""'I'r-I'' ' ' ' """'?'"""'?'_'-,,,,,,,,,,,,,,"""'?-r,_,-''''''''''''''"""'?'-r,_'_,""'I,,..,,,......,"""'?,-r,-r._."',r-I,,......,-r,-r._.""',,..,,,......,-r,-r,
I
~I"""'?-r.
0.00
0.02
0.04
0.06
0.08
0. t 0
0. t 2
ESTIMATED PREVALENCE
w
o
e
.,
-,
_
31
~
~
(S)
,....
,
lJ)
a
c
....,
(S)
·
(S)
0
<0
~
lJ)
·
~
~
q
Q.
L.
0
lJ)
Ct-
~
c
0
....,
~
·
0
..q.t
e
c
V
~
:J
.Q
~
....,L
~
~
W
~
1:
H
,>
en
w
N
~
·
(S)
....,
-0
j
E
:J
~
·
U
~
""0
C
0
(S)
(S)
0
E
W
t<t
t-
Ie
Q.
a
(\')
~
·
a-
j:
<t
Cl.
:J
0
..J
· >w
~
a::
u..
~ c
0
~ ....,
~
w
u
z
w
~
.
.
ClO
<0
'''It
N
~
~
CS)
CS)
au....
. ·
~
~
(S)
32
(S)
(S)
(S)
<0
(S)
·
'\
(S)
(S)
II
C
a,foI
0
lJ)
(S)
lJ)
(S)
a
(S)
·
Q.
L.
0
t.,.
w
u
z
·
(S)
w
...J
«
"¢
c
(S)
0
~
0
l.!J Jc
>
W
lL.
Ct
r:z:.1 c0
~
P
~
~
a,foI
~
.........
(Y')
(S)
~
(S)
J
Cl.
Q
W
~
«
..Q
:t:
L.
~
H
(/)
-
CD
W
(\J
Q
(S)
..
·
(S)
>
a,foI
0
-J
e
J
(S)
u
·
(S)
'"0
C
0
0
0
-
(S)
(S)
L.
Q.
we
.
U>
<0
"¢
(\J
(S)
(S)
(S)
(S)
(S)
(S)
.
QlL.
.
·
(S)
e
33
~
~
~
lJ)
II
C
...,
0
0)
U)
·
ex:>
~
II
J
L.
....0
ex:>
c
~
·
0
...,
U
C
CO
e
~
~
~
'"·
J
lA..
~
C
0
:J
<0
~ ...,L.
~
U)
·
...,
~
-:J0
e
'lit
:J
U
~
-0
C
0
(Y)
0
0
e
w
a
W
t-
.:t
1:
(f)
•>
Q.
(f)
Z
H
t-
-III
~ a
-
tH
(f)
.u
L.
>
H
W
~
.......
~
tH
·
~
-
.
.
co
<0
V
(\I
~
~
~
~
~
~
.
au...
~
w
34
CS)
CS)
CS)
<0
'\
0)
·
11
C
CS)
..,
0
lJ)
ex>
CS)
0)
11
CS)
·
j
L.
0
Ct-
"'It
U>
·
C
0
CS)
..,
H
>
H
U
l'-
c
ex>
j
lL.
'"
~ c
0
~
~
~
~
I:'x-t
CS)
..,
(\/
..,L.
'"
-a..
CS)
t<t
~
H
t-
en
w
•
..,
<0
<0
>
·
CS)
-0
j
E
j
CS)
U
<0
"'0
CS)
·
C
0
~
0
u
IJ)
-
CS)
·
L.
w
H
en
z
w
en
a
.Q
E
t-
W
j
Q.
>t-
CS)
-
.
.
.
.
U>
<0
~
(\/
CS)
CS)
CS)
CS)
CS)
CS)
au..
e
35
cg
(g
cg
(g
(g
,
l/)
•c
"'0"'
l/)
(J)
l/)
U)
·
cg
•
:J
L
0
~
(g
(J)
C
·
(g
0
"'U"'
CO
e
~
~
H
H
u..
(I)
l/)
U)
C
0
«
:J:
H
cg
•
~
a
U)
>
"'0"'
-:J
l/)
E
:J
U
I"
(g
~
c
0
0
u
(S)
I"
cg
•
.
U)
<0
(g
(g
Q.
E
(I)
t-
P:-t "'-""'
w
Z
W
· a
(g
W
~ "'.Q:J"'
0........ L
-
H
>
t-
C
:J
L
>t-
cg
alL..
N
cg
(g
(g
.
cg
t-
(I)
w
36
CS)
(S)
CS)
'\
(S)
(V)
c
(S)
-•
0>
0
~
0
IJ)
(S)
CO
0>
•
0
CS)
;)
L.
0
Ct-
t""-
C
0
<0
(S)
~
(j)
;)
~ c
~
CJ
.......
~
'¢'
<0
0
(S)
~
:J
.Q
II
-
<0
a
CS)
•
>
~
-:J0
<0
t""-
e
;)
0
(S)
U
"'0
C
0
IJ)
t""-
o
0
(S)
<0
oo:t
<.0
0
-
(S)
CS)
(S)
Q.
e
w
rH
en
z
w
en
a
w
rH
~
L.
H
>
et
1:
L.
U
r-
H
U
C
lL.
~
~
au..
N
(S)
0
0
CS)
(S)
ren
w
e
e
e
e
FIGURE 10
Empirical and Cumulalrve Dlslrlbullon Funclron Tor v-.95 al n-500
1 .0
D
F
0.91
0.92
0.93
0.94
0.95
0.96
ESTIMATED SPECIFICITY
0.97
0.98
W
'!
FIGURE 11
Empirical and Cumulalfve Dfalrfbulfon Funcllon for v-.95 al n-I,000
I .0
0.8
0.6
D
F
0.4
0.2
o.0
..l"I...,I-...I...,.'...,.1-1""Ir-I'-...'''''''_1-"'1.,'r-II...,I-...I...,.I-"'1-'''''''''''1...,1-...1...,.'-,,..,,,..,,,...,,...,,...,.,_1-"".,Ir-Ilr--II-...I...,.,-,..,.,'r-Ilr--II-...I...,.I-,..,"'r-Ilr--I'-...I-...I...,.,-I""'r-Ilr--I'-...'''''''-"'1...,..,"'r-I'''''I-
0.92
0.93
0.94
0.95
0.96
0.97
0.98
ESTIMATED SPECIFICITY
e
e
w
co
e
e
e
e
FIGURE 12
Empirical and Cumulative
DI.~rlbutlon
Funcllon for v-.95 al n-5,000
D
F
O• 0
I
""sup
0.936
II'
wi
I
......
0.940
'u'i"""
0.944
•. ,."'
0.948
, . "".,.""""""",
0.952
0.956
0.960
ESTIMATED SPECIFICITY
W
1..0
FIGURE 13
EmpIrIcal and Cumulallve D'.lrlbul'on Funcllon Tor v-.95 al n-t0,000
1 .0
r;;7
0.8
0.6
J
D
(J
F
0.4
0.2
~
[7
~
0.0 ,
I
,
0.940
,
,
,
,
I
,
0.943
I
,
I
Ii'
0.946
,
,
,
,
,
,
0.949
,
,
,
,
i
,
0.952
I
,
,
,
i
,
0.955
,
,
,
,
i
,
0.958
ESTIMATED SPECIFICITY
~
o
e
e
e
41
(The means and standard deviations used in obtaining the CDF's are
those shown in Table 3).
On the basis of these analyses, the sample
size of n=500 was rejected as too small for further consideration.
Since n=1,000 and n=5,000 appear to yield adequate results,
n=10,000 was not considered further for Design 2.
The maximum likelihood estimation (MLE) of the prevalence,
provides simultaneous estimates of the sensitivity,
specificity,
v.
If we additionally define outcomes
as "positive" outcomes, a multinomial estimator of
defined.
u,
p,
and
and
"+_+"
"++"
p can also be
The sensitivity and specificity associated with the
multinomial estimator are not estimable from the observable margin of
the data table.
To test the normality of estimators for the complete range of
possibilities of prevalence,
v,
p,
sensitivity,
the Lilliefors test was used.
For n=1,000
hypotheses of normality were accepted
(Tables 5 and 6).
(P
>
For prevalence estimates
u,
and specificity,
and n=5,000,
20) and some were rejected
(Table 5), the maximum
likelihood estimates were accepted as normally distributed
(out of 27) when
n=1,000
and
12
times when
n=5,000;
multinomial estimates were accepted as normally distributed
when
n=1,000
and
9 times when
n=5,000.
some
9 times
the
12
times
The multinomial estimate
was accepted as normal when the MLE was not in five instances with
n=1,000
and in three instances with
n=5,000.
The MLE was accepted as
normal when the multinomial was not in two instances when
five times when
n=5,000.
For sensitivity estimates (Table 6), the
hypothesis of normality was accepted 7 (of 24)
and
12 (of 24)
times for
n=1,000
n=5,OOO.
times for
For both the
n=1,000
n=1,000
and
and
TABLE 5
Lilliefors Test of Normality of Estimates of Prevalence (p)
for N=100 Monte Carlo Runs for Design 2
n
Significance Level
.85
.90
1. 00
.95
.90
.95
1.00
.01
>.20
.05
>.20
.01
.20
.01
.01
.05
>.20
.01
.01
>.20
>.20
.15
.20
>.20
>.20
.10
.15
>.20
>.20
. 15
>.20
>.20
>.20
.01
.01
.05
>.20
.10
.10
>.20
.15
>.20
>.20
.20
.15
.20
>.20
.10
>.20
.10
>.20
~1 LE
.01
.05
.05
.05
.05
.20
>.20
>.20
.05
MULT
.05
>.20
.05
.05
.05
.20
>.20
>.20
.05
~1 LE
.05
>.20
> .20
.20
>.20
>.20
.01
>.20
.05
MULT
.01
>.20
> .20
.05
.05
.05
>.20
.15
.05
MLE
.10
>.20
> .20
.15
>.20
.10
>.20
.05
.20
.15
.05
.05
.15
>.20
>.20
>.20
.05
.20
u:
.70
v: .90
.95
1. 00
MLE
.01
.01
.01
.01
MULT
.01
.01
> .20
MLE
.01
.01
MULT
>.20
MLE
MULT
p
.05
1000
.10
. 15
.05
5000
.10
.15
r~UL
T
1. 00
~
N
e
e
e
e
e
e
TABLE 6
Lilliefors Test of Normality of Estimates of Sensitivity ( u ) and
Sensitivity ( v ) for N=100 Monte Carlo Runs for Design 2
n
p
u:
.70
v: .90
.95
Significance Level
.85
1. 00
.90
1. 00
.95
1. 00
.90
.95
1. 00
SENSITIVITY
1000
.05
.10
.15
.05
. 10
>.20
.05
>.20
.10
.01
>.20
.20
.01
>.20
.10
.01
>.20
. 15
.05
>.20
>.20
.01
.01
.01
.01
.01
.01
5000
.05
· 10
· 15
>.20
>.20
>.20
>.20
>.20
. 15
>.20
>.20
>.20
.15
>.20
.20
>.20
>.20
>.20
.05
.20
.20
.01
.01
.01
.01
.01
.01
SPECIFICITY
1000
.05
· 10
· 15
>.20
.01
>.20
.01
.10
.10
.01
.01
.01
>.20
.15
.15
>.20
>.20
>.20
.05
. 10
>.20
> .20
>.20
> .20
>.20
>.20
>.20
5000
.05
.10
.15
>.20
>.20
>.20
>.20
>.20
.15
.01
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
> .20
> .20
> .20
>.20
>.20
>.20
.t>o
W
44
n=5,000
groups, acceptance of normality was more common among the
moderate levels of underlying sensitivity.
For specificity estimates
(Table 6), the hypothesis of normality was accepted
for
n=1,000
and
22 (of 24)
times for
13 (of 24)
times
n=5,000.
The relatively better performance of the specificity parameter is
to be expected, since, with prevalence only .05,
nearly all
individuals who are tested are truly negative, and specificity is the
probabil ity of classifying a true negative as IInegative.1I
Maximum likelihood and multinomial estimates of the prevalence can
be compared on the basis of the root mean squared error (RMSE) (Table 7
for
n=1,OOO
and Table 8 for
RMSE is starred.
n=5,OOO).
In each table, the smaller
Comparisons of the RMSE's are based on the 24
combinations of parameter estimates for which they can be different: if
there is no error, the MLE and multinomial estimates must be the same.
The multinomial estimate tends to have smaller RMSE for lower values of
p,
u,
v,
and
larger values of
n,
whereas the MLE tends to have the smaller RMSE for
p,
u,
v,
and
n.
performance is between sample sizes;
times (of 24) for
n=1,OOO
and
20
The clearest difference in
the MLE had the smaller RMSE
times (of 24) for
12
n=5,000.
Tables 7' and 8 also show a pair of columns for cost reference to
Design 1.
One way to make the cost/utility comparisons described in
Chapter 1 is to hold the direct cost of obtaining the sample constant
and then compare the RMSE's of the two designs.
Using this approach,
the RMSE for Design 1 is smaller than that of Design 2 in five
possible) cases for
n=1,000
and in only one (of 24) for
(of 24
n=5,000.
The poor performance of the Design 1 estimate is due primarily to its
45
TABLE 7
Maxi~um
Likelihood and Multinomial Estimates
of Prevalence (p) for Design 2 (n=l,OOO)
Based on 100 Monte Carlo Runs
p
u
v
.05
.70
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.85
1.00
.70
.10
e
.85
1.00
.70
.15
.85
1.00
Prevalence
MLE
Mult
.057
.060
.056
.056
.051
.051
.053
.051
.050
.112
.107
.105
.101
.101
.101
.102
.101
.099
.161
.157
.154
.150
.151
.151
.151
.149
.148
.050
.036
.032
.059
.046
.041
.068
.054
.050
.080
.068
.063
.099
.086
.082
.116
.104
.099
.111
.099
.095
.139
.128
.124
.165
.152
.148
Mult
MLE
.044
.041
.024
.035
.013
.0lD
.011
.007
.006
.055
.039
.026
.023
.016
.013
.013
.0lD
.009
.057
.041
.028
.023
.018
.015
.013
.012
.011
*
*
*
=
*
*
*
*
=
*
*
*
*
*
=
NOTES
*
=
+
Design 1
Reference
RMSE
Est
RMSE
Sma 11 er RMSE
RMSE is equal to MLE RMSE
Reference RMSE smaller than MLE RMSE
.006
.015
.019
.011
.008
.011
.019
.008
.006
.021
.033
.037
.009
.016
.019
.019
.0lD
.009
.040
.052
.056
.015
.024
.028
.018
.011
.011
*
*
*
*
*
=
*
*
*
*
=
*
*
*
=
.130
.083
.035
.138
.090
.042
.133
.097
.050
.160
.115
.069
.174
.129
.084
.189
.145
.099
.190
.147
.104
.211
.169
.127
.234
.191
.148
.081
.034
.016
.088
.041
.0lD
.083
.048
.006
.061
.018
.032
.075
.031
.018
.090
.046
.009
.041
.010
.047
.062
.022
.026
.085
.042
.011
+
+
=
=
+
=
+
+
=
46
TABLE 8
Maximum Likelihood and Multinomial Estimates
of Prevalence (p) for Design 2 (n=5,000)
Based on 100 Monte Carlo Runs
p
u
v
.05
.70
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.85
1.00
.70
.10
.85
1.00
.70
.15
.85
1.00
MLE
.053
.053
.051
.051
.050
.050
.051
.051
.050
.103
.102
.101
.100
.100
.100
.101
.101
.100
.151
.152
.151
.150
.151
.151
.151
.151
.150
.019
.012
.008
.009
.006
.004
.005
.004
.003
.021
.013
.011
.009
.007
.006
.006
.005
.004
.021
.016
.013
.010
.008
.007
.006
.006
.005
.050
.036
.032
.060
.046
.042
.068
.055
.050
.081
.068
.064
.100
.087
.083
.117
.104
.100
.112
.099
.095
.141
.129
.125
.166
.154
.150
Mult
*
*
*
*
*
*
=
*
*
*
*
*
*
=
*
*
*
*
*
*
*
*
=
NOTES
*
=
+
Design 1
Reference
RMSE
Est
RMSE
Prevalence
MLE
Mult
Smaller RMSE
RMSE is equal to MLE RMSE
Reference R~SE smaller than MLE RMSE
.003
.014
.018
.010
.005
.009
.018
.006
.003
.020
.032
.037
.004
.013
.017
.018
.006
.004
.039
.051
.055
.011
.022
.026
.017
.007
.005
*
*
=
*
*
=
=
.130
.083
.035
.138
.090
.042
.145
.098
.050
.160
.ll5
.070
.175
.130
.085
.190
.145
.100
.190
.147
.105
.213
.170
.128
.235
.192
.150
.080
.033
.015
.088
.040
.008
.095
.048
.003
.060
.016
.030
.075
.030
.016
.091
.045
.004
.040
.005
.045
.063
.021
.023
.085
.043
.005
=
e
=
+
=
47
large bias, as can be observed by comparing the Design 1 estimate to
the true prevalence in each table.
Maximum likelihood estimates of sensitivity,
u,
and specificity,
v, were obtained along with the MLE's for prevalence, as shown in
Tables 9 and 10.
The means of the estimates and their respective
RMSE's for the 27
combinations of p,
considered for
n=l,OOO and
The RMSE's for
n=l,OOO
n=5,000
(Table 10).
u,
and
v that were
n=5,000 are shown in Tables 9 and 10.
(Table 9)
are all larger than the RMSE's for
This is a consequence primarily of the reduction
in variance due to the increased sample size.
For both sample sizes,
the estimates are practically unbiased, usually agreeing with their
true parameter value to three decimal places.
The estimates of
specificity,
v,
have smaller RMSE's than the estimates of
sensitivity,
u. These maximum likelihood estimates were obtained by
maximizing the log-likelihood described in Chapter 3 for Design 2.
Five of the estimates of specificity in Table 9 and six in Table
10 are greater than 1.
problems.
These are probably due to boundary convergence
The computer programs written by the author did not force
each estimate to remain inside the sample space (an oversite).
3.2 Model and Analysis -- Design 3
In discussing the implications of logical and statistical
independence, Quade and McClish (1977) describe a situation in which
logical independence is maintained (as in Design 2) but in which
statistical dependence (due to repeated testing of the same
individuals) is present:
Suppose for the sake of a simple example that half the
individuals are "easy" to classify and the other half "hard,"
48
TABLE 9
Maximum Likelihood Estimates of
Sens it i vity (u) & Specificity (v) for Design 2 (n=l,OOO)
Based on 100 Monte Carlo Runs
Sens it i vity
p
u
.05
.70
.10
.15
v
.90
.95
1.00
.85
.90
.95
1.00
1.00
.90
.95
1.00
.70
.90
.95
1.00
.85
.90
.95
1.00
1.00
.90
.95
1.00
.70
.90
.95
1.00
.90
.85
.95
1.00
.90
1.00
.95
1.00
MLE
.7163
.6830
.6849
.8456
.8456
.8458
.9732
.9895
1.0000
.6947
.6899
.6925
.8476
.8435
.8413
.9872
.9948
1.0000
.6938
.6911
.6923
.8474
.8465
.8458
.9918
.9968
1.0000
R~1SE
.18600
.15772
.13931
.13346
.08887
.07423
.05461
.02261
.00000
.12420
.10427
.09187
.07696
.06307
.05465
.02558
.01121
.00000
.10328
.08215
.07091
.05150
.04638
.04064
.01613
.00726
.00000
MLE
Specifi city
RI·1SE
.8998
.9493
.9994
.9009
.9498
1.0002
.9012
.9502
1.0000
.9042
.9503
1.0010
.9002
.9501
1.0006
.9012
.9504
1.0000
.9021
.9513
1.0020
.9001
.9503
1.0005
.9011
.9507
1.0000
.02026
.02374
.02368
.01539
.00896
.00156
.01305
.00814
.00000
.02773
.01537
.00759
.01611
.00942
.00250
.01305
.00815
.00000
.02274
.01478
.00903
.01635
.00985
.00304
.01355
.00824
.00000
e
49
TABLE 10
Maximum Likelihood Estimates of
Sensitivity (u) & Specificity (v) for Design 2 (n=5,000)
Based on 100 Monte Carlo Runs
Sens it i vity
RMSE
p
u
v
MLE
.05
.70
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.90
.95
1.00
.7021
.6933
.6944
.8507
.8493
.8481
.9864
.9937
1.0000
.6974
.6965
.6976
.8489
.8481
.8486
.9937
.9968
1.0000
.7017
.6973
.6984
.8502
.8500
.8494
.9959
.9979
1.0000
.85
1.00
.10
e
.70
.85
1.00
.15
.70
.85
1.00
.09734
.07799
.06215
.06039
.04232
.03372
.02563
.01165
.00000
.06444
.04682
.04050
.03386
.02616
.02255
.01160
.00577
.00000
.04779
.04106
.03602
.02562
.01993
.01784
.00739
.00376
.00000
_Speci fi city__
~1LE
RMSE
.9002
.9504
1.0004
.8998
.9501
1.0001
.9003
.9503
1.0000
.9002
.9503
1.0003
.8997
.9501
1.0001
.9004
.9504
1.0000
.8998
.9504
1.0003
.8996
.9501
1.0002
.9006
.9504
1.0000
.00730
.00463
.00191
.00556
.00363
.00063
.00466
.00313
.00000
.00774
.00512
.00255
.00587
.00383
.00087
.00484
.00310
.00000
.00800
.00582
.00353
.00622
.00392
.00116
.00526
.00323
.00000
50
with sensitivity and specificity both equal to 1.0 for the
easy cases but only .8 for the hard ones. Then two tests
applied independently to the same case will never disagree if
it is an easy one, but 32% of the time if it is hard, or
16% overa 11. But if the i ndi vi dua 1s were "moderately
difficult" to classify, with sensitivity and specificity both
equal to .9, then two applications of the test would
disagree 18% of the time. Thus with the same overall
sensitivity and specificity there is less disagreement if the
individuals are heterogeneous than if they are all alike:
i.e., results from seemingly independent applications of the
same test are positively correlated. Our model deals with
this situation by explicitly providing for more categories
than would otherwise be needed: for example, we might allow
for I = 4 categories (easy positive, easy negative, hard
positive, and hard negative) with only J = 2 results
(positive and negative).
In presenting their hard/easy model, Quade and McClish call
individuals "easy" to classify if the test makes no errors in
classifying them (i.e., sensitivity and specificity are both unity).
This implies the existence of a test that is perfect for a particular
subpopulation.
To account for statistical dependence, but not assume the
existence of a perfect test, we present an alternate formulation using
what we will call ari "obvious/subtle" model.
Here, if an individual is
"obvious" to classify, then all tests agree in their results, but they
may all be wrong.
In order to account for the new aspects of
characteristics of obvious and subtle, a new parameter is necessary:
t
=
Pr(Obvious).
If an individual is "obvious," then the sensitivity and specificity of
the consensus is the same as that of an individual classification of a
"subtle" individual.
To deal with this situation, we propose a new Design 3 which can
be described as one classification procedure applied independently to
each tested individual four times, with each classification yielding a
51
positive (11+11) or negative (II_II) result.
Since the classifications are
indistinguishable from one another, the outcome of testing can be
viewed as the number of 11+11
results obtained.
The two
characteristics of interest are positive and negative.
In order to
take into account the stochastic dependence induced by homogeneous
subpopulations, we show each characteristic as subdivided into two
groups representing the difficulty of classifying an individual.
The
structure of the design is shown in the table below.
Cha racteri stic
Positive
obvious
# of
Negative
subtle obvious
subtle
observable
margin
no
o
AO
BO
Co
DO
1
Al
B1
C1
D1
2
A2
B2
C2
D2
3
A~
'"
B3
C3
D3
A4
B4
C4
D4
'positive'
results
4
The observable margin is clearly available from the outcomes of
testing.
The cells of the table are not observable, though they have
expected values in terms of the parameters of the model, which are the
parameters previously defined for Design 2 (prevalence,
sensitivity,
Since t
u,
and specificity,
v)
p,
plus the new parameter t.
is defined in terms of the agreement of classifications,
it follows that any individual about whom there is any disagreement of
the results of testing must be II subtle. 1I
Thus, all "obvious" cells
52
except those coinciding with rows 0 and 4 have zero for their
expected values.
The expected values for the cells are shown below:
AO = npt(l-u)
=a
Al
A =
2
A =
3
= nptu
A4
80
8
1
82
83
84
where
a
a
=
np(l-t)(I-u)3
=
4 np(l-t)(I-u)3 u
6 np(l-t)(I-u)2 u2
=
= 4 np(l-t)(I-u)u 3
=
np(l-t)u 4
Co
=
n(l-p)tv
C
1
C
2
C3
=
=
a
a
a
C4
DO
=
n(l-p)t(l-v)
=
n(l-p)(I-t)v 4
=
D1
D2
=
=
4 n(l-p)(I-t)v 3 (I-v)
6 n(l-p)(I-t)v 2 (I-v) 2
D
3
D4
=
4 n(l-p)(I-t)V(I-v)3
=
n{l-p)(I-t)(I-v) 4 ,
n is the sum of the observable margin
n
(nO + n1 + n2 + n3 +
The integral factors in the equations above are the binomial
=
n4 ).
coefficients to show the number of possible orderings of "+11 results.
Note also that for cells
appears just once.
AO and A4 , the sensitivity factor, u,
This is a consequence of the "obvious" designation
53
of this column -- that is, the "obvious·' individuals in the population
would all agree on the results of the test, no matter how often they
are tested, and add only the information of one test.
the specificity factor,
Co and C4 .
The sum of the expected values in the columns are npt,
n(l-p)t,
and
subtle -,
v,
Similarly for
n(l-p)(l-t)
respectively.
in
for obvious +,
subtle +,
np{l-t),
obvious -,
The sum of all expected values is
and
n.
This model can also be described in terms of the general model
parameters
8 and
There are
from Chapter 2.
~
N
,.,
1=4
categories
and
8
1
= pt
82 = p(l-t)
6
e
6
3
4
= (l-p)t
= (l-p)(l-t),
(p)
is actually (6 1 + 82 ).
There are four identical tests, each with ma = 2 results:
thus the parameter of primary interest
tal = 1 (=
"+11),
thus D =
{II
3
t a2 = 2 (=
II, 11+
results immaterial.
The design allows
II_II).
", "++ __ 11, 11+++_11, II++++II}
The elements of
...,~
K= 5
outcomes,
wi th the order of
for any test a are the same
as those for Design 2, viz.,
1jJall
=
u
1jJa12 = (l-u)
1jJa21 = (l-v)
=
v.
a22
Unlike Design 2, the probability
1JJ
Pk{8,~)
"I
1'1
that applying the entire
testing procedure to a randomly chosen individual will yield the
outcome is no longer the sum of the simple products of parameters
kth
54
of individual tests.
Though not difficult to obtain,
Pk(e,~)
-~
can no
longer be computed simply by observing the outcome and then multiplying
the pieces together.
The correlation between two classifications can be derived from
the parameters in the model.
Conditional on the true classification,
the correlation can be seen to be equal to
t = Pr(Obvious).
If the
individual is truly positive, the covariance of two classifications is
Cov(l st test, 2nd test) = [tu + (l-t)u 2] - u2 = tu(l-u)
and the variance of either test is
Var(test) = u - u2 = u(l-u)
Thus the conditional correlation is
tu(l-u)/u(l-u)
Replacing
t.
=
u's with
v's yields the same correlation for a truly
negative individual.
It is, however, not a simple function of the
parameters of the model.
~ =
Let
pu + (l-p)(I-v),
which is the unconditional probability that an individual will be
classified positive.
r
If P
=
=t
Then the unconditional correlation is
+
0 or p
=
1 then
r
=
t.
This is appropriate, since the
unconditional correlation should then equal one or the other
conditional probability.
what is expected.
If u = v = 1 then
In all cases,
t
<=
r
<=
1,
r
=
1, which is also
since the fractional
term in the equation above is always non-negative.
For comparison purposes, a multinomial estimate can be obtained
for Design 3.
Here, three or four "positive" results for an individual
55
are used to classify that individual overall as "positive" [i .e., (n +
3
n4 )/n]. The Lilliefors test of normality of estimates of prevalence
(p)
for Design 3 is shown in Table 11.
The multinomial and MLE
estimates behave in a similar fashion overall, with nearly the same
number of tests having
P
>
.20 for each.
In general, more of the
tests were accepted for the larger sample sizes than the smaller.
the
n=5,000
In
section of the table there may be a tendency for the
estimates to behave more normally as the size of the prevalence gets
larger.
The behavior at
n=10,000
n=l,OOO
seems to be consistently bad, at
it seems consistently good.
The number of Monte Carlo runs
completed for each different combination of parameters is shown in this
table.
For each combination of parameters, 100 Monte Carlo runs were
attempted, but some failed due to numerical difficulties in the search
algorithm.
Fewer failures occurred for the larger sample sizes than
for the smaller.
Table 12 shows the Lilliefors tests for the estimates
of sensitivity, specificity, and Pr(Obvious).
The estimates of prevalence (p) and their root mean square errors
(RMSE) for
n=l,OOO, 5,000, and 10,000 are shown in Tables 13, 14, and
15, respectively.
n=l,OOO,
In each table the smaller RMSE is starred.
For
the multinomial estimator has a smaller RMSE than the MLE in
13.5 of the 24
combinations, and for
smaller RMSE in 20 of the
24
cases.
n=10,000,
the MLE has the
For n=10,000 (Table 13), the
multinomial estimator appears to perform better than the MLE for
smaller u and smaller p.
The different values of v and
t
do not
have any clear effect on the size of the RMSE for either the
•
multinomial estimate or the MLE.
n=10,000 (Table 15)
For n=5,000 (Table 14)
and
the maximum likelihood estimator is preferred in
TABLE 11
Lilliefors Test of Normality for Design 3
Estimates of Prevalence (p)
p:
n
u
v
t
N
.05
MLE
Mult
N
1,000
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
81
88
98
94
92
83
98
89
.01
.01
.05
.01
.10
.01
>.20
.10
.10
>.20
.15
.20
>.20
>.20
.05
.10
96
93
100
98
98
99
98
99
.05
.01
.05
.01
>.20
>.20
>.20
>.20
.10
>.20
.05
.20
>.20
.10
>.20
>.20
100
99
100
100
99
100
98
97
.01
.01
>.20
.20
.01
.05
.20
>.20
.15
>.20
>.20
.20
.20
.20
.20
.10
.60
.80
.60
.80
.60
.80
.60
.80
100
100
100
100
99
100
100
100
.01
.01
>.20
.10
.20
>.20
>.20
>.20
.10
>.20
>.20
>.20
.20
>.20
>.20
>.20
100
100
100
100
100
100
100
100
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
.15
>.20
>.20
.10
>.20
>.20
100
100
100
100
100
100
100
100
>.20
>.20
>.20
. 15
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
.60
.80
.60
.80
.60
.80
.60
.80
96
94
100
99
99
99
100
100
>.20
.05
>.20
>.20
>.20
>.20
>.20
. 15
>.20
>.20
>.20
>.20
>.20
>.20
>.20
.15
100
98
100
100
99
100
100
100
>.20
. 15
>.20
>.20
.01
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
99
100
100
100
99
100
100
100
.15
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
.05
>.20
>.20
.95
.85
.90
.95
5,000
.70
.90
.95
.85
.90
.95
10,000
.70
.90
.95
.85
.90
.95
* N
NOTE:
=
#
.10
MLE
Mult
N
. 15
LE
~1
Mult
Monte Carlo Runs completed (100 were begun).
(J1
0'1
e
~
•
.e
•
e
e
"-
e
f
TABLE 12
Lilliefors Test of Normality for Design 3
Estimates of Sensitivity (u), Specificity (v), and Pr(Obvious) (t)
n
1,000
Pr(Obvious) ( t)
.05
. 10
· 15
t
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
.10
.01
>.20
.10
.01
.01
.05
.01
>.20
>.20
>.20
.20
.01
.01
>.20
.01
.05
.15
>.20
>.20
.01
.01
>.20
.10
>.20
>.20
>.20
. 10
>.20
>.20
>.20
>.20
.01
>.20
.20
>.20
>.20
>.20
>.20
>.20
.01
>.20
>.20
>.20
.01
>.20
.10
>.20
.05
.01
.01
.01
.10
.01
.01
.01
>.20
.05
.15
.01
.01
.01
.01
.01
>.20
>.20
.01
.05
· 15
.05
.01
.01
.60
.80
.60
.80
.60
.80
.60
.80
.10
.20
>.20
.10
.01
.01
>.20
.01
>.20
>.20
>.20
>.20
. 15
.01
.15
.05
>.20
>.20
>.20
>.20
>.20
.20
.01
.01
>.20
>.20
>.20
>.20
.05
.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
.05
>.20
>.20
. 15
>.20
>.20
>.20
>.20
.20
.05
.20
>.20
.05
.01
.01
>.20
.10
.05
>.20
>.20
.20
>.20
.20
.10
.15
>.20
>.20
.05
>.20
>.20
>.20
.10
.05
.60
.80
.60
.80
.60
.80
.60
.80
.15
.10
>.20
>.20
.10
.05
.15
.15
>.20
>.20
>.20
>.20
>.20
.01
.05
.10
>.20
>.20
>.20
>.20
>.20
.05
.20
>.20
>.20
>.20
>.20
>.20
>.20
>.20
.15
>.20
>.20
>.20
>.20
>.20
>.20
.01
.05
.01
>.20
>.20
>.20
>.20
.05
>.20
>.20
>.20
.20
.20
.15
.05
.01
.05
.01
.20
.01
>.20
>.20
.05
>.20
.05
.01
>.20
>.20
>.20
>.20
>.20
.05
· 15
· 10
.01
.85
.90
.95
.70
.90
.95
.85
.90
.95
10,-000
Specificity (v)
.05
.10
. 15
v
.95
5,000
Sensitivity (u)
p: .05
.10
.15
u
.70
.90
.95
.85
.90
.95
(J"1
'-J
TABLE 13
Design 3 Estimates of Prevalence (p) and their RMSE's
with Reference RMSE for n = 1,000 per Monte Carlo Run
Mult
Design 1
Reference
Est
RMSE
.045
.062 *
.015 *
.024 *
.053
.070
.024
.033
.13 1
.130
.083
.083
.138
.137
.090
.090
.081
.080 +
.034
.034 +
.089
.088
.041
.042
.031
.065
.019
.046
.018 *
.029 *
.017 =
.020 *
.026
.044
.011
.012
.044
.060
.017
.024
*
*
*
*
.160
.16 1
.015
. 115
. 176
.176
.130
.130
.061
.062 +
.018 +
.018 +
.076
.076
.032
.032
.084
.047
.022 *
.025
.076
.029 *
.015
.018
.012
.026
.024
.016
.034
.049
.013
.017
*
*
.19 1
.19 1
.148
.148
.213
.213
.17 1
.17 1
.042
.042
.010
.010
.064
.064
.024
.024
p
u
v
t
N
Prevalence
MLE
Mult
.05
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
81
88
98
94
92
83
98
89
.063
.103
.053
.071
.062
.076
.055
.062
.094
· 111
.063
.073
.102
.120
.072
.082
.027
.133
.018
.062
.023
.057
.016
.022
.60
.80
.60
.80
.60
.80
.60
.80
96
93
100
98
98
99
98
99
· 112
· 119
.102
.109
.106
.108
· 101
.105
· 124
.143
.096
.105
.143
· 159
· 114
.122
.60
.80
.60
.80
.60
.80
.60
.80
100
99
100
100
99
100
98
97
.160
· 153
.156
· 155
· 157
· 157
· 151
· 155
· 155
· 173
.128
.138
.182
.198
.157
.163
.95
.85
.90
.95
.10
.70
.90
.95
.90
.85
.95
. 15
.70
.90
.95
.85
.90
.95
NOTES
*
Smaller RMSE;
=
RMSE's are equal;
RMSE
+
~1
LE
*
*
*
*
*
=
*
*
*
*
+
+
+
+
+
Design 1 RMSE smaller than MLE RMSE
<.J1
(X)
e
Ie •
-
e
e
e
TABLE 14
Design 3 Estimates of Prevalence and their RMSE's with
Reference RMSE for n = 5,000 per Monte Carlo Run
Prevalence
MLE
Mult
MLE
p
u
v
t
N
.05
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
100
100
100
100
99
100
100
100
.051
.057
.052
.054
.051
.052
.051
.053
.093
· 111
.063
.073
.102
· 119
.072
.081
.010
.020
.007
.011
.009
.011
.006
.008
.60
.EO
.60
.80
.60
.80
.60
.80
100
100
100
100
100
100
100
100
.102
.102
.100
.100
.101
.102
.100
.100
.123
.142
.095
.105
.142
.159
· 114
· 122
.013
.017
.008
.010
.009
.012
.007
.008
.60
.80
.60
.80
.60
.80
.60
.80
100
100
100
100
100
100
100
100
. 151
.149
. 151
.152
.149
.152
.150
.150
· 155
· 172
.128
.138
.182
.198
.156
· 163
.014
.019
.010
.013
.011
.011
.007
.009
.95
.85
.90
.95
. 10
.70
.90
.95
.85
.90
.95
. 15
.70
.90
.95
.85
.90
.95
NOTES
*
Smaller
R~lSE;
= RMSE's are equal;
Mult
Design 1
Reference
Est
R~lSt
*
*
*
*
*
*
*
*
.043
.061
.013
.023
.052
.070
.022
.032
.130
.130
.083
.083
.138
· 137
.090
.090
.080
.080
.033
.033
.088
.088
.040
.040
*
*
.024
.042
.006 *
.007 *
.042
.059
.014
.022
.160
.160
· 115
· 115
· 175
· 175
.130
.130
.060
.060
.016
.016
.075
.075
.030
.030
.007 *
.023
.023
.013 =
.033
.048
.008
.014
.190
.190
.148
.148
.213
.213
.170
.170
.040
.040
.005 +
.005 +
.063
.063
.021
.021
RMSE
*
*
*
*
*
*
=
*
*
*
*
+ Design 1 RMSE smaller than MLE RMSE.
U1
~
TABLE 15
Design 3 Estimates of Prevalence and their RMSE's with
Reference RMSE for n = 10,000 per Monte Carlo Run
Prevalence
Mult
MLE
RMSE
Design 1
Reference
Est
RMSE
p
u
v
t
N
.05
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
96
94
100
99
99
99
100
100
.051
.053
.051
.052
.051
.051
.051
.052
.093
.111
.063
.073
.102
.120
.072
.081
.008
.011
.005
.006
.006
.008
.004
.006
*
*
*
*
*
*
*
*
.043
.061
.013
.023
.052
.070
.022
.031
.130
.130
.083
.083
.138
.138
.090
.090
.080
.080
.033
.033
.088
.088
.040
.040
.60
.80
.60
.80
.60
.80
.60
.80
100
98
100
100
99
100
100
100
.100
.099
.100
.100
.101
.108
.100
· 101
· 123
· 142
.095
.105
.142
.158
· 114
· 122
.008
.012
.006
.008
.005
.081
.005
.006
*
*
*
*
.024
.042
.005 *
.006 *
.042
.058 *
.014
.022
.160
.160
· 115
· 115
· 175
· 175
.130
.130
.060
.060
.015
.015
.075
.075 +
.030
.030
.60
.80
.60
.80
.60
.80
.60
.80
99
100
100
100
99
100
100
100
.150
· 151
.150
· 151
· 150
· 151
.149
.150
.154
· 172
.128
.138
.182
.197
.155
.163
.009
.014
.006
.008
.007
.008
.005
.006
*
*
*
*
*
*
*
.006 *
.022
.022
.013
.032
.048
.007
.013
.190
.190
.148
.148
.213
.213
.170
.170
.040
.040
.004 +
.004 +
.063
.063
.020
.020
.95
.85
.90
.95
.10
.70
.90
.95
.85
.90
.95
. 15
.70
.90
.95
.85
.90
.95
NOTES
*
Smaller RMSE;
+
MLE
~lu
*
1t
Design 1 RMSE smaller than MLE's RMSE.
0'>
a
e
e
e
61
almost all cases;
As the sample size increases above
n=I,OOO,
the
MLE becomes the clearly preferred estimator on the basis of the RMSE.
Tables 13, 14, and 15 show an additional pair of columns for cost
reference to Design 1.
The Design 1 estimate and RMSE are based on
four times the sample size of the Design 3 estimate, since each
individual was tested exactly four times.
costs, the RMSE's are now comparable.
Having obtained equal direct
For n= 1,000
(4,000 for
Design 1) (Table 13), the RMSE for Design 1 is smaller than that for
Design 3,
and for
10 of 24
n = 10,000
times;
for
(Table 15),
the Design 1 estimate for
n = 5,000 (Table 14),
3 of 24.
n = 5,000 and
2 of 24;
The poor performance of
n = 10,000
is due primarily
to its large bias as can be observed by comparing the Design 1
estimates to the true prevalence in each table.
RMSE for Design 1 is at its minimum for
Note also that the
p = .15,
u = .70,
and
v = .95.
This is very near an lIessentially fortuitous circumstance
(Quade et
~
u = .70,
v = .95
ll
1980) at which the bias is zero for Design 1 (p = .14,
is one such point).
In Table 15, one entry in the MLE's RMSE column deserves special
attention.
For p=.10,
u=.85,
v=.90,
and
t=.80,
the MLE's RMSE is
.081, which is on the order of ten times as large as any other RMSE in
the MLE column.
Further investigation of the Monte Carlo results for
this combination of parameters showed that in one of the 100 Monte
Carlo runs, the parameter estimates were very far from their respective
parameter values
.91,
(I-v).
.89,
.11,
and
p,
.15.
u,
and
v.
The estimates in this instance are
These are approximately
(I-p),
(I-u),
If these estimates are replaced by their complements
and
.85,
then the MLE of p is
and
.09,
.010 and the RMSE of p is
62
.009,
which is comparable to the other RMSE's of p in its order of
magnitude.
Similarly, the MLE of u is
and the MLE of
v is
.900
.852
and its RMSE is
and its RMSE is
.009.
.034,
80th of these
RMSE's are of the same order of magnitude as others for their sample
size.
The search routine was to maximize the likelihood (for
computational convenience, the equivalent of minimizing the negative of
the log-likelihood was used).
The log-likelihood obtained agreed with
all of the others in the group of Monte Carlo runs for this combination
of parameter values to five decimal places, which is the amount of
precision requested of the search routine for all runs.
The
log-likelihood that was obtained for the final estimates was indeed
larger than that which would have been obtained if the estimates were
equal to the parameter values and using the observed marginals.
The
search routine also took about ten times as long to converge as it did
for the other Monte Carlo runs in this group.
It appears that the
complementary estimates obtained here may be an example of the well
known result of simple screening tests (Rogan and Gladen [1978J) that
if
p,
u,
and
v are solutions, then
are also solutions.
(l-p),
(l-u),
and
(I-v)
The search algorithm should have been tuned to
prevent this occurrence.
The ML estimates of sensitivity
Pr(Obvious)
(t)
and their RMSE's for
(u),
specificity (v), and
n=l,OOO,
are shown in Tables 16, 17, and 18, respectively.
5,000, and
10,000
They were obtained
as correlates to the MLE's for prevalence shown in Tables 13, 14, and
15.
The means of the estimates and their respective RMSE's for the 24
combinations of
n=l,OOO,
p,
n=5,000,
u,
and
and
v that were considered for each of
n=10,000
are shown in Tables 16, 17, and 18,
63
TABLE 16
Design 3 Estimates of Sensitivity, Specificity,
and Pr(Obvious) and their Root-MSE's
for n = 1,000 per Monte Carlo Run
Sensitivity
MLE
RMSE
Speci fi ci ty
RMSE
MLE
Pr(Obvious)
RMSE
MLE
p
u
v
t
.05
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
.753
.712
.687
.693
.841
.785
.833
.812
.155
.235
.121
.198
.121
.190
.108
.152
.909
.918
.950
.958
.908
.911
.954
.957
.016
.026
.009
.015
.016
.022
.015
.017
.557
.770
.590
.758
.561
.770
.526
.741
.081
.066
.074
.110
.082
.092
.187
.154
.60
.80
.60
.80
.60
.80
.60
.80
.683
.710
.709
.711
.842
.848
.847
.840
.111
.139
.089
.144
.080
.090
.069
.098
.904
.907
.951
.953
.903
.906
.951
.953
.016
.021
.009
.013
.017
.027
.014
.018
.588
.787
.580
.780
.578
.768
.569
.750
.055
.041
.077
.068
.074
.097
.134
.132
.60
.80
.60
.80
.60
.80
.60
.80
.714
.721
.692
.700
.838
.835
.857
.843
.099
.114
.084
.082
.095
.083
.050
.087
.895
.902
.951
.952
.891
.903
.951
.951
.074
.022
.011
.014
.076
.024
.014
.018
.582
.789
.584
.789
.593
.785
.554
.760
.055
.042
.082
.049
.069
.058
.142
.139
.95
.85
.90
.95
.10
.70
.90
.95
.85
e
.90
.95
.15
.70
.90
.95
.85
.90
.95
64
TABLE 17
Design 3 Estimates of Sensitivity, Specificity,
and Pr(Obvious) and their Root-MSE's
for n = 5,000 per Monte Carlo Run
p
.05
Pr(Obvious)
MLE
R~lSE
u
.70
.90
.60
.80
.60
.80
.60
.80
.60
.80
.715
.707
.692
.694
.851
.826
.845
.853
.081
.124
.065
.090
.059
.098
.047
.061
.901
.903
.951
.952
.901
.901
.951
.953
.006
.009
.004
.007
.008
.011
.006
.008
.594
.794
.596
.795
.591
.797
.591
.786
.024
.017
.033
.026
.037
.023
.049
.041
.60
.80
.60
.80
.60
.80
.60
.80
.692
.695
.703
.701
.845
.850
.851
.849
.054
.071
.040
.058
.033
.049
.032
.046
.900
.900
.950
.950
.900
.902
.951
.950
.007
.0lD
.004
.006
.008
.011
.006
.009
.599
.798
.599
.799
.598
.793
.590
.792
.023
.016
.030
.019
.032
.027
.052
.039
.60
.80
.60
.80
.60
.80
.60
.80
.705
.706
.698
.698
.847
.839
.851
.853
.046
.052
.035
.041
.024
.041
.028
.034
.901
.900
.950
.951
.899
.900
.950
.951
.007
.009
.005
.007
.008
.012
.006
.009
.594
.799
.598
.797
.601
.798
.590
.792
.026
.014
.031
.021
.030
.026
.050
.037
.85
.90
.95
.70
.90
.95
.85
.90
.95
.15
Specificity
MLE
RI~SE
t
.95
.10
r
sensitivit
MLE
RMS
v
.70
.90
.95
.85
.90
.95
~
e
65
TABLE 18
Design 3 Estimates of Sensitivity, Specificity,
and Pr(Obvious) and their Root MSE's
for n = 10,000 per Monte Carlo Run
p
u
v
t
.05
.70
.90
.95
.85
.90
.95
.10
.70
.90
.95
e
.85
.90
.95
.15
.70
.90
.95
.85
.90
.95
Sensitivity
Specificity
RMSE
MLE
Pr(Obvious)
MLE
R~lSE
~'LE
R~1SE
.60
.80
.60
.80
.60
.80
.60
.80
.714
.700
.695
.702
.852
.833
.847
.845
.061
.095
.041
.065
.036
.069
.035
.046
.901
.902
.950
.951
.901
.900
.950
.951
.004
.006
.003
.004
.006
.007
.005
.006
.596
.798
.595
.797
.596
.800
.595
.795
.(j19
.018
.024
.016
.040
.025
.60
.80
.60
.80
.60
.80
.60
.80
.701
.712
.703
.705
.851
.844
.849
.848
.037
.052
.026
.040
.022
.081
.023
.032
.900
.900
.950
.950
.901
.893
.950
.950
.004
.006
.003
.004
.005
.075
.005
.006
.599
.797
.600
.799
.596
.798
.599
.797
.016
.013
.017
.014
.022
.017
.037
.025
.60
.80
.60
.80
.60
.80
.60
.80
.701
.700
.701
.701
.846
.842
.850
.850
.032
.039
.022
.029
.019
.028
.020
.026
.900
.900
.950
.950
.899
.899
.949
.950
.004
.006
.003
.004
.005
.008
.005
.006
.599
.799
.599
.798
.601
.800
.599
.797
.016
.011
.022
.013
.019
.017
,034
.027
.014
.O~3
66
respectively.
The RMSE's for
n=1,OOO (Table 16) are larger than those
for n=5,OOO (Table 17) which are larger than those for
(Table 18) (with the one exception of the RMSE's for
p=.10,
u=.85,
v=.90,
and
"
u
n=10,OOO
A
and v for
t=.80 which was discussed above).
This
is a consequence primarily of the reduction in variance due to the
increased sample size.
The estimate of specificity (v) has a smaller
RMSE than the estimate of sensitivity (u) within each combination of
parameters.
This is due in part to the prevalence being less than .5,
which causes there to be more true negatives in the sample and thus
relatively better estimates of specificity can be obtained.
3.3 A Problem of Estimability
In examining a design such as the two that have been discussed in
~
this chapter, with nearly as many parameters as data with which to
estimate them, the question of estimability of the system arises.
The
data are a complicated function of the parameters, so an explicit
solution is not available.
We can prove that a system is not estimable
if we show that at least one parameter is a function of others, but it
is difficult to show that a design is estimable.
To attempt to determine if a design was estimable at some points,
the method of Jacobians was used to determine if the sample space was
singular at selected points.
A singularity implies that a design is
inestimable, though the converse is not always true.
In Design 3, all
of the starting points and all of the points that were true parameter
values were examined by this method and found to be nonsingular.
In Appendix A, a design is discussed that had several desirable
attributes, and could not be shown to be not estimable, but whose
67
numerical behavior indicated that it indeed was not an estimable
design.
The result is important in that it shows that much work must
be done on a design to satisfy oneself that it is adequate before
proceeding with further analyses.
•
68
Chapter 4.
EXAMPLE
The presence of hypertension in an individual can be measured
using a sphygmomanometer to measure systolic and diastolic blood
pressure in mill imeters of mercury.
The techni que for determi ni ng
hypertension is subject to error form a number of sources.
source is person-specific:
The first
blood pressures vary depending on age an
race t time of daYt posture (standing t sitting t of lying) and how long
in that posture t length of time since physical activitYt and
psychological state.
A second source of error is related to the person
measuring the blood pressure t whose skill level may effect the reading
and whose digit preferences may produce bias in recording the reading.
The Lipids Research Clinics (LRC) Prevalence Study measured blood
pressure four times in succession under a highly structured protocol
during each subject's first follow-up visit (Visit 2 t in the LRC
designation).
The details of the protocol may be obtained from
Williams et al (1980).
In order to induce a certain homogeniety into
the population and comparability into the analyses t the blood pressure
measurements were restricted to those for whites (male and female) aged
20 through 54.
had all
There were
7105
people in this groUpt of whom 7074
four blood pressure measurements taken.
Though a
simplification of the LRC protocol t for the purpose of these analyses t
the four measurements are treated as having been performed sequentially
under identical conditions.
69
If a cutpoint is specified, the results of testing can be positive
("+11) or negative ("-") for hypertension for each measurement of blood
pressure.
The data thus fit the assumptions of Design 3, with the
number of readings above a certain threshold (i.e., the number of "+"
results) determining the outcome of the testing procedure.
Severa 1 thresholds have been proposed as rna rkers for severe
hypertension.
Some commonly discussed ones are:
systolic blood
pressure >=140 mm Hg, diastolic blood pressure >=90 mm Hg, systolic
blood pressure >=160 mm Hg, diastolic blood pressure >=95 mm Hg, both
systolic blood pressure >=140 mm Hg and diastolic blood pre$sure
>=90 mm Hg, and both systolic blood pressure >=160 mm Hg and diastolic
blood pressure >=95 mm Hg.
The categorization of these six methods of
separating severe hypertensives from others for the
7074
subjects
with complete data is shown in Table 19 and the parameter estimates for
the six thresholds are shown in Table 20.
As can be seen from Table 20, the parameter estimates obtained
from the LRC data are not all in the range previously considered in the
Monte Carlo studies.
However, since they were, with the exception of
specificity, less extreme than those considered above, the procedure
converged with no numerical difficulties.
The diastolic blood pressure cutpoints in Table 20 select a few
more individuals as positive than do the systolic blood pressure
cutpoints.
The specificity
(v)
is quite high, indicating that most
individuals were consistently below the cutpoints.
(u)
The sensitivity
is lower than is expected, perhaps because more of these persons
were borderline to the cutpoints.
Shown also in Table 20 are entries
for multinomial estimates of prevalence of hypertension obtained by
TABLE 19
Blood Pressure Categorizations for
7074
White Subjects Aged
20-54
from All Lipids Research Clinics
Threshold
0-"+"
1-"+"
2-"+"
3-"+"
4-"+"
Sys BP >= 140
6049
265
210
153
397
Dias BP >= 90
5560
441
281
275
517
Sys BP >= 160
6896
51
36
21
70
Dias BP >= 95
6384
249
136
94
211
Sys BP >= 140
BP > =. 90
&
6354
231
136
128
225
Sys BP > = 160
Dias BP>= 95
&
6939
46
30
15
44
oi a s
....,
o
e
e
e
-
e
e
TABLE 20
Design 3 Hypertension Parameter Estimates for Lipids Research Clinics Data
prevalence
p
MULT
MLE
Threshold
sensitivity
u
specificity
v
Pr(Obvious)
t
Sys BP > = 140
.16
.08
.53
.99
.52
Dias BP>= 90
.19
.11
.60
.97
.42
Sys BP > = 160
.03
.01
.47
1.00-
.56
DiasBP>=95
.09
.04
.51
.99
.45
Sys BP > = 140
DiasBP>=90
&
.09
.05
.59
.99
.40
Sys BP> = 160
DiasBP>=95
&
.02
.01
.63
1.00-
.50
'-J
~
72
classifying those with three or four positive results as positive and
the remainder as negative, as was done in the comparisons of Design 3
in Section 3.2.
The multinomial estimates of prevalence are
considerably smaller than the maximum likelihood estimates.
This is
not consistent with the results obtained by the Monte Carlo studies.
The reduced sensitivity may have had some influence here.
If the sample obtained were homogeneous, then the Pr(Obvious)
parameter would be superfluous to the parameterization of the problem.
The estimates of
p,
u,
and
v shown in Table 21 were obtained from
Design 3 with the value of Pr(Obvious) forced to be zero.
individuals are subtle to classify.
The estimates of
obtained here may be compared to those of Table 20.
are smaller, those of
the same.
u are larger, and those of
p,
That is, all
u,
and
v
Estimates of
p
v are virtually
These values of sensitivity are now in the range considered
in Section 3.2.
It is not known whether this sample is homogeneous or
heterogeneous with respect to the ability to classify individuals into
blood pressure groups.
Another thing that is not known about these
individuals, but which must be taken into consideration when comparing
these data to other screening data, is their status with respect to
anti-hypertensive medication.
It is quite likely that some of these
individuals were taking medicine to lower their blood pressure.
e
-
.
-
TABLE 21
Design 3 Hypertension Parameter ML Estimates for
Lipids Research Clinics Data with
Pr(Obvious) Forced to be Zero
Threshold
prevalence
p
sensitivity
u
specificity
v
Sys BP > = 140
.11
.80
.99
Di a s BP > = 90
. 15
.79
.98
Sys BP >= 160
.02
.80
1. 00-
Dias BP>= 95
.06
.78
.99
Sys BP > = 140
Di as BP > = 90
&
.07
.78
.99
Sys BP>= 160
DiasBP>=95
&
. {) 1
.76
1. 00-
""-.J
W
74
SUMMARY AND RECOMMENDATIONS
Chapter 5.
The method of multiple testing of individuals can be used to more
effectively estimate the prevalence of a characteristic in the
population.
The maximum likelihood method presented to obtain a single
estimate from multiple tests provides unbiased, asymptotically normal
estimates of prevalence, sensitivity, and specificity.
If each
individual is tested a sufficient number of times, then it is also
possible to obtain a measure of the association between two tests
applied to the same individual and to use this information to account
for the problems associated with the repeated use of the same
individuals.
The MLE's generally perform better than the analogous
multinomial estimator, though the improved performance requires larger
sample sizes than would be preferred.
This work has opened several questions of interest for further
research.
1)
Is there some way, theoretical or numerical, to determine
whether or not a design is estimable?
The usefulness of the
technique would be enhanced considerably if an investigator
could be certain
~
priori that the design with which he is
working is estimable.
2)
This work has dealt with tests which have the same underlying
sensitivity and specificity.
It appears that arbitrarily
different values for each test could not be estimated because
75
the number of parameters to be estimated would increase
faster than the number of outcomes of testing.
leeway could be allowed:
How much
if four tests were done, could one
(or two) have different parameters from the others?
Perhaps
combining a sensitive (but non-specific) test with several
specific (but non-sensitive) tests would produce an estimate
with lower bias than having all tests have the same levels of
sensitivity and specificity.
3)
We have dealt only with binary results of each test.
The
model is designed to be general enough to handle any set of
nominal results.
How will a design similar to Design 3
behave with tests that have three or four results?
4)
Tests frequently yield ordinal or continuous results.
What
modifications of the model are necessary to be able to take
advantage of the additional information that ordinal or
continuous results provide?
Can ordinal and nominal tests be
combined in one design?
5)
The design described in Appendix A actually extends the
current general model by the use of population strata and the
combination of parameters across strata.
Since many large
sampling problems do involve dealing with samples from
different strata, can this extension to the model be made to
work?
It would also be of interest to know whether the
•
asymptotic normality of the MLE's requires a large number of
..
subjects per stratum or simply a larger number of subjects
overall.
76
APPENDIX A.
AN INESTIMABLE DESIGN
This design occurred as a natural extension of Design 2 of Chapter
3.
There were seven outcomes to be considered and five parameters to
be estimated.
If the first test yields a IIpositivell result, then
perform additional tests until a majority has been reached.
If the
first test is II nega tive,1I let that stand except that a certain
proportion of these initial IInegativesll are also to be re-tested until
a majority is reached.
of Design 3 (prevalence
Pr(Obvious)
(t))
The five parameters of the Design are the four
(p),
sensitivity (u),
g
(v),
and
and one additional to account for the initial
"negatives" that are re-tested:
where
specificity
g
=
Pr{re-test given initial "_II),
can be estimated directly from the margin.
table for this design is shown below.
The conceptual
77
Positive
Negative
Obvious Subtle Obvious Subtle
Al
B1
C1
°1
A2
B2
C2
°2
A3
B3
C3
°3
"_"
A4
B4
C4
°4
"_+_"
AS
BS
Cs
Os
"_++"
A6
86
C6
°6
II
A7
B7
C7
°7
"_++"
II
Observable
Margin
n
s
n
The cell entries are designated by
Ai' Bi , Ci , and 0i and the
observable margin is represented by the ni . The theoretical cell
probabilities are shown below; to obtain the theoretical cell
frequencies, multiply each entry by n:
A1 -- 0
•
A2
A3
=
0
=
ptu
A4
=
pt(l-u)(l-g)
AS
=
0
A6
A7
=
0
=
pt(l-u)g
B1
B2
=
p(l-t)l(l-u)
=
p(l-t)l (l-u)g
In this design, the estimate of g can be obtained directly from
the margins' and thus the multinomial and maximum likelihood estimates
of
g are the same.
"+_+"
and
"_++11,
Once the parameter g has been estimated, the
as well as the
distinguished by the model.
11+ __ "
and
11_+_"
rows cannot be
Effectively, this leaves five outcomes and
five unknowns, i.e. an indeterminate model.
It can be shown that there
is a linear dependency among the row margins.
To increase the number
of outcomes with less of an increase in the number oJ parameters, we
..
79
may modify the design to incorporate two strata.
In doing so, we allow
different prevalences (call them PI
and
P2) in the strata, but the
same values for the error rates
and
v)
(u
and
t
and
g.
This
increases the total number of outcomes to ten in the new design and the
number of parameters to six, leaving two degrees of freedom to test the
goodness-of-fit of the model (two were lost to the two strata totals).
This design was then examined very closely for any linear
dependencies in the parameters, and none were found.
It was then
examined using the Jacobian method described in Section 3.3 for all of
the starting points for estimation and also for all of the parameter
values and it was not found to be singular at any of them.
The Hl
method was used to estimate the parameter values from a variety of
starting points.
It converged for each starting point, but not,
however, to the same end points.
..
Even when the starting points were
very "close" to the parameter values, the
to the parameters.
Er~
method would not converge
(If the starting point were the parameter values,
then it would converge to them, but that is not a very useful
property.)
Dawid and Skene (1979) have pointed out that the EM method
can converge to a local, rather than a global, maximum.
The variety of
"solutions" found by the Hl method implied a more serious problem in
the estimation procedure.
In order to determine if the programming of the EM method were at
fault, the analysis was re-programmed using the modified Newton-Raphson
algorithm from the IMSL (subroutine ZXMIN).
The subroutine from the
IMSL refused to converge at all, providing the first concrete evidence
that there was something seriously wrong with the model.
(A difference
in the use of the EM method and the Newton-Raphson should be noted.
80
The EM method maximizes the expectation of the parameters.
Therefore,
an explicit statement of the likelihood function is not necessary to
use the method (an advantage in ease of use for many cases).
The
Newton- Raphson method requires an explicit single-valued function to
minimize:
here, the negative of the log-likelihood.
The desirable
property of refusing to converge when the data would not allow
convergence is a feature that led to the use of the ZXMIN subroutine
for all subsequent analyses (i.e., Design 3), even though it required
the explicit derivation of the likelihood function.
There was no observable deficiency in the design, and the
computing algorithm was changed to a general one that was publicly
available.
found.
Still there is a problem with the model that has not been
Several people in the Department of Biostatistics assisted in
the search for a design error.
This points to the caution and the need to carefully scrutinize
any prospective model that is proposed for use with this method.
•
"
81
.
J
82
Conover, W.J. (1980).
Practical Nonparametric Statistics, Second
Edition, John Wiley & Sons, New York.
Dawid, A.P. and Skene, A.M. (1979).
Maximum likelihood estimation of
observer error rates using the EM algorithm.
Applied Statistics,
28:20-28.
•
Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977).
Maximum
likelihood from incomplete data via the EM algorithm.
Journal of
the Royal Statistical Society, Series B, 39:1-22
Diamond, E.L. and Lilienfeld, A.M. (1962a).
Effects of errors in
classification and diagnosis in various types of epidemiologic
studies.
American Journal of Public Health, 52:1137-1144.
Diamond, E.L. and Lilienfeld, A.M. (1962b).
Misclassification errors
in 2 x 2 tables with one margin fixed: some further comments.
American Journal of Public Health, 52:2106-2110.
Federer, W.T. (1963).
Procedures and designs useful of screening
material in selection and association, with bibliography.
Biometrics, 19:553-587.
Galen, R.S. (1975).
Multiphasic screening and biochemical profiles,
state of the art, 1975.
Progress in Clinical Pathology, 6:83-110.
Galen, R.S. and Gambino, S.R. (1975).
Beyond Normality: The Predictive
Value of Efficiency of Medical Diagnoses.
John Wiley &Sons, New
York.
Grant, J.A. (1974).
Qualitative evaluation of a screening program.
American Journal of Public Health, 64:66-71.
Hammersley, J.M. and Handscomb, D.C. (1964).
Methuen &Co., London.
Monte Carlo Methods,
•
,
83
Hartz, S.C. (1973).
A statistical model for assessing the need for
medical care in a health screening program.
Clinical Chemistry,
19:113-116.
Hochberg, Y. (1977).
On the use of double sampling schemes in
analyzing categorical data with misclassification errors.
•
Journal
of the American Statistical Association, 72:914-921.
International Mathematical & Statistical Library, Seventh Edition
(1979). IMSL, Inc., Houston, Texas.
Kleijnen, J.P. (1974).
Statistical Technigues in Simulation.
Marcel
Dekker, Inc. New York.
Koch, G.G. (1969).
The effect of non-sampling errors on measures of
association in 2 x 2 contingency tables.
Journal of the American
Statistical Association, 64:852-863.
Lusted, L.B. (1971).
Decision-making studies in patient management.
New England Journal of Medicine, 284:416-424.
Mote, V.L. and Anderson, R.L. (1965).
An investigation of the effect
of misclassification on the properties of Chi-square tests in the
analysis of categorical data.
Nelder, J.A. and Mead, R. (1976).
minimization.
Newell, D.J. (1962).
epidemiology.
Neyman, J. (1947).
A simplex method for function
Computer Journal, 7:308-313.
Errors in the interpretation of errors in
American Journal of Public Health, 52:1925-1928.
Outline of statistical treatment of the problem of
medical diagnosis.
Nissen-Meyer, S. (1964).
diagnosis.
Biometrika, 52:95-109.
Public Health Reports, 62:1449-1456.
Evaluation of screening tests in medical
Biometrics, 20:730-755.
84
Quade, D. (1976).
model.
On designing the multiple-read system for a simple
Nipnote 11, SENIC Project, School of Public Health,
Chapel Hill, N.C.
Quade, D., Lachenbruch, P.A., Whaley, F.S., McClish, D.K., and Haley,
R.W. (1980).
Effects of misclassifications on statistical
inferences in epidemiology.
•
American Journal of Epidemiology,
111:503-515.
Quade, D. and McClish, D. (1977).
Enhancement of sensitivity and
specificity by multiple observation.
Nipnote 26, SENIC Project,
School of Public Health, Chapel Hill, N.C.
Rogan, W.J. and Gladen, B. (1978).
results of a screening test.
Estimating prevalence with the
American Journal of Epidemiology,
107:71-76.
Schoen, I. and Brooks, S.H. (1970).
limits:
Judgement based on 95% confidence
a statistical dilemma involving multitest screening and
proficiency testing of multiple specimens.
American Journal of
Clinical Pathology, 53:190-193.
Sunderman, F.W., Jr. (1972).
of multitest surveys.
Conceptual problems in the interpretation
Clinical Oriented Documentation of
Laboratory Data, E.R. Gabrieli, (ed).
Academic Press, New York.
Pp. 39-68.
Sunderman, F.W., Jr. and Van Soestbergen, A.A. (1971).
Laboratory
suggestions: probability computations for clinical interpretation
of screening tests.
55: 105-111.
American Journal of Clinical Pathology,
..
85
Tenenbein, A. (1970).
A double sampling scheme for estimating from
binomial data with misclassifications.
Journal of the American
Statistical Association, 65:1350-1361.
Tenenbein, A. (1971).
A double sampling scheme for estimating from
binomial data with misclassification:
sample size determination.
Biometrics, 27:935-944.
Thorner, R.M. and Remein, Q.R. (1961).
Principle~
the evaluation of screening for disease.
and procedures in
United States Public
Health Service, Division of Chronic Diseases, G.P.O., Washington,
D.C., Monograph #67.
Werner, M., Brooks, S.H., and Wette, R. (1973).
effective laboratory testing.
Strategy for cost-
Human Pathology, 4:17-30.
Williams, 0.0., Mowery, R.L., and Waldman, G.T. (1980).
methods, different populations:
Common
the LRC program prevalence study.
Circulation, 62:18-23.
Wilson, J.M.G. (1973).
screening.
Current trends and problems in health
Journal of Clinical Pathology, 26:555-563.
Vecchio, T.J. (1966).
Predictive value of a single diagnostic test in
unselected populations.
New England Journal of Medicine,
274:1171-1173.
Yerushalmy, J. (1947).
Statistical problems in assessing methods of
medical diagnosis, with special reference to X-ray techniques.
Public Health Reports, 62:1432-1449.
I
Yerushalmy, J. (1953).
The reliability of chest roentgenography and
its clinical applications.
Youden, W.J. (1950).
Cancer, 3:32-35.
Diseases of the Chest, 24:133-147.
Index for rating diagnostic tests.