Sample size determination for estimation of the accuracy of two

Sample size determination for
estimation of the accuracy of two
conditionally independent
diagnostic tests
Marios Georgiadis,
Faculty of Veterinary Medicine,
Aristotle University of Thessaloniki,
Greece
è this work was done by
v Wes Johnson, University of California, Davis
v Ian Gardner, University of California, Davis
v Marios Georgiadis, Aristotle University of
Thessaloniki
Hui-Walter model (Biometrics, 1980)
Population 1
Population 2
Test 2
+
-
Test 2
+
-
Test 1 +
a
b
a+b
Test 1 +
e
f
e+f
-
c
d
c+d
-
g
h
g+h
a+c b+d n1
e+g f+h n2
Assumptions
Population 1
Population 2
Test 2
+
-
Test 2
+
-
Test 1 +
a
b
a+b
Test 1 +
e
f
e+f
-
c
d
c+d
-
g
h
g+h
e+g
f+h
n2
a+c
b+d n1
è  Validity of the assumptions is critical and
should be given careful consideration
è  π1 # π2
è  Each test has the same Se-Sp in the two
populations
è  Conditional independence (Vacek, 1985)
sample size estimation using HW
è data from 2-tests applied on 2-populations
è goal is to estimate Se1, Se2, Sp1, Sp2, π1, π2
è minimum sample size to achieve a desired
level of precision
è the method provides sample sizes to obtain
CI’s of a specified maximum width for one
or more of the 6 parameters
è alternatively, we can specify CI widths for
the difference in sensitivities (Se1-Se2) and
specificities (Sp1-Sp2)
è spreadsheet 1
HW estimates and CI’s
è HW provided closed-form formulas for
the ML estimates for the two Se’s, the
two Sp’s and the two prevalences (6
parameters)
è using these formulas with our 2-table
data we get ML point estimates for the
six parameters of interest
è these point estimates are the points of
the (6-dimensional) parameter space for
which the likelihood function is
maximized
è spreadsheet 3
HW formulas for
the Fisher
Information Matrix
(FIM)
è once we get the FIM we can invert it to
obtain the estimated variancecovariance matrix
è the diagonal elements of this matrix are
the standard large-sample estimates of
the variances of the respective
parameter estimates
è The square roots of the diagonals are
the usual s.e.’s
è off-diagonal elements are the
corresponding estimated covariances
è excel spreadsheet 2
è once we have the standard errors we
can calculate CI’s
Seˆ ± zα ∗ s.e.(Seˆ )
1
/2
1
è we need the assumption of asymptotic
normality of the ML estimates - large
sample sizes
è rule of thumb: ML estimate ± 3*s.e.
should not cover 0 or 1
è if the assumption does not hold we
cannot calculate CI’s in the usual way
estimation of the differences:
Se1-Se2 and Sp1 –Sp2
è an objective of the study might be to
compare the sensitivities or the specificities
of the tests
è the point estimate of the differences is the
difference of the point estimates
è the estimated variances of the difference
estimates are:
vâr(Seˆ − Seˆ ) = vâr(Seˆ ) + vâr(Seˆ ) − côv(Seˆ , Seˆ )
1
2
1
2
1
2
vâr(Spˆ − Spˆ ) = vâr(Spˆ ) + vâr(Spˆ ) − côv( Spˆ , Spˆ )
1
2
1
2
1
2
è all the necessary estimated variances
and covariances can be obtained from
the estimated variance-covariance
matrix
è the standard error of the difference is
the square root its estimated variance
è if the asymptotic normality assumption
holds we can create CI’s as before
calculation of sample size
è if the sampling distribution of an
estimator µ̂ is approximately normal
then the (1-α)*100% CI is
µˆ ± z ∗ s.e.( µˆ )
α /2
where s.e.( µˆ ) = s / N
è the width (w) of this CI is
w = 2 ∗ zα ∗ s.e.( µˆ ) = 2 ∗ zα ∗ s / N
/2
/2
è solving for N, we get:
N = (2 * zα ∗ s / w )
2
/2
è to calculate the sample size, N, we
need an estimate of s
è spreadsheet 1
è if the largest sample size is picked, all
the CI widths will be as specified or
smaller
è estimation of only a subset of
parameters might be of interest
v prevalence estimates are not usually of
interest
v some performance estimates might be
known
è information on these is used in the
spreadsheet but their CI widths are set
arbitrarily large
è for some combinations of parameter values the
diagonals of C and Coˆv can be negative
è this is because these parameter values result in
a singular information matrix
è we have to make sure that we do not have
negative diagonals or very large pairwise
correlation values (close to or over 1 or -1)
è another indication is that the sample sizes will
become very large
è in these situations, the usual ML method cannot
be used to obtain s.e.’s and therefore our
sample size calculations are not applicable
è it’s a good idea to try some
combinations of parameter guesses to
make sure you are not near a
problematic area of the parameter
space
è the same potential problems and
warnings can be found in spreadsheet 2
initial parameter guesses
è guesses of the 6 parameters of interest are
necessary
è since the sample size calculation is strongly
dependent on those they have to be realistic
è expert opinion - be careful:
v sensitivity can vary with severity of infection and
stage of disease process
v sensitivity of a test with experimental samples
might be higher than with real field samples
v specificity can vary according to geographic
distribution of cross-reacting microorganisms
è best to do a pilot study
è calculate sample sizes for a range of
possible parameter values
if you wanted to conduct an
evaluation study
è if you want to use the HW model: first make
sure that the assumptions hold
v tests conditionally independent
v populations have different prevalences
v test performance the same in both populations
è sample size calculations – precision and
cost considerations
v specify up front how much precision we need
è formulate educated guesses for the
parameters of interest (expert opinion
and/or pilot study)
è use spreadsheet 1 to get sample sizes
è check to see if the large-sample
approximation is reasonable by
calculating the initial estimate/guess
±3*s.e. to determine if the interval
obtained includes 0 or 1
è if it does, the sample is likely not large
enough to justify large-sample normality
è during the calculation process we
should monitor the diagonals of matrix
C and the pairwise correlations and be
careful about the “singular information
matrix” problem
è conduct the study
è insert raw data into spreadsheet 3 to
get parameter estimates
è use parameter estimates in spreadsheet
2 to get standard errors
è if large sample theory holds, we can
calculate CI’s for the parameters of
interest
è again, monitor information matrix
diagonals and pairwise correlations
dependent tests
è if the tests are conditionally dependent, we
can still use the HW setup but we will need
different methods of analysis of our results
è since there are no sample-size calculation
methods for such tests, we can still use our
method, knowing that to obtain comparable
precision we will probably need larger
sample sizes
è the calculated sizes can be used as an
absolutely minimum value
HW data example