Designing 2-Phase Prevalence Studies in the Absence of a “Gold

American Journal of Epidemiology
ª The Author 2009. Published by the Johns Hopkins Bloomberg School of Public Health.
All rights reserved. For permissions, please e-mail: [email protected].
Vol. 170, No. 3
DOI: 10.1093/aje/kwp122
Advance Access publication June 8, 2009
Practice of Epidemiology
Designing 2-Phase Prevalence Studies in the Absence of a ‘‘Gold Standard’’ Test
Agus Salim and Alan H. Welsh
Initially submitted December 18, 2007; accepted for publication April 24, 2009.
A population survey for estimating prevalence is challenging when a disease or condition is difficult to diagnose.
If clinical diagnosis is expensive, a 2-phase study, in which less expensive but less accurate tests are administered
to all study subjects in the first phase (screening phase) and a more accurate but expensive or time-consuming test
is administered to only a subset of the subjects in the second phase, is an attractive approach. Published research
has discussed ways of maximizing precision of the prevalence estimate from a 2-phase study with a ‘‘gold standard’’ second-phase test. For many psychiatric disorders, even the best diagnostic tests are not of gold standard
quality. In this paper, the authors propose a quasi-optimal design for 2-phase prevalence studies without a gold
standard test; random-effects latent class analysis facilitates the estimation of prevalence and appropriately
addresses the issue of dependent errors among the diagnostic tests. The authors show that the quasi-optimal
design is efficient compared with the balanced and random designs when there is strong inter-test dependence
caused by additional factors, apart from disease status, and highlight the importance of collecting data on those
subjects testing negative in the first phase.
efficiency; mass screening; patient selection; psychiatry; sampling studies
Abbreviation: LCA, latent class analysis.
Advances in medical technology enabled many conditions
to be diagnosed with no or negligible error. Diagnostic tests
with this property, referred to as ‘‘gold standard,’’ determine
directly the true disease status of an individual. However,
there are still many conditions for which a gold standard test
is not available or is simply too expensive to apply.
The lack of a gold standard test is a common feature in
psychiatry. Because there are no established biomarkers for
psychiatric disorders, these conditions are diagnosed against
consensus-determined sets of symptoms and signs (1).
When the population of interest is large, a screening phase
can be performed. Tests conducted during this phase are
usually easy to administer and relatively inexpensive, although they may not have high specificity. A fraction of
those subjects who ‘‘screened’’ positive are examined further by using more thorough tests (refer, for example, to
Kumar et al. (2) and Hayden et al. (3)). Prevalence is computed by assuming that the phase II test is a gold standard
and that false negatives in the screening phase are negligi-
ble. When the gold standard assumption is reasonable and
a fixed fraction of subjects are to be assessed at phase II,
precision of the prevalence estimate can be maximized by
selecting an optimal number of subjects from each category
of screening outcomes (4–6). However, falsely assuming
gold standard quality will result in a biased prevalence estimate (7, 8) and subvert any claims to optimality. Despite
the need to study the optimal framework for settings in
which there is no definitive diagnostic method, to our
knowledge, there has been no effort to address this problem.
Latent class analysis (LCA) has been used to estimate
prevalence in the absence of a gold standard test (9–11),
but the conditional independence assumption that underlies
LCA rarely holds. Torrance-Rynard and Walter (8) showed
that when it is violated, the prevalence estimate is biased.
Several extensions to classical LCA have been developed.
A latent class model with conditional dependence between
1 or 2 pairs of binary responses was introduced by Harper
(12). This approach has been generalized for handling extra
Correspondence to Dr. Agus Salim, Department of Epidemiology and Public Health, National University of Singapore, MD3, 16 Medical Drive,
Singapore, Singapore, 117597 (e-mail: [email protected]).
369
Am J Epidemiol 2009;170:369–378
370 Salim and Welsh
dependence among multiple tests by introducing random
subject effects (13).
Using random-effects LCA models given in Qu et al. (13),
we study a framework for designing 2-phase prevalence
studies in the absence of a gold standard test, with the objective of maximizing precision of the prevalence estimate.
We present a scenario in which the phase I sample size n is
fixed and a predetermined number of subjects will be examined at phase II.
random-effects LCA, where, in addition to the underlying
disease status, the outcome depends on random subjectspecific effects (13).
We model the probability of a positive outcome for the jth
test on the ith individual as a function of the underlying
disease status Di and a subject-specific random effect Ti
through a probit regression model,
Pr Yij ¼ 1j Di ¼ d; Ti ¼ t ¼ U ajd þ bjd t ;
ð2Þ
d ¼ 0; 1; Ti ~ N 0; 1 ;
METHOD
Classical LCA
Suppose there are M binary, non-gold-standard, diagnostic tests. The unobserved true disease status for individual i
(Di) is a latent binary variable. Let Yi ¼ (Yi1, Yi2, . . ., YiM) be
the vector of outcomes for individual i. The tests are characterized by their sensitivity and specificity, which, for the
jth test, are respectively given by
wj ¼ Pr Yij ¼ 1j Di ¼ 1 ;
xj ¼ Pr Yij ¼ 0j Di ¼ 0 :
Classical LCA assumes conditional independence, in
which outcomes of the different tests within an individual
are assumed to be independent given the true disease status,
which can be written as
Pr½Yi j Di ¼
M
Y
A quasi-optimal hybrid design for 2-phase prevalence
studies
Pr½Yij ¼ yij j Di :
j¼1
The likelihood function based on these M test outcomes is
given by
Li ðhÞ ¼ Pr½Di ¼ 1Pr½Yi j Di ¼ 1 þ Pr½Di ¼ 0Pr½Yi j Di ¼ 0
¼q
M
Y
M Y
1yij
y 1y
y
1 xj ij xj ij ;
wj ij 1 wj
þð1 qÞ
j¼1
where U(.) is the standard normal cumulative density function, and the amount of extra dependence caused by interaction between the jth test and subject-specific characteristic
Ti is controlled by parameters bj1 and bj0. Within individual
i, test outcomes are conditionally independent given Di and
Ti. To estimate the fixed-effects parameters, h ¼ {q, a10,
a11, b10, b11, . . ., aM0, aM1, bM0, bM1}T, we maximize the
marginal log-likelihood (Appendix 1). The sensitivity and
specificity parameters are estimated as functions of the maximum likelihood estimates of fixed-effects parameters. Alternatives to probit regression are available. In particular, if
logistic regression is used instead, the model will be equivalent to a subclass of the generalized partial-credit models
used in psychometrics (14, 15). The main advantage of using the probit link is that the sensitivity and specificity estimates can be computed as relatively simple functions of
parameter estimates.
j¼1
ð1Þ
where q ¼ Pr[Di ¼ 1] is the disease prevalence.
The parameters of interest h ¼ {q, w1, w2, . . ., wM, x1,
x2, . . ., xM}T can be estimated by maximizing the logarithm
of equation 1 summed over all individuals.
LCA with random effects
The conditional independence assumption in LCA fails
when dependence due to factors other than disease status
exists. In psychiatry, it is likely that psychological test batteries share similar test items. These tests will tend to give
similar outcomes for the same subject, regardless of the
subject’s disease status. Dependency can also arise when
sensitivity of the individual screening procedure in practice
increases with the severity of the underlying condition. Dependence between different tests can be handled by using
LCA can be adapted to handle data from 2-phase prevalence studies (refer to Appendix 2). In 2-phase prevalence
studies, we screen a random sample of n subjects using M
binary tests, and 100qv% of them are assessed by using
a different test at phase II. We assume that the phase I
sample size (n) and the overall phase II sampling fraction
(qv) have been predetermined. We label the outcome classes
defined by a unique combination of screening test outcomes
as Zk, k ¼ 1, 2, . . ., 2M. Let n*k ; k ¼ 1; . . . ; 2M denote the
number of observations in each outcome class, so
2M
P
n*k . In phase II, we select nk ¼ pk n*k subjects from
n¼
k¼1
the kth outcome class. Each of the nk subjects selected is
given the additional binary test W, with nks denoting the
number of subjects in the kth class with phase II outcome
s ¼ 0, 1, so that nk ¼ nk0 þ nk1, k ¼ 1, . . ., 2M.
Our objective is to minimize the variance of the prevalence estimate, q̂, by choosing the optimal phase II sampling
to the overall
fractions, p ¼ fp1 ; p2 ; . . . ; pP
2M g, subject
sampling fraction being qv ¼ k pk n*k n.
For a given set of phase II sampling fractions, p, the
prevalence estimator q̂ is distributed as
q̂ ~ N q; I 1
11 h; p ;
where I 1
11 h; p is the first main diagonal element of the
inverse of expected Fisher information matrix
Am J Epidemiol 2009;170:369–378
Prevalence Studies in the Absence of a ‘‘Gold Standard’’
@ 2 log‘ h; p
I ðh; pÞ ¼ E
:
@h@h
ð3Þ
The analytical form of log-likelihood, log‘(h; p), is given in
Appendix 2.
Because we are interested in finding the ‘‘optimal’’ p, the
parameters of the underlying random-effects LCA are assumed to be known. If the parameters need to be estimated,
one can do so from a pilot study conducted prior to the main
study. The log-likelihood in Appendix 2 can be used to
estimate h based on data from a pilot study. The pilot study
typically includes a small number of individuals being assessed by using M phase I tests; a subset of them are also
assessed by using the phase II test.
To locate the optimal sampling fractions, we need to
evaluate the expected information matrix in equation 3 for
various values of p. The matrix can only be approximated
numerically by using finite-difference methods. We initially
used the Nelder-Mead simplex algorithm (16) to locate the
optimal sampling fractions. However, doing so proved quite
challenging because of the difficulty in obtaining highly
accurate approximations of the second-order derivatives.
The problem is compounded because we rely on achieving
stability in estimating a small variance parameter for declaring convergence.
To overcome this problem, we search for ‘‘optimal’’
phase II sampling fractions among designs that are hybrid
between a random (RAN) design, in which the sampling
fractions are equal (pRAN ¼ qv), and a balanced (BAL)
design, in which the sampling fraction for the kth class is
set to be
pBAL ¼
nqv
:
M
2 3n*k
The balanced design seeks to make the phase II sample sizes
for the different phase I outcome classes as equal as possible. An exception is made when there are no subjects in the
kth phase I stratum, and, as a consequence, we cannot select
anybody from this stratum at phase II. On this rare occasion,
we set the balanced sampling fraction for this stratum to
zero and recalculate the other sampling fractions using simple renormalization (refer to the information below). Similarly, when the number of subjects in the kth stratum is
smaller than the number required under a balanced design,
the sampling fraction will be set to 1 and the other sampling
fractions are recalculated (again, see below).
In our hybrid design, the vector of sampling fractions is
taken as a linear combination of sampling fractions under
random and balanced designs,
p ¼ kpBAL þ ð1 kÞpRAN :
ð4Þ
We search
forthe optimal hybrid parameter, k, that minimizes I 1
11 h; p using the line search method. To help our
search, the curve of variances at various k values is
smoothed first prior to performing the search. Because there
is no guarantee that the hybrid sampling fractions actually
achieve the global minimum of standard errors of the prevAm J Epidemiol 2009;170:369–378
371
alence estimate, we refer to this hybrid design as ‘‘quasioptimal.’’
When k ¼ 0, the quasi-optimal design samples randomly; when k ¼ 1, the quasi-optimal design is balanced.
When k > 1, the quasi-optimal design overshoots the balanced design and samples more intensively from rare phase
I outcome classes. Similarly, when k < 0, the quasi-optimal
design samples more intensively from common phase I outcome classes compared with the random design.
For some k, elements of p may be negative; we replace
negative sampling fractions by zero and adjust the
other sampling
P fractions using the simple renormalization
p* ¼ pnqv = k;pk >0 pk n*k . Similarly, if any sampling
fractions are greater than one, then we replace these sampling fractions by one and adjust the other sampling
fractions
renormalization: p* ¼
Pusing the* following
P
p nqv k;pk 1 pk nk = k;pk 1 pk n*k .
RESULTS
We compare the relative efficiency of the proposed quasioptimal design with that of random and balanced designs.
The comparisons are aimed at assessing the efficiency that
could be achieved under realistic sets of parameters, chosen
to represent ‘‘typical’’ psychiatric diagnostic tests.
To compare the different designs, we assume that n ¼
10,000 subjects are assessed at phase I using 4 different tests
and that only a fixed fraction of these subjects receive the
phase II test. The choice of n is somewhat arbitrary because
the relative efficiency of any 2 designs given a fixed p does
not depend on n, as evidenced by the form of the loglikelihood function in Appendix 2. The sensitivity parameters of the screening tests are set to be 0.68, 0.70, 0.73, and
0.76, and the specificity parameters are set to be 0.60, 0.62,
0.65, and 0.68. Four factors are varied in the study:
1. Phase II test parameters: sensitivity ¼ 0.95 and specificity ¼ 0.95 (labeled Good), and sensitivity ¼ 0.80 and
specificity ¼ 0.75 (labeled Moderate).
2. Conditional dependence parameters: b1 and b0 are set
equal across the 5 tests and are assigned the values b1 ¼
b0 ¼ 2 (labeled Strong) and b1 ¼ b0 ¼ 0.5 (labeled
Weak).
3. Disease prevalence: q ¼ 0.01 corresponds to a rare disease such as schizophrenia, and q ¼ 0.10 corresponds to
a relatively common disease such as depression.
4. Phase II sampling fraction: ranges from qv ¼ 0.05 to
qv ¼ 0.50, with an increment of 0.05.
In this current study, the underlying parameters of the
LCA model are known; hence, no simulated pilot data are
needed. Instead, for each combination of phase II sampling
fraction and k, we can directly use these parameters to calculate the information matrix (3). The numeric approximations of the expected information matrix are carried out by
using the fdHess function of the R package nlme; the codes
are available, upon request, from the authors. The variance
estimate is calculated by taking the first main diagonal element of the inverse of this matrix.
A)
1.0
B)
1.0
Relative Efficiency
0.8
Relative Efficiency
372 Salim and Welsh
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
0.0
0.1
0.2
0.3
0.4
0.5
0.1
0.2
0.3
0.4
0.5
C) 1.0
D)
1.0
Relative Efficiency
Phase II Sampling Fraction
Relative Efficiency
Phase II Sampling Fraction
0.8
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
0.0
0.1
0.2
0.3
0.4
0.5
Phase II Sampling Fraction
0.1
0.2
0.3
0.4
0.5
Phase II Sampling Fraction
Figure 1. Relative efficiency defined as the variance of the prevalence estimate under the quasi-optimal design relative to the variance estimate
under random (solid lines) and balanced (dashed lines) designs for the same sampling fractions (q ¼ 0.01). A) Good phase II test with strong intertest dependence; B) good phase II test with weak inter-test dependence; C) moderate phase II test with strong inter-test dependence; D) moderate
phase II test with weak inter-test dependence.
To estimate the relative efficiency of the quasi-optimal
design, for each combination of factors 1 and 2 above, we
compute the variance of the prevalence estimate for hybrid
designs with k ranging from 1 to 6. To remove the stochastic noise from the numeric approximations, the variance
estimates are then smoothed as a function of disease prevalence, phase II sampling fractions, and k. The smoothing
process is performed by using the R function gam. The
relative efficiency for each combination of disease prevalence and phase II sampling fraction is then computed by
taking the ratio of the lowest variance estimate (quasioptimal variance) relative to the variance estimate when
k ¼ 0 (random design) and when k ¼ 1 (balanced design).
The relative efficiency curves are constructed by smoothing
the values of relative efficiency estimates as functions of
phase II sampling fractions.
Reduction of variance for the same overall phase II
sampling fraction
For rare disease, when conditional dependence is weak
(Figure 1B and D), balanced and random designs perform
almost as well as the quasi-optimal design. However, when
conditional dependence is strong, the advantage of the
quasi-optimal design over random designs is obvious. The
balanced design, however, still performs relatively well
when conditional dependence is strong. A very similar pattern holds for common disease (Figure 2).
Although the phase I sample size (n) does not affect the
relative efficiency of the quasi-optimal design relative to the
random or balanced design if the phase II sampling fraction
is fixed, it still affects the relative efficiency if the phase II
sample size is fixed. For example, suppose that n ¼ 10,000
and we can afford to sample only 500 individuals (qv ¼
0.05). Then, according to Figure 1A, the balanced design
is about 90% as efficient as the quasi-optimal design. However, if n ¼ 1,000 and we can afford to sample only 500
individuals (qv ¼ 0.50), then using the balanced design
would produce a prevalence estimate nearly as precise as
that obtained with the quasi-optimal design.
Reduction of the phase II sampling fraction for the
same standard error of prevalence estimate
We also look at how the relative efficiency translates into
reduction of the required phase II sampling fraction. For
each value of the phase II sampling fraction under the
quasi-optimal design (factor 4 above), we use the smoothed
variance estimates to interpolate the phase II sampling fraction, with k ¼ 0 and k ¼ 1, that will give the same variance
of prevalence estimate as the quasi-optimal design does. The
relative efficiency is then calculated by the ratio of quasioptimal phase II sampling fraction relative to that of balanced and random designs. Figures 3A and 4A show that
when conditional dependence is strong and the phase II test
is of Good quality, the quasi-optimal design can achieve
Am J Epidemiol 2009;170:369–378
A)
1.0
B)
1.0
Relative Efficiency
0.8
Relative Efficiency
Prevalence Studies in the Absence of a ‘‘Gold Standard’’
0.8
0.6
0.4
0.2
0.0
373
0.6
0.4
0.2
0.0
0.1
0.2
0.3
0.4
0.1
0.5
0.2
0.3
0.4
0.5
C)
1.0
D)
1.0
0.8
Relative Efficiency
Phase II Sampling Fraction
Relative Efficiency
Phase II Sampling Fraction
0.8
0.6
0.4
0.2
0.0
0.6
0.4
0.2
0.0
0.1
0.2
0.3
0.4
0.5
Phase II Sampling Fraction
0.1
0.2
0.3
0.4
0.5
Phase II Sampling Fraction
Figure 2. Relative efficiency defined as the variance of the prevalence estimate under the quasi-optimal design relative to the variance estimate
under random (solid lines) and balanced (dashed lines) designs for the same sampling fractions (q ¼ 0.10). A) Good phase II test with strong intertest dependence; B) good phase II test with weak inter-test dependence; C) moderate phase II test with strong inter-test dependence; D) moderate
phase II test with weak inter-test dependence.
more than a 40% reduction in phase II sample size when
compared with the random design. However, significant gain
over the balanced design is observed only when the sampling
fraction is small. When the phase II test is of Moderate
quality (Figures 3C and 4C), the quasi-optimal design
achieves a more meaningful reduction in terms of the phase
II sampling fraction when compared with the balanced design.
However, the most interesting findings occur when conditional dependence is weak. Figure 3B and D reveal that,
generally, the benefit of the quasi-optimal design is much
more pronounced when reduction of phase II sampling fractions, rather than variance reduction, is used to measure the
benefit (Figure 1B and D). A similar pattern is observed
when Figures 2 and 4 are compared for common disease,
although the discrepancy is less pronounced. This discrepancy can be explained by the fact that the variance of the
prevalence estimate is not an inverse function of the phase II
sampling fraction; rather, it is an inverse function of the
phase I sample size. Hence, if for the same sampling fraction, variance under the quasi-optimal design that is half that
under the random design does not mean that the variance
level shown by the quasi-optimal design can be achieved by
simply doubling the sampling fraction under the random
design. As a consequence, the scale used to measure the
benefit of the quasi-optimal design matters. In terms of
practical applications, however, the definition of benefit in
terms of phase II sampling fraction is probably more
meaningful.
Am J Epidemiol 2009;170:369–378
APPLICATION
We apply our methodology to a study of mild cognitive
impairment in a cohort of subjects aged 60–64 years in
Canberra, Australia (2). Mild cognitive impairment has been
defined as a transitional state between normal aging and
Alzheimer’s disease (17). The application of different diagnostic procedures and the age range of study subjects have
led to diverse and inconsistent estimates of prevalence of
this condition (refer, for example, to De Carli (18)). Prevalence estimation has been further complicated by the fact
that none of the currently used tests can be regarded as a gold
standard.
The study was a 2-phase investigation, with 2,551 individuals assessed by using the following 3 tests at phase I:
1. A Mini-Mental State Examination score of 25 (19)
2. A score below the 5th percentile for immediate or delayed recall on the California Verbal Learning Test (20)
3. A score below the 5th percentile on the Symbol Digit
Modalities Test (21).
At phase II, individuals underwent thorough physical,
laboratory, and neuropsychological examinations conducted
by a clinician and were assessed according to diagnostic
criteria for amnestic mild cognitive impairment (17). The
study team decided to conduct the phase II investigations on
only a subset of individuals who had at least one positive
A) 1.0
B)
1.0
Relative Efficiency
Relative Efficiency
374 Salim and Welsh
0.8
0.8
0.6
0.4
0.2
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.1
C) 1.0
D)
1.0
Relative Efficiency
0.8
0.8
0.6
0.4
0.2
0.2
0.3
0.4
0.5
Quasi-optimal Sampling Fraction
Relative Efficiency
Quasi-optimal Sampling Fraction
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
0.4
0.5
Quasi-optimal Sampling Fraction
0.1
0.2
0.3
0.4
0.5
Quasi-optimal Sampling Fraction
Figure 3. Relative efficiency defined as the ratio of the required phase II sampling fraction under the quasi-optimal design relative to the random
(solid lines) and balanced (dashed lines) designs to achieve the same standard errors of prevalence estimate (q ¼ 0.01). A) Good phase II test with
strong inter-test dependence; B) good phase II test with weak inter-test dependence; C) moderate phase II test with strong inter-test dependence;
D) moderate phase II test with weak inter-test dependence.
phase I test. Seventeen of 2,551 subjects had one or more
missing screening test outcomes and were excluded from
the subsequent analysis. Therefore, 2,534 subjects remained, of whom 112 of 224 who ‘‘screened’’ positive consented to undergo and were administered the phase II test
(corresponding to qv 0.044).
The estimates of prevalence, as well as sensitivity and
specificity parameters of the 4 tests under 1) classical
LCA and 2) random-effects LCA, are given in Table 1.
Because we have 4 tests, only those models with at most
24 1 ¼ 15 independent parameters can be fitted. With this
restriction, we assume that the variance component for the
random effects is the same in both the diseased and nondiseased subpopulation for all tests. That is, b11 ¼ b21 ¼
b31 ¼ b41 ¼ b1 and b10 ¼ b20 ¼ b30 ¼ b40 ¼ b0. The likelihood ratio statistic to compare the classical and randomeffects LCA models is 6.70 on 2 degrees of freedom (P ¼
0.035), so we reject the conditional independence assumption. The random-effects LCA model fits the data quite well, as
indicated by comparison of the observed and predicted frequencies for the different phase I outcome classes (Table 2).
Violation of the conditional independence assumption results in underestimation of disease prevalence by 2.6% in
absolute value and more than 30% relative to the prevalence
estimate itself. It also results in underestimation of the sensitivity of the amnestic mild cognitive impairment test and
overestimation of the sensitivity of the phase I tests.
The low sensitivities of the Mini-Mental State Examination and the Symbol Digit Modalities Test indicate that the
current threshold values used to determine positive screening for these tests are too high and that, consequently, higher
sensitivity can be achieved by lowering the threshold (e.g.,
lower than 25 for the Mini-Mental State Examination), although the specificity will decrease to compensate for the
lower threshold. The phase II test (amnestic mild cognitive
impairment) has very high specificity. In fact, the 95% confidence interval for the specificity of amnestic mild cognitive impairment covers 1 (perfect specificity). The value of
estimated prevalence is in accordance with previously published reports in which the prevalence of mild cognitive
impairment for age groups 60 years or older is reported to
be between 5% and 10% (18, 22).
Treating data from this study as our ‘‘pilot’’ data, we investigate how we should have designed phase II sampling.
We search for a quasi-optimal design by using the estimates
in Table 1 (the Conditional Dependence column) to generate
numeric estimates of the expected information matrix. The
hybrid parameter for the quasi-optimal design is found to be
1.8 (Figure 5). Figure 5 also demonstrates that the quasioptimal design, compared with the random design, greatly
reduces variance of the prevalence estimate. The quasioptimal design does not perform much better than the balanced design. This finding is not surprising given that our
calculations show that, when the phase II test is of Moderate
Am J Epidemiol 2009;170:369–378
A) 1.0
B)
1.0
Relative Efficiency
Relative Efficiency
Prevalence Studies in the Absence of a ‘‘Gold Standard’’
0.8
0.8
0.6
0.4
0.2
375
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.1
C) 1.0
D) 1.0
Relative Efficiency
0.8
0.6
0.4
0.2
0.2
0.3
0.4
0.5
Quasi-optimal Sampling Fraction
Relative Efficiency
Quasi-optimal Sampling Fraction
0.8
0.6
0.4
0.2
0.0
0.0
0.1
0.2
0.3
0.4
0.5
Quasi-optimal Sampling Fraction
0.1
0.2
0.3
0.4
0.5
Quasi-optimal Sampling Fraction
Figure 4. Relative efficiency defined as the ratio of the required phase II sampling fraction under the quasi-optimal design relative to the random
(solid lines) and balanced (dashed lines) designs to achieve the same standard errors of prevalence estimate (q ¼ 0.10). A) Good phase II test with
strong inter-test dependence; B) good phase II test with weak inter-test dependence; C) moderate phase II test with strong inter-test dependence;
D) moderate phase II test with weak inter-test dependence.
Table 1. Parameter Estimates and Their 95% Confidence Intervals
for the Mild Cognitive Impairment Study Data Set
Conditional
Independence
Conditional
Dependence
Estimate
95% CI
Estimate
95% CI
0.051
0.026, 0.077
0.077
0.052, 0.102
MMSE
0.470
0.289, 0.664
0.084
0.033, 0.467
CAL
0.439
0.287, 0.582
0.255
0.024, 0.745
SDMT
0.484
0.135, 0.657
0.051
0.011, 0.665
aMCI
0.319
0.135, 0.503
0.764
0.272, 0.969
MMSE
0.990
0.981, 0.999
0.971
0.926, 0.992
CAL
0.967
0.956, 0.976
0.962
0.904, 0.991
SDMT
0.972
0.962, 0.982
0.951
0.874, 0.989
aMCI
0.689
0.568, 0.823
0.986
0.656, 1.000
q
Sensitivity
Specificity
Variance
component
b1
0
0.720
0.545, 1.551
b0
0
1.154
1.131, 1.225
Abbreviations: aMCI, amnestic mild cognitive impairment; CAL,
California Verbal Learning Test; CI, confidence interval; MMSE,
Mini-Mental State Examination; SDMT, Symbol Digit Modalities Test.
Am J Epidemiol 2009;170:369–378
quality and conditional dependence is Strong, the balanced
design performs nearly as well as the quasi-optimal design
(Figures 1C and 2C).
The variance of the prevalence estimate under the quasioptimal design is 85% of the variance under the balanced
design and is 40% of the variance under the random design
(Figure 5). When we compared the quasi-optimal design
with the design used in the actual study, we found that the
actual design is only about 33% efficient. Thus, had we been
able to use the quasi-optimal design in planning the phase II
sampling, then the standard error of the prevalence estimates
under the LCA model would have been reduced from approximately 0.0128 (the width of the 95% confidence interval for q in Tablep
1 ffiffiffiffiffiffiffiffiffi
(the Conditional Dependence column)
divided by 3.94) to 0:3330:0128 0:0074. Alternatively,
if we had used the balanced design, the standard
qffiffiffiffiffiffi error of the
0:33
prevalence estimate would have been
0:85 3 0:0128 0:008.
The recommended phase II sampling fractions are given
in Table 3. When we compare the prescribed phase II sampling fractions with those used in the actual study, we learn
that the study design can be improved by sampling more
subjects who had 2 or more positive tests in phase I and by
sampling fewer subjects who had only one positive test in
phase I. A number of subjects who had screened negative
376 Salim and Welsh
Table 2. Observed Number of Subjects in the Different Phase I
Outcome Classes and the Expected Number Predicted by the
Random-Effects Latent Class Analysis Model
MMSE
CAL
SDMT
Observed No.
Expected No.
1
1
1
12
12
1
1
0
17
16
1
0
1
20
0
1
1
1
0
0
1
0
0
CAL
SDMT
Prevalence
Sampling Fraction
1
1
1
0.005
1.000 (0.286)
1
1
0
0.008
1.000 (0.476)
21
1
0
1
0.009
1.000 (0.292)
15
13
0
1
1
0.007
0.822 (0.412)
0
31
43
1
0
0
0.014
0.402 (0.432)
0
91
91
0
1
0
0.037
0.190 (0.537)
0
1
78
78
0
0
1
0.032
0.220 (0.110)
0
0
2,270
2,260
0
0
0
0.887
0.008 (0.000)
Abbreviations: CAL, California Verbal Learning Test; MMSE, MiniMental State Examination; SDMT, Symbol Digit Modalities Test.
would also need to be assessed at phase II. The exact number is 0:887 3 0:008 3 2; 551 ’ 19 individuals.
DISCUSSION
In this paper, we have presented a quasi-optimal design
for a 2-phase prevalence study without a gold standard test.
We demonstrated that the quasi-optimal design is more efficient than random and balanced designs under a variety of
realistic parameter settings, especially when strong conditional dependence exists. The quasi-optimal design is very
useful for psychiatric studies, where even the best diagnostic
test is often of Moderate quality, with sensitivity typically
less than 80%, and conditional dependence between tests is
strong.
1.0
0.8
Relative Efficiency
Table 3. Population Prevalence of Different Phase I Outcome
Classes and the Phase II Sampling Fractions Under the Quasioptimal Design (Observed Counterpart Used in the Actual Study)
Optimal Hybrid
Balanced
Random
0.6
0.4
1.0
2.0
3.0
4.0
λ
Figure 5. Relative efficiency defined as the variance of the prevalence estimate under the quasi-optimal design (k ¼ 1.8) relative to
hybrid designs with different hybrid parameters (k).
MMSE
Abbreviations: CAL, California Verbal Learning Test; MMSE, MiniMental State Examination; SDMT, Symbol Digit Modalities Test.
Because the quasi-optimal design is based on randomeffects LCA models, researchers generally need to have at
least 3 screening tests to take advantage of the methodology
for planning their study. With 3 screening and one phase II
test, the simplest random-effects LCA model with the same
conditional dependence parameters for the 4 tests would
need to estimate 11 parameters (prevalence estimate, 4 sensitivity estimates, 4 specificity estimates, and 2 estimates of
conditional dependence parameters). With 4 tests in total,
this is possible because we have 24 1 ¼ 15 degrees of
freedom to use from a 2 3 2 3 2 3 2 contingency table.
However, if we have 2 subpopulations such as male and
female with different prevalences, the minimum number
of tests required to conduct LCA can be reduced to 3 (9).
With 3 tests, each subpopulation has 23 1 ¼ 7 degrees of
freedom, so in total we have 14 degrees of freedoms with
only 12 parameters to be estimated. When needed, the extra
degrees of freedom can be used to model the effect of covariates such as gender on specificity and sensitivity parameters. For population-based studies of a psychiatric
condition such as mild cognitive impairment or dementia,
we expect that the requirement of having 2 screening tests
would be easily fulfilled.
Our findings also demonstrate the importance of conducting a further test at phase II for a small fraction of those who
screened negative at phase I. This procedure is contrary to
popular practice in 2-phase psychiatric studies, where no
further assessment is conducted for this subgroup of participants. Assessing a few of those who screened negative can
have a profound impact on reducing the variance of prevalence estimate.
Random-effects LCA can be extended to include covariates. Factors such as poor education and motor or sensory
impairment may have differential effects on tests that use
particular approaches to assess cognitive status. Because the
efficiency of our quasi-optimal design relies on the validity
of the random-effects LCA models, we suggest that researchers use the pilot data to examine model fit by comparing the observed and predicted frequencies of subjects
in the different strata defined by phase I test outcomes.
Am J Epidemiol 2009;170:369–378
Prevalence Studies in the Absence of a ‘‘Gold Standard’’
A significant lack of fit would suggest that the current model
is inappropriate. Comparing the observed and theoretical
odds ratios for each pair of tests will also help improve the
model by identifying pairs of tests that are more correlated
than the others and for which separate, shared, randomeffects parameters are needed. We refer readers interested
in models with covariates to Qu et al. (13) and Reboussin
et al. (23). The latter authors also discussed the strengths and
weaknesses of random-effects latent class models.
Several extensions are worth further research. Study withdrawals are a ubiquitous problem in psychiatric epidemiology, and taking this problem into account will increase
applicability of the methodology. We considered a scenario
in which the phase I sample size is predetermined. A more
complex scenario would be to determine the optimal phase I
sample size as well as the phase II sampling fractions. In this
scenario, we need to consider the cost of phase I tests per
subject (c1) and phase II test per subject (c2). For a gold
standard phase II test and with only one phase I test, it has
been shown that a 2-phase study will be efficient compared
with a single-phase study (where everybody is assessed by
using all available tests) if c2/c1 > 5 and the sum of the
sensitivity and specificity parameters is at least 1.7 (4).
Looking into this issue in the case of a non-gold-standard
phase II test will have practical value and is worth researching further.
ACKNOWLEDGMENTS
Author affiliations: Department of Epidemiology and
Public Health, National University of Singapore, Singapore
(Agus Salim); Centre for Mental Health Research, Australian National University, Australian Capital Territory, Australia (Agus Salim); and Centre for Mathematics and Its
Applications, Australian National University, Australian
Capital Territory, Australia (Alan H. Welsh).
The statistical work was funded by Australian National
Health and Medical Research Council (NHMRC) Capacity
Building Grant 418020 and Australian Research Council
(ARC) Grant DP0559135. The mild cognitive impairment
study was funded by NHMRC Program Grant 179805 and
NHMRC Project Grant 157125. The mild cognitive impairment study involved human subjects, and ethical approval
was obtained from the Australian National University Ethics
Committee.
Conflict of interest: none declared.
REFERENCES
1. American Psychiatric Association. Diagnostic and Statistical
Manual of Mental Disorders, Fourth Edition, Text Revision.
Washington, DC: American Psychiatric Association; 2000.
2. Kumar R, Dear KB, Christensen H. Prevalence of mild cognitive impairment in 60 to 64-year old community-dwelling
individuals: the personality and total health through life 60þ
study. Dement Geriatr Cogn Disord. 2005;19(2–3):67–74.
Am J Epidemiol 2009;170:369–378
377
3. Hayden KM, Khachaturian AS, Tschanz JT. Characteristics of
a two-stage screen for incident dementia. J Clin Epidemiol.
2003;56(11):1038–1045.
4. Shrout PE, Newman SC. Design of two-phase prevalence
surveys of rare disorders. Biometrics. 1989;45(2):549–555.
5. Newman SC, Shrout PE, Bland RC. The efficiency of twophase designs in prevalence surveys of mental disorders.
Psychol Med. 1990;20(1):183–193.
6. McNamee R. Efficiency of two-phase designs for prevalence
estimation. Int J Epidemiol. 2003;32(6):1072–1078.
7. Ihorst G, Forster J, Petersen G. The use of imperfect diagnostic
tests had an impact on prevalence estimation. J Clin Epidemiol. 2007;60(9):902–910.
8. Torrance-Rynard VL, Walter SD. Effects of dependent errors
in the assessment of diagnostic test performance. Stat Med.
1997;16(19):2157–2175.
9. Walter SD, Irwig LM. Estimation of test error rates, disease
prevalence, and relative risk from misclassified data: a review.
J Clin Epidemiol. 1988;41(9):923–937.
10. Faraone SV, Tsuang MT. Measuring diagnostic accuracy in the
absence of a ‘‘gold standard. Am J Psychiatry. 1994;151(5):
650–657.
11. Johnson WO, Gastwirth JL, Pearson LM. Screening without
a ‘‘gold standard’’: the Hui-Walter paradigm revisited. Am J
Epidemiol. 2001;153(9):921–924.
12. Harper D. Local dependence latent structure models. Psychometrika. 1972;37(1):53–59.
13. Qu Y, Tan M, Kutner MH. Random effects models in latent
class analysis for evaluating accuracy of diagnostic tests.
Biometrics. 1996;52(3):797–810.
14. von Davier M, Yamamoto K. Partially observed mixtures of
IRT models: an extension of the generalized partial-credit
model. Appl Psychol Meas. 2004;28(6):389–406.
15. Muraki E. A generalized partial credit model: application of an
EM algorithm. Appl Psychol Meas. 1992;16(2):159–176.
16. Nelder JA, Mead R. A simplex method for function minimization. Comput J. 1965;7(4):308–313.
17. Petersen RC, Doody R, Kurz A. Current concepts in mild
cognitive impairment. Arch Neurol. 2001;58(12):1985–1992.
18. DeCarli C. Mild cognitive impairment: prevalence, prognosis,
aetiology, and treatment. Lancet Neurol. 2003;2(1):15–21.
19. Folstein MF, Folstein SE, McHugh PR. Mini-Mental State:
a practical method for grading the cognitive state of patients
for the clinicians. J Psychiatr Res. 1975;12(3):189–198.
20. Delis D, Kramer J, Kaplan E. California Verbal Learning Test.
San Antonio, TX: Psychological Corporation; 1987.
21. Smith A. Symbol Digit Modalities Test (SDMT) Manual. Los
Angeles, CA: Western Psychological Service; 1982.
22. Hänninen T, Hallikainen M, Tuomainen S. Prevalence of mild
cognitive impairment: a population-based study in elderly
subjects. Acta Neurol Scand. 2002;106(3):148–154.
23. Reboussin BA, Ip EH, Wolfson M. Locally dependent latent
class models with covariates: an application to under-age
drinking in the USA. J R Stat Soc (A). 2008;171(4):877–897.
APPENDIX 1
Marginal Likelihood of Random-Effects LCA
To estimate the fixed-effects parameters, h, we maximize
the sum of the marginal log-likelihood of the fixed-effects
parameters over all subjects. For subject i, the marginal
likelihood is given by
378 Salim and Welsh
Li ðhÞ ¼ q
Z
N
M
Y
y
U aj1 þ bj1 t ij
N j¼1
1yij
1 U aj1 þ bj1 t
dUðtÞ
Z N Y
M
y
þ ð1 qÞ
U aj0 þ bj0 t ij
N j¼1
1yij
dUðtÞ:
1 U aj0 þ bj0 t
ðA1Þ
For computation, the likelihood in equation A1 is often
approximated by using Gaussian quadrature.
Under model 2, Qu et al. (13) showed that the sensitivity
of the jth test, wj, is given by the average probability of
positive test outcomes in the diseased population,
1
0
Z N
B aj1 C
Uðaj1 þ bj1 tÞdUðtÞ ¼ U@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiA: ðA2Þ
wj ¼
N
1 þ b2j1
Similarly, the specificity is given by the average probability
of negative test outcomes in the nondiseased population
1
0
Z N
B aj0 C
xj ¼ 1 Uðaj0 þ bj0 tÞdUðtÞ ¼ U@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiA:
N
1 þ b2j0
ðA3Þ
The maximum likelihood estimates of wj and xj can be
computed by substituting aj1 and bj1 with their maximum
likelihood estimates.
APPENDIX 2
Likelihood Function for 2-Phase Prevalence Studies
The phase I sample is a simple random sample of subjects
from the population. Hence, the probability of being selected at phase I is a constant and is equal for all eligible
subjects. All subjects in phase I undergo M tests with binary
outcomes, so there are 2M different possible outcome classes. Let n*k ; k ¼ 1; . . . ; 2M denote the number of observa-
PM
tions in each outcome class, so n ¼ 2k¼1 n*k . We model
ðn*1 ; . . . ; n*2M Þ as having a multinomial ðn; ðP½Z1 ; h; . . . ;
P½Z2M ; hÞÞ distribution, where the Zk’s are the outcome classes defined by a unique combination of phase I test outcomes and P[Zk, h] is calculated under random-effects
LCA (2). In phase II, given the outcome of phase I, we select
nk subjects independently with probability pk from the kth
outcome class, k ¼ 1, . . ., 2M. Thus,
nk j Phase I ~ binomial n*k ; pk :
Each of the nk subjects selected at phase II is then given the
additional test, W. Let nks denote the number of subjects with
outcome s ¼ 0, 1 on the test W, k ¼ 1, . . ., 2M. Then,
nks j phase I and selection in phase II
~ binomialðnk ; Pr½Sj Zk ; hÞ;
where Pr[S|Zk;h] ¼ Pr[S, Zk;h]/Pr[Zk;h]. It can be shown
that the log-likelihood based on data from this 2-phase sampling is given by
log‘ðh; pÞ ¼
2M X
n*k nk log Pr½Zk ; h
k¼1
þ
2M X
1
X
nks log Pr½S; Zk ; h:
ðA4Þ
k¼1 s¼0
Note that we can use this log-likelihood to estimate h ¼ ĥ
from the pilot data if we know the values of the other quantities (nk, n*k , nks, pk). However, we can also use this loglikelihood to find the ‘‘optimal’’ p that will give us the
smallest variance of the prevalence estimate by evaluating
the log-likelihood at fixed h ¼ ĥ while varying the vector of
sampling fractions, p.
Now, if we conducted the pilot study and estimate h ¼ ĥ
after phase I of the main study has been carried out, we
know the number of subjects n*k in each of the phase I outcome classes, but we do not know nk or nks. In this case, we
replace the unknown nk by its conditional expectation given
phase I, which is n*k pk and nks, by its conditional
expectation
given phase I, which is n*k pk Pr S; Zk ; ĥ Pr Zk ; ĥ . If we design the main study before phase I has been carried out, we do
not even know the number of subjects n*k in each of the phase
*
I outcome classes, so we also replace
the unknown nk by its
unconditional expectation nPr Zk ; ĥ .
Am J Epidemiol 2009;170:369–378