SUBSAHPLING STRATEGIES ~N 'LARGE STUDIES
OF CHRONIC DISEASES
by
vicki Grey Davis
Department of Biostatistics
University of North Carolina at Chapel Hill, NC
Institute of Mimeo Series No. 1884T
Otober 1990
SUBSAMPLING STRATEGIES IN LARGE STUDIES
OF CHRONIC ,DISEASES
by
Vicki Grey Davis
A Dissertation submitted to the faculty of
The University of North Carolina at Chapel Hill
in partial fulfillment of the requirements for the
degree of Doctor of Public Health in the
Department of Biostatistics.
Chapel Hill
1990
Approved by:
Advisor
VICKI GREY DAVIS. Subsampling Strategies in Large Studies of Chronic
Diseases (Under the direction of Clarence E. Davis).
ABSTRACT
The case-cohort design (Prentice, 1986) and the synthetic casecontrol design (Mantel, 1973; Liddell, McDonald & Thomas, 1977; Prentice
& Breslow, 1978) are economical alternatives to the traditional full-cohort
design for the study of chronic diseases. By processing covariate
information on all the events and only a fraction of those subjects who do
not experience the outcome of interest, substantial savings can be realized
in epidemiologic studies. This research evaluates the efficiency of the
subsampling strategies under various conditions using simulated data. The
outcome of interest in all analyses is time-to-response, and the performance
of both hybrid designs is judged relative to that of the traditional cohort
design~
In particular, emphasis is placed on estimating the relative risk
function for a primary variable, such as treatment or exposure, while
adjusting for a covariate which mayor may not be correlated with the
primary variable.
When both the exposure and covariate are dichotomous
variables, the hybrid designs perform satisfactorily, with relative
efficiencies near 80 percent. This is affected only slightly by the degree of
association between the exposure and covariate, the strength of their
effects, and the size ofthe referent group. However, when the exposure and
covariate are bivariate normal, the relative efficiency is inversely related to
the strength of their relationship with the outcome. The case-cohort and
syntheticcase-control designs are applied to two studies of cardiovascular
disease in order to illustrate the use of these strategies as alternatives to
the traditional full-cohort design.
ii
ACKNOWLEDGMENTS
I am very grateful to my advisor, Dr. C. E. Davis, for his
guidance, encouragement,andpatience throughout the writing of this
dissertation. I thank Drs. Frank Harrell, Larry Kupper, Ken Poole and
David Savitz for their helpful suggestions and participation on my
committee.
I would also like to thank Dr. Jay Lubin of the National Cancer
Institute who was kind enough to provide me with a copy of his program for
analyzing data from a case-cohort study. A special acknowledgment is
given to Mike Padrick of Academic Computing Services who succeeded in
getting Dr. Lubin's program to run on the Convex supercomputer, and who
also taught me a great deal about FORTRAN and UNIX.
I wish to express my appreciation to Research Triangle Institute
for their financial assistance toward my doctorate degree. My supervisor,
Dr. Tyler Hartwell, was especially patient and understanding in allowing
me time away from my job as I worked on this degree.
Finally, I wish to thank my family and friends for their
encouragement and continued interest in this project, especially my
husband, Gordon, whose love and support (and many hours of babysitting
Kathleen) enabled me to achieve this goal.
iii
TABLE OF CONTENTS
Page
.LIST OF TABLES ~.................................................................................
...... vi
Chapter
1.
.INTRODUCTIONAND REVIEW OF THE LITERATURE ..........
1
.1.1.
1.2.
Traditional Designs.in Epidemiologic Research
..
3
1.1.1.
1.1.2.
.
.
3
1.2.1.
II.
2.1.
2.3.
2.4.2.5.
m.
10
-... 15
Outline of the Research Proposal
22
Introduction
of Simulations
Resul'ts &om Simulations
Simulation Results for a Common Disease
Discu.ssiOD and S11mmary
D~ption
25
25
26
28
31
'.....• 32
HYBRID DESIGNS IN AN OBSERVATIONAL STUDYWITH
A DICHOTOMOUS EXPOSURE
3.1.
3.2.
3.3.
3.4.
IV.
Th.e Synthetic Case-Control Design
Th.e Case-Cohort Design.
HYBRID DESIGNS IN A CLINICAL TRIAL SETTING
2.2.
7
Hybrid Designs as Alternatives to Traditional
Studies ....................•................................................................ 10
1.2.2.
1.3.
Th.e Cohort Study
Th.e Case-Control Study
Introduction
D~ption of Simulations ..
Resul'ts &om Simulations
Discu.ssion and S11mmary
45
45
46
47
50
HYBRID DESIGNS IN AN OBSERVATIONAL STUDY WITH
A CON'I'IN"'UOUS EXPOSURE
:... 63
4.1.
4.2.
Introduction
Description of Simulations ....•..•..•....•....
iv
63
64
4.3.
4.4.
v.
65
68
APPLICATION OF SUBSAMPLING STRATEGIES TO
RESE.ARCH DATA
5.1.
5.2.
5.3.
5.4.
VI.
Results from Simulations
Discu.ssion and S11mmsry-
87
Introd.uction
Description of Data from the Lipid Research
Clinics Follow-up St1ldy ••••.••••.•••.••••••.•••••••••.••..•••••••'••'..............
Comparison of Results from the Full-Cohort and
Hybrid DesigD.S
Description of Data from the Duke
University Medical Center Cardiovascular
Disease Da'ta. Bank
..••............
87
88
89
92
5.5.
~~m~opomcI~~~~mC~
5.6.
Severity Index Under the Full-Cohort ~d
Hybrid DesigD.S
.•.........................•.
..• •......
93
Summary of Results from Research Data............................. 95
SUMMARY AND SUGGESTIONS FOR FURTHER
RESE.ARCH
6.1.
6.2.
S11mmsrySuggestions for Further Research
REFERENCES
100
u
100
102
.,.................................................................. 104
•
v
LIST OF TABLES
Table 2.1
Simulation Summary Statistics for Estimating
the Treatment Effect (~1) While Adjusting for a
Covariate Effect (~) when Both Variables are
Dichotomo'US and Independent ............................................ 34
Table.2.2
Efficiency ofthe Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Treatment and Covariate are
Dichotomous and Independent ............................................ 37
Table 2.3
Efficiency of the Hybrid Designs Relative to the
Unadjusted Estimates of the Full-Cohort Design
when the Treatment and Covariate are
..Dichotomous and Independent ............................................ 38
Table 2.4
Efficiency of the Hybrid Designs Relative to the
Adjusted Estimates of the Full-Cohort Design
when the Treatment and Covariate are
Dichotomous and Independent ............................................ 39
Table 2.5
Simulation Summary Statistics for Estimating
the Treatment Effect (~1) While Adjusting for a
Covariate Effect (~2) when Both Variables are
Dichotomous and Independent and the
Disease 18
'Common ............................................................... 40
Table 2.6
Efficiency of the Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Treatment and Covariate are
Dichotomous and Independent and the Disease is
Common ................................................................................ 43
Table 2.7
Efficiency of the Case-Cohort Design Relative to
the Unadjusted and Adjusted Estimates of the
Full-Cohort Design when the Treatment and
Covariate are Dichotomous and Independent and
the Disease is Common ........................................................ 44
•
vi
Table 3.1
Simulation Summary Statistics for Estimating
the Exposure Effect (~1) While Adjusting for a
Covariate Effect (132) when Both Variables are
Dichotomous and Mildly Correlated
(p=O.20) •••••••••••••••••••••••••.••••••••••••••••••••••••••••••••••••.••••••••••.••••.•. 53
Table 3.2
Efficiency of the Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Exposure and Covariate are
Dichotomous and Mildly Correlated (p-O.20) __................... 56
Table 8.8
Efficiency of the Hybrid Designs Relative to the
Adjusted Estimates of the Full-Cohort Design
when the Exposure and Covariate are
Dichotomous and Mildly Correlated (p-O.20) .................... 57
Table 8.4
Simulation Summary Statistics for Estimating
the Exposure Effect (131) While Adjusting for a
Covariate Effect (132) when Both Variables are
Dichotomous and Moderately Correlated
(p-O.40) ....•............................................................................ 58
Table 8.5
Efficiency of the Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Exposure and Covariate are
Dichotomous and Moderately Correlated
(p=O.40) ................................................................................. 61
Table 8.6
Efficiency of the Hybrid Designs Relative to the
Adjusted Estimates of the Full-Cohort Design
when the Exposure and Covariate are
Dichotomous and Moderately Correlated
(p.O.40)· ..................................................................._................ 62
Table 4.1
Simulation Summary Statistics for Estimating
the Exposure Effect (131) While Adjusting for a
Covariate Effect (132) when Both Variables are
ContiD.uo'US and Independent .............................................. 72
Table 4.2
Efficiency of the Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Exposure and Covariate are Continuous
and Independent .................................................................. 75
vii
•
Table 4.3
Efficiency of the Hybrid Designs Relative to the
Adjusted Estimates of the Full-Cohort Design
when the Exposure and Covariate are Continuous
and Independent
76
Table 4.4
Simulation Summary Statistics for Estimating
the Exposure Effect (~1) While Adjusting for a
Covariate Effect (~2) when Both Variables are
Continuous and Mildly Correlated (p.0.20) .......................'77
Table 4.5
Efficiency of the Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Exposure and Ccwariate are Continuous
and Mildly Correlated (p=0.20) ........................................... 80
Table 4.6
Efficiency of the Hybrid Designs Relative to the
Adjusted Estimates of the Full-Cohort Design
when the Exposure and Covariate are Continuous
and Mildly Correlated (p.0.20) ........................................... 81
Table 4.7
Simulation Summary Statistics for Estimating
the Exposure Effect (~1) While Adjusting for a
Covariate Effect (~2) when Both Variables are
Continuous and Moderately Correlated
(p=0.40) ................................................................................. 82
Table 4.8
Efficiency of the Unadjusted Analysis Relative to
the Adjusted Analysis in the Full-Cohort Design
when the Exposure and Covariate are Continuous
and Moderately Correlated (p.O.40) ................................... 85
Table ,4.9
Efficiency of the Hybrid Designs Relative to the
Adjusted Estimates of the Full-Cohort Design
when the Exposure and Covariate are Continuous
and Moderately Correlated (p-0.40) ................................... 86
Table 5.1
Estimates of the Effect of Exercise Tolerance Test
(ETT) Outcome (~) on Cardiovascular Disease
Mortality in the Lipid Research Clinics
Follow-up Study ................................................................... 97
•
viii
Table 5.2
Estimates of the Effect of Coronary Artery
Obstruction (131,132, 133) and Coronary Artery
Disease (CAD) Severity Index (134) on
Cardiovascular Disease Mortality in the Duke
University Medical Center Study....................................... 98
..
CHAPTER I
INTRODUCTION AND REVIEW OF THE LITERATURE
The goal of many studies in health research is to identify factors
which increase an individual's risk of developing disease. Traditionally, the
cohort study has been the preferred design for establishing such
associations. The clinical trial, a special type of cohort study, is used in
particular for studies of treatment efficacy. When the disease under study
is chronic in nature, such as cardiovascular disease or cancer, outcomes of
interest are infrequent whether they be disease incidence or diseaseassociated mortality. As a result, many patients have to be followed for a
long period of time before a sufficient number of events occur to provide
meaningful results. In addition to keeping a record of all endpoints, a great
deal of effort will be expended over the course of the study on the acquisition
of raw materials for assembly into individual covariate histories. Only a
small part of these data will be directly related to the primary question,
while the rest will be used for increasing the validity and precision of the
comparison, as well as for performing ancillary studies. Unfortunately,
much of the information collected and processed on the cohort will be
redundant due to the low rate of events in these studies. For example, in
the Lipid Research Clinics Coronary Primary Prevention Trial (1984),3,806
men were followed for an average of 7.4 years in order to assess the
effectiveness of lowering low-density lipoprotein cholesterol levels in
•
reducing CHD risk. During this period, there were a total of 342 (11%)
primary endpoints of either definite CHD death or nonfatal myocardial
infarction. Thus, extensive covariate information was collected on 3,464
patients who did not experience the primary outcome. Although, it is
necessary to have some of these data for purposes of comparison, there is
also a point of diminishing returns after which the covariables for the
remaining patients offer little additional information.
The case-control design was introduced as a practical alternative
in the study of chronic diseases. Since the number of endpoints is not
restricted by the natural frequency of disease-related outcomes, there is
considerable savings in the time
requ~red to
complete the study. In
addition, the number of controls for whom covariate histories must be
obtained is determined at the start of the study and can be chosen to
optimize efficiency within given cost restraints. However, the vulnerability
of this design to several types of bias, particularly with regard to the
selection of controls, can outweigh the savings in time and money. Thus
much of the recent efforts in this area of epidemiologic research has been
aimed at developing methodology for hybrid designs which combine the
basic elements (and advantages) of cohort and case-control studies. The two
which have received the most attention are the sYDtheti-c case-eontrol design
and the case-cohort design. In particular, these designs minimize the large
amount of redundant covariate information assembled on subjects who fail
to experience the outcome of interest, while still maintaining the
advantages that come with following patients from a well-defined cohort.
Essentially, this is accomplished by processing covariate histories on all of
the cases and only a sample of the noncases.
A brief overview is given here for cohort and case-control
studies, before discussing the newer hybrid designs. In addition, proPerties
of the clinical trial are reviewed as a special case of the cohort study. The
design of observational studies is described in detail in most epidemiologic
textbooks [MacMahon and Pugh (1970), Friedman (1987), Kleinbaum,
Kupper and Morgenstern (1982)]. A comprehensive discussion of clinical
2
•
trials can be found in Pocock (1983) and Friedman, Furberg, and DeMets
(1986).
'.
~.1
Traditional Designs in Epidemiologic Research
1.1.1 The Cohort Study
In general, the term "cohort" refers to a group of people who
share some feature in common. For example, in a study to assess the effects
of exposure to a particular risk factor, the cohort might include employees
from an industry where exposure to this risk is high. Cohorts are
considered "fixed" if new entries are not allowed once follow-up has begun,
thus producing equal lengths of follow-up for all subjects who do not
experience the event of interest. More common, however, is the study of
"dynamic" populations in which the length of follow-up differs among the
noncases as a result of staggered entry into the cohort and/or migration out
of the cohort during the course of the study. Once the cohort is defined, the
subjects are classified according to their exposure to a particular risk factor
and then followed to ascertain disease experience. In many studies, the
outcome is the first occurrence of the disease linder study (Le., incident
cases). Therefore, only subjects who are initially disease-free are included in
the cohort. The term "population-at-risk" is used to refer to this group since
it is composed only of subjects who are eligible to become cases.
In a completely prospective cohort design, the risk factor and the
outcome are observed after the study has begun. In a retrospective design,
both of these events have occurred prior to the onset ofthe study. However,
it is not uncommon for a study to begin after exposure has occurred, but
prior to the outcome of interest (Le., ambispective design). The most
important feature of the cohort design is the fact that information
•
concerning the risk factor is known prior to occurrence of the primary
endpoint. For the retrospective cohort, also referred to as a historical
cohort, this information must have been recorded on a well-defined
3
population which was then followed for detection of new cases. This feature
strengthens the argument for causal inference since it is reasonably certain
that the exposure preceded the outcome. In addition, the risk of selection
.bias.due to.differentialinclusion oreases who were exposed is minimal.
Among observational studies, the prospective cohort is preferred
for establishing an association between a risk factor and an outcome, as it
most closely follows the logic of a true experiment (Kleinbaum, Kupper, .and
Morgenstern, 1982). However, the cohort study does have certain
limitations in this respect. Loss to follow-up can be a potential source of
bias as cohort members move away, die from other causes, or just lose
interest in the study. In addition, the comparison groups may differ in
important ways other than the risk factor under study. In extreme cases,
these differences may completely mask the relationship between the risk
factor and the outcome being studied. Often the investigator is aware of
potential confounders or effect modifiers and can control for these at the
analysis stage. However, sometimes these differences are difficult to
quantify, or they may not be recognized until the study is finished.
Particular problems can arise when the·study factor is a therapeutic
intervention, such as a new drug. In such situations, it may be the
investigator himself who imposes imbalance on the comparison groups by
either intentionally or unintentionally allocating the treatments to his
patients in a biased manner. For example, some physicians may favor
using an experimental therapy only as a last resort, perhaps for those
patients who have not benefited from the standard treatment. In the
situation where treatment for the next patient has already been
predetermined and is not acceptable to the investigator, he may choose not
to enroll this subject in the study. Thus, differences in the study groups can
be even more pronounced in situations where intervention is assigned
rather than self-selected as in studies of smoking or occupational exposure.
For these reasons, clinical trials are often used in interventions studies
4
since they permit greater control over various aspects of the design, thereby
allowing a more accurate assessment ofthe treatment effects.
In a controlled clinical trial designed to assess the efficacy of an
•
.experimentaltreatment, patients are assigned at random to receive either
the new therapy or a control treatment. The control treatment may be the
standard medication which is used for the .particular condition or it may be
a placebo. In many clinical trials, the patient and the investigator are
blinded as to which treatment has been assigned. This eliminates two
potential sources of bias in evaluating the new therapy. In addition, it
assures that both the experimental and the control groups receive
equivalent medical supervision over the course of follow-up. It is also
important that patients are assigned to the treatment groups without bias.
This is typically achieved with a randomization scheme which assures that
each patient has an equal chance of receiving either of the treatments.
Therefore, extraneous factors which may affect the outcome tend to be
equally distributed between the two groups. As a result, randomization
makes it possible to- assign a probability distribution to differences observed
between the treatments under the null hypothesis that they are equal.
Furthermore, the validity of statistical tests of significance does not depend
on the balance of prognostic factors between these groups (Friedman,
Furberg, and DeMets, 1985). Therefore it is possible (and perfectly valid) to
test for differences in the treatment groups without having to control for'
other risk factors, both known and unknown. This result is important since
randomization does not guarantee that the two groups will be equivalent
with respect to all extraneous factors. This is especially true when the
number of patients randomized is small. Often stratified randomization is
used to obtain an equal distribution of the most important concomitant
•
factors. For example, in a multicenter clinical trial, patients are frequently
randomized separately within each center. Thus differences which exist
5
between the centers with respect to patient population and medical practice
will be equally divided between the treatment groups.
Prior to analysis of the data, it is common practice to check the
,effectiveness of the randomization by comparing the two treatment groups
with respect to the distribution of prognostic factors. Therefore, if
imbalances do exist with respect to one or more covariables, they can be
adjusted for in the analysis to increase the precision of the estimate of
treatment effect and improve the power of the contrast. Armitage and
Gehan (1974) point out that inclusion of balanced variables which are
highly correlated with the outcome may also improve the precision of the
estimate of treatment effect by reducing the random variation in the
response. However, Morgan and Elasoff (1986) report that in the
comparison of survival times, this gain in efficiency decreases with
increasing probability of being censored. Armitage and Gehan also .note
that consideration of concomitant information will permit the detection of
interactions which may exist between the treatment and these factors.
Although they caution that any prognostic factors which are considered
. should be unaffected by the treatment, otherwise their inclusion in the
analysis may remove some of the treatment effect.
If nonlinear models such as those used for survival data are
ftsentialto the analysis, it may be necessary under certain conditions to
include important prognostic variables in the model to avoid bias, even
when these variables are balanced across the treatment groups. Gail,
Wieand, and Piantadosi (1984) reviewed several of the more frequently used
nonlinear models and found that the estimate of treatment effect in the
censored exponential model and the Cox proportional hazards model is
biased toward zero when one fails to adjust for a balanced prognostic
variable. This is a result of the hazards no longer being proportional in the
misspecified model. Under moderate censoring, a comparison between
these two models reveals greater bias with the Cox proportional hazards
6
•
model. Chastang, Byar, and Piantadosi (1988) have shown the bias to be
most influenced by the strength of the association of the. omitted covariate
•
with survival, the percentage of censoring, and the distribution of the
covariate. .In general,thebias seems to be greatest for stronger covariate
effects, approximately 45 percent censoring, and for a binary covariate with
nearly equal probability of having the value of 0 or 1.
Although prospective cohort studies in general, and clinical
-trials inparticu1ar, are the most widely accepted designs for establishing
causal inference, there are several features common to both of these designs
which make them inefficient for studying rare diseases. These studies
typically require the follow-up of hundreds of subjects for several years until
the number of endpoints assures adequate power to determine whether an
association exists. In addition, much time and money will be spent
gathering covariate information on the entire cohort of which only a small
portion will experience the outcome of interest. Lilienfeld and Lilienfeld
(1979) noted that the cohort design. was prominent in the 19th century when
•
the outstanding health problems were infectious diseases with high
incidence and short latency periods. As emphasis has shifted to chronic
diseases in the 20th century, the case-control design has grown in
popularity.
1.1.2 The Case-Control Study
The case-eontrol design compares a group of subjects who have
experienced the outcome of interest (cases) and a group of noncases
(controls) with respect to a suspected risk factor. Thus, the emphasis is on
"
the cause of a specific disease, rather than the health consequences of a
particular exposure as in cohort studies. Case-control studies are often
•
referred to as retrospective since the directionality of the investigation is
backward, from outcome to exposure. The case-control study is well-suited
to testing hypotheses concerning rare diseases or diseases with long latent
7
periods. Since the outcome has occurred prior to the start of the study, the
number of events is not constrained by the natural frequency of the disease.
Therefore, management of the study is more efficient in terms of sample
size, time, and cost. However, it is this reverse directionality which is also
one of the major weaknesses of the design. The quality of the information
regarding exposure may be affected by errors in either subject recall or
historical records. In particular, "selective recall" may result from a
subject's memory of events being differentially influenced by his disease
status. These sources of bias make it problematic to ascertain if the
exposure actually preceded the outcome and therefore hinder the
establishment of a causal inference.
Schneiderman and Levin (1973) point out that while access to
cases is relatively straightforward, the selection process for controls is
crucial to the integrity of the study. Although in reality the
su~jects are
chosen from two separate populations, in order to draw inferences from the
results, it is assumed that the controls reflect the population from which the
cases developed. It is desirable for the groups to be similar with respect to
other extraneous factors which might be related to the outcome, but are not
of particular interest to the study. Several design strategies aimed at
achieving comparability between the cases and controls have been explored
in the literature. One strategy which has received particular attention is
matching. This involves placing constraints on the choice of controls so that
they possess characteristics in common with the cases regarding potentially
confounding factors. Matching schemes can range from the pairing of one
case to one control or the grouping of one case with R controls (R-to-1
matching). Category matching is also common in which all cases and
controls who share a common set of values for one or more variables (e.g.,
.
age group, gender and race) are said to be matched on those variables. By
reducing the differences between the two groups with respect to
determinants of the disease other than the exposure being studied, there is
8
•
greater power to detect the primary association of interest. Kupper et a1.
(1981) point out, however, that this gain in statistical efficiency may be
offset if the matching variables are not true risk factors.
,Another.strategy often used in case-control studies is to choose
more than one comparison group, particularly from different sources (e.g.,
community controls and hospital controls). This was recommended by
Ibrahim and Spitzer (1979) as both a check for potential biases and a way to
demonstrate consistency in findings. .As it is unlikely that bias would affect
both control groups to the same extent, different estimates of association
would be an indication that at least one is biased.
The case-control design offers the opportunity to study rare
diseases in a relatively efficient and inexpensive maDner. In addition, it is a
practical alternative for research on diseases which take a long time to
manifest themselves. However, m.any of the design features which make
•
the case-control study feasible for these applications are also limitations in
that they expose the results to several types of bias. Even the most carefully
•
designed case-control studies are subject to scrutiny, and frequently results
are not fully accepted until a subsequent prospective study confirms the
findings.
It is evident from a review of the cohort and case-control studies
that there are distinct advantages and disadvantages associated with each
of these traditional designs. In many ways, the two designs are
complementary in that the strengths of one are the weaknesses of the other.
The hybrid case-cohort and synthetic case-control designs reflect an effort to
exploit the strengths of these two traditional strategies while minimizing
•
the overall costs. They show particular promise in the study of chronic
diseases where outcomes of interest are infrequent and latent periods are
•
long. As might be expected, the distinction between these two designs lies
in the selection process for noncases.
9
1.2 Hybrid Designs as Alternatives to Traditional Studies
1.2.1 The Synthetic Case-Control Design
The synthetic case-control design is simply a case-control study
carried out within the well-defined boundaries of a cohort which has been
followed prospectively over time. As in an ordinary cohort design, each
subject must be followed to identify those who subsequently experience the
outcome of interest. For each "case" which occurs, one or more noncases or
"controls" are randomly selected from the remaining cohort to serve as the
referent group. The controls may be matched to the cases on certain
concomitant factors such as age or smoking. A special case of this which
will be discussed later is matching on time-to-response.
This hybrid design, also referred to as a case-control within
cohort or a nested case-control design, realizes advantages of both the
traditional designs from which it is derived. The possibility of selection
bias, always a major concern in case-control studies, is greatly reduced since
both the cases and noncases are identified from the same candidate
population at risk. In addition enrollment in the cohort provides a list from
which the controls can be randomly selected. A true random sample is
frequently not possible when case-control studies are performed in the
community. This arrangement also ensures that a 100 percent sample of
the cases is obtained. This not only provides maximum statistical power,
but it also reduces the bias associated with differential identification of
cases who have been exposed to the risk factor under study.
Since only a sample of the noncases is included in the analysis,
this design is economically more appealing for the study of chronic diseases
than a traditional cohort design. The time and expense associated with
obtaining covariate histories on the entire cohort is significantly reduced.
This design permits a forward approach in the sense that the raw materials
necessary to determine exposure and other covariate information can be
collected and stored prior to the occurrence of the outcome. If the raw
10
•
material is consumable, such as blood sera, then enough must be sampled
from each subject initially to allow for the possibility of a subject being
•
included in the analysis more than once. Information concerning exposure
to the risk factor can also be obtained in a backward fashion once the cases
have been identified. With this approach, the synthetic case-control design
is a natural choice for the purpose of generating new hypotheses using data
,
"from cohorts which have already been defined and followed for the outcome
of interest. Along these same lines,Mantel (1973) notes the usefulness of
this design when a traditional cohort study had already been performed, but
the sample size was too large to permit a comprehensive analysis. Mantel
suggested converting the cohort study into a synthetic case-control design
for the purpose of reducing computer costs. Following the identification of
preliminary models, the complete cohort could then be used to perform the
final analysis.
•
Based on a similar idea, a two-stage design was suggested by
White (1982) for situations in which exposure and disease status are readily
•
available, but important covariate information is either difficult or
expensive to obtain. The covariates referred to here are potential
confounders or effect modifiers for which adjustment must be considered at
the analysis stage. White addressed the situation of a single dichotomous
exposure and discrete covariates. In the first stage,the entire cohort is
cross-classified into one of four groups according to the presence or absence
of the exposure and outcome. The second stage involves taking a random
sample from each of the four groups and retrospectively obtaining covariate
information for individuals in these smaller subsamples. White pointed out
that such a design would be cost effective in situations where the first stage
results in highly unbalanced cells (e.g., a rare exposure and a rare disease).
•
By choosing the sampling fractions such that covariate information is
obtained for all subjects in the small cells and only a portion of those in the
large cells, little is lost in the precision ofthe stratum-specific odds ratios as
11
well as the overall adjusted odds ratio. Estimates of the odds ratio and
hypothesis tests for the two-stage design are presented using a weighted
least squares approach. Cain and Breslow (1988) elaborated on White's
design using a logistic regression.model.thatallowed.multiple exposure and
•
confounding variables which could be either continuous or discrete.
Adjusted parameter e$timates were derived from the standard logistic
regression model by incorporating information about disease and exposure
status on all subjects from the first stage ofthe analysis.
Liddell, McDonald, and Thomas (1977) found that the quality of
the data could be improved by using the synthetic case-control design. They
studied a large cohort of workers from the Quebec chrysotile asbestosproducing industry. The complete cohort contained 10,951 men of whom
215 died of lung cancer, the primary study endpoint. By selecting a random
,sample of 5 controls per case from the entire cohort, only 1290 men were
included in the analysis data set. Although the complete cohort had to be
defined and followed to obtain the cases, it was only necessary to process
work and smoking histories on about a tenth of the cohort. Therefore,
•
additional effort could be spent on improving the quality of the data which
were collected. Furthermore, when a comparison was made with a fullcohort analysis, not only were the results essentially the same, but the case'control within cohort analysis was performed at one-twentieth of the cost.
Several authors [Liddell, McDonald, and Thomas (1977),
Prentice and Breslow (1978), Breslow and Patton (1979), and Whittemore
and McMillan (1982)] have expanded the methodology of the synthetic casecontrol design to studies in which "time-to-response" is the primary
endpoint. In the analysis of time-to-response data, this design involves
matching each case (or failure) with one or more randomly selected subjects
at risk at that point in time. These subjects are frequently referred to as
time-matched controls. The consideration of time in the analysis creates
several alternatives relative to the sampling of controls. Of particular
12
•
interest are whether the controls are sampled with or without replacement
at each failure time, whether a previous control remains eligible for
selection at future .failure times, .and whether subjects are excluded from the
pool of possible controls if they are.known to subsequently become a case.
Prentice and Breslow (1978) developed methodology for a
sampling scheme in which selection was without replacement at each
distinct failure time, and previously chosen controls remained eligible for
:matching to future cases (Design A). This design was discussed in the
context of sampling from a very large population, which they referred to as
a "conceptual prospective cohort." Thus, the matched sets could be assumed
disjoint, giving rise to a conditional likelihood very similar to that proposed
by Cox (1972), with the only difference being the composition of the risk set,
R(t), in the denominator. The original function, as proposed by Cox for use
in prospective cohort studies, defined R(t) as all those in the cohort at risk
at the time of the failure (including the subject who failed). In the function
of Prentice and Breslow, R(t) was redefined to include the failing subject
and only a random sample of all subjects at risk. Initially, there were
reservations about the validity of the likelihood proposed by Prentice and
Breslow since it combined statistical information from each risk set as if
they were independent. In fact, it was probable that an individual who
served as a control in one Tiskset might develop into a case at some later
time, or even be reselected as a control. However, Oakes (1981) expanded
on this work and showed that if Design A was carried out in a well-defined
(finite) cohort, then the likelihood function would possess a partial
likelihood interpretation, provided the selection of controls at each distinct
failure time was independent and that the cohort constituted a random
sample from a larger target population.
Robins, Gail, and Lubin (1986) suggested a variation of Design A
in which controls, once matched, were no longer eligible for selection at
future failure times. However, under this design it was not clear how to
13
·.
handle controls who later developed into cases. Lubin and Gail (1984)
showed that if future cases were excluded from the pool of potential controls
prior to their failing (referred to as "case exclusion"), then substantial bias
could result in estimating relative risks different from unity. However, the
alternative solution would require excluding from analysis any contribution
from a case who had previously been selected as a control. Prentice (1986a)
observed that such an exclusion would be wasteful, especially in studies
"Where the outcome of interest is a Tare occurrence. Lubin and Gail noted
that if previous controls were excluded from consideration as future
controls, but not as future cases, the relative risk estimates from the
proportional hazards model would still be consistent as long as "time"
represented time since the start of follow-up, and not some other
measurement such as chronological age. However, they pointed out that the
usual variance estimator based on the inverse of the observed information
matrix would not be consistent under these circumstances. Prentice
elaborated on this sampling scheme by specifying that a subject failing at
time t who had previously served as a control at some earlier time, say t',
could only be matched to subjects who were also at risk at both times t and
t' (Design B). Since such a sampling design would result in non-nested
events, the likelihood function proposed for Design A would no longer
-possess a partial likelihood interpretation. Therefore, Prentice used the
term pseudolikelihood following that of Besag (1977) to describe the
function. He also developed a consistent estimate of the variance of this
pseudolikelihood score and compared the efficiency of Design B relative to
Design A through a series of simulations. Although small improvements
were detected for Design B, it was noted that the increased complications in
sampling and variance estimation might not be worth the additional effort
required, especially in the presence of time-dependent covariates. Prentice
suggested Design A might be improved if control selection were to continue
at each failure time until a pre-specified number of previously unused
14
controls were chosen. This would increase the number of subjects
contributing to the estimation, and perhaps gain some of the efficiency
noted in Design B without the inconvenience.
As mentioned .briefly with respect to case-control studies, the
question frequently arises concerning the optimal number of controls to
match to each case. Based on Ury's work (1975), it has been shown that, for
relative risks near unity, K=4 controls per case will have (K/(K+1)]*100=80
percent efficiency. relative to.a matched design which uses all available
controls for testing the significance of a dichotomous covariate. However,
Breslow et a1. (1983) demonstrate that there is considerable gain in
efficiency by increasing the number of controls per case when the primary
goal is to estimate large relative risks associated with rare exposures. For
example, if the exposure probability among the controls is estimated at 10
percent, and the relative risk is 5, then the use of 4 controls per case will be
only 60 percent efficient. ·It would be necessary to use approximately 10 or
12 controls per case to have 80 percent efficiency. As might be expected, the
synthetic case-control design is generally less efficient than a full-cohort
study. This is particularly true when the prevalence of the risk factor is
low, and the association between the risk factor and the outcome is strong
(Whittemore,1981). Some of this loss in efficiency results from the fact that
the contribution from a given subject is limited to selected failure times. An
alternate approach, the case-cohort design, which provides more complete
information about the total cohort, has been described as more efficient in
some situations.
1.2.2 The Case-Cohort Design
In the case-cohort design, a subcohort is randomly selected from
the entire cohort to serve as ·a comparison group for all cases during followup. Covariate histories are required only for members of the subcohort and
15
cases which occur outside the subcohort. Although the raw materials from
which the covariate histories are constructed must be obtainable for the
entire cohort (since it is not initially known who will become a case); the
savings for this design come from the relatively large number of noncases
for whom this information is neither processed nor analyzed.
As a result of the manner in which the referent group is selected,
the case-cohort design is different from the synthetic case-control design in
several ways. Since the subcohort is randomly chosen from the entire
cohort, rather than just from the noncases, it is possible to investigate the
relationship of the risk factors to more than one outcome of interest, without
having to select a different referent group for each endpoint. Kupper,
McMichael, and Spirtas (1975) warn that consideration of multiple
outcomes can result in situations where a case for one disease may serve as
a control for another disease. However, this should not be a common
occurrence in large cohort studies where the endpoints are infrequ~nt. An
additional consideration is preserving the level of significance, a, when the
same subcohort is repeatedly used in analyses of different responses. This
is not a major concern in exploratory analyses directed at generating
hyPOtheses. Otherwise, standard methods for multiple comparisons can be
used.
A second difference of this sampling scheme is the ability to
identify members of the subcohort at the beginning of follow-up. Prentice,
Self and Mason (1986) note the particular benefits of this in disease
prevention trials. Identification of the subcohort at baseline provides a
group of subjects on whom compliance can be monitored, as well as allowing
assessment of the intervention on a continuing basis throughout the trial.
However, Langholz and Thomas (1990) caution that bias can result if
members of the subcohort are followed more rigorously than the other
subjects. This concern does not arise in synthetic case-control studies, since
it is not known in advance who will comprise the referent group. An
16
additional consideration for the case-cohort design is the stability of
materials stored for future assembly into covariate histories. Problems can
arise if these raw materials deteriorate over time, or the processing methods
change during the course of the study..Langholzand Thomas note the
importance of assembling the covariate data.for the subcohort at the same
time as for the cases which develop outside the subcohort. However,
assembling all the data at the end of the study prevents interim analyses.
Furthermore, in the case of perishables such as blood sera, this would result
in the use of samples which have deteriorated over the entire study period.
They suggest that the samples be analyzed in batches and stratified on the
time· of analysis to avoid this problem.
Another difference between these two hybrid designs is that the
cumulative baseline failure rate, which is not obtainable from the synthetic
case-control design,canbe estimated directly under the case-cohort design.
It is equal to the usual baseline hazard divided by the sampling fraction
(the proportion of the full cohort selected as the subcohort).
Early work on the case-cohort design involved estimation of the
relative risk (RR) for a binary exposure and response. Kupper, McMichael,
and Spirtas (1975) developed methodology for use with this design which
was based on the assumption that the cohort (or the population at risk) was
-the target population for'whom inferences were to be drawn. In addition to
presenting methods for estimating the relative risk and an appropriate
confidence interval, sample size requirements for the subcohort were also
considered. Miettinen (1982) also proposed a procedure for estimating the
ratio of the two risks associated with the levels of the dichotomous
covariate. He referred to this design as a "case-base" design.
Use of the logistic regression model to estimate the odds ratio
(OR) for case-cohort studies was addressed by Prentice (l986b). He showed
that based on work by Prentice and Pyke (1979), 13=ln(OR) can be estimated
by maximizing the conditional likelihood function, given information on
17
disease status. Furthermore, a variance estimator for the maximum
likelihood estimate of /3 can be obtained from the inverse of the observed
information matrix. Thus, by simply applying the ordinary logistic model to
data from the subcohort plus cases outside the subcohort,a valid estimate of
the odds ratio and its variance can be calculated.
The analysis of time-to-response data for case-cohort studies was
also discussed by Prentice (1986b). He proposed the use of a likelihood
function similar to that of Cox (1972) which was based on the subcohort
rather than the entire cohort. Risk sets are formed for each occurrence of
disease regardless of whether they occurred within the subcohort. However,
thenoncases include only those members of the subcohort at risk at the
time of failure. Therefore, subjects outside the subcohort who subsequently
experience the endpoint, are not included among those "at risk" prior to
their failing. As a result of these subjects contributing to the likelihood only
at selected failure times, the risk sets are said to be nonnested, and
contributions to the score statistic are correlated. Thus, the likelihood
function does not possess a partial likelihood interpretation. As in the case
of Design B of the synthetic case-control design, Prentice referred to this
function as a pseudolikelihood. Self and Prentice (1988) describe the
asymptotic distribution theory which applies to this function.
In addition, Prenticeexamjned the sample efficiency of the case-
cohort design relative to a full-cohort design and a synthetic case-control
design in estimating the regression coefficient
/3 in a model with hazards
proportional to exp(j3X). He simulated a full-cohort sample size of 500
patients with exponentially distributed failure times. The failure rate
parameter was chosen so that 50 failures would be expected. The values
X.O and X.1 of the single binary exposure were each assigned to exactly
half of the cohort, as would be expected for treatment in a randomized
clinical trial. One hundred samples were generated at relative risks of 1
and 2 for the binary covariate. Two different subcohorts (N.55 and N.275)
18
were selected in or"der to yield an expected 100 and 300 subjects,
respectively, for whom covariate data would be required. Similarly, for the
synthetic case-control design, sampling was performed with 1:1 and 1:5
matching ratios.
In general, Prentice noted that the efficiency of the case-cohort
design relative to the full-cohort design increased dramatically with the size
of the subcohort that was selected. The relative efficiency of the synthetic
case-control design also improved with the number of controls matched to
each case. At a relative risk of 1 (P=O), there was not evidence of bias in any
of the estimates of P, although the 1:1 matched synthetic case-control
estimator had a sample mean that differed from zero by more than one
standard error. The size of the tests for
13.0 were all within expected
sampling variation, except for the 1:1 matched synthetic case-control design,
which exceeded the .5 percent significance level. Similar findings are
explained by Langholz and Thomas as a result of the variance of ~ not being
estimated well by the inverse information matrix when only one control is
sampled for each case. Overall, the case-cohort design appeared to perform
better than the synthetic case-control design in terms of yielding more
precise estimates of the exposure effect. However, the differences between
the two designs is of more practical importance for the 1:1 matching scheme.
Self and Prentice (1988) obtained nearly identical results for the
asymptotic relative efficiency of the case-cohort design. They considered the
special case of simultaneous entry into the cohort, at which point subjects
were followed to either failure or the end of the study (i.e., no loss to followup). For a relative ~k of unity and an overall probability of disease of 0.10,
the asymptotic relative efficiency for a subcohort size of 55 (i.e., K-1 control
per case) was 0.55. When a subcohort of275 patients was considered (Le.,
K.5 controls per case), the efficiency jumped to 0.92. Self and Prentice
compared these values to relative efficiencies of 0.50 and 0.83 [K/(K+1)] for
the synthetic case-control design. When the overall disease probability was
19
reduced to 0.05, the difference in the asymptotic relative efficiencies
between the two designs was only half as great. Self and Prentice
speculated that as the disease probability increases, the case-cohort design
will perform better than the synthetic case-control design since the referent
group will increase in the former case, but remain constant in the
latt~r.
Langholz and Thomas suggest that the reason for these differences is the
failure of Self and Prentice to control for the possibility of repeated
samplings of subjects in the synthetic case-control design. In doing so, the
case-cohort design was given an unfair advantage in terms of the number of
distinct persons for whom covariate data would be collected. Self and
Prentice calculated that with an overall disease rate ofp-0.10 and a cohort
size of n-500, in order to achieve the equivalent of s-3 controls per case, the
sampling fraction should be sp/(1-p).0.333. Thus, a subcohort size of 167
would yield E(D)-(s+1)pn.200 expected ,distinct subjects for whom covariate
data must be collect~d. Langholz and Thomas, however, showed that if one
takes into account the possibility of a subject being sampled more than once,
the expected number of distinct persons in a synthetic case-control design is
E(D)_n[1-(l-p)s+1l=172, not 200. The discrepancies between these two sets
of calculations increases with the overall probability of disease (p) and the
number of sampled controls per risk set (s). Langholz and Thomas
recalculated the asymptotic relative efficiencies for'the two designs after
altering the sampling fraction used by Self and Prentice to yield an
equivalent number of expected distinct subjects in both designs. For the
idealized case of simultaneous entry and no loss to follow-up, they found
that the case-cohort design still performed better, although to a much lesser
extent than originally reported. However, in the presence of loss to followup and/or staggered entry, the case-cohort design suffered a substantial loss
in efficiency relative to the synthetic case-control design. In this situation,
the subcohort loses
p~tients to
either censoring or failure, resulting in a
small referent group for the late failures. Ciol and Self (1989) are exploring
20
•
the possibility of maintaining the subcohort at some constant level by
replacing subjects in the original subcohort who fail or are censored during
the follow-up period.
Wacholder et a1. (1989) studied two alternative variance
estimators for
P based
on a modified bootstrap (VB) and on a
superpopulation model (Vsp)' They simulated several designs which
allowed for varying PrObabilities of exposure, competing risks and staggered
(uniform) entry into the cohort. Under these conditions, they studied VB'
Vsp' and the variance estimator given by Prentice (1986b), Vp' They noted
that the variance decreased with an increasing number of expected deaths
and with increasing size of the subcohort. The variance also tended to
increase as the probability of exposure approached 0.5. In general, the size
of the cohort, the pattern of accrual, and the strength of the competing risk
had very little effect on the variance. In addition, the number ofreplicates,
.B, used for the bootstrap method did not appear to be important. Although
Vsp is a valid estimate only under the null hypothesis, it was found to be
equally useful for studying power under the alternative for the case-cohort
design. This result held in the presence of censoring and competing risks,
provided these events acted equally on the two exposure groups.
Wacholder and Boivin (1987) explored ways in which the casecohort design could be used to compare relative risk estimates within the
cohort to the risk in an external population. In doing so, they noted that
stratification in the selection of the subcohort improved the performance of
the case-cohort design for external comparisons. They suggested that
similar results might hold for internal comparisons since individuals from
strata with higher baseline risk would contribute to more risk sets when the
.
proportional hazards model is used. However, Prentice (1986b) wamed that
the efficiency of the case-cohort design may suffer if the stratification is too
fine, causing risk sets which are small at some failure times. A particular
problem can occur if subcohort sampling rates are equal within strata, but
21
the failure rate varies greatly among the strata. This will result in small
comparison groups for large numbers of cases within those strata where the
risk is greatest. Typically, this situation can be anticipated and appropriate
stratum-specific-sampling rates can be determined a priori. However, in the
unexpected case, Prentice suggests that the subcohort be augmented at
preselected follow-up times.
1.3 Outline of the Research Proposal
The primary objective of this research will be to evaluate the
efficiency of the case-cohort and synthetic case-control designs in estimating
the relative risk function for a principle variable, Xl' while adjusting for a
concomitant variable, X2. The performance of both hybrid designs will be
judged relative to that of the traditional cohort design. The outcome of
interest in all analyses will be time-to-response, also referred to as survival
time. The current research involving these hybrid strategies has
exclusively considered efficiency for the special case of a single binary
covariate. It is of interest here to expand on this work in several ways.
The efficiency of estimating the relative risk from simulated
clinical trial data will be presented in Chapter II. As might be expected
with a rare outcome, nearly all of the data (90%) will be censored. The
exposure (or treatment) variable, Xl' will be dichotomous with equal
probabilities of assuming either value. Likewise, the covariate,~,will also
be dichotomous with a similar distribution and will be completely
independent of treatment assignment, as would be expected in a
randomized trial. As in the simulations of Prentice (198Gb), all subjects will
be simultaneously enrolled in the study and then followed for a fixed period
of time during which they will either experience the outcome of interest or
continue to be at risk. Langholz and Thomas (1990) referred to this
situation as an "idealized" intervention trial. The efficiency of these designs
will be compared under the null hypothesis that the adjusted relative risk
22
for Xl is unity, as well as two alternatives representing moderate and
strong associations (e.g., RR=2 and RR=3, respectively). Similar effects will
be assigned to the covariate X2 . Under the case-cohort design, the
treatment effect and its variance will be estimated using ,the pseudolikelihood score statistic defiD;ed by Prentice (l986b). Estimation for the
synthetic case-control design will be based on the partial likelihood function,
as shown to be appropriate by Oakes (1981). These estimates will be
averaged over all simulations for each design in order to obtain summary
statistics for comparison with the traditional cohort model. As mentioned
previously, bias in the estimate of treatment effect can result from omitting
an important covariate from the proportional hazards model, even when
this risk factor is perfectly balanced between the treatment groups.
Although it has not been investigated, it is assumed that such bias may also
occur under the case-cohort and synthetic case-control designs as well. Due
to this potential bias, it will be necessary to evaluate the efficiency of the
hybrid designs using the mean squared error (MSE) which is a combined
measure of the bias and precision of an estimate:
where fi1 is the regression coefficient for the exposure variable Xl' The
relative efficiency will be defined .as the .ratio of the MSE under the
traditional design to that ofthe hybrid design.
Chapters m and IV will consider the performance of the hybrid
designs in an observational setting. Since randomization will not have been
used to obtain equal distributions of the covariate X2 across the exposure
groups, it will be necessary to use an adjusted analysis in order to obtain
valid results. Chapter III will consider the situation in which the exposure
and covariate are binary variables. Chapter IV will explore the relative
efficienCy of the hybrid designs for a continuous exposure and covariate. In
order to evaluate the effect of confounding on the performance of the hybrid
23
designs, simulations will be performed under conditions of both weak and
moderate associations between the exposure and covariate, and compared to
the results from Chapter II in which the covariate was distributed
independently of treatment assignment.
Two examples.will be presented in Chapter V as counterparts to
the simulations performed in Chapters II through IV. For these examples,
the data will be analyzed as if the study design had been approached from
both a case-cohort and a synthetic case-control perspective, as well as the
traditional full-cohort design under which the studies were originally
conducted. As in the simulations, emphasis will be placed on estimation of
the relative risk of exposure; however, the examples will consider
adjustment for multiple covariates, discrete and continuous, as opposed to
the single covariate in the simulations. The rust example will use data
collected as part of the Lipid Research Clinics (LRC) Follow-up Study
(Jacobs et aI., 1990). In .particular, a random sample of 2,284 male
participants who were followed for cardiovascular disease (CVD) death
comprise the cohort for this illustration. It is of interest to investigate the
relationship between a positive exercise tolerance test and CVD mortality
while controlling for age, body mass index, systolic blood pressure, smoking
status, and total and high density lipoprotein cholesterol. The second
example will use data from the Duke University Medical Center CVD data
bank (Califf et a!., 1989). Of the 5,809 patients who underwent cardiac
catheterization between 1969 and 1984, a subset of 3,192 patients who were
medically treated for coronary artery disease will be analyzed here. These
data will be used to compare the prognostic capability of a coronary artery
disease severity index and coronary artery obstruction while adjusting for
age, left ventricular ejection fraction, and indices of myocardial damage,
vascular disease, and pain.
Finally, a summary of the results and recommendations for
further research will be given in Chapter VI.
24
CHAPTERn
HYBRID DESIGNS IN A CLINICAL TRIAL SETTING
2.1 Introduction
Traditionally, data which come from a randomized experiment,
such as a clinical trial, can be used to test for treatment differences without
concern for imbalances between the groups. It is usually accepted that they
will be comparable due to the randomization process. However, some gain
in precision may be obtained by adjusting for an important covariate whic1)
can.reduce the random variation in the response. In addition, when time-toresponse is of interest, adjustment may, under certain conditions, avoid
potential bias in the estimate of treatmen~ effect due to nonproportional
hazards.
There appears to be three ways to approach the analysis of data
from a clinical trial. Firm believers in the virtues of randomization may
proceed with an analysis of the treatment effect ignoring all covariates.
This is the simplest approach and affords the greatest cost reduction since
only treatment assignment and time-to-response need to be recorded on
each patient. An alternative and more conservative approach would involve
measuring important covariates on all subjects and including these in the
final analysis along with treatment assignment. If the covariates are
strongly associated with survival time and there are a large number of
events of interest, this may offer some gain in precision as well as protection
against bias due to nonproportional hazards. At the very least, this
approach will allow one to show that the treatment groups are indeed
equivalent with respect to these factors, which for some is worth the added
expense and trouble of measuring extraneous variables. The third approach
involves using one of the hybrid designs to measure the important
covariates on all of the cases and a fraction of the controls. This method is
clearly a compromise in cost efficiency between the first two choices
especially when the covariate information is expensive to process, although
it is more complicated in terms of sampling and estimation. This chapter
will address the role of the case-cohort and synthetic case-control designs in
clinical trials. More specifically, the capability of these hybrid designs to
duplicate the results of a full-cohort analysis by using only a fraction of the
covariate information will be investigated.
2.2 Description of Simulations
In order to investigate the performance of the case-cohot:t and
synthetic case-control designs in a clinical trial setting,a series of
simulation studies were conducted on data with exponentially-distributed
failure times, censored at unity. The failure times were generated
corresponding to the hazard function:
whereX1and X2 are binary indicators for treatment assignment and a risk
factor, respectively. The coefficients, 131 and 132' are the regression
parameters representing the strength of the association of the covariates
with the outcome. For example, exp(131) estimates the risk of failing in one
treatment group relative to the other treatment group. For these
simulations, the primary interest is how well the treatment effect, 131' is
estimated under the hybrid designs relative to the traditional cohort design.
It is assumed here that information on the risk factor X2 is either difficult
or expensive to process, making it desirable to reduce the number of
patients for whom these data are required.
26
As in the studies of Prentice (1986b), cohorts of size 500 were
generated with the baseline hazard, "'0' chosen such that 50 failures would
be expected in each cohort. However, the number of samples comprising
each simulation was increased from 100 to 300 to obtain a more stable
sampling distribution of the estimates of treatment effect. The treatments
were assigned using simple randomization, with each patient having an
equal chance of receiving either of the two interventions. For this study, the
prevalence of the risk factor,~, in the cohort was also 50 percent, and its
distribution was independent of treatment assignment, as would be the case
in a randomized clinical trial. These simulations consider the idealized
situation of a fixed cohort in which all subjects were followed from the
beginning of the study until they failed or the study was over.
The efficiency of the hybrid designs relative to the full- cohort
design was investigated under the null hypothesis that there is no
difference between the treatments (i.e.,
~1 =0)
and under two alternatives
representing moderate and strong associations (i.e.,
~1=ln(2)
and
respectively). Values of 0, In(2), and In(3) were also assigned to
~1=ln(3),
~2'
thus
creating nine combinations of effects over which trends could be examined.
Matching ratios of 1:3 and 1:5 were used for the synthetic case-control
design. Thus, for an overall probability of disease ofp.0.10 and allowing for
the probability of a subject being included in the analysis more than once,
the expected number of distinct persons when s-3 controls were matched to
each case was n[l-(1-p)s+1]=500[l-(1-0.10)3+1]=172. When s=5 controls
were selected per case, the expected number of distinct persons was 234.
The subcohort size for the case-cohort design was calculated so that the
number of distinct individuals in the two hybrid designs would be
equivalent (Langholz and Thomas, 1990). Thus for the matching ratio of
1:3, a sampling fraction of [1-( 1-p)s]=[1-( 1-0.10)3]=0.271 was used,
producing a subcohort of size 136. Similarly, for the 1:5 matching ratio, the
sampling fraction was calculated to be 0.410, yielding a subcohort of size
27
205. These subcohorts, along with the cases occurring among the remaining
patients, also yielded an expected total of 172 and 234 subjects, respectively,
for whom covariate data were necessary.
The Cox proportional hazards model was used to estimate the
treatment effect under the full-cohort design. For each simulation under
this design, two different models were considered: one adjusting for the risk
factor,
~,
and a second with only the treatment effect included. Although
the adjusted model is considered the "correct" model since it incorporates
both effects which were used to generate the data, results from the
unadjusted analysis are unbiased due to the randomization of treatment
assignment. Therefore, one comparison of interest is whether the efficiency
of the adjusted analysis relative to the unadjusted analysis is great enough
to warrant the extra cost of measuring and processing covariate information
on all patients. Only the adjusted model is considered.under the hybrid
designs, since their attraction lies mainly in the reduction of the number of
subjects for whom the expensive covariate will need to be obtained. I.n a
clinical trial, the treatment assignment is easily known for all patients,
therefore little would be saved by using these designs when treatment is the
only covariate included in the model. The pseudolikelihood score statistic
was used to estimate the treatment effect and its variance under the case"COhort design, using a FORTRAN program written and kindly provided by
Jay Lubin of the National Cancer Institute. Estimation for the sYnthetic
case-control design was obtained through standard conditional logistic
regression methods.
2.3 Results from Simulations
•
Summary statistics from the simulations are presented in Table
2.1. An estimate of the treatment effect,
~1'
was obtained for the 300
replicates under each of the study designs. Convergence was obtained in all
cases. The sample mean and the standard deviation of the distribution of
28
~1 were calculated based on the 300 estimates. The mean square error
(MSE) was also computed as a combined measure of the bias and precision
of the estimates for each design:
The efficiency of the unadjusted full-cohort model relative to the full-cohort
model containing the risk factor X2' as well as treptment assignment, is
presented in Table 2.2 for the different values of 131 and
~2.
Relative
efficiencies were determined from the ratio of mean square errors given in
Table 2.1. Table 2.3 shows efficiencies of the hybrid estimates relative to
the unadjusted estimates of the full-cohort design. Similarly, Table 2.4
gives efficiencies of the hybrid estimates relative to the adjusted estimates
from the full-cohort design..
The most obvious result of the simulations is that adjustment for
the risk factor,~, in the full-cohort analysis does not appreciably influence
the precision of the estimate of treatment effect, nor does there appear to be
any bias in the unadjusted estimate. In fact, the relative efficiency of the
unadjusted estimate is virtually 100 percent for all values of 131 and 132
which were examined (Table 2.2). The lack of effect of adjusting for X2 is
primarily due to the heavy censoring of the data. These results are similar
to those of Morgan and Elasof'f(1986) who, using the Weibull proportional
hazards model, showed that with 90 percent censoring, the efficiency of not
adjusting for a dichotomous covariate is very high, over 95 percent even
when the covariate effect is In(3). The importance of adjusting for the
covariate decreases even more as the strength of its association with
survival time decreases. Similarly, Chastang, Byar, and Piantadosi (1988)
showed that the bias in estimating the treatment effect due to omitting a
covariate is small when a large portion of the observations are censored,
particularly when the covariate effect is 1 or less (RRS2.72). Therefore, if
randomization has been used to remove extraneous differences between the
29
groups of interest, an unadjusted analysis of the treatment effect will yield
nearly identical results to those that would have been obtained with
adjustment, given the data are heavily censored.
Several observations can be made about the performance of the
hybrid designs as well. The mean square error tends to be larger for
stronger treatment effects and also for stronger associations between the
risk factor and the outcome (Table 2.1). In general, there is only a slight
improvement in efficiency for the case-cohort design as the size of the
subcohort is increased. In contrast, the relative efficiency of the synthetic
case-control design is about 10 percent higher on average for the 1:5
matching ratio than for the 1:3 ratio (Tables 2.3, 2.4). Overall, relative to
the full-cohort analysis, the case-cohort design performs somewhat better
that the synthetic case.-control design when there is a matching ratio of 1:3.
However, with 5 controls per case, the relative efficiency for the two designs
are generally quite close, and the differences which are observed here are
due mainly to sampling variation.
The results in Table 2.4 can be compared to the asymptotic
relative efficiencies calculated by Langholz and Thomas (1990) for the
idealized intervention trial with no treatment effect. For an overall disease
probability ofp-0.10, they calculated asymptotic relative efficiencies of 0.78
and 0.87 for the respective subcohort fractions of 0.271 and 0.409 in the
case-cohort design. The asymptotic relative efficiency of 0.87 for the larger
subcohort is slightly greater than the relative efficiency of 0.82 given in
Table 2.4 where the treatment and covariate effects are both O. For the
smaller subcohort, the relative efficiency of 0.80 is quite close to the
asymptotic relative efficiency of 0.78. When there is no treatment effect, the
asymptotic relative efficiency of the synthetic case-control design is given by
sI(s+1), where s is the number of controls matched to each case. Thus, the
asymptotic relative efficiencies are 0.75 and 0.83 for matching rati~s of 1:3
and 1:5, respectively. For the 1:5 matching ratio, the asymptotic relative
30
..
efficiency is a little smaller than the relative efficiency of 0.87 observed in
these simulations. For the smaller matching ratio, the asymptotic relative
efficiency is identical to that presented in Table 2.4 for
~1 =0
and
~2=0
(0.75).
2.4 Simulation Results for a Common Disease
Since it was determined that the heavy censoring associated
with studies of rare diseases was responsible for the high relative efficiency
ofthe unadjusted full-cohort analysis, a second smaller series of simulations
(100 replicates) were conducted on data for which only 60 percent of the
observations were censored. With an overall probability of disease of 0.40,
200 cases would be expected in a cohort of size 500. Thus, matching even 1
control to each case (1:1) would result in about two-thirds of the cohort
requiring covariate information, thus offering little advantage over the full•
cohort design in terms of cost containment. Therefore a subcohort fraction
of 0.225 was chosen for the case-cohort design which would be comparable to·
a matching ratio of 1:0.5. Simulations could not be performed for the
synthetic case-control design using this matching ratio.
Summary statistics and relative efficiencies are presented in
Tables 2.5 through 2.7. There appears to be a slight decrease in the relative
efficiency ofthe unadjusted full-cohort model, as a result of the precision of
the adjusted estimate increasing with the value of ~2. There may also be a
small amount of bias in the unadjusted estimate when ~2.ln(3), although it
is barely perceptible. However, even at
~2.ln(3), the
relative efficiency is
still above 90 percent. In contrast, the case-cohort design does not appear to
perform well under these conditions. In particular, the standard deviations
of the case-cohort es~imates are about 50 percent greater than for the fullcohort, thus making the relative efficiencies of the case-cohort design only
about 40 percent. This poor performance may be due to the fact that many
31
of the patients in the initial subcohort are lost due to failure, resulting in a
somewhat smaller referent group for later failures.
2.5 Discussion and Summary
Motivation for the case-cohort and synthetic case-control designs
has been primarily in those situations where the outcome of interest occurs
infrequently and the cost of obtaining covariate information is expensive.
These two factors combined can make some studies infeasible since a rare
outcome implies that a large number of subjects will have to be followed in
order to achieve the power necessary to obtain significant results. The
hybrid designs offer the means by which these studies can be carried out
since they require covariate information on only a fraction of the subjects
who fail to experience the outcome. However, the
us~
of either of these
designs in place of a traditional full-cohort design for a clinical trial is
questionable. Since information on treatment assignment is readily
available for all members, of the cohort, these hybrid methods would only be
beneficial if it was desired to adjust for an important risk factor which was
.
either difficult or expensive to measure. As noted previously, the inclusion
of this risk factor in the analysis might be warranted in order to maintain
proportional hazards in the analysis of time-to-response data, or for
increasing the precision of the estimate of treatment effect. However, a
review of the literature, as well as the simulations presented here, have
shown that neither of these advantages is likely to be obtained in the
presence of heavy censoring (i.e., a rare outcome), which is precisely the
circumstances under which a hybrid design might be utilized.
In other words, in the analysis of time-to-response data from a
randomized clinical trial, the loss in precision and the potential for biased
estimates due to not adjusting for an important risk factor are both minimal
when there is heavy censoring. Even though the hybrid designs appear to
do reasonably well in this situation, there is little point in going to the extra
32
•
trouble and expense of measuring important covariate information even on
a fraction of the cohort when more efficient results can be obtained from
ignoring this information in a full-cohort analysis.
Consideration was.also given to possible use of the case-cohort
design in studies of common diseases. Although a slight decrease in the
relative efficiency was noted for the unadjusted full-cohort analysis, the
efficiency of the case-cohort design suffered considerably under these
~onditions.
Thus it appears that the use of the case-cohort and sYnthetic
case-control designs in a clinical trial setting would apply mainly to
ancillary questions or monitoring compliance, neither of which are directly
related to testing treatment efficacy (Davis, 1990). In effect, these issues
must be addressed with the same methods that would be applied to an
observational study, since only treatment comparisons would benefit from
the randomization process. The performance of the hybrid studies in
nonrandomized epidemiologic studies will be addressed in Chapters III and
IV.
33
TABLE 2.1
SIMULATION SUMMARY STATISTICS FOR ESTIMATING THE TREATMENT EFFECT (P ) WHILE ADJUSTING
FOR A COVARIATE EFFECT (1'2) WHEN BOTH VARIABLES ARE DICHOTOMOUS ~ INDEPENDENT
Overall Probability of Disease p=O.10
Estimation Procedure
Full-Cohort
Full-Cohort
(unadjustecl)1J .(adjusted)21
Expected Subjects
Requiring Covariate Data
I.
~
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
(adjusted~/
500
500
172
172
234
234
a. Covariate Effect P2=O
Sample mean (B 1)
Sample std. dev. <B:lJ
Mean square error
0.004
0.295
0.087
0.004
0.294
0.086
0.011
0.328
0.108
-0.004
0.338
0.114
0.014
0.323
0.105
0.006
0.315
0.099
b. Covariate Effect P2=ln(2)
Sample mean <B 1)
Sample std. dev. <B:lJ
Mean square error
0.001
0.295
0.087
0.004
0.297
0.088
0.000
0.333
0.111
-0.003
0.347
0.120
0.009
0.325
0.106
-0.005
0.322
0.104
0.006
0.289
0.084
0.008
0.291
0.085
0.010
0.334
0.112
-0.025
0.367
0.135
0.017
0.324
0.105
0.015
0.331
0.110
Treatment Effect PI=0 (RR=l)
c.
Covariate Effect P2=ln(3)
Sample mean (B 1)
Sample std. dev. <B:lJ
Mean square error
.
1/ The unadjusted model includes only one variable, Xl. for treatment assignment.
21 The adjusted model includes the covariate ~ in addition to treatment assignment. The simulated data were generated
corresponding to a hazard function based on both Xhand ~.
31 Mean square error: MSE = (PrB1~ + [std. dev.<B 1)] .
.
"
TABLE 2.1
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
II.
CA)
en
Full-Cohort
Full-Cohort
(unadjusted)lJ (adjusted)21
500
Case-Cohort
(acijusted)2,
Synthetic
Case-Control
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2,
500
172
172
234
234
(adjusted~'
Treatment Effect IIl=ln(2)=O.693 (RR=2)
a. Covariate Effect 112=0
Sample mean (~1)
Sample std. dev. (~:JJ
Mean square error
0.703
0.301
0.091
0.703
0.301
0..091
0.709
0.337
0.114
0.706
0.326
0.106
0.712
0.332
0.111
0.710
0.326
0.107
b. Covariate Effect 112=ln(2)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
0.684
0.310
0.096
0.690
0.312
0.097
0.687
0.352
0.124
0.704
0.362
0.131
0.695
0.345
0.119
0.698
0.341
0.116
0.698
0.323
0.104
Q.710
0.326
0.107
0.703
0.366
0.134
0.716
0.372
0.139
0.714
0.356
0.127
0.742
0.353
0.127
c.
Covariate Effect 112=ln(3)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
11 The unadjusted model includes only one variable. Xl' for treatment assignment.
21 The adjusted model includes the covariate ~ in addition to treatment assignment. The simulated data were generated
corresponding to a hazard function based on both X12and~.
31 Mean square error: MSE = (IIr~I)2 + [std. dev'(~l)] .
TABLE 2.1
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
m.
Co\)
0)
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2,
Case-Cohort
(adjusted)2,
500
172
172
234
Full-Cohort
Full-Cohort
(unadjusted)1I (adjusted)21
500
Synthetic
Case-Control
(adjusted)2/
234 '
Treatment Effect P1=1n(3)=1.099 (RR=3)
a. Covariate Effect P2=O
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.115
0.323
0.105
1.115
0.322
0.104
1.122
0.358
0.129
1.129
0.370
0.138
1.125
0.354
0.126
1.119
0.353
0.125
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.137
0.339
0.116
1.144
0.342
0.119
1.146
0.381
0.147
1.153
0.433
0.190
1.152
0.375
0.143
1.138
0.359
0.130
1.102
0.350
0.123
1.119
0.354
0.126
1.111
0.393
0.155
1.127
0.418
0.176
1.122
0.387
0.150
1.130
0.381
0.146
c.
Covariate Effect P2=1n(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
11 The unadjusted model includes only one variable, Xl' for treatment a88ignment.
21 The adjusted model includes the covariate X2 in addition to treatment assignment. The simulated data were generated
corresponding to a hazard function;ased on both X12and~.
31 Mean square error: MSE = (P1~1 + [std. dev'(~l)] .
A
Table 2.2
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSTED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
TREATMENT AND COVARIATE ARE DICHOTOMOUS AND INDEPENDENT
Overall Probability of Disease p=O.10
.Covariate
Effect (P2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
0.99
1.01
1.01
In(2)
m(2)
In(2)
0
In(2)
In(3)
1.00
1.01
1.03
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.99
1.03
1.02
Treatment
Effect <PI)
37
Table 2.3
EFFICIENCY OF THE HYBRID DESIGNS RELATIVE TO THE
UNADJUSTED ESTIMATES OF THE FULL-COHORT DESIGN WHEN THE
TREATMENT AND COVARIATE ARE DICHOTOMOUS AND INDEPENDENT
Overall Probability of Disease p=0.10
,
Matching
Ratio
Subcohort
Fraction
Treatment
Effect· (PI)
1:3
0.271
0
0
0
0
In(2)
In(3)
0.81
0.78
0.75
0.76
0.73
0.62
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.80
0.77
0.78
0.86
0.73
0.75
In(3)
. In(3)
In(3)
0
In(2)
In(3)
0.81
0.79
0.79
0.76
0.61
0.70
0
0
0
0
In(2)
In(3)
0.83
0.82
0.80
0.88
0.84
0.76
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.82
0.81
0.82
0.85
0.83
0.82
In(3)
In(3)
In(3)
0
In(2)
0.83
0.81
0.82
0.84
0.89
0.84
1:5
11
0.409
Synthetic
Covariate
lJ
Case-Control21
Effect (P2) Case-Cohort
In(3)
MSE(Unadjusted Full-Cohort Estimate)/MSE(Case-Cohort Estimate)
21 MSE(Unadjusted Full-Cohort Estimate>IMSE(Synthetic Case-Control Estimate)
to;
38
Table 2.4
EFFICIENCY OF THE HYBRID DESIGNS RELATIVE TO THE
ADJUSTED ESTIMATES OF THE FULL-COHORT DESIGN WHEN THE
TREATMENT AND COVARIATE ARE DICHOTOMOUS AND INDEPENDENT
Overall Probability of Disease p=0.10
Matching
Ratio
Subcohort
Fraction
1:3
0.271
1:5
0.409
Treatment
Effect (PI)
Covariate
Synthetic
Effect (P2) Case-Cohort1/ Cas~Control21
0
0
0
0
In(2)
In(3)
0.80
0.79
0.76
0.75
0.73
0.63
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.80
0.78
0.80
0.86
0.74
0.77
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.81
0.81
0.81
0.75
0.63
0.72
0
0
0
0
In(2)
In(3)
0.82
0.83
0.81
0.87
0.85
0.77
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.82
0.82
0.84
0.85
0.84
0.84
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.83
0.83
0.84
0.83
0.92
0.86
1/ MSE(Atijusted Full-Cohort Estimate)/MSE(Case-Cohort Estimate)
21 MSE(Atijusted Full-Cohort Estimate)IMSE(Synthetic Case-Control Estimate)
•
39
TABLE 2.5
SIMULATION SUMMARY STATISTICS FOR ESTIMATING THE TREATMENT
EFFECT (PI) WlULE ADJUSTING FOR A COVARIATE EFFECT (P2)
WHEN BOTH VARIABLES ARE DICHOTOMOUS AND INDEPENDENT
AND THE DISEASE IS COMMON
Overall Probability of Disease p=O.40
Estimation Procedure
Full-Cohort
(unadjusted) 1/
Full-Cohort
(adjusted)2/
Case-Cohort
(adjusted)21
500
500
268
a. Covariate Effect P2=0
Sample mean (~1)
.Sample std. dev. (~:lJ
Mean square error
0.017
0.164
0.027
0.017
0.163
0.027
0.042
0.234
0.057
b. Covariate Effect P2=ln(2)
S.ample mean (~1)
Sample std. dev. (~:ll
Mean square error
0.013
0.156
0.025
0.011
0.155
0.024
0.027
0.231
0.054
0.015
0.155
0.024
0.011
0.149
0.022
0.024
0.233
0.055
Ezpected Subjects
Requiring Covariate Data
I.
Treatment Effect PI=0 (RR=1)
c.
Covariate Effect P2=ln(3)
Sample mean (~1)
Sample std. dev. (~:ll
Mean square error
1/ The unadjusted model includes only one variable, Xl' for treatment assignment.
~ The adjusted model includes the covariate ~ in addition to treatment
uaigmnent. The simulated data were generated corresponding to a hazard
function bued on both Xl and Jea.
31 Mean square error: MSE = (PrJJi~ + [std. dev'(~1)]2 .
40
TABLE 2.5
(continued)
Full-Cohort
(unadjusted) 11
Full-Cohort
(adjusted)21
Case-Cohort
(adjusted)21
500
500
268
a. Covariate Effect P2=O
Sample mean (~1)
Sample std. dev. (~:ll
Mean square error
0.701
0.158
0.025
0.701
0.158
0.025
0.732
0.246
0.062
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:ll
Mean square error
0.700
0.152
0.023
0.716
0.150
0.023
0.735
0.238
0.058
0.672
0.147
0.022
0.713
0.142
0.020
0.727
0.235
0.056
.Estimation Procedure
Expected Subjects
Requiring Covariate Data
II.
Treatment Effect PI=In(2) (RR=2)
c.
Covariate Effect P2=ln(3)
Sample mean (~1)
Sample std. dev. (~:ll
Mean square error
...
11 The unadjusted model includes only one variable, Xl' for treatment assignment.
2/ The adjusted model includes the covariate ~ in addition to treatment
assignment. The simulated data were generated corresponding to a hazard
function based on both Xl and XJ.z.
31 Mean square error: MSE = (PI-IJi)2 + [std. dev'(~I)]2 .
•
41
TABLE 2.5
(continued)
Full-Cohort
(unadjusted) 11
Full-Cohort
(adjusted)2/
Case-Coho~
500
500
268
a. Covariate Effect ~2=0
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.105
0.162
0.026
1.105
0.162
0.026
1.139
0.257
0.068
b. Covariate Effect ~2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.087
0.163
0.027
1.114
0.161
0.026
1.136
0.248
0.063
1.051
0.158
0.027
1.117
0.156
0.025
1.136
0.240
0.060
Estimation Procedure
Expected Subjects
Requiring Covariate Data
m.
(adjusted)
Treatment Effect ~1=ln(3) (RR=3)
c. Covariate Effect ~2=ln(3)
Sample mean (~1)
Sample std. dev. (~31
Mean square error
11 The unadjusted model includes only one variable, Xl' for treatment assignment.
21 The adjusted model includes the covariate ~ in addition to treatment
assignment. The simulated data were generated corresponding to a hazard
function based on both Xl and Xtl.
3/ Mean square error: MSE = (~rIJi)2 + [std. dev'(~I)]2 .
"
42
Table 2.6
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSTED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
TREATMENT AND COVARIATE ARE DICHOTOMOUS AND INDEPENDENT
AND _THE DISEASEIS COMMON
Overall Probability of Disease p=O.40
Treatment
Effect (PI)
Covariate
Effect (P2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
1.00
0.96
0.92
m(2)
In(2)
In(2)
0
In(2)
In(3)
1.00
1.00
0.93
In(3)
In(3)
In(3)
0
In(2)
In(3)
1.00
0.96
0.92
II
43
Table 2.7
EFFICIENCY OF THE CASE-COHORT DESIGN RELATIVE TO
THE UNADJUSTED AND ADJUSTED ESTIMATES OF THE FULL-COHORT
DESIGN WHEN THE TREATMENT AND COVARIATE ARE DICHOTOMOUS
.AND_INDEPENDENT AND THE. DISEASE IS COMMON
Overall Probability of Disease p=0.4O
Treatment
Elfect (PI)
Covariate
Effect (P2)
Unadjusted11
Adjusted21
0
0
0
0
m(2)
m(3)
0.47
0.46
0.44
0.47
0.44
0.40
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.40
0.40
0.39
0.40
0.40
0.36
In(3)
In(3)
In(3)
0
In(2)
m(3)
0.38
0.43
0.45
0.38
0.41
0.42
~ MSE(Unadjusted Full-Cohort Estimate)/MSE(Case-Cohort Estimate)
MSE(Adjusted Full-Cohort Estimate)/MSE(Case-Cohort Estimate)
44
CHAPTERm
HYBRID DESIGNS IN AN OBSERVATIONAL STUDY
WITH A DICHOTOMOUS EXPOSURE
3.1 IntroducQon
In Chapter IT, the role of hybrid designs in a randomized clinical
trial setting was discussed. Under conditions of randomi- zation, an
unadjusted comparison of the treatment groups is, by definition, unbiased.
However, when ancillary questions are of interest which require either
combining information over the treatment groups or making comparisons
within a treatment group, then randomization is no longer preserved. Thus
the analysis of these subgroups should be treated like any other
nonrandomized or observational study. When systematic differences are·
found between the subgroups with respect to extraneous risk factors for the
outcome, then necessary adjustments must be made in the analysis in order
to avoid a biased estimate of the exposure effect. Subgroup analyses of
clinical trial data, as well as analyses of observational studies in general,
are precisely the type of studies for which the hybrid designs were
developed. When an unadjusted analysis of the groups under comparison is
not a viable option, then the hybrid sampling strategies offer the
methodology by which data on extraneous factors can be obtained and
adjusted for at a substantial cost reduction.
In order to avoid confusion, a brief explanation of the
terminology used in this chapter is warranted. In the clinical trial setting,
the primary comparison ofinterest was between the treatment groups. The
more general termiIiology, "exposure" groups, will be used here to include
observational designs, as well as ancillary questions within a clinical trial
setting. The covariate, X2 ' which was an independent risk factor for the
outcome of interest was not related to the treatment groups in Chapter II.
The randomization methods used in the clinical trial created an equal
distribution of the risk factor within each of the treatment groups.
However, in this chapter, the covariate will not only be related to the
outcome (i.e., a risk factor), but it will also be positively correlated with the
exposure ofinterest. Hence the term confounder will be used in place ofrisk
factor. This chapter will evaluate the performance of the case-cohort and
synthetic case-control designs in the presence of a confounder, by simulating
data in which the exposure and risk factor are correlated, as is often found
in an observational study design.
3.2 Description of Simulations
'.
The simulations presented here are similar to those considered
in Chapter II, with the exception that the extraneous risk factor is not
distributed independently of the exposure as in the randomized clinical trial
setting. Instead, a weak and moderate level of association between the
exposure and the confounder are examined. For each series of simulations,
300 cohorts of size 500 were generated such that 50 failures would be
expected in each cohort. Failure times were exponentially distributed and
censored at unity. The exposure and the confounder are both dichotomous
with an expected prevalence of 50 percent in the population. Exposure and
confounder effects were examined over the range 0, In(2), and In(3).
Subcohort fractions for the case-cohort disease were chosen to yield the
same number of distinct subjects as the matching ratios of 1:3 and 1:5 used
in the synthetic case-control design.
A positive association was simulated between Xl and X2 in
order to evaluate the performance of the hybrid designs in the presence of
46
confounding. For this data, the population correlation coefficient, p, was
estimated in the selected cohorts with the Pearson correlation coefficient:
1\
r=
Cov(X1,x2)
~r(X1)\{r(X2)]V2
where C~v(X1,x2) is the sample covariance between Xl and X2' and
~r(X1) and ~r(X2) are the sample variances of Xl and~, respectively..
Since Xl and ~ are both dichotomous factors, r can be rewritten as:
where X2 is the Pearson chi-square statistic for a 2x2 table of Xl versus X2
with 1 degree of freedom (Kleinbaum, Kupper, and Morgenstern, 1982).
Nunnally (1978) refers to this as the phi (ep) coefficient.
As in Chapter II, two models were examined under the fullcohort design: the adjusted model which included the confounder X2' as
well as the exposure Xl' and an unadjusted model which contained only the
exposure effect. Results from the later model are presented here for the
purpose of demonstrating the loss in efficiency in the unadjusted or crude
analysis. It should be pointed out that the adjusted estimates represent the
true exposure-disease relationship, whereas the unadjusted analysis yields
biased estimates as a result of not taking into account the factor ~ which is
also known to be associated with the outcome. Since there is a positive
correlation structure between the exposure and the confounder, the true
estimate of the exposure-disease relationship, as reflected by the adjusted
estimate, will always be weaker than the crude estimate since some of the
relationship is explained away by the confounder.
3.3 Results from Simulations
Table 3.1. contains summary statistics for the observational
study simulations in which a mild association (p.O.20) exists between the
47
exposure, Xl' and the covariate, X2 . Convergence was obtained for all 300
replications under each of the study designs. In general, the sample
standard deviations of ~1 and the mean square errors are larger than for
,the clinical trial simulations in which the treatment and covariate were
independent. Overall, the standard deviation increases with the value of 131
for the three study designs. There is also some indication that the standard
deviation increases with 132 as well, but this trend is not consistent for all
simulations.
Another observation which can be made from Table 3.1 is the
bias in the estimate of exposure effect 131 when there is a significant
covariate effect for the unadjusted full-cohort analYSes. It is interesting to
note however that the sample standa~d deviation of ~1 remains fairly
constant even as the sample mean of ~ 1 becomes increasingly biased. The
estimates of 131 for the adjusted full-cohort analysis also seem slightly larger
than their true value for 131=In(2) and 131==In(3). However, this difference
does not appear important and is most likely due to sampling variation.
The adjusted estimates from the hybrid designs accurately reflect those
from the full-cohort adjusted analysis, although the standard deviations are
approximately 10 to 15 percent higher.
Under the full-cohort design, efficiencies of the unadjusted
analyses relative to the adjusted analyses are presented in Table 3.2. These
results clearly indicate the need to adjust for a covariate which is associated
with both the exposure and the outcome of interest. When the covariate is
not a true risk factor for the disease under study (i.e., 132-0), the relative
efficiency of the unadjusted analysis is 100 percent. However, as the
strength of the relationship between the covariate and response increases,
failure to adjust for X2 reduces the relative efficiency by an average of 18
percent when 132=ln(2) and 30 percent when 132=ln(3).
Efficiencies of the hybrid designs relative to the adjusted
estimates of the full-cohort design are presented in Table 3.3 for p.0.20. In
48
general, the relative efficiencies for both designs increase with the size of
the referent group. However, the gain is minimal for the case-cohort design,
only 3 to 4 percent when the subcohort fraction is 50 percent larger. In
contrast, the synthetic case-control design shows an average increase of 10
percent when an additional 2 controls are matched to each case. When
there is a significant treatment effect, the case-cohort design appears to
have a slight advantage (3 to 4 percent) over the synthetic case-control
design for the smaller referent group. It is more difficult to generalize the
results of the larger study design. For the most part, the two designs are
equivalent with respect to relative efficiencies. However, the synthetic casecontrol design does perform unusually well when both the treatment and
covariate effects are nonsignificant. The synthetic case-control design also
does slightly better when the treatment effect is moderate [i.e.,
131 =-In(2)]
and the covariate effect is not large [i.e., P2-ln(3)]. Overall, however, there
is very little difference in relative efficiencies between the two hybrid
designs when p=~.20, which is similar to the results obtained for the clinical
•
trial simulations (i.e., p=O).
Summary statistics for which a moderate association (p=0.40)
exists between the exposure, Xl' and the covariate, X2 , are presented in
Table 3.4. Complete convergence was obtained for all study designs. Again,
standard deviations for ~1 are larger than 'When p.0.20 or p-O. Also
consistent with earlier observations, the standard deviations increase with
both 131 and 132. The bias in the unadjusted estimate of 131 is clearly obvious
for p.0.40 when the covariate effect is significant. As noted previously, the
adjusted full-cohort estimate of 131 also appears to be larger when there is a
significant covariate effect. The estimates of 131 under the hybrid designs
reflect those ofthe full-cohort design.
Relative efficiencies of the unadjusted full-cohort estimates are
presented in Table 3.5. The loss in efficiency resulting from failing to adjust
for a significant covariate effect is even more apparent when p=0.40. For a
49
covariate effect of ~2=ln(2), there is approximately a 40 percent reduction in
efficiency relative to the adjusted estimate. For
~2=ln(3), this
reduction
jumps to 60 percent. There also appears to be a small sacrifice in efficiency
in the adjusted estimate of ~1 when the covariateis not associated with the
outcome (i.e., adjustment is made unnecessarily).
Similar to the simulations for p=0.20, there is only a small
improvement in relative efficiency of the case-cohort design when the
subcohort fraction is increased from 0.271 to 0.409 (Table 3.6). The
synthetic case-control design averages slightly more than 10 percent
improvement when the equivalent change is made to the matching ratio.
There is almost no difference in relative efficiency between the case-cohort
and synthetic case-control designs for the matching ratio of 1:3, although
the synthetic case-eontrol design performed unusually poor for
~1=~2=ln(3).
There is perhaps a slight advantage in the synthetic case-control design for
the larger matching ratio of 1:5. There does not appear to be any evidence
that the relative efficiencies of the hybrid designs suffer as the correlation
between the exposure and risk factor, p, increases.
3.4 Discussion and Summary
When risk factors for the outcome are also associated with the
exposure, adjustment for these confounders is necessary in order to explore
the true exposure-disease relationship free of bias. This situation is typical
in a nonrandomized observational study, although it also arises in the
analysis of ancillary questions in clinical trials which do not preserve the
original randomization. It is clear from the simulations presented here that
the true exposure-disease relationship is artificially inflated when the
confounder is not properly controlled for in the analysis. Thus the
probability of finding an association between the exposure and the disease
when none actually exists, also referred to as the Type I error rate, a, is
increased.
50
Under these circumstances, the hybrid designs offer attractive
savings when the confounder is expensive to measure, with only a moderate
loss in efficiency relative to the full-cohort design. In general, the
.simulations revealed that the smaller hybrid designs which required
'covariateinformation on one-third of the subjects were approximately 75
percent efficient relative to the full-cohort adjusted analyses. The larger
hybrid designs using covariate information on one-half of the subjects were
about 82 percent efficient. Savings with the hybrid designs in a randomized
clinical trial setting is limited primarily to the reduction in the number of
patients for whom extraneous risk factors must be measured, since
treatment assignment is readily known. However, for nonrandomized
observational studies, the hybrid designs otTer additional economy since the
exposure information will be required on only a fraction of the entire cohort
as well.
It is important to note that both the case-cohort design and the
synthetic case-control design perform consistently regardless of the degree
of association between the exposure and the confounder. In fact, the
relative efficiencies of the hybrid designs for the observational studies were
very similar to those for the clinical trial simulations. Although not
explored here, it is generally assumed that these results would also hold for
negative correlations between the exposure and the covariate.
Regarding the size of the referent group, there appeared to be
only a slight improvement in the relative efficiency of the case-cohort design
when the subcohort fraction was increased from 0.271 to 0.409. It is
doubtful that a 3 to 4 percent gain in relative efficiency would be worth
obtaining expensive covariate information on an additional 62 patients.
The same comparison for the synthetic case-control design revealed a
modest gain in relative efficiency of approximately 10 percent. Depending
on the cost of the covariate information, this increase may be worth
sampling an additional 2 controls per case. The asymptotic relative
51
efficiencies derived by Langholz and Thomas (1989) for the hybrid designs
with one binary covariate differ somewhat from the results described here
for two binary covariates. They noted an 8 to 9 percent increase in the
asymptotic relative efficiencies of the hybrid designs when the number of
subjects requiring covariate information was increased from 172 to 234.
This increase is slightly less than the results described for the synthetic
case-control designs, and substantially larger than what was noted for the
case-cohort design. Some of these differences may be due tQ sampling
variation, however, inclusion of two covariates in the analysis rather than
one covariate may also have affected the relative efficiencies.
Although the results presented here are not in perfect agreement
with the asymptotic results of Langholz and Thomas, ~t is important to note
that one of their primary conclusions still applies to these simulations.
They observed that the differences between the two designs in terms of
relative efficiency are very small and that this criterion should play only a
minor role in which subsampling strategy is best suited to a particular
study.
52
TABLE 3.1
SIMULATION SUMMARY STATISTICS FOR ESTIMATING THE EXPOSURE EFFECT (1'1) WlULE ADJUSTING
FOR A COVARIATE EFFECT (1'2) WHEN BOTH VARIABLES ARE DICHOTOMOUS AND MILDLY CORRELATED (p=0.20)
Overall Probability of Disease p=O.10
Estimation Procedure
Expected Subjects
Requiring Covariate Data
I.
en
CQ
Full-Cohort Full-Cohort
(unadjusted)lJ (adjusted)21
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
500
172
172
234
234
a. Covariate Effect 1'2=0
Sample mean (~1)
Sample std. dev. (~N
Mean square error
0.005
0.301
0.091
0.006
0.302
0.091
0.010
0.348
0.121
0.006
0.343
0.118
0.014
0.339
0.115
0.018
0.312
0.098
b. Covariate Effect 1'2=ln(2)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
0.141
0.302
0.111
0.012
0.308
0.095
0.015
0.353
0.125
0.010
0.358
0.128
0.019
0.346
0.120
0.003
0.346
0.120
0.206
0.304
0.135
0.007
0.310
0.096
0.009
0.361
0.130
-0.004
0.356
0.127
0.013
0.353
0.125
-0.003
0.352
0.124
Exposure Effect 1'1=0 (RR=l)
c.
Covariate Effect 1'2=ln(3)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both Xl2and~.
3/ Mean square error: MSE = (I'r~I)2 + [std. dev'(~l)] .
TABLE 3.1
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
n.
en
~
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted~/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjustecl)2/
500
172
172
234
234
Full-Cohort
Full-Cohort
(unadjusted) 11 (adjusted)21
500
Exposure Effect iii =In(2>=O.693 (RR=2)
a. Covariate Effect 1i2=O
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.705
0.318
0.101
0.707
0.319
0.102
0.711
0.366
0.134
0.711
0.373
0.139
0.715
0.355
0.127
0.715
0.339
0.115
b. Covariate Effect 1i2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.862
0.312
0.126
0.734
0.314
0.100
0.735
0.358
0.130
0.732
0.375
0.142
0.740
0.351
0.125
0.731
0.339
0.116
0.935
0.323
0.163
0.744
0.329
0.111
0.744
0.374
0.142
0.745
0.404
0.166
0.749
0.366
0.137
0.745
0.380
0.147
c.
Covariate Effect li2=ln(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both Xhand~.
31 Mean square error: MSE = (Iir~l~ + [std. dev'(~l)] .
•
•
TABLE 3.1
(continued) .
Estimation Procedure
Expected Subjects
Requiring Covariate Data
III.
tn
tn
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2,
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2,
500
172
172
234
234
Full-Cohort
Full-Cohort
(unadjusted)11 (adjusted)21
500
Exposure Effect ~1=ln(3)=1.099 (RR=3)
a. Covariate Effect ~::O
Sample mean (~1)
Sample std. dev. (~N
Mean square error
1.127
0.348
0.122
1.128
0.348
0.122
1.133
0.396
0.158
1.146
0.410
0.170
1.136
0.387
0.151
1.131
0.382
0.147
b. Covariate Effect ~2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.289
0.366
0.170
1.161
0.369
0.140
1.162
0.416
0.177
1.164
0.430
0.189
1.167
0.404
0.168
1.164
0.403
0.167
Covariate Effect ~=ln(3)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
1.350
0.358
0.191
1.163
0.364
0.137
1.163
0.409
0.171
1.179
0.427
0.189
1.168
0.400
0.165
1.163
0.401
0.165
c.
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function~asedon both X~and ~.
31 Mean square error: MSE = (~1~1) + [std. dev'(~l)] .
Table 3.2
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSTED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
EXPOSURE AND COVARIATE ARE DICHOTOMOUS AND
MILDLY CORRELATED (p=0.20)
Overall Probability of Disease p=0.10
Exposure
Effect (PI)
Covariate
Effect (P2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
1.00
0.86
0.71
In(2)
In(2)
In(2)
0
In(2)
In(3)
1.01
0.79
0.68
In(3)
In(3)
In(3)
0
In(2)
In(3)
1.00
0.82
0.72
.
•
56
Table 3.3
EFFICIENCY OF THE HYBRID DESIGNS RELATIVE TO THE ADJUSTED
ESTIMATES OF THE FULL-COHORT DESIGN WHEN THE EXPOSURE
AND COVARIATE ARE DICHOTOMOUS AND MILDLY CORRELATED (p=O.20)
Overall Probability of Disease p=0.10
Synthetic
Covariate
Effect (P2) Case-Cohort11 Case-Control21
Matching
Ratio
Subcohort
Fraction
Exposure
Effect (P1)
1:3
0.271
0
0
0
0
In(2)
In(3)
0.75
0.76
0.74
0.77
0.74
0.76
In(2)
In(2)
In(2)
0
!n(2)
In(3)
0.76
0.77
0.78
0.73
0.70
0.69
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.77
0.79
0.80
0.72
0.74
0.72
0
0
0
0
In(2)
In(3)
0.79
0.79
0.77
0.93
0.79
0.77
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.80
0.80
0.81
0.89
0.86
0.76
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.81
0.83
0.83
0.83
0.84
0.83
1:5
0.409
11 MSE(Adjusted Full-Cohort Estimate)IMSE(Case-Cohort Estimate)
2/ MSE(Adjusted Full-Cohort Estimate)IMSE(Synthetic Case-Control Estimate)
57
TABLE 3.4
SIMULATION SUMMARY STATISTICS FOR ESTIMATING THE EXPOSURE EFFECT (PI)
WHILE ADJUSTING FOR A COVARIATE EFFECT (P2) WHEN BOTH VARIABLES
ARE DICHOTOMOUS AND MODERATELY CORRELATED (p=O.40)
Overall Probability of Disease p=O.10
Estimation Procedure
Expected Subjects
Requiring Covariate Data
I.
en
ClO
Full-Cohort
Full-Cohort
(unadjusted)1I (adjusted)21
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted~/
(adjusted~/
500
500
172
172
234
234
a. Covariate Effect P2=O
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.005
0.301
0.091
0.007
0.312
0.097
0.012
0.365
0.133
0.005
0.359
0.129
0.015
0.357
0.128
0.016
0.329
0.108
b. Covariate Effect P2=1n(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.282·
0.306
0.173
0.019
0.317
0.101
0.022
0.366
0.134
0.017
0.391
0.153
0.025
0.360
0.130
0.014
0.348
0.121
0.425
0.311
0.277
0.019
0.320
0.103
0.022
0.371
0.138
-0.003
0.383
0.147
0.024
0.363
0.132
0.025
0.366
0.135
Exposure Effect PI=0 (RR::1)
c.
Covariate Effect P2=ln(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both Xtland~.
31 Mean square error: MSE = (PI ~1)2 + [std. dev'(~l)] .
TABLE 3.4
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
u.
en
co
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Full-Cohort
Full-Cohort
(unadjusted) 1/ (adjusted)21
500
Exposure Effect ~1 =In(2)=O.693 (RR=2)
a. Covariate Effect ~=O
Sample mean (~1)
Sample std. dev. (~:N
Mean square error
0.705
0.318
0.101
0.704
0.331
0.110
0.709
0.390
0.152
0.714
0.383
0.147
0.7.12
0.377
0.142
0.717
0.354
0.126
b. Covariate Effect ~2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.007
0.332
0.209
0.738
0.339
0.117
0.741
0.391
0.155
0.736
0.391
0.155
0.745
0.382
0.149
0.746
0.377
0.145
Covariate Effect ~2=ln(3)
Sample mean (~1)
Sample std. dev. (~:ll
Mean square error
1.152
0.346
0.330
0.748
0.352
0.127
0.749
0.401
0.164
0.754
0.403
0.166
0.754
0.394
0.159
0.751
0.387
0.153
c.
11 The unadjusted model includes only one variable. Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both Xhand~.
3/ Mean square error: MSE = (~1~1~ + [std. dev'(~1)] .
TABLE 3.4
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
III.
~
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted~/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Full-Cohort
Full-Cohort
(unadjusted)lI (adjusted)21
500
Exposure Effect Pl=ln(3)=1.099 (RR=3)
a. Covariate Effect P2=O
Sample mean (~1)
Sample std. dev. (~N
Mean square error
1.127
0.348
0.122
1.123
0.356
0.127
1.128
0.413
0.171
1.137
0.423
0.180
1.131
0.403
0.163
1.129
0.392
0.155
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.427
0.379
0.251
1.156
0.393
0.150
1.160
0.441
0.198
1.187
0.447
0.208
1.163
0.428
0.187
1.174
0.418
0.180
1.569
0.384
0.369
1.168
0.386
0.154
1.171
0.437
0.196
1.190
0.463
0.223
1.175
0.428
0.189
1.165
0.433
0.192
c.
Covariate Effect P2=ln(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
11 The unadjusted model includes only one variable, XI, for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both X\2and ~.
31 Mean square error: MSE = (Pl~l~ + [std. dev'(~I)] .
..
.
Table 3.5
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSTED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
EXPOSURE AND COVARlATE.ARE DICHOTOMOUS AND
MODERATELY CORRELATED (p=O.40)
Overall Probability of Disease p::O.lO
Exposure
Effect (PI)
Covariate
Effect (!J2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
1.07
0.58
0.37
In(.2)
In(2)
In(2)
0
In(2)
In(3)
1.09
0.56
0.38
In(3)
In(3)
In(3)
0
In(2)
In(3)
1.040.60
0.42
•
61
Table 3.6
EFFICIENCY OF THE HYBRID DESIGNS RELATIVE TO THE ADJUSTED
ESTIMATES OF THE FULL-COHORT DESIGN WHEN THE EXPOSURE AND
COVARIATE ARE DICHOTOMOUS AND MODERATELY CORRELATED (p=0.40)
Overall Probability of Disease p=0.10
•
Matching
Ratio
Subcohort
Fraction
Exposure
Effect (PI)
1:3
0.271
0
0
0
0
In(2)
In(3)
0.73
0.75
0.75
0.75
. 0.66
0.70
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.72
0.75
0.77
0.75
0.75
0.77
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.74
0.76
0.79
0.71
0.72
0.54
0
0
0
0
In(2)
In(3)
0.76
0.78
0.78
0.90
0.83
0.76
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.77
0.79
0.80
0.87
0.81
0.83
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.78
0.80
0.81
0.82
0.83
0.80
1:5
0.409
Synthetic
Covariate
Effect (P2) Case-CohortlJ Case-Control21
11 MSE(Adjusted Full-Cohort Estimate)/MSE(Case-Cohort Estimate)
21 MSE(Acijusted Full-Cohort Estimate)IMSE(Synthetic Case-Control Estimate)
•
62
CHAPTER IV
•
HYBRID DESIGNS IN AN OBSERVATIONAL STUDY
WITH A CONTINUOUS EXPOSURE
4.1 Introduction
The simulations described thus far have exclusively considered
the case where both the principle variable, X!' and the covariate, X2 , are of
a dichotomous nature. A natural extension of this investigation is to
examine the performance of the subsampling strategies when both Xl and
~are continuous variables.
Epidemiologic studies frequently explore the
risks associated with a continuous) exposure, such as serum cholesterol,
pack-years of smoking, and radiation dosage in rads. The nature of the
secondary variable, X2' which was described earlier as an important
covariate that is difficult and/or expensive to measure, is also likely to be
continuous in practice. For example, this covariate may be a measurement
obtained from an expensive blood assay, or the quantifying of specific
Butrientsfrom hand-coded food frequency questionnaires.
In Chapter II, simulations were performed under the conditions
of no correlation between the binary treatment variable and covariate, as
would be expected in a clinical trial setting. Chapter III considered the
effect of mild (i.e., p-O.20) and moderate (i.e., p-O.40) correlations between
•
Xl and X2. For completeness, all three situations will be explored here
using continuous data. Although continuous treatment variables are
virtually nonexistent in clinical trials, it is possible that the exposure and
covariate of an observational study design are unrelated. Therefore the zero
correlation structure will be considered here in the context of an
epidemiologic setting.
4.2. Description of Simulations
The simulations presented here are very similar to those
.
described in Sections 2.2 and 3.2, with the exception that both the exposure,
Xl' and covariate, X2' are continuous rather than binary variables.
Specifically, Xl and X2 were structured as bivariate standard normal
variables with mean and variance
E[~~] = [~]
(4.1)
and
v[~~]
-
[~
i].
(4.2)
Observations were generated from three mutually independent normal
distributions such that the random variables Yi , i.l,2,3 were normally
distributed with mean E[Yi].~ and variance V[Yi]=(7~ i=1,2,3. Data points
generated for Y1"were added to those for Y3 to obtain the distribution of the
•
exposure, Xl. Similarly, the distribution of the covariate X2 was generated
from the addition of observations from the distributions of Y2 and Y3.
Thus, Xl and X2 are bivariate normal random variables with mean and
variance
(4.3)
and
(4.4)
,.
In order for equations 4.1 and 4.3 to be satisfied, each of the distributions of
Yi , i-l,2,3 were generated with an expected value of O. Substitution of
equation 4.2 into 4.4 implies that the variance of Y3 will determine the
correlation,
p,
between Xl and X2. In addition the variance of Y1 and Y2
64
must be equal to 1-p. For the simulations in which the exposure and
covariate are independent, the distributions of Xl and X2 are simply those
ofY1 and Y2 , since the mean and variance ofY3 must necessarily be zero.
,Note ,that a zero correlation between these two variables implies that they
are independent since they are normally distributed. In the case where a
mild correlation (i.e., p-0.20) is desired between Xl
and~, V[Y3]=0.20
and
V[Y1]=V[Y2]=0.80. Similarly, for p=0.40, V[Y3]=0.40 and
V[Y1]=V[Y2]=0.60.
As in Chapters II and III, 300 cohorts of size 500 were generated
for each simulation such that 50 failures would be expected in each cohort.
Failure times were exponentially distributed and censored at unity.
Exposure and confounder effects were examined over the range 0, m(2), and
m(3) with case-to-control matching ratios of 1:3 and 1:5. As with the binary
data,convergence was obtained for all simulations under each of the study
designs.
4.3 Results from Simulations
Summary statistics for the simulations in which the continuous
exposure and covariate are independent are presented in Table 4.1. In
general, there appears to be a slight upward bias in the adjusted estimate of
the continuous exposure under the hybrid designs when the relative risk is
significant. This bias is somewhat greater for the case-cohort design,
although it decreases slightly for both designs as the size of the referent
group is expanded. The unadjusted estimates of the continuous exposure
under the full-cohort design are biased toward the null when there are
•
significant exposure and covariate effects. This bias increases with the
strength of the relationship of both the exposure and the covariate to the
outcome. However, the mean square errors do not reflect this
underestimation since there is a corresponding decrease in the size of the
sample standard deviations as well.
65
Overall, the sample standard deviations of the continuous data
are considerably smaller than was observed for the dichotomous data
presented in Chapter II. Another difference to be noted between the two
types of data is that the standard deviations and mean square errors of the
continuous exposure estimates appear to increase with P1 only under the
hybrid designs. For the binary data, this increase was observed for the fullcohort design as well. The same trend, although less prominent, appears to
exist for P2 also.
The relative efficiencies of the unadjusted estimates to the
adjusted estimates of the full-cohort design are presented in Table 4.2. As
noted previously, the bias in the unadjusted estimates is for the most part
undetected in the mean square errors as a result of the corresponding
smaller standard deviations. Thus, the results presented in this table are
somewhat misleading. However, for the strongest exposure and covariate
effects [i.e., P1=132=ln(3)], the bias for the unadjusted estimate is strong
enough toward the null to yield a low relative efficiency of only 59 percent.
The efficiencies of the hybrid designs relative to the adjusted
full-cohort design are presented in Table 4.3 for these simulations. When
neither exposure nor covariate effects are significant (i.e., P1=P2=0), the
synthetic case-control design is 75 percent efficient for the 1:3 matching
ratio. This agrees with the asymptotic theory for the case of a single
continuous exposure variable derived by Breslow and Patton (1979) and Ury
(1975). The relative efficiency of the case-cohort design under the same
conditions is 72 percent. This is slightly smaller than the asymptotic
relative efficiency of 78 percent derived by Langholz and Thomas (1990) for
the case of a single binary exposure. An expression for the asymptotic
relative efficiency for a continuous exposure in the case-cohort design has
not yet been developed.
When the matching ratio is increased to five controls per case,
the relative efficiency of the synthetic case-control design is 78 percent,
66
•
which falls short of the asymptotic estimate of 83 percent. Similarly, for the
case-eohort design, the corresponding relative efficiency is 81 percent which
is also less than the 87 percent reported by Langholz and Thomas for a
single binary exposure.
Some general trends can be 'seen ,in the 'relative efficiencies of
the hybrid designs. As would be expected, the performance of both the casecohort and synthetic case-eontrol designs improves with the larger referent
group. This increase in relative efficiency was not as obvious for the binary
results of the case-cohort design presented in Chapter-II. It is also clear
from Table 4.3 that the relative efficiency of these designs deteriorates
significantly as both the exposure and covariate effects become stronger. In
fact, these subsampling strategies are only 30 percent efficient under the 1:3
matching ratio when the exposure and covariate effects are both In(3). The
relative efficiency improves to 40 percent when an additional two controls
are matched to each case. These results are in contrast to those seen with
the binary data in which the relative efficiency varied only slightly with the
strength of the treatment and covariate effects.
Similar to the binary data presented in Chapter III,correlations
between the continuous exposure and covariate are examined here. In
general, when the correlation is mild (i.e., p.0.20) the standard deviations
and mean square errors of the exposure in Table 4.4 are larger than when
the exposure and covariate are independent, as in Table 4.1. A similar
observation was also noted for the binary data. The slight upward bias in
the hybrid estimates of the exposure effect persist under the conditions of
mild correlation; however, the bias does not appear to be any greater than
..
when p.O. It is also quite apparent that the continuous covariate becomes a
confounder as a result of its relationship to the exposure. The extent of the
bias due to omitting this confounder in the unadjusted analysis of the fullcohort design can be seen in Table 4.5. Although there is a slight loss in
efficiency in the adjusted analysis when the covariate effect is
67
nonsignificant, the relative efficiency of the unadjusted analysis decreases
substantially as the covariate effect becomes stronger. It is interesting to
note that the bias in the unadjusted estimate decreases as the exposure
effect becomes·stronger. Chastang, Byar,and Piantadosi (1988) observed
this trend for a binary treatment and covariate which were not correlated,
although it was only apparent for very strong treatment effects (i.e., 13>2).
In general, the performance of the hybrid designs is not affected by the
introduction of a mild correlation between the continuous exposure and
covariate (Table 4.6). This agrees with the resu.J.ts for the binary data
presented in Chapter TIl.
When the exposure and covariate are moderately correlated
(p-O.40), the standard deviations and mean square errors are even larger
still (Table 4.7). Again, bias is present in the hybrid estimates of the
exposure effect, although to the same degree as when p=O and p.O.20. The
unadjusted estimates under the full-cohort design are extremely biased
away from the null and the bias increases with the strength of the
covariate. As noted previously, the bias decreases as the exposure effect
strengthens (Table 4.8). Overall, the results in Table 4.9 show that the
relative performance of the hybrid designs remain unchanged regardless of
the strength of the correlation.
4.4 Discussion and Summary
The results presented here for a continuous exposure and
covariate are different in several ways from those for the binary data.
These comparisons give a more complete picture of the effectiveness of the
subsampling strategies under various conditions.
The results of the unadjusted full-cohort analysis for an
independent exposure and covariate will be considered first. It was noted in
Chapter II that bias toward the null could occur when an important, yet
balanced, covariate is omitted from the Cox model due to the assumption of
68
•
proportional hazards being violated. Aithough Chastang, Byar, and
Piantadosi (1988) found this bias to be small when the data were heavily
censored. This explanation was supported by the lack of bias found in the
_unadjustedJull-cohort estimates of the binary treatment effect presented in
Chapter II. However, the results presented here for continuous data show
the exposure effect to be underestimated in the unadjusted full-cohort
analysis. This bias increases with the strength of the exposure and
covariate effects. One possible explanation for this is that the bias reported
by Chastang, Byar, and Piantadosi is more severe when the data are
continuous, thus affecting the estimate despite heavy censoring. The
implications of this result would initially seem to be that the subsampling
strategies might prove useful even when the exposure and covariate are
independent, as long as the data are continuous. Unfortunately, the hybrid
designs perform poorly relative to the full-cohort design under precisely
these conditions. In fact, when both the exposure and covariate effects are
In(3), the relative efficiency of the unadjusted estimate is 59 percent.
.
However, the corresponding relative efficiencies of the hybrid designs were·
only 30 and 40 percent for the· 1:3 and 1:5 matching ratios, respectively.
Thus, the case-cohort and synthetic case-control designs have little to offer
in this situation.
The Telatively poor performance of the hybrid designs when the
data are continuous warrant concern. A comparison of the standard
deviations under the full-cohort design for the binary and continuous data
may suggest an explanation. When the data are binary, the standard
deviation of the exposure increases with 131 and
132 under the full-cohort
design as well as the hybrid designs, thus producing fairly stable relative
efficiencies over the different treatment and covariate effects in Chapters II
and ID. However, when the data are continuous, the standard deviation of
the exposure does not increase appreciably with 131 and
132 under the full-
cohort design. Therefore, since this increase is still present under the
69
hybrid designs, their relative efficiencies suffer accordingly. Kalbfleisch and
Prentice (1980) also noted that the efficiency of the Cox model decreased
with f3 in the two-sample problem. In the analysis of survival in two
treatment groups (e.g., experimental drug versus placebo), if the active drug
was highly effective (i.e., f3large), then it is likely that the placebo group
will have considerably more deaths. In a simulation of this example with
300 replications, there may be some occurrences in which nearly all the
deaths are in the placebo group. Under these conditions, the estimates of f3
become very unstable, thus producing large variances. Langholz and
Thomas (1990) showed that the empirical variances for the hybrid designs
also increase with f3 for a dichotomous treatment variable. It is
hyPOthesized that when the data are continuous, the variance from the Cox
model does not increase with f3 since there is a better continuum of
estimates. Why this reasoning does not hold for the subsampling strategies
is not obvious. Intuitively, it seems as if the smaller sample sizes of these
designs may playa role in their inability to take full advantage of the
continuous data. One result which lends support to this theory is that the
relative efficiencies of the continuous estimates under the hybrid designs
.increased substantially when the size of the referent groups was increased.
The difference in the relative efficiencies for the two matching ratios was
only minor for the binary data.
Another factor which also adds to the poor performance of the
hybrid designs for continuous data is the slight upward bias in the estimate
of significant exposure effects. This bias did not appear to be present for the
binary data. Although the bias exists under both hybrid designs, it is
slightly worse for the case-cohort design. As noted previously, the bias
decreases somewhat when more controls are used, thus suggesting again
that sample size is a potential problem for the subsampling strategies when
the data are continuous.
70
.
The simulations presented here considered situations in which
the continuous exposure and covariate were independent, mildly correlated,
and moderately correlated. In general, the relative performance of the casecohort and synthetic case-control designs was not affected by the level of
association between the continuous exposure and covariate. The same
result was also observed when these variables were dichotomous in nature.
Finally, in an overall comparison of the relative performance,
the case-cohort design and synthetic case-control design appear to be nearly
equivalent in efficiency for continuous data. This is the same conclusion
that was reached for the binary data. Unfortunately, for the continuous
data, the two 8ubsampling strategies are equally inferior to the full-cohort
design. Instead it may be more efficient to use a surrogate to the expensive
covariate, rather than collecting this information on only a fraction of the
original cohort.
•
71
TABLE 4.1
SIMULATION SUMMARY STATISTICS FOR ESTIMATING TIlE EXPOSURE EFFECT (PI) WlDLE ADJUSTING
FOR A COVARIATE EFFECT (P2) WHEN BOTH VARIABLES ARE CONTINUOUS AND INDEPENDENT
Overall Probability of Disease p=O.10
Estimation Procedure
Expected Subjects
Requiring Covariate Data
I.
~
Full-Cohort
Full-Cohort
(unadjusted)1I (adjusted)21
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
500
172
172
234
234
a. Covariate Effect P2=0
Sample mean (~1)
Sample std. dev. (~:r>
Mean square error
-0.004
0.144
0.021
-0.005
0.144
0.021
-0.002
0.171
0.029
-0.007
0.168
0.028
-0.002
0.161
0.026
-0.005
0.163
0.027
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:r>
Mean square error
0.010
0.141
0.020
0.009
0.141
0.020
0.011
0.181
0.033
0.003
0.188
0.035
0.011
0.166
0.028
0.005
0.167
0.028
c. Covariate Effect P2=In(3)
Sample mean (~1)
Sample std. dev. (~:r>
Mean square error
0.008
0.134
0.018
0.007
0.134
0.018
0.007
0.192
0.037
0.003
0.209
0.044
0.008
0.174
0.030
0.002
0.177
0.031
Exposure Effect PI =0 (RR=1)
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard functioniased on both Xltnd~.
3/ Mean square error: MSE = (Pr~l) + [std. dev'(~l)] .
p
•
•
TABLE 4.1
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
u.
Full-Cohort
Full-Cohort
(unadjusted)11 <adjusted)21
500
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
<adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Exposure Effect ~1 =In(2>=O.693 (RR=2)
a, Covariate Effect ~2=O
Sample mean <B1 )
Sample std. dev. (B:lJ
Mean square error
0.694
0.140
0.020
0.695
0.141
0.020
0.731
0.199
0.041
0.709
0.188
0.036
0.716
0.175
0.031
0.708
0.180
0.033
b. Covariate Effect ~=ln(2)
Sample mean <B 1)
Sample std. ~ev. (B:lJ
Mean square error
0.678
0.140
0.020
0.709
0.145
0.021
0.748
0.214
0.049
0.727
0.209
0.045
0.733
0.190
0.038
0.721
0.187
0.036
Covariate Effect ~=ln(3)
Sample mean <B1)
Sample std. dev. (B:lJ
Mean square error
0.628
0.132
0.022
0.706
0.144
0.021
0.742
0.216
0.049
0.738
0.219
0.050
0.729
0.192
0.038
0.701
0.190
0.036
-.::J
CoI)'
c.
11 The unadjusted model includes only one variable, XI' for exposure level
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
correspondir.g to a hazard function based on both Xhand~.
3/ Mean square error: MSE = <~rBl)2 + [std. dev.<B 1)] .
TABLE 4.1
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
III.
~
Case-Cohort
(adjU8ted~/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Full-Cohort Full-Cohort
(unadjusted)1I (adjusted)21
500
Exposure Effect 1l1=ln(3)=1.099 (RR=3)
a. Covariate Effect 112=0
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.110
0.148
0.022
1.114
0.150
0.023
1.173
0.255
0.071
1.134
0.237
0.057
1.146
0.205
0.044
1.128
0.195
0.039
b. Covariate Effect 1l2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
1.047
0.147
0.024
1.113
0.152
0.023
1.171
0.252
0.069
1.142
0.232
0.056
1.148
0.217
0.050
1.121
0.220
0.049
c. Covariate Effect l12=ln(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.964
0.138
0.037
1.117
0.147
0.022
1.174
0.259
0.073
1.152
0.280
0.081
1.152
·0.221
0.052
1.134
0.222
0.051
11 The unadjusted model includes only one variable. Xl. for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both X12and~.
3/ Mean square error: MSE = (1l1~1)2 + [std. dev'(~I)] .
•
Table 4.2
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSTED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
EXPOSURE AND COVARIATE ARE CONTINUOUS AND INDEPENDENT
Overall Probability o£Disease p=0.10
Covariate
Effect (P2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
1.00
1.00
1.00
In(2)
In(2)
In(2)
0
In(2)
In(3)
1.00
1.05
0.95
In(3)
In(3)
In(3}
0
In(2}
In(3)
1.05
0.96
0.59
Exposure
Effect (PI)
75
Table 4.3
EFFICIENCY OF THE HYBRID DESIGNS RELATIVE TO THE
ADJUSTED ESTIMATES OF THE FULL-COHORT DESIGN WHEN THE
. EXPOSURE AND COVARIATE ARE CONTINUOUS AND INDEPENDENT
Overall Probability of Disease p=0.10
Matching
Ratio
Subcohort
Fraction
Exposure
Effect (PI)
1:3
0.271
0
0
0
0
In(2)
In(3)
0.72
0.61
0.49
0.75
0.57
0.41
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.49
0.43
0.43
0.56
0.47
0.42
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.32
0.33
0.30
0.40
0.41
0.27
0
0
0
0
In(2)
In(3)
0.81
0.71 .
0.60
0.78
0.71
0.58
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.65
0.55
0.55
0.61
0.58
0.58
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.52
0.46
0.42
0.59
0.47
0.43
1:5
0.409
. Synthetic
Covariate
lJ
Case-Control21
Effect (P2) Case-Cohort
11 MSE(Acljusted Full-Cohort Estimate>IMSE(Case-Cohort Estimate)
21 MSE(Acljusted Full-Cohort Estimate)IMSE(Synthetic Case-Control Estimate)
...
76
#
TABLE 4.4
SIMULATION SUMMARY STATISTICS FOR ESTIMATING ~ EXPOSURE EFFECT (~1) WHILE ADJUSTING
FOR A COVARIATE EFFECT (P2) WHEN BOTH VARIABLES ARE CONTINUOUS AND MILDLY CORRELATED (p=0.20)
Overall Probability of Disease p=O.10
Estimation Procedure
Expected Subjects
Requiring Covariate Data
I.
...:J
...:J
Full-Cohort
Full-Cohort
(unadjusted) 11 (adjusted)21
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
500
172
172
234
234
a. Covariate Effect ~2=0
Sample mean (~1)
Sample std. dev. (~:lI
Mean square error
-0.003
0.144
0.021
-0.003
0.147
0.022
0.004
0.176
0.031
-0.008
0.175
0.031
0.002
0.165
0.027
-0.006
0.169
0.029
b. Covariate Effect ~2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lI
Mean square error
0.140
0.140
0.039
0.006
0.144
0.021
0.012
0.182
0.033
-0.002
0.178
0.032
0.008
0.167
0.028
0.006
0.176
0.031
Covariate Effect ~2=ln(3)
Sample mean (~1)
Sample std. dev. (~:lI
Mean square error
0.205
0.139
0.061
0.005
0.144
0.021
0.010
0.200
0.040
0.007
0.201
0.040
0.005
0.181
0.033
-0.002
0.189
0.036
Exposure Effect ~1 =0 (RR=l)
c.
11 The unadjusted model includes only one variable. Xl' for exposure level.
2J The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both X~and~.
3/ Mean square error: MSE = (~r~I)2 + [std. dev'(~I)] .
TABLE 4.4
(continued)
Estimation Procedure
Ex,pected Subjects
Requiring Covariate Data
n.
-..1
00
Full-Cohort
Full-Cohort
(unadjusted)1I (adjusted)21
500
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2,
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Exposure Effect PI=In(2>=O.693 (RR=2)
a. Covariate Effect P2=O
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.696
0.142
0.020
0.698
0.144
0.021
0.744
0.207
0.045
0.716
0.192
0.037
0.725
0.179
0.033
0.714
0.174
0.031
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lI
Mean square error
0.796
0.136
0.029
0.696
0.145
0.021
0.745
0.228
0.055
0.703
0.196
0.039
0.724
0.193
0.038
0.719
0.195
0.039
0.811
0.131
0.031
0.700
0.142
0.020
0.748
0.240
0.061
0.714
0.237
0.057
0.727
0.204
0.043
0.700
0.198
0.039
c.
Covariate Eft"ect P2=1n(3)
Sample mean (~1)
Sample std. dev. (~:lI
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the co~ariate~ in addition to exposure level. The simulated data were generated
corresponding to a hazard functioniased on both Xhand~.
31 Mean square error: MSE = (Pl~l + [std. dev'(~l)] .
•
TABLE 4.4
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
III.
~
CD
Full-Cohort Full-Cohort
(unadjusted)1J (adjusted)21
500
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Exposure Effect Pl=ln(3)=1.099 (RR=3)
a. Covariate Effect P2=O
Sample mean (~1)
Sample std. dev. (~:N
Mean square error
1.104
0.146
0.021
0.108
0.154
0.024
0.173
0.260
0.073
0.143
0.254
0.066
1.147
0.215
0.049
1.120
0.218
0.048
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:N
Mean square error
1.168
0.145
0.026
1.106
0.152
0.023
1.174
0.269
0.078
1.143
0.267
0.073
1.144
0.219
0.050
1.125
0.220
0.049
1.137
0.141
0.021
1.106
0.153
0.023
1.169
0.283
0.085
1.160
0.296
0.091
1.141
0.229
0.054
1.110
0.243
0.059
c.
Covariate Effect P2=ln(3)
Sample mean (~1)
Sample std. dev. (~:N
Mean square error
1/ The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both X~and~.
3/ Mean square error: MSE = (Pr~I)2 + [std. dev'(~l)] .
Table 4.5
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSTED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
EXPOSURE AND COVARIATE ARE CONTINUOUS AND
MILDLY CORRELATED (p=O.20)
Overall Probability of Disease p=0.10
Exposure
Effect (~1)
Covariate
Effect (~2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
1.05
0.54
0.34
In(2)
In(2)
In(2)
0
In(2)
In(3)
1.05
0.72
0.65
In(3)
In(3)
In(3)
0
In(2)
In(3)
1.14
0.88
1.10
80
Table 4.6
EFFICIENCY OF THE HYBRID DESIGNS RELATIVE TO THE ADJUSTED
ESTIMATES OF THE FULL-COHORT DESIGN WHEN THE.EXPOSURE
AND COVARIATE ARE CONTINUOUS AND MILDLY CORRELATED (p=O.20)
Overall Probability of Disease p=0.10
Matching
Ratio
Subcohort
Fraction
Exposure
Effect (PI)
1:3
0.271
0
0
0
0
In(2)
In(3)
0.71
0.64
0.53
0.71
0.66
0.53
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.47
0.38
0.33
0.57
0.54
0.35
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.33
0.29
0.27
0.36
0.32
0.25
0
0
0
0
In(2)
In(3)
0.81
0.75
0.64
0.76
0.68
0.58
In(2)
In(2)
In(2)
0
In(2)
In(3)
0.64
0.55
0.47
0.68
0.54
0.51
In(3)
In(3)
In(3)
0
In(2)
In(3)
0.49
0.46
0.43
0.50
0.47
0.40
1:5
•
0.409
Covariate
Synthetic
11
Case-Control2/
Effect (P2) Case-Cohort
11 MSE(Adjusted Full-Cohort Estimate)/MSE(Case-Cohort Estimate)
2/ MSE(Adjusted Full-Cohort Estimate)/MSE(Synthetic Case-Control Estimate)
•
81
TABLE 4.7
SIMULATION SUMMARY STATISTICS FOR ESTIMATING THE EXPOSURE EFFECT (iiI)
WHILE ADJUSTING FOR A COVARIATE EFFECT (1i2) WHEN BOTH VARIABLES
ARE CONTINUOUS AND MODERATELY CORRELATED (p=O.40)
Overall Probability of Disease p=O.10
Estimation Procedure
Expected Subjects
Requiring Covariate Data
I.
~
Full-Cohort
Full-Cohort
(unadjusted)lI (adjusted)21
Case-Cohort
(adjusted~/
Synthetic
Case-Control
(adjusted~/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
500
172
172
234·
234
a. Covariate Effect 1J2=O
Sample mean (~1)
Sample std. dev. (~N
Mean square error
-0.002
0.143
0.020
-0.001
0.157
0.025
0.008
0.188
0.035
-0.006
0.188
0.035
0.005
0.175
0.031
-0.005
0.181
0.033·
b. Covariate Effect 1i2=ln(2)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
0.277
0.139
0.096
0.008
0.154
0.024
0.018
0.194
0.038
0.003
0.210
0.044
0.011
0.180
0.033
0.009
0.185
0.034
0.408
0.136
0.185
0.002
0.lp1
0.023
0.013
0.211
0.045
-0.004
0.224
0.050
0.004
0.190
0.036
-0.008
0.194
0.038
Exposure Effect iiI =0 (RR=1)
c.
Covariate Effect lJ2=ln(3)
Sample mean (~1)
Sample std. dev. (~N
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard function based on both Xband ~.
3/ Mean square error: MSE = (Iir~1)2 + [std. dev'(~l)] .
..
•
•
·.
•
TABLE 4.7
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
II.
~
Full-Cohort Full-Cohort
(unadjustec1)1I (adjusted)21
500
Case-Cohort
(adjustec1)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Exposure Effect PI=In(2>=O.693 (RR=2)
a. Covariate Effect P2::O
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.697
0.143
0.020
0.700
0.153
0.023
0.748
0.221
0.052
0.722
0.219
0.049
0.729
0.194
0.039
0.714
0.178
0.032
b. Covariate Effect P2=ln(2)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
0.931
0.140
0.076
0.697
0.158
0.025
0.751
0.247
0.064
0.723
0.235
0.056
0.727
0.211
0.046
0.704
0.202
0.041
1.009
0.132
0.117
0.699
0.151
0.023
0.751
0.255
0.068
0.728
0.257
0.067
0.727
0.218
0.049
0.713
0.226
0.051
c.
Covariate Effect P2=ln(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard functionlased on both Xhand~.
3/ Mean square error: MSE = (Pl~l) ... [std. dev.(~l)] .
TABLE 4.7
(continued)
Estimation Procedure
Expected Subjects
Requiring Covariate Data
III
~
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted~/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
500
172
172
234
234
Full-Cohort
Full-Cohort
(unadjusted)1I (adjusted)21
500
Exposure Effect 1'1=In(3)=1.099 (RR=3)
a. Covariate Effect f'2=O
Sample mean (~1)
Sample std. dev. (~:f
Mean square error
1.101
0.145
0.021
1.105
0.160
0.026
1.176
0.268
0.078
1.133
0.246
0.062
1.149
0.223
0.052
1.124
0.222
0.050
b. Covariate Effect f'2=ln(2)
Sample mean (~1)
Sample std. dev. (~:f
Mean square error
1.301
0.150
0.063
1.102
0.163
0.027
1.174
0.283
0.086
1.137
0.287
0.084
1.143
0.232
0.056
1.115
0.232
0.054
1.334
0.149
0.078
1.105
0.166
0.028
1.174
0.300
0.096
1.152
0.344
0.121
1.141
0.239
0.059
1.098
0.239
0.057
c.
Covariate Effect f'2=ln(3)
Sample mean (~1)
Sample std. dev. (~:lJ
Mean square error
11 The unadjusted model includes only one variable, Xl' for exposure level.
21 The adjusted model includes the covariate ~ in addition to exposure level. The simulated data were generated
corresponding to a hazard functioniased on both Xhand ~.
3/ Mean square error: MSE = (I'r~l + [std. dev'(~l)] .
•
•
Table 4.8
EFFICIENCY OF THE UNADJUSTED ANALYSIS RELATIVE TO THE
ADJUSl'ED ANALYSIS IN THE FULL-COHORT DESIGN WHEN THE
EXPOSURE AND COVARIATE ARE CONTINUOUS AND
MODERATELY CORRELATED (p=O.40)
Overall Probability of Disease p=0.10
Exposure
Effect (PI)
Covariate
Effect (P2)
Relative
Efficiency
0
0
0
0
In(2)
In(3)
1.25
0.25
0.12
In(2)
In(2)
In(2)
0
In(2)
In(3)
1.15
0.33
0.20
In(3)
m(3)
In(3)
0
In(2)
In(3)
1.24
0.43
0.36
85
86
CHAPrERV
•
APPLICATION OF SUBSAMPLING STRATEGIES TO RESEARCH DATA
5.1 Introduction
In the preceding chapters, the performance of the case-cohort
and synthetic case-control designs has been explored using simulated data
with exponentially-distributed failure times and known parameters. These
simulations were useful in evaluating the relative efficiency of the hybrid
designs under various .conditions. In this c.hapter, the subsampling
"
strategies will be applied to research data from two studies of
cardiovascular disease (CVD). The first example examines data collected as
part of the Lipid Research Clinics (LRC) Follow-up Study (Jacobs et aI.,
1990). The second example utilizes a CVD data bank collected over a 15year period at Duke University Medical Center (Califf et aI., 1989). Both
studies were originally conducted using the traditional full-cohort design.
However, for the purpose of this illustration, the ease-cohortand synthetic
case-control designs will be applied retrospectively to the research data.
In contrast to the simulations described in Chapters II-IV, the
models examined in each example will include multiple covariates, both
discrete and continuous. In addition, as a result of staggered entry into
both studies, the length of follow-up varies considerably among the subjects.
Another important difference between the simulated data and these
•
examples is the occurrence of tied failure times. Ties were nonexistent
among the simulated failure times since they were generated from the
continuous exponential distribution. However, in both studies, failure times
were recorded to the nearest day which allowed for the possibility of more
than one person experiencing the event at the same time. This was
particularly true for the sicker patients participating in the Duke study who
were undergoing treatment for coronary artery disease.
5.2 Description of Data from the Lipid Research Clinics Follow-up Study
The data used in this illustration include observations on a
random sample of 2,284 male participants between the ages of30 and 82
who were followed for the outcome of CVD death. Details of the Follow-up
Study design and data collected can be found elsewhere (Jacobs et aI., 1990).
In addition to the outcome CVD death, age (years), body mass index
(kg/m2 ), systolic blood pressure (mmHg), smoking status (yes, no), total
cholesterol (mg/dl), high density lipoprotein (HDL) cholesterol (mgldl), and
exercise tolerance test (ETr) outcome (positive, negative) were also recorded
for each patient. The primary interest in this example is the relationship
between a positive ETT and CVD mortality, controlling for the remaining
covariates mentioned above.
Of the 2,284 patients followed for the outcome, only 94 (4%) died
from CVD. As explained in Chapter I, much of the covariate information
collected on this cohort is redundant due to the low CVD mortality rate.
The subsampling strategies are potentially useful in this situation since
they minimize the large amount of covariate information assembled on
subjects who fail to experience the outcome of interest. In particular, it
would be useful in this study if blood samples for the cholesterol
measurements were only evaluated for a subsample of the full cohort. Total
cholesterol has been shown to have a positive association with CVD
mortality, whereas HDL cholesterol is known to be inversely related to this
outcome. Thus it is important to consider these measurements when
evaluating the effects of a positive ETT in this context. Although the
procedure for determining total and HDL cholesterol values is not an
88
,'"'
expensive one, there would still be a modest savings involved if the number
of assays could be reduced by 75 percent or more.
5.3 Comparison of Results from the Full-Cohort and Hybrid Designs
The purpose of this example is to demonstrate the application of
subsampling strategies to the LRC Follow-up data and to evaluate their
ability to estimate the adjusted relative risk of CVD mortality associated
with a positive ETr. Since all data from the Follow-up study were collected
on the entire cohort, it is possible to perform a full-cohort analysis of the
relationship between ETT and CVD mortality adjusting for the other
covariates (including total and HDL cholesterol). An unadjusted full-cohort
model which includes all the covariates except the two cholesterol
measurements is also considered here for illustrative purposes. This is the
model which would have to be used if a traditional full-cohort design was
utilized yet budget constraints did not allow for the collection of cholesterol
..
measurements. Finally, the adjusted relative risk of a positive ETT is also
estimated under the case-cohort and synthetic case-control designs. Note
that in addition to a reduction in the number of necessary cholesterol
values, additional savings may be realized from the other covariates as well.
In this example, it may be possible to have each patient complete the ETT
at baseline, but only evaluate the test for the selected subsample.
As with the simulations, case-to-control matching ratios of 1:3
and 1:5 are utilized here. The subcohort sampling fractions were calculated
as described in Chapter TI, with an overall probability of death of p=0.04.
Thus for the matching ratio of s-3 controls to each case, a sampling fraction
of [l-(1-p)S]=[1-(1-0.04)3]=0.115 was used, producing a subcohort of size 263.
Similarly the sampling fraction for the 1:5 matching ratio was calculated to
•
be 0.185, yielding a subcohort of size 423. These subcohorts, along with the
cases occurring among the remaining patients, yielding an expected total of
348 and 503 subjects, respectively, for whom covariate data were required.
89
Allowing for the probability of a subject being included in the synthetic
case-control analysis more than once, the expected number of distinct
persons when s=3 controls were matched to each case is n[l_(l_p)s+l]=
2284[1-(1-0.04)3+1]=344, which is slightly less than the 350 actually
selected in this example. For the larger synthetic case-control design (i.e.,
1:5 matching ratio), 496 distinct subjects were expected to be chosen which
is also slightly less than the 509 subjects in the example. These differences
are merely due to sampling variation and do not represent any bias in the
selection process. It should be pointed out that the overall probability of
death of 4 percent is much less than the simulated disease rate of 10
percent. It is also noteworthy that the probability of exposure (i.e., a
positive ETT) is 5 percent in this example versus 50 percent in the
simulated data.
Results from the application of the various designs to the LRC
data are presented in Table 5.1. Under the full-cohort design, the adjusted
relative risk of a positive ETT is 2.2 with a 95% confidence interval of [1.3,
3.7]. Thus the risk of dying from CVD is over twice as great for patients
with a positive ETT. If total and HDL cholesterol are not taken into
consideration, the unadjusted relative risk is 2.7, thus implying that these
variables are indeed confounders. The standard error of the ETT estimate,
s.e.(~>-0.265,is unaffected by the adjustment.
The adjusted estimate of the relative Tisk under the smaller
case-cohort design is significantly greater than the full-cohort estimate (3.0
versus 2.2). Even more noteworthy is the 80 percent increase in the
standard error from 0.265 to 0.483 when this subsampling strategy is
applied to the data. On the positive side, the case-cohort estimate does fall
.
within the 95% confidence limits under the full-cohort design. Furthermore,
the significant association between ETT and CVD mortality would still be
identified using only 15 percent of the original data. The results from the
1:3 synthetic case-control design are not as promising. The estimate of the
90
.
relative risk (2.7) does fall between those from the full-cohort and casecohort designs. However, the standard error of 0.574 represents a 112
percent increase over the .full-cohort standard error, thus resulting in a
•
.nonsignificantETT effect despite a relative risk of2. 7. Whittemore (1981)
noted that the synthetic case-control design is generally less efficient when
the prevalence of the risk factor is low, as is the case for these data.
Breslow et a1. (1983) found that increasing the number of controls in this
situation has a desirable effect on the size of the standard errors. These
theoretical results are supported by empirical observations in Table 5.1.
The discrepancy in the size of the standard errors under the hybrid designs
disappears when 5 controls are matched to each case, yet these values
(0.416 and 0.420) still represent over a 50 percent increase from the fullcohort design. The estimate of relative risk under the synthetic case-control
design is just under the full-cohort estimate (2.1 versus 2.2). The casecohort estimate is still moderately larger at 2.7.. However, with standard
errors of this size, only the case-cohort relative risk is significantly different
•
from unity.
Results from the traditional full-cohort adjusted analysis
indicate that ETT outcome is an independent predictor ofCVD mortality.
The same conclusion would have been reached without adjusting for HDL
and total cholesterol, althOugh the study design may have been subjected to
criticism given the confounding nature of these variables. The hybrid
designs permitted adjustment for HDL and total cholesterol with covariate
information being required for only 15 or 22 percent of the cohort.
depending on the matching ratio. It can not be said that the estimate of
..
relative risk from any of the designs is too large or small since the true
value of this parameter is unknown. Ironically, for this example, if the
•
choice of subsampling strategy was to be based strictly on which design
yielded the relative risk estimate closest to the full-cohort estimate. then
the synthetic case-control design with 5 controls per case would be selected.
91
However, due to the relatively large standard error, it would not be
concluded that ETT outcome is a predictor of CVD mortality under this
design. The case-cohortsubsampling strategy would have obtained the
.same result as the full-cohort design, regardless of the matching ratio.
Although in comparison to the full cohort, its estimates are considerably
larger.
5.4 Description of Data from the Duke University Medical Center
Cardiovascular Disease Data Bank
The Duke CVD data bank consists of observations from 5,809
patients who underwent cardiac catheterization at Duke University Medical
Center between November 1969 and December 1984. Details of this study
are described elsewhere (Califf et aI., 1989). A subset of this population
which consists of 3,192 patients who were medically (as opposed to
surgically) treated for coronary artery disease comprise the cohort used for
•
this example. Although the data were collected over a 15-year period,
survival times were truncated after 2.5 years of follow-up in order to obtain
...
a 10 percent mortality rate for the purposes of this illustration.
This data bank is unique in that it includes a composite variable
referred to as coronary artery disease (CAD) severity index which, although
expensive to obtain, is thought to provide a comprehensive assessment of a
patient's condition and therefore be particularly beneficial in predicting his
outcome. In addition to CVD mortality and CAD severity index, the
presence of coronary artery obstruction is also known for each patient. This
information has previously been shown to be a very strong predictor of CVD
mortality. The objective of this example is to determine through the use of
the full-cohort design whether CAD severity index is more prognostic than
coronary artery obstruction, and whether these results can be duplicated
using the subsampling strategies. CAD severity index is a discrete, interval
measurement which ranges from 23 to 100 in this data set. Four levels of
92
..
obstruction
(~75%
stenosis in the artery) are considered here: 1-, 2-, 3-
vessel disease and left main disease. Three indicator variables are used to
model this information, rather than assuming linearity among the levels.
•
The unadjusted full-cohort model includes the three indicator variables for
coronary artery obstruction, as well as several other extraneous factors
which are normally controlled for in this data set: age, left ventricular
ejection fraction, and indices of myocardial damage, vascular disease and
pain. The adjusted full-cohort model includes CAD severity index in
addition to these variables. The nature of this example is slightly different
from the previous example and simulations in that it is of interest here to
estimate the effect of CAD severity index, and not simply adjust for this
expensive covariate.
5.5 Assessment of Prognostic Importance of CAD Severity Index Under
the Full-Cohort and Hybrid Designs
Since the survival times in this example were truncated to yield
.
the same overall probability of death as in the simulated data (p=0.10), the
case-cohort sampling fractions calculated in Section 2.2 apply here as well.
However, the sizes of the full-cohort and subsamples are considerably larger
for this example than for the simulated data. With three controls matched
to each case, the case-cohort sampling fraction was calculated to be 0.271,
which along with the cases occurring outside the subcohort, yields 1096
patients for whom covariate information is necessary. The sampling
fraction is 0.410 for the larger case-cohort design, which translates into
1494 subjects requiring covariate data. Using the equations derived by
"
•
Langholz and Thomas (1989), the expected number of distinct subjects in
the synthetic case-control designs with 1:3 and 1:5 matching ratios are 1098
and 1496, respectively. These sample sizes are very close to the 1090 and
1498 patients that were actually selected in this example. Thus the
subsampling strategies use approximately 35 and 50 percent of the
93
information available in the full-cohort when 3 and 5 controls, respectively,
are matched to each case.
Estimates of the effect of coronary artery obstruction
(~1' ~2'
and 133) from the unadjusted full-cohort analysis are presented in the first
column of Table 5.2. It appears from these estimates that the risk of CVD
death increases significantly with the number of occluded vessels, and
especially with left main disease. The likelihood ratio test for the inclusion
ofthese three indicator variables in the model is clearly significant (X2 =101,
p<O.OOl). However, when CAD severity index is added to the model,
coronary artery obstruction is no longer prognostic of CVD mortality. This
implies that all relevant information regarding obstruction is incorporated
into the severity index, thus it is a better predictor. The est~te of relative
risk associated with the index, although significant, is close to unity
(RR-l.05). However, it should be noted that this is the risk associated with
a one point increase in a 100-point index. It may be more useful to interpret
a 20 point difference in the severity index as being associated with a 150
percent increase in risk (i.e., RR=2.5).
The -results from the adjusted model under the hybrid designs
are also given in Table 5.2. It is noteworthy that coronary artery
obstruction is not a significant predictor of CVD mortality under either of
the hybrid designs, regardless of the number of controls used. Thus the
importance of CAD severity index is also realized by these subsampling
strategies which use less than 50 percent of the data. In general, the
estimates of relative risk associated with the indicator variables for
obstruction are somewhat larger under the hybrid designs than the fullcohort design, particularly for the 1:3 matching ratio. Likewise, the
estimates of standard errors for these variables are also larger than when
all the data are utilized. The standard errors decrease with an increase in
the size of the referent group. There does not appear to be any consistent
trend in the performance of one hybrid design over the other.
94
Regarding the estimates of the effect of CAD severity index, the
smaller hybrid designs Yield slightly lower estimates than the full-cohort
design. However, when the referent group is increased, the estimates are
:nearly identical to those of the full-cohort design. As expected, the size of
the .standard error is slightly greater under the hybrid designs, with the
larger synthetic case-control design Yielding the standard error closest to
the full cohort. Interestingly, there was no decrease in the standard error
associated with augmenting the size of the case-cohort design. All four
subsampling strategies were able to replicate the findings of the full-cohort
design, despite the relative risk being so close to unity.
5.6 Summary of Results from Research Data
The two examples presented here provide an interesting contrast
with respect to the performance of the hybrid designs. In the first example,
t~e
standard errors of the ETT estimate under the hybrid designs ranged
from 50 percent to 112 percent larger than the full-cohort standard error for
•
the 1:3 and 1:5 matching ratio, resp~ctively. In addition, the estimates of
relative risk for a positive ETT varied considerably over the different
designs. Although the 1:5 synthetic case-control design produced the
estimate of relative risk closest to the full-cohort estimate, its large
standard error produced a wide 95% confidence interval which included
unity. The case-cohort design detected significance in the relative risk
estimate using both the smaller and larger referent groups, however the
size of the estimates were also considerably greater than that of the full
cohort. In contrast, the applications of each of the subsampling strategies to
"
the Duke CVD data bank resulted in the same conclusions reached under
the traditional full-cohort design. The estimates of relative risk were very
similar under the different designs and, more importantly, the standard
error ranged from only 20 to 45 percent larger than the full-cohort standard
error.
95
In addition to the possibility that the samples selected in the
hybrid designs may not have been representative of the cohort, there are
several differences in the data of the two examples which may be
responsible for these observations. TheLRC Follow-up Study has about
•
two-thirds the number of patients as the Duke data bank. With the overall
probability of death being lower in the LRC study, the relative sizes of the
subsamples are considerably smaller as well. Finally, the primary variable
of interest in the LRC data was a dichotomous variable with a 5 percent
probability of exposure (one-tenth of the probability of exposure in the
simulations). The principle variable in the Duke Study was a discrete
interval variable ranging from 23 to 100. Although it is not possible, based
on two examples, to determine which one or combination of these factors
may be responsible for the differences seen here, it is clear that the
performance of the hybrid designs can vary considerably depending on the
nature of the data being explored. There is considerable room for future
research in this area.
.
.
96
•
•
•
•
TABLE 5.1
ESTIMATES OF THE EFFECT OF EXERCISE TOLERANCE TEST (ETT) OUTCOME (P)
ON CARDIOVASCULAR DISEASE MORTALITY IN THE LIPID RESEARCH CLINICS FOLLOW-UP STUDY
Estimation Procedure
Full-Cohort Full-Cohort . Case-Cohort
(unadjusted) 1/ (adjusted)21 (adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Case-Cohort
(adjusted)2/
Synthetic
Case-Control
(adjusted)2/
Subjects Requiring
Covariate Data
2284
2284
348
350
503
509
ETT Estimate ~
0.999
0.795
1.083
0.978
1.007
0.742
s.e. (~)
0.265
0.265
0.483
0.574
0.416
0.420
Relative Risk
2.7
2.2
3.0
2.7
2.7
2.1
co
...::I
95% Confidence IntervalSl
(l.6,4.6)
(1.3,3.7)
(l.1,7.6)
(0.9,8.2)
(1.2,6.2)
(0.9,4.8)
11 The unadjusted model includes age, body mass index, systolic blood pressure, and smoking status, as well as ETT.
21 The adjusted model includes total and HDL cholesterol levels in addition to the covariates listed in footnote 1.
SI 95% Confidence Interval: exp[~±l.96(s.e.(~))].
TABLE 5.2
ESTIMATES OF TIlE EFFECT OF CORONARY ARTERY OBSTRUCTION (~1' ~, ~3)1I AND
CORONARY ARTERY DISEASE (CAD) SEVERITY INDEX (~4) ON CARDIOVASCULAR DISEASE MORTALITY
IN THE DUKE UNIVERSITY MEDICAL CENTER STUDY
I.
3192
1096
0.808
0.239
2.2
(1.4,3.6)
0.286
0.257
1.3
(0.8,2.2)
0.508
0.296
1.7
(0.9,3.0)
0.466
0.307
1.6
(0.9,2.9)
0.388
0.278
1.5
(0.9,2.5)
0.444
0.286
1.6
<0.9,2.7)
~2
1.311
0.221
3.7
(2.4,5.7)
-0.180
0.350
0.8
<0.4, 1.7)
-0.013
0.464
1.0
(0.4,2.5)
0.369
0.462
1.4
(0.6,3.6)
-0.283
0.423
0.8
(0.3, 1.7)
-0.114
0.418
1.1
(0.4,2.0)
~3
2.393
0.254
10.9
(6.7, 1589.9)
0.092
0.498
1.1
<0.4,2.9)
0.574
0.743
1.8
(0.4, 7.6)
1.008
0.731
2.7
(0.7, 11.5)
-0.033
0.673
1.0
(0.3,3.6)
0.323
0.644
1.4
(0.4,4.9)
0.046
0.009
1.05
(1.03, 1.07)
0.038
0.013
1.04
(1.01, 1.07)
0.034
0.013
1.03
(l.Ol, 1.06)
0.049
0.013
1.05
(1.02, 1.08)
0.047
0.011
1.05
(1.03, 1.07)
~1
II.
3-Vessel disease
III.
s.e. (~2)
Relative Risk
95% Confidence Interval41
Left Main Disease51
s.e. (~3)
Relative Risk
95% Confidence Interval41
IV.
Case-Cohort
(adjusted)2,
2-Vessel Disease
s.e. (~1)
Relative Risk
95% Confidence Interval41
co
co
1494
Synthetic
Case-Control
(adjusted)2,
1498
3192
Subjects Requiring
Covariate Data
Full-Cohort
Full-Cohort
(unadjusted)lI (adjusted)21
Case-Cohort
(adjusted)2,
Synthetic
Case-Control
(adjusted)2,
1090
Estimation Procedure
CAD Severity Index
~4
s.e. (~4)
Relative Risk
95% Confidence Interval41
•
•
.
..
.
•
.
..
..
TABLE 5.2
(continued)
11 Three dummy variables are used here to define coronary artery obstruction:
If Left Main Disease, then Xs=1;
Else ifS-Vessel Disease, then~=l;
Else if 2-Vessel Disease, then Xl=1;
Single-Vessel Disease is implied by all three dummy variables being equal to zero.
21 The unadjusted model includes age, left ventricular ejection fraction, myocardial damage index, pain index, and vascular
disease index, as well as the three indicator variables for degree of coronary artery obstruction.
31 The adjusted model includes coronary arterY disease (CAD) severity index in addition to the covariates listed in footnote 2.
41 95% Confidence Interval: exp[~±I.96(s.e.(~))].
5/ Left Main Disease is defined as 75% or greater stenosis of the left main coronary artery.
~
CHAPTER VI
•
SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH
6.1 Summary
The primary objective of this research was to evaluate the
efficiency of two subsampling strategies, the case-cohort and synthetic casecontrol designs, in estimating the relative risk of exposure when the
outcome is time-to-response. Previous investigation in this area has
exclusively considered the special case ofa model containing a single binary
exposure. This dissertation expands on prior research by considering the
addition of an expensive covariate to the model which may also be a
confounder of the true exposure-disease relationship. Attention is given to
the effects of a continuous exposure and covariate as well.
A survey of the existing literature for the hybrid designs is given
in Chapter I, with a general overview of the characteristics of the
traditional cohort and case-control designs from which they were derived.
The c1inica1trial is also discussed asa special case of the cohort study.
In Chapter II, the simplest model in "which both the exposure
and covariate are dichotomous and independent is considered. This is
described as a simulated clinical trial with the binary exposure variable
representing the randomized treatment assignment. Although both hybrid
designs performed relatively well (i.e., approximately 80 percent efficient) in
this situation, it was determined that there was no gain from adjusting for
the covariate in the presence of heavy censoring. Thus, an unadjusted full-
•
cohort model was determined to be the least expensive and most
straightforward way to analyze these data.
The effects of a positive correlation structure between the binary
•
"
exposure and covariate on the efficiency of the hybrid designs are examined
in Chapter III. It was of interest to determine if these subsampling
strategies were capable of detecting and adjusting for a confounder of the
exposure-disease relationship as might be found in an observational study.
In general, it
w~s
noted that the hybrid designs were able to effectively
control for the confounder and yield unbiased estimates of the relative risk
function. Both strategies performed consistently regardless of the degree of
association between the exposure and confounder. In fact, the results were
very similar to those of Chapter II in which there was no correlation
between these variables. In addition, the relative efficiency of these designs
was not affected by the strength of the exposure and covariate effects. A
•
slight improvement was noted when the size of the referent group was
increased.
The performance of the hybrid designs when both the exposure
and covariate are continuous is examined in Chapter IV. The results are
comparable to those using binary data when the exposure and covariate
effects are nonsignificant, however the relative efficiency of both hybrid
designs deteriorate substantially as these effects become stronger. The
lower relative efficiencies can be related to both bias and poorer precision in
the estimates of relative risk. A significant improvement was noted for both
hybrid designs when the size ofthe reference group was increased, although
the relative efficiency was still unacceptably low (i.e., 50 percent) when
•
there were significant effects.
Finally, in Chapter V, the subsampling strategies were applied
to data from two cardiovascular studies: the Lipids Research Clinics (LRC)
Follow-up Study and the Duke University Cardiovascular Disease (CVD)
Data Bank. These results were contrasted with those from the traditional
101
full-cohorl design under which these studies were originally conducted. The
two studies differed from each other and from the simulated data in the size
of the cohort and subsamples and the nature of the primary variable of
interest. While the hybrid designs performed rather poorly on the smaller
LRC data set where both the exposure and the outcome were rare, these
strategies were a convincing alternative in the analysis of the larger Duke
CVD data bank. These contrasting results suggest that the hybrid designs
may be overly sensitive to certain characteristics of the data being analyzed.
6.2 SUggestions for Further Research
Throughout this dissertation, it has become apparent that the
case-cohort and synthetic case-control designs are comparable in the
situations proposed here. Thus the choice between these two designs is
more dependent on the actual goals of the study and the nature of the data
being collected rather than efficiency considerations. It seems that a more
important question is not which subsampling strategy to use, but whether
to use one at all. Both designs performed poorly when the exposure and
covariate were continuous. This area requires additional research, both
theoretical and empirical, in order to determine the reasons for low
precision ofthe hybrid designs with this type of data. Since both designs
were around 80 percent efficient when the exposure and covariate were
dichotomous, a natural extension of this investigation would be to apply
these strategies to simulated data in which the exposure is binary and the
covariate is continuous and vice versa. The relative efficiencies from these
combinations may offer additional insight to the results in Chapter IV.
Computation of the variance of the parameter estimates under
the case-cohort design is quite complicated. In addition, the extensive
software necessary to perform these analyses is not widely available. It
would be of interest to explore the possibility of using a parametric
distribution, such as exponential or weibull, to model the hazard function
102
•
instead of the semi-parametric proportional hazards model. The use of such
models may facilitate the computation of the variance estimates, as well as
_simplify .the programming.needed to obtain these statistics.
Another area which is in need of further attention is to
determine how departures from the proportional hazards assumption affect
the performance of the subsampling strategies and to what extant they are
.more sensitive than the traditional full-cohort design to such departures.
•
•
•
•
103
REFERENCES
Armitage, P. and Gehan, E.A. (1974), "Statistical Methods for the
Identification and Use of Prognostic Factors," International Journal
of Cancer, 13, 16-36.
Besag, J.E. (1977), "Efficiency of Pseudolikelihood Estimation for Simple
Gaussian Fields," Biometrika, 64, 616-618.
..
•
Breslow, N.E., Lubin, J.H., Marek, P. and Langholz, B. (1983),
"Multiplicative Models and Cohort Analysis," _Journal of the American
Statistical Association, 78, 1-12.
Breslow, Norman and Patton, Janice (1979), "Case-Control Analysis of
Cohort Studies," in Energy and Health, eds. N.E. Breslow and A.S.
Whittemore, Philadelphia: SIAM, 226-242.
Cain, Kevin C. and Breslow, Norman E. (1988), "Logistic Regression
Analysis and Efncient Design for Two-stage Studies," American
Journal of Epidemiologv. 128, 1198-1206.
Califf, Robert M., Harrell, Frank E., Jr., Lee, Kerry L., Rankin, J. Scott,
Hlatky, Mark A., Mark, Daniel B., Jones, Robert H., Muhlbaier,
Lawrence H., Oldham, H. Newland, Jr., and Pryor, David B. (1989),
"The Evolution of Medical and Surgical Therapy for Coronary Artery
Disease: A 15-Year Perspective," Journal of the American Medical
Association, 261, 2077-2086.
Chastang, Claude, Byar, David, and Piantadosi, Steven (1988), "A Quantitative Study of the Bias in Estimating the Treatment Effect Caused by
Omitting a ,Balanced Covariate in Survival Models," Statistics in
Medicine, 7, 1243-1255.
Ciol, Marcia A., and Self, Steven (1989), "An Extension of the Case-Cohort
Design," Presented at the 149th Annual Meeting of the American
Statistical Association, Washington, D.C., August 6, 1989.
Cox, D.R. (1972), "Regression Models and Life Tables (with discussion),"
Journal of the Royal Statistical Society, B, 34, 187-220.
Davis, C.E. (1990), "Efficient Means of Studying Ancillary Questions in
Clinical Trials," Statistics in Medicine, 9, 97-100.
104
'"
Friedman, G.D. (1987), Primer of Epidemiology, 3rd edition, New York:
McGraw-Hill, Inc., Chapters 7-10.
1/
Friedman, Lawrence M., Furberg, Curt D. and DeMets, David L. (1985),
Fundamentals of Clinical Trials, 2nd Edition, Littleton,
Massachusetts: PSG Publishing Company, Inc., Chapters 4-6,8,14.
Gail, M.H., Wieand, S., and Piantadosi, S. (1984), "Biased Estimates of
Treatment Effect in Randomized Experiments with Nonlinear
Regressions and Omitted Covariates," Biometrika, 71,431-444.
Ibrahim, Michel A. and Spitzer, Walter O. (1979), "The Case-Control Study:
The Problem and the Prospect," Journal of Chronic Diseases, 32, 139144.
Jacobs,. David R., Jr., Mebane, Irma L., Bangdiwala, Shrikant 1., Criqui,
Michael H., and Tyroler, Herman A. for the Lipid Research Clinics
Program (1990), "High Density Lipoprotein Cholesterol as a Predictor
of Cardiovascualr Disease Mortality in Men and Women: The Followup Study of the Lipid Research Clinics Prevalence Study," American
Journal of Epidemiology, 131,32-47.
•
•
Kalbfleisch, John D., and Prentice, Ross L. (1980), The Statistical Analysis
of Failure Time Data, New York: John Wiley & Sons, Inc., 103-113.
IOeinbaum, D.G., Kupper L.L. and Morgenstern, H~ (1982), Epidemiologic
Research: Principles and Quantitative Methods, New York: Van
Nostrand Reinhold Co., Chapters 4-5.
Kupper, L.L., Karon, J.M., Kleinbaum, D.G., Morgenstern, H. and Lewis,
D.K. (1981), "Matching in Epidemiologic Studies: Validity and
Efficiency Considerations," Biometrics, 37,271-291.
Kupper, L.L., McMichael, A.J. and Spirtas, R. (1975), "A Hybrid
Epidemiologic Study Design Useful in Estimating Relative Risk,"
Journal of the American Statistical Association, 70, 524-528.
Langholz, Bryan and Thomas, Duncan C. (1990), "Nested Case-Control and
Case-Cohort Methods of Sampling from a Cohort: A Critical
Comparison," American Journal of Epidemiology, 131, 169-176.
Liddell, F.D.K., McDonald, J.C. and Thomas, D.C. (1977), "Methods of
Cohort Analysis: Appraisal by Application to Asbestos Mining,"
Journal of the Royal Statistical Society, A, 140,469-491.
105
Lilienfeld, Abraham. M. and Lilienfeld, David E. (1979), "A Century of CaseControl Studies: Progress?," Journal of Chronic Diseases, 32, 5-13.
Lipid Research Clinics Program (1984), "The Lipid Research Clinics
Coronary Primary Prevention Trial Results: I. Reduction in Incidence
of Coronary Heart Disease," Journal of.the American Medical
Association, 251, 351-364.
.
Lubin, Jay H. and Gail, Mitchell H. (1984), "Biased Selection of Controls for
Case-Control Analyses of Cohort Studies," Biometrics. 40, 63-75.
MacMahon, B. and Pugh, T.F. (1970), Epidemiology:_Principles and
Methods, Boston: Little, Brown, Chapters 11-13.
Mantel, N. (1973), "Synthetic Retrospective Studies and Related Topics,"
Biometrics, 29, 479-486.
Miettinen, OUi (1982), "Design Options in Epidemiologic Research,"
Scandinavian Journal of Work, Environment, and Health, 8, Suppl. 1,
7-14.
Morgan, Timothy M. and Elasoff, Robert M. (1986), "Effect of Censoring on
Adjusting for Covariates in Comparison of Survival Times,"
Communications in Statistics, A15, 1837-1854.
•
..
Nunnally, J. C. (1978), Psychometric Theory, 2nd edition, New York:
McGraw-Hill, Inc., 121-133.
Oakes, D. (1981), "Survival Times: Aspects of Partial Likelihood,"
International Statistical Review, 49, 235-264.
Pocock, Stuart J. (1983), Clinical Trials: A Practical Approach,
York: John Wiley & Sons, Chapters 4-6,14.
.New
I
Prentice, Ross L. (1986a), "On the Design of Synthetic Case-Control
Studies," Biometrics, 42, 301-310.
Prentice, R.L. (1986b), "A Case-Cohort Design for Epidemiologic Cohort
Studies and Disease Prevention Trials," Biometrika, 73, 1-11.
Prentice, R.L. and Breslow, N.E. (1978), "Retrospective Studies and Failure
Time Models," Biometrika, 65, 153-158.
Prentice, R.L. and Pyke, R.L. (1979), "Logistic Disease Incidence Models and
Case-control Studies," Biometrika, 66, 403-411.
106
..
•
,
Prentice, Ross L., Self, Steven G. and Mason, Mark W. (1986), "Design
Options for Sampling Within a Cohort," in: Modern Statistical
Methods in Chronic Disease Epidemiology, eds. Suresh H. Moolgavkar
and Ross L. Prentice, New York: John Wiley and Sons, 50-62.
Robins, James M., Gail, Mitchell H. and Lubin, Jay H. (1986), "More on
'Biased Selection of Controls for Case-Control Analyses of Cohort
Studies'," Biometrics, 42, 293-299.
Schneiderman, Marvin A. and Levin, David L. (1973), "Parallels,
Convergences, and Departures in Case-Control Studies and Clinical
Trials," Cancer Research, 33, 1498-1503.
Self, Steven G. and Prentice, Ross L. (1988), "Asymptotic Distribution
Theory and Efficiency Results For Case-Cohort Studies," The Annals of
Statistics, 16,64-81.
Ury, H.K (1975), ''Efficiency of Case-Control Studies with Multiple Controls
Per Case: Continuous or Dichotomous Data," Biometrics, 31, 643-649.
1
Wachold-er, Sholom and Boivin, Jean-Francois (1987), "External
Comparisons With the Case-Cohort Design," ADierican Journal of
Epidemiology, 126, 1198-1209.
•
Wacholder, Sholom, Gail, Mitchell H., Pee, David and Brookmeyer, Ron
(1989), "Alternative Variance and Efficiency Calculations for the CaseCohort Design," Biometrika, 76, 117-123.
White, J. Emily (1982), "A Two-Stage Design For the Study of the
Relationship Between a Rare Exposure and a Rare Disease," American
Journal of Epidemiology, 115, 119-128.
Whittemore, A.S. (1981), "The Efficiency of Synthetic Retrospective
Studies," Biometrical Journal, 23, 73-78.
•
Whittemore, A.S. and McMillan, A. (1982), "Analyzing Occupational Cohort
Data: Application to U.S. Uranium Miner·s," in: Environmental
Epidemiology: Risk Assessment, eds. B.L. Prentice and A.S.
Whittemore, Philadelphia: SIAM, 65-81.
,
107
© Copyright 2026 Paperzz