Multiple Imputation of Baseline Data in the Cardiovascular Health

American Journal of Epidemiology
Copyright © 2003 by the Johns Hopkins Bloomberg School of Public Health
All rights reserved
Vol. 157, No. 1
Printed in U.S.A.
DOI: 10.1093/aje/kwf156
Multiple Imputation of Baseline Data in the Cardiovascular Health Study
Alice M. Arnold and Richard A. Kronmal
From the Department of Biostatistics, University of Washington, Seattle, WA.
Received for publication November 16, 2001; accepted for publication July 22, 2002.
Most epidemiologic studies will encounter missing covariate data. Software packages typically used for analyzing
data delete any cases with a missing covariate to perform a complete case analysis. The deletion of cases
complicates variable selection when different variables are missing on different cases, reduces power, and creates
the potential for bias in the resulting estimates. Recently, software has become available for producing multiple
imputations of missing data that account for the between-imputation variability. The implementation of the software
to impute missing baseline data in the setting of the Cardiovascular Health Study, a large, observational study, is
described. Results of exploratory analyses using the imputed data were largely consistent with results using only
complete cases, even in a situation where one third of the cases were excluded from the complete case analysis.
There were few differences in the exploratory results across three imputations, and the combined results from the
multiple imputations were very similar to results from a single imputation. An increase in power was evident and
variable selection simplified when using the imputed data sets.
biometry; epidemiologic methods; imputation; missing data; regression analysis
Abbreviation: NHANES, National Health and Nutrition Examination Survey.
Most epidemiologic studies will encounter missing data.
The problems that missing data present in the analysis and
interpretation of results have been widely studied, as have
methods for imputing missing data (1–9). Recently, software
has become available to perform multiple imputations of
missing data (10, 11). We describe our experience in the
Cardiovascular Health Study of imputing missing data on
approximately 150 variables collected at baseline. Then, with
multiple copies of our filled-in baseline data available, we
explored the questions of whether or not results from complete
case analyses, which use observed data and delete any cases
with missing data, differ from those using the imputed data
sets and of how much results from a single imputation differ
from those of the combined results from multiple imputation.
disease in individuals aged 65 or more years. In 1989, 5,201
participants were enrolled, and a supplemental cohort of 687
African Americans was added in 1992–1993. Invited participants were a random sample of Health Care Financing
Administration eligibility lists and persons living in their
households. Participants provided informed consent, and
study methods were approved by the institutional review
committees at each participating center. Details of the design
and recruitment have been published (12, 13).
At the baseline visit, participants were given an extensive
clinical examination that included medical and personal histories, assessment of physical functioning and activity, cognitive
testing, phlebotomy, electrocardiogram, and carotid ultrasound. The original cohort also had echocardiograms and
spirometry tests. Despite efforts to obtain complete data, nearly
all examination components were missing data on one or more
participants. The reasons range from participant refusal or
inability to answer certain questions or perform some of the
examination components to technical difficulties resulting in
unreadable images on ultrasound or echocardiogram (14).
BACKGROUND
The Cardiovascular Health Study
The Cardiovascular Health Study is a population-based
study designed to identify risk factors for cardiovascular
Correspondence to Dr. Alice Arnold, Collaborative Health Studies Coordinating Center, Building 29, Suite 310, 6200 NE 74th Street, Seattle,
WA 98115 (e-mail: [email protected]).
74
Am J Epidemiol 2003;157:74–84
Multiple Imputation in the Cardiovascular Health Study 75
TABLE 1. Amount of missing data in 156 variables used in imputation
Original cohort
% of missing
data
No. of cases missing
out of 5,201
African-American cohort
Variables missing
out of 156
No.
%
None
0
14
9.0
≤2
1–104
99
2.01–5
105–260
25
5.01–10
261–520
9
10.01–16
521–832
35
1,791–1,839
Variables missing
out of 140
No.
%
0
19
13.6
63.5
1–14
62
44.3
16.0
15–34
35
25.0
5.8
35–69
22
15.7
1
0.6
70–110
2
1.4
8
5.1
Missing data and multiple imputation
Missing covariate data in epidemiologic studies present
several problems to the analyst including difficulties in variable selection, reduced power, and the potential for bias in
the resulting estimates (1–7). For these reasons, we sought to
impute missing data and to study the impact of the imputation on previously published findings from complete case
analyses. We wanted the imputed data sets to be available to
other analysts using Cardiovascular Health Study data,
requiring that the imputation be done once centrally and not
repeatedly in the context of each particular analysis.
Greenland and Finkle (6) reviewed several methods of
handling missing covariates in regression analysis including
stratification on missing-data status, conditional-mean imputation, and multiple imputation, and they concluded that the
more complex methods of multiple imputation were preferable but challenging to implement because of a lack of software. Barnard and Meng (15) conclude that Rubin’s method
of multiple imputation is “without serious competition” in
incomplete-data problems when analysis files will be distributed to researchers other than those who created and maintained the database, as is the case in the Cardiovascular
Health Study. Several programs are available for multiple
imputation (11). We used S-PLUS software (MathSoft, Inc.,
Seattle, Washington) created by Dr. Joseph L. Schafer (10).
Multiple imputation has been used and reported on in the
US National Health and Nutrition Examination Survey
(NHANES) (16, 17). Ezzati-Rice et al. (16) performed a
simulation study to demonstrate that the confidence intervals
of regression estimates from multiple imputation have the
correct coverage. Schafer et al. (17) discussed the imputation
process in a subset of NHANES data and showed that the
distributions of the imputed variables were consistent with
those from the observed data. The current manuscript adds to
the literature by describing the process and difficulties of
imputing over 100 variables and by comparing results from
complete case analyses with those from both singly and
multiply imputed data sets in the realistic setting of a large
epidemiologic study.
MULTIPLE IMPUTATION METHODS
A full description of multiple imputation is beyond the
scope of this report, but we provide a brief overview,
Am J Epidemiol 2003;157:74–84
No. of cases missing
out of 687
including some key considerations for the analyst utilizing
the software. The method for imputation and subsequent
analysis of the filled-in data involve three steps: 1) imputing
data under an appropriate model and repeating the imputation to obtain m copies of the filled-in data set; 2) analyzing
each data set separately to obtain desired parameter estimates and standard errors; and 3) combining results of the m
analyses by computing the mean of the m parameter estimates and a variance estimate that includes both a withinimputation and an across-imputation component.
In the first step, a group of correlated variables containing
some missing values were imputed together in an iterative
process that allowed the missing values for each variable to
be predicted from all of the other variables in the correlated
group. The model we used specified a log-linear distribution
for categorical variables and a multivariate normal regression model for continuous data. The parameters included the
cell probabilities for each distinct cell defined by the categorical variables and, within each cell, the mean and variance of the continuous variables. The variance covariance
matrix of the continuous variables was assumed constant
across cells, whereas the mean values were cell dependent.
Because the model parameters are estimated from the
observed and filled-in data, the parameters themselves can
be considered to have a probability distribution: a Bayesian
prior distribution specified before estimating the missing
data and a posterior distribution determined afterward. In the
absence of any information regarding the mean and variance
of the parameters, a noninformative prior is recommended
and that is what we used (8). Sampling from the posterior
distribution allows for adjustment of the variability of the
parameter estimates for the uncertainty introduced by the
missing value replacement. An assumption of this modelbased method of imputation is that the missing values are
missing at random; that is, their values may depend on the
values of other observed data but not on data that have not
been measured.
Rubin (1) and Schafer (8) have shown that 3–5 imputations are usually all that is needed and, based on the minimal
missingness observed in most of our variables (table 1), we
chose to create three copies of the baseline data set. In step 2,
sample analyses were replicated three times, using variables
from each of the imputed data sets in turn. At this stage, any
standard statistical software package may be used, provided
the parameter estimates of interest (e.g., regression coeffi-
76 Arnold and Kronmal
cients) and their standard errors can be saved. We used SPSS
for Windows, version 8, software (SPSS, Inc., Chicago, Illinois). In step 3, the parameter estimates and standard errors
from each of the three separate analyses were combined to
give the mean of the point estimates and a standard error that
accounts for the average variability observed within (W) and
between (B) the separate analyses. With the statistical definition of information as the average negative second derivative
of the log posterior density of the parameters, then W/B estimates (1 – γ)/γ, where γ is the fraction of information missing
due to nonresponse, and (1 + γ/m)–1 estimates the relative
efficiency of an estimate based on m imputations compared
with one based on an infinite number of imputations (2, 8).
VARIABLE SELECTION AND DATA PREPARATION
Table 1 shows the amount of missing data on the variables
considered in the imputation for each cohort. Although
nearly three fourths of the variables were missing data for 2
percent or fewer cases in the original cohort, these were not
always the same cases, so that in multivariable analyses the
combined effect of the missing data is a loss of a greater
percentage of cases. The data missing the most cases were
from the M-mode echocardiography, where images were not
readable for approximately one third of the original cohort
participants.
In order to prepare the data for imputation, we needed to
consider the distributional asssumptions of the method and
the selection of variables to impute together. We decided to
impute the two Cardiovascular Health Study cohorts separately because they were enrolled in different years, had
some differences in the data collected, and differed substantially in racial mix, with African Americans comprising only
4.7 percent of the original cohort. More difficult to determine was which of the variables of interest to impute
together. If variables that are related are not imputed
together, and then subsequently used in analyses together,
the relations among them will be dampened by the fact that
the imputed subset of each variable will not be related to the
other variables. To our advantage is the fact that the imputed
subset of most variables would be small. For practical
reasons of computation time and memory requirements, the
programs have a default maximal size of 30 variables to
impute together, and we chose to stay within that variable
limit. The richness and slight redundancy of the Cardiovascular Health Study data set allowed us to impute groups of
similar and highly correlated variables together (table 2). For
example, heart rate was measured at four different times
during the baseline examination, and three different height
measurements were taken. A wealth of covariate data was
available, which became especially important when an entire
group of measurements was missing. For example, 49 people
from the original cohort were missing all echocardiogram
data, but other deterministic variables such as sex, body size,
disease status, and electrocardiogram data were available.
Covariates were selected by reviewing published papers of
associations, by examining bivariate and partial correlation
coefficients to find the variables most highly related to those
selected for imputation, and by examining regression coefficients in models of many potential variables for inclusion.
Ten separate imputations were run on the original cohort and
nine on the new cohort. Once the imputations were
completed, we explored correlations of variables in different
blocks to determine empirically if correlations were dampened between variables not imputed together.
All data were scrutinized for outliers or errors, both
univariately and bivariately when another highly correlated
variable was available. For example, blood pressure was
measured twice with the participant seated, once standing,
and once supine, allowing these values to be compared
against one another. Once it was decided which variables to
impute together, outliers in the multivariate space were identified by large residuals in regression analyses. Any gross
outliers indicative of errors were set to missing and subsequently imputed, because they could exert undue influence
on the parameter estimates and inflate the variability of the
multiple imputations (18).
Continuous variables that were not normally distributed
were transformed. Careful consideration was given to the
choice of variables to be considered categorical in the
program, since the choice of categorical variables determined the number of cells or stratification of the data within
which estimation of the mean value of continuous covariates
would occur. Unordered categorical variables such as clinic
site needed to be modeled categorically; others could be
modeled continuously and then rounded after imputation.
Studies have shown that the programs are quite robust to
modeling ordered categorical variables or indicator variables
as continuous (8). Biologic rather than statistical considerations often influenced the selection of the categorical variables for stratification, at times choosing variables which
contained no missing data but which could influence the
mean of other variables in the imputation, for example, sex,
race, or the presence of cardiovascular disease.
IMPUTATION RESULTS
Univariate distributions of the imputed variables were
consistent with those of the observed data. Bivariate correlations among 16 selected variables representing each of the
different imputation blocks were compared pre- and postimputation. Of the 87 pairs of variables in different blocks,
complete case correlations were less than 0.1 for 58 pairs (67
percent). Twenty-one pairs had correlations between 0.1 and
0.2, and five had correlations between 0.2 and 0.3. Three
pairs had correlations between 0.3 and 0.5, and their pre- and
postimputation correlations differed by 0, 0.014, and 0.003.
Of the 29 pairs with correlations greater than 0.1, the
maximum difference between the complete case and
imputed correlations was 0.025 for two pairs. All other
differences were 0.015 or less.
RESULTS IN COMPARATIVE ANALYSES
We replicated results from several previously published
reports, using both singly and multiply imputed data. We
present results from three analyses: 1) a stroke prediction
model based on 7.5 years of follow-up (19), 2) a linear
regression of left ventricular mass based on a model by
Gardin et al. (20), and 3) a survival analysis of mortality in
Am J Epidemiol 2003;157:74–84
Multiple Imputation in the Cardiovascular Health Study 77
TABLE 2. Sets of variables imputed together
Data
Related covariates
Body size: standing and seated height, weight, waist, hip, heel-knee
length, bioresistance, bioreactance, weight at age 50 years
Strata: race,* gender. Other: age
Blood pressure and heart rate: supine, seated, and standing systolic and
diastolic blood pressure, brachial blood pressure, left and right ankle
blood pressure, three resting pulse measures, supine and standing
pulse
Strata: race, gender, antihypertensive medications, diabetes.
Other: history of high blood pressure, coronary heart disease,
claudication, age, height, weight, alcohol, clinic, smoking
status
Blood laboratory: uric acid, platelet count, white blood cells, potassium,
hemoglobin, hematocrit, glucose, insulin, albumin, cholesterol, high
density lipoprotein cholesterol, triglycerides, fibrinogen, factors VII and
VIII, creatinine, C-reactive protein
Imputed men and women separately. Strata: cardiovascular
disease, race, diabetes. Other: age, history of high blood
pressure, current smoking, forced expiratory volume in 1
second, use of insulin, oral hypoglycemics, estrogen, or lipidlowering medications
Lung function: history of asthma, emphysema, bronchitis, pneumonia,
chest operation, hay fever, wheeziness, dyspnea, frequent cough or
phlegm. Spirometry: forced expiratory volume in 1 second, forced vital
capacity, QC grade†
Strata: gender, race, former smoking, current smoking. Other:
age, height, waist circumference, congestive heart failure,
systolic blood pressure, diabetes, major electrocardiograph
abnormalities, related medications, pack-years
Psychosocial: income, occupation, education, live alone. Scores: MiniMental, Center for Epidemiologic Studies Depression, digit symbol,
social network, social support
Strata: marital status, gender, race. Other: age, clinic, activities
of daily living, instrumental activities of daily living, history of
high blood pressure, coronary heart disease, stroke,
congestive heart failure, antidepressant use, vision or hearing
problem, self-reported health
Alcohol: beer, wine, liquor consumption, heavy drinking in past, change in
drinking pattern
Strata: gender, drink beer, wine, liquor. Other: race, clinic,
smoking (status, pack-years, live with smoker), cardiovascular
disease, diabetes, age, systolic blood pressure, education,
exercise, high density lipoprotein cholesterol, self-reported
health, weight, on special diet
Physical function and exercise: activities of daily living, instrumental
activities of daily living, 15-foot‡ walk time, grip strength, exercise
intensity, kcal of exercise, usual walking pace, number of flights of
stairs/week, blocks walked/week, hours/day seated or lying, activity
level relative to younger age. Difficulty: walking 0.5 mile§ and around
home, getting out of bed or chair, doing stairs, lifting, reaching, gripping
Strata: gender, any exercise. Other: weight, cardiovascular
disease, clinic, race, age, self-reported health
Electrocardiogram: ventricular conduct defect, atrioventricular block,
cardiac injury score, atrial fibrillation, left ventricular hypertrophy, left
ventricular mass, major Q or QS abnormality, isolated major ST-T,
minor Q/QS with ST-T, ABT axis, QRS axis. Intervals: PR, QRS, QT
Strata: gender, coronary heart disease. Other: age, congestive
heart failure, weight, heart rate, systolic blood pressure, echo
left ventricular mass, use of beta blockers, digitalis,
antiarrhythmics, warfarin
Echocardiogram*: left ventricular ejection fraction, wall motion, early and
late Doppler flow velocities, left atrial size, aortic root size, left
ventricular size in diastole and systole, ventral-septral thickness in
diastole and systole, left ventricular posterior wall thickness in diastole
and systole
Imputed men and women separately. Stratified on left ventricular
ejection fraction, left ventricular systolic wall motion
abnormalities. Other: congestive heart failure, coronary heart
disease, history of high blood pressure, race, diabetes, age,
height, weight, systolic blood pressure, heart rate, carotid
vessel maximum, warfarin use; electrocardiogram: atrial
fibrillation, cardiac injury score, left ventricular mass, major
abnormalities, QRS interval
Ultrasound: internal and common carotid wall thickness, vessel maximum,
stenosis of internal carotid
Stratified on reader. Other: age, gender, race, clinic, height,
weight, current smoking, pack-years; history of stroke,
transient ischemic attack, claudication, congestive heart
failure, diabetes, endarterectomy, high blood pressure; bruits,
diastolic and systolic blood pressure, high density lipoprotein
and low density lipoprotein cholesterol, lipid-lowering or
antihypertensive medications; electrocardiogram: left
ventricular mass, major abnormalities
* Original cohort only.
† QC grade, quality control grade of A, B, C, D, or F, indicating of the degree of confidence in the results.
‡ Metric equivalent: 4.57 m.
§ Metric equivalent: 0.8 km.
the smaller, African-American cohort, using results from a
study of 5-year mortality in the original cohort (21). The
complete case results reported here differ slightly from those
published because of continual updates in our database and,
in the model for left ventricular mass, because of a decision
to incorporate only the nonechocardiographic predictors into
the current model.
The stroke prediction model used variables measured at
baseline to predict future stroke among participants with no
history of stroke at baseline. Variables in the model are
Am J Epidemiol 2003;157:74–84
shown in table 3. The variable that accounted for eliminating
the most participants from the complete case analysis was
left ventricular mass, which was missing on 34 percent of the
participants at risk. Covariate values were compared for
participants included in and excluded from the complete case
analysis, using one set of imputed values to estimate the
missing data for those excluded (table 3). Significant differences were found for all variables except regular aspirin use.
Those excluded from the complete case analysis were older,
more likely to be male and, in general, sicker than those
78 Arnold and Kronmal
TABLE 3. Comparison of variables in stroke risk model by completeness of data
Variable
Subjects included in Subjects excluded from
complete case
complete case
analysis (n = 3,088)
analysis (n = 1,914)
p value
Age, years (mean, SD*)
72.2 (5.26)
73.62 (5.96)
<0.001
Male sex (%)
39.2
47.4
<0.001
Aspirin use (%)
25.7
27.0
0.32
Diabetes (%)
<0.001
Normal
52.9
46.2
Impaired glucose tolerance
26.7
28.6
Diabetic
20.4
25.2
Systolic blood pressure, mmHg (mean,
SD)
141.6 (20.4)
143.1 (20.3)
0.01
Timed walk, seconds (%)
2–6
<0.001
81.5
76.8
7
9.3
10.6
≥8
9.1
12.6
Frequent falls (%)
2.3
3.7
Creatinine, mg/dl (mean, SD)
1.04 (030)
Abnormal left ventricular wall motion (%)
7.6
0.006
1.07 (0.33)
13.2
Carotid stenosis (%)
<0.001
<0.001
<0.001
None
23.8
18.8
<50
71.8
76.1
50–74
3.4
3.9
≥75
1.0
1.2
Left ventricular mass, ≥194 g in women
and ≥267 g in men (%)
6.6
8.7
0.006
Atrial fibrillation (%)
5.7
4.5
0.07
Incident stroke (%)
8.2
9.6
0.08
* SD, standard deviation.
included, illustrating the potential bias in using complete
case prevalences to estimate population prevalences.
Table 4 displays results of the stroke prediction model for
the complete case analysis (model 1, n = 3,088) and for two
analyses using imputed data (n = 5,002), representing a
single imputation (model 2) and the combined results from
three imputations (model 3). The addition of the extra participants in the imputed data models resulted in tighter confidence bounds and smaller p values, most noticeably for
categories with few participants represented by the group
aged 85 years and older, those with frequent falls, and those
with abnormal left ventricular wall motion. The hazard ratios
were similar between the observed and imputed models,
with the exception of the highest category of carotid stenosis,
which was not precisely estimated in any model because of
the small number of participants with this degree of stenosis
(1 percent of the cohort). The complete case point estimate
fell within the 95 percent confidence interval of the estimate
from the imputed data set. There were no differences in
results between the single imputation and the combined
results from three imputations. The fraction of missing information was less than 6 percent for all variables except left
ventricular mass, for which it was 27 percent. Therefore, the
relative efficiency of the estimates based on three imputa-
tions compared with those based on an infinite number of
imputations was 98 percent or more for all but left ventricular mass, for which it was 92 percent, suggesting that three
imputations were adequate.
Tables 5, 6, and 7 display the results for modeling left
ventricular mass by linear regression. Table 5 contains the
means and frequencies of variables for those included in
versus excluded from the complete case analysis. Men were
less likely than women to be included in the analysis. Participants excluded from the complete case analysis were older
and had slightly higher mean systolic blood pressure than
those included. The women excluded were heavier, had lower
mean levels of high density lipoprotein, and were more likely
to have a history of hypertension or a minor electrocardiogram abnormality. The men who were excluded were more
likely to have had a previous myocardial infarction.
Table 6 shows results from a complete case analysis
against those from the combined multiple imputation. Given
the number of cases omitted from the complete case analysis
and the apparent differences in health status between the two
groups, the results are surprisingly similar. Only one variable, total cholesterol, goes from highly statistically significant to not significant. Although the point estimate for age is
higher in the imputation model, it is less significant because
Am J Epidemiol 2003;157:74–84
Multiple Imputation in the Cardiovascular Health Study 79
TABLE 4. Multiple imputation of risk factors for stroke: hazard ratios and confidence intervals from Cox models
Model 1: 3,088 subjects;
complete case
Model 2: 5,002 subjects at risk;
single imputation
Model 3: 5,002 subjects at risk;
multiple imputations
Variable
Hazard
ratio
Hazard
ratio
95% CI*
p value
95% CI
p value
0.07
1.44
1.09, 1.89
0.009
Hazard
ratio
95% CI
p value
1.43
1.09, 1.88
0.01
Age, years
65–69
1.00
70–74
1.38
0.98, 1.94
1.00
1.00
75–79
1.77
1.23, 2.54
0.002
1.94
1.47, 2.56
<0.001
1.94
1.47, 2.56
<0.001
80–84
2.21
1.43, 3.42
<0.001
2.47
1.79, 3.40
<0.001
2.48
1.80, 3.42
<0.001
≥85
1.92
0.98, 3.77
0.06
2.00
1.24, 3.23
0.005
2.00
1.24, 3.22
0.005
Male sex
0.89
0.65, 1.21
0.47
1.05
0.83, 1.33
0.68
1.04
0.82, 1.32
0.73
Aspirin use
1.42
1.09, 1.85
0.009
1.37
1.12, 1.67
0.002
1.37
1.12, 1.67
0.002
Diabetes
Normal
1.00
1.00
Impaired glucose tolerance
1.27
0.93, 1.73
0.14
1.28
1.02, 1.62
0.04
1.28
1.01, 1.62
0.04
Diabetic
1.88
1.39, 2.54
<0.001
1.68
1.33, 2.11
<0.001
1.67
1.33, 2.11
<0.001
0.77
1.09
0.78, 1.53
0.62
Systolic blood pressure, mmHg
<127
1.00
128–140
0.94
0.60, 1.46
0.78
1.00
1.05
0.75, 1.47
1.00
141–152
1.56
1.03, 2.36
0.04
1.52
1.10, 2.10
0.01
1.54
1.11, 2.12
0.009
153–166
1.96
1.28, 3.01
0.002
1.92
1.38, 2.66
<0.001
1.93
1.39, 2.68
<0.001
≥167
2.41
1.57, 3.71
<0.001
2.38
1.71, 3.31
<0.001
2.38
1.71, 3.31
<0.001
Timed walk, seconds
2–6
1.00
7
1.26
0.85, 1.87
0.26
1.29
0.96, 1.73
0.09
1.31
0.97, 1.76
0.07
≥8
1.83
1.30, 2.58
<0.001
1.79
1.39, 2.32
<0.001
1.77
1.37, 2.30
<0.001
1.51
0.79, 2.88
.22
1.77
1.17, 2.69
0.007
1.78
1.17, 2.70
0.007
Frequent falls
1.00
1.00
Creatinine, mg/dl
0.4–0.9
1.00
1.0
1.02
0.67, 1.54
0.94
0.99
0.72, 1.35
0.93
0.99
0.72, 1.35
0.94
1.1–1.2
1.18
0.84, 1.67
0.34
1.12
0.85, 1.48
0.41
1.14
0.86, 1.49
0.36
1.3–1.4
0.97
0.60, 1.56
0.89
1.09
0.78, 1.54
0.61
1.10
0.78, 1.55
0.58
1.5–7.7
1.51
0.94, 2.43
0.09
1.63
1.15, 2.30
0.006
1.65
1.17, 2.34
0.005
1.42
0.96, 2.12
0.08
1.39
1.06, 1.83
0.02
1.41
1.06, 1.87
0.02
Abnormal left ventricular wall
motion
1.00
1.00
Carotid stenosis, %
None
1.00
<50
1.30
0.96, 1.75
0.09
1.00
1.51
1.14, 2.02
0.005
1.00
1.51
1.13, 2.01
0.005
50–74
2.22
1.37, 3.59
0.001
2.41
1.54, 3.79
<0.001
2.32
1.45, 3.70
<0.001
≥75
1.40
0.55, 3.55
0.47
0.48
0.12, 1.97
0.31
0.47
0.11, 1.94
0.30
Left ventricular mass, ≥194 g in
women and ≥267 g in men
1.40
0.92, 2.13
0.11
1.29
0.94, 1.75
0.11
1.24
0.86, 1.80
0.25
Atrial fibrillation
2.44
1.69, 3.52
<0.001
2.06
1.52, 2.79
<0.001
2.05
1.51, 2.78
<0.001
* CI, confidence interval.
of variability across imputations (table 7). The coefficient
for current smoking also varies substantially across imputations, ranging from 1.9 to 3.9, which is less than the value of
4.2 from the complete case analysis. There were no significant differences in cholesterol or smoking between those
Am J Epidemiol 2003;157:74–84
included in or excluded from the complete case analysis,
even though these were the two variables with the greatest
difference in regression coefficients between the complete
case analysis and the imputation analysis. The fraction of
missing information ranges from 7 percent to 68 percent,
80 Arnold and Kronmal
TABLE 5. Variables in left ventricular mass model for cases included versus excluded from complete case analysis
Women
Variable
Excluded
(n = 1,050)
Men
Included
(n = 1,912)
p value
<0.001
Excluded
(n = 974)
74.0 (6.0)
Included
(n = 1,265)
p value
72.9 (5.5)
<0.001
Age, years (mean, SD*)
73.4 (5.8)
71.9 (5.1)
Current smoker (no., %)
138 (13.1)
237 (12.4)
0.56
109 (11.2)
117 (9.2)
0.14
Weight, pounds† (mean, SD)
149.1 (32.9)
146.1 (28.1)
0.009
174.9 (28.6)
173.5 (26.0)
0.22
History of myocardial infarction (no., %)
67 (6.4)
116 (6.1)
0.75
171 (17.6)
158 (12.5)
0.001
History of congestive heart failure (no., %)
50 (4.8)
63 (3.3)
0.06
55 (5.6)
58 (4.6)
0.28
Diastolic blood pressure, mmHg (mean, SD)
69.06 (11.7)
69.12 (10.7)
0.89
72.0 (11.5)
71.2 (11.3)
0.10
Systolic blood pressure, mmHg (mean, SD)
137.4 (22.1)
135.4 (21.7)
0.015
136.4 (22.0)
134.5 (20.4)
0.04
History of hypertension (no., %)
520 (49.5)
849 (44.4)
0.008
429 (44.0)
514 (40.6)
0.11
Total cholesterol, mg/dl (mean, SD)
224.4 (39.4)
224.9 (38.1)
0.72
200.3 (35.6)
202.0 (36.3)
0.27
57.8 (16.0)
59.4 (16.0)
0.009
47.2 (13.2)
47.3 (12.3)
0.86
High density lipoprotein cholesterol, mg/dl
(mean, SD)
Major electrocardiogram abnormality (no., %)
251 (23.9)
413 (21.6)
0.15
336 (34.5)
424 (33.5)
0.65
Minor electrocardiogram abnormality (no., %)
591 (56.3)
960 (50.2)
0.002
557 (57.2)
723 (57.2)
1.00
Imputation 1
138.1 (42.0)
135.3 (41.1)
0.07
180.2 (62.4)
176.2 (53.8)
0.10
Imputation 2
139.3 (40.8)
0.011
180.3 (57.8)
0.08
Imputation 3
136.8 (40.8)
0.32
179.2 (60.7)
0.21
Left ventricular mass, g (mean, SD)
* SD, standard deviation.
† One pound = 0.45 kg.
with the largest fraction associated with the coefficient on
age. The relative efficiency of the estimate for age from the
combined imputations was 82 percent.
Finally, we present an example using fewer cases and a
frequently used modeling strategy of backward selection.
Using predictors of mortality in the original cohort (21), we
explored their significance as predictors of death in the smaller
African-American cohort. The variables considered are
presented in table 8, and those remaining in the backward selection model for the complete case and imputed data analyses are
shown in table 9. The same set of variables was selected for
each of the three imputed data sets, and these did not coincide
with those selected using the complete cases. When the variables chosen by the selection procedure on the imputed data set
TABLE 6. Comparison of complete case and multiple imputation model results for left ventricular mass
Complete case (n = 3,177)
Combined imputation (n = 5,201)
Covariate
Coefficient (SE*)
Age, years
Male sex
0.30 (0.15)
19.9 (1.8)
95% CI*
p value
0.01, 0.60
0.044
16.4, 23.4
<0.0001
–0.34, 8.8
0.07
Coefficient (SE)
0.38 (0.18)
95% CI
p value
–0.07, 0.83
0.08
21.4 (1.5)
18.3, 24.4
<0.0001
2.7 (2.3)
–2.0, 7.4
0.25
0.57 (0.03)
0.51, 0.63
<0.0001
Current smoker
4.2 (2.3)
Weight, pounds†
0.56 (0.03)
0.50, 0.62
<0.0001
History of myocardial infarction
9.8 (2.8)
4.4, 15.2
0.0002
9.5 (2.3)
4.8, 14.1
0.0001
History of congestive heart failure
29.0 (4.0)
21.2, 36.7
<0.0001
29.7 (4.0)
21.0, 38.4
<0.0001
Diastolic blood pressure, mmHg
–0.32 (0.08)
–0.47, –0.16
<0.0001
–0.27 (0.07)
–0.40, 0.13
0.0001
Systolic blood pressure, mmHg
0.32 (0.04)
0.24, 0.41
<0.0001
0.27 (0.04)
0.19, 0.35
<0.0001
History of hypertension
4.8 (1.6)
1.7, 7.9
0.002
3.6, 9.3
<0.0001
6.4 (1.4)
Total cholesterol, mg/dl
–0.055 (0.02)
–0.094, –0.017
0.005
–0.031 (0.02)
–0.070, 0.008
High density lipoprotein cholesterol, mg/dl
–0.14 (0.05)
–0.25, –0.04
0.006
–0.12 (0.04)
–0.20, 0.03
0.11
0.007
Major electrocardiogram abnormality
17.3 (1.8)
13.7, 20.8
<0.0001
18.0 (2.0)
13.4, 22.6
<0.0001
Minor electrocardiogram abnormality
8.1 (1.5)
4.4, 15.2
<0.0001
7.8 (1.3)
5.2, 10.5
<0.0001
* SE, standard error; CI, confidence interval.
† One pound = 0.45 kg.
Am J Epidemiol 2003;157:74–84
Multiple Imputation in the Cardiovascular Health Study 81
TABLE 7. Comparison of results from three imputations of left ventricular mass
Imputation 1
Imputation 2
Imputation 3
Covariate
Age, years
Male sex
Coefficient (SE*)
p value
Coefficient (SE)
0.24 (0.12)
0.041
0.45 (0.12)
21.6 (1.5)
<0.0001
21.7 (1.4)
0.32
2.2 (1.9)
p value
Coefficient (SE)
p value
<0.0001
0.45 (0.12)
<0.0001
<0.0001
20.8 (1.4)
0.23
3.9 (1.9)
<0.0001
Current smoker
1.9 (1.9)
Weight, pounds†
0.56 (0.02)
<0.0001
0.56 (0.02)
<0.0001
History of myocardial
infarction
8.7 (2.1)
<0.0001
9.3 (2.1)
<0.0001
10.4 (2.1)
<0.0001
History of congestive
heart failure
32.2 (3.1)
<0.0001
28.0 (3.0)
<0.0001
28.0 (3.0)
<0.0001
Diastolic blood
pressure, mmHg
–0.29 (0.06)
<0.0001
–0.264 (0.07)
<0.0001
–0.245 (0.06)
<0.0001
0.58 (0.02)
0.04
<0.0001
Systolic blood pressure,
mmHg
0.29 (0.04)
<0.0001
0.266 (0.04)
<0.0001
0.253 (0.03)
<0.0001
History of hypertension
7.0 (1.3)
<0.0001
6.3 (1.3)
<0.0001
5.9 (1.3)
<0.0001
Total cholesterol, mg/dl
–0.038 (0.02)
0.019
–0.021 (0.02)
0.18
–0.035 (0.02)
0.03
High density lipoprotein
cholesterol, mg/dl
–0.11 (0.04)
0.011
–0.12 (0.04)
0.005
–0.13 (0.04)
0.002
Major electrocardiogram
abnormality
19.3 (1.4)
<0.0001
17.9 (1.4)
<0.0001
16.8 (1.4)
<0.0001
Minor electrocardiogram
abnormality
8.2 (1.2)
<0.0001
7.3 (1.2)
<0.0001
8.0 (1.2)
<0.0001
* SE, standard error.
† One pound = 0.45 kg.
were entered into a Cox model for the observed data, point estimates were similar, but not all variables attained statistical
significance. There was little difference between the results for
a single imputation and those for the combined multiple imputation (table 10), and the fraction of missing information was
less than 10 percent for all covariates.
DISCUSSION
Using available statistical software, we imputed missing
baseline data on over 150 variables in the Cardiovascular
Health Study. The process involved detailed explorations of
the data in order to select from among dozens of correlated
variables the ones to impute together, to identify gross
outliers, to transform continuous variables that were not
normally distributed in order to satisfy the model assumptions of the imputation method, and to decide which variables to treat categorically, defining the cells within which
the continuous variables would have a common mean value.
We created three filled-in copies of the baseline data and
used these to replicate previously published results based on
analyses of observed data. We also compared covariate
values for those included in versus excluded from the
complete case analyses and found significant differences on
most variables. Despite these differences, the results of the
complete case analyses and the analyses using imputed data
were similar. Results from a single imputed data set differed
little from the combined results of three imputations.
Bivariate correlations of imputed variables in different
blocks were similar to the complete case correlations.
Am J Epidemiol 2003;157:74–84
The consistency of our results in comparative analyses is not
unexpected, given that data were missing on 5 percent or fewer
cases for approximately 85 percent of the variables imputed,
implying that the imputed subset for most variables is relatively small. We explored many more models than are
presented and found no greater differences than those reported,
either between the observed and imputed data analysis results
or across results from multiple imputations. We included left
ventricular mass in two of the models reported because it was
missing on 35 percent of the original cohort, therefore
presenting what we believe may be a worst case scenario in the
Cardiovascular Health Study. As a predictor in the stroke
model, the hazard ratio for elevated left ventricular mass was
similar across all models. When left ventricular mass was the
outcome variable, there were some differences between the
complete case and imputation results. Age and cholesterol
were no longer significant in the imputed model, and the coefficient for smoking was reduced by 35 percent compared with
the complete case model. The relative efficiency of 82 percent
for the estimated coefficient of age quantifies the variability
across imputations and indicates the desirability of more than
three imputations with this amount of missing data.
The strengths of our imputation approach are its comprehensiveness and generality. The intent is for all Cardiovascular Health Study analysts to be able to utilize the imputed
data without having to run separate imputations for each
analysis. There are also limitations associated with our
approach. We did not include any follow-up variables in the
imputations, leaving the potential for missing associations
with variables such as stroke and death (8, 9). Our exploratory analyses of these variables and the fact that most cova-
82 Arnold and Kronmal
TABLE 8. Comparison of data by completeness status: death in the African-American cohort
Variable
Cases included in
complete case
analysis (n = 496)
Age, years (mean, SD*)
72.3 (5.34)
74.7 (6.5)
Male sex (%)
38.7
33.5
0.21
4.6
3.1
0.38
Income, >$50,000 (%)
Weight, pounds† (mean, SD)
Exercise, kcal/week (mean, SD)
Smoking, pack-years (mean, SD)
174.4 (34.0)
1,047 (1,331)
Cases excluded from
complete case
analysis (n = 191)
p value
<0.001
166.9 (38.9)
0.02
877 (1,326)
0.13
14.7 (25.4)
12.7 (20.6)
0.29
Brachial blood pressure, mmHg (mean, SD)
149.6 (23.2)
153.8 (25.6)
0.04
Tibial blood pressure, mmHg (mean, SD)
155.0 (32.7)
147.1 (35.1)
0.006
Diuretic use (%)
Fasting glucose, mg/dl (mean, SD)
38.9
116.5 (46.0)
36.6
0.59
124.5 (56.3)
0.08
Albumin, mg/dl (mean, SD)
3.96 (0.28)
3.95 (0.25)
0.66
Creatinine, mg/dl (mean, SD)
1.08 (0.35)
1.11 (0.33)
0.34
Congestive heart failure (%)
5.4
7.3
0.35
38.9
42.9
0.34
2.0
2.6
0.63
94.4
88.5
0.02
2
3.6
5.8
3–5
2.0
Major electrocardiogram abnormality (%)
Carotid stenosis of >50% (%)
Instrumental ADL* difficulties (%)
0, 1
Digit symbol score (mean, SD)
29.3 (12.9)
5.8
22.4 (15.8)
<0.001
Self-assessed health (%)
Excellent
9.5
1.6
Very good
15.9
19.9
Good
35.7
30.9
Fair
31.7
34.0
Poor
7.3
13.6
12.1
14.7
Death by June 30, 1997 (%)
<0.001
0.37
* SD, standard deviation; ADL, activities of daily living.
† One pound = 0.45 kg.
riates were missing values for fewer than 5 percent of the
participants suggest that it is unlikely that important associations will be missed. We chose not to include outcomes in
the baseline data imputation, because event data collection is
ongoing and because data would have been imputed on the
basis of only the earliest outcomes. Similarly, associations
among correlated variables not in the same block have the
potential of being dampened because they were not imputed
together. Our postimputation exploration of correlations
across blocks is reassuring in this regard. To further examine
the issue, we investigated the five variables in the regression
model for left ventricular mass in table 6 that were not
included in the imputation of left ventricular mass. Those
variables were total and high density lipoprotein cholesterol,
smoking, diastolic blood pressure, and minor electrocardiogram abnormality, which, although correlated with left
ventricular mass, did not add significantly to its prediction
given the other variables in the block of echocardiography
data, which accounted for 91 percent of the variability in left
ventricular mass. To quantify the effect of omitting these
covariates from the imputation, we reimputed left ventricular
mass adding these five variables and reran the multiple
imputation model in table 6. The coefficient for smoking
increased from 2.7 to 4.0, closer to the complete case value,
but remained nonsignificant. The significance of the correlated variables high density lipoprotein and total cholesterol
alternated, with high density lipoprotein cholesterol becoming
less significant (β = –0.10 (standard error, 0.06); p = 0.17)
and total cholesterol becoming more significant (β = –0.059
(standard error, 0.02); p = 0.001). In view of these results, we
suggest that, to tease out associations involving highly correlated covariates in a regression analysis of a variable that has
a substantial percentage of missing data, a model-specific
imputation be done. For the majority of analyses in the
Cardiovascular Health Study, our explorations suggest that
the centrally created imputed data sets would preserve
associations while increasing power.
There are advantages to using the imputed data in terms of
power and variable selection. The sample size in the Cardiovascular Health Study is large enough that inadequate power is
Am J Epidemiol 2003;157:74–84
Multiple Imputation in the Cardiovascular Health Study 83
TABLE 9. Comparison of results from backward elimination procedure: death in the African-American cohort
Complete case
(n = 496; 60 deaths)
Multiple imputation
(n = 687; 88 deaths)
Variable
Hazard
ratio
95% CI*
p value
Hazard
ratio
95% CI
p value
Age, years
1.08
1.04, 1.13
0.0003
1.06
1.02, 1.10
0.006
Male sex
1.99
1.13, 3.50
0.018
1.95
1.23, 3.07
0.004
Smoking, pack-years
Never smoked
1.00
1.00
1–25
0.92
0.45, 1.88
0.83
1.21
0.68, 2.13
0.518
26–60
2.64
1.30, 5.37
0.007
2.14
1.19, 3.88
0.012
>60
2.84
1.09, 7.42
0.03
2.86
1.32, 6.20
0.008
Exercise, ln(kcal/week)
0.89
0.81, 0.99
0.032
Brachial blood pressure, 10 mmHg
1.13
1.00, 1.28
0.05
Tibial blood pressure, 10 mmHg
0.90
0.84, 0.97
0.005
0.93
0.87, 0.99
0.016
Instrumental ADL* difficulties (maximum = 5)
0, 1
1.00
2
2.33
1.06, 5.11
0.035
3–5
2.25
0.93, 5.44
0.073
0.98
0.96, 1.00
0.071
Glucose, ln(mg/dl)
1.94
1.08, 3.49
0.026
Major electrocardiogram abnormality
1.89
1.20, 2.98
0.006
Digit symbol score
0.97
0.95, 0.99
0.016
* CI, confidence interval; ADL, activities of daily living.
rarely a concern. In smaller studies or in subset analyses, the
increase in power from using imputed data may be substantial.
In model selection, identifying an optimal set of covariates
from among the many correlated variables collected in the
Cardiovascular Health Study and other large, observational
studies is always a challenge. Missing data complicate that
process and may influence the choice of variables to consider,
with a preference for excluding from consideration those
TABLE 10. Comparison of results for variables found significant in imputed data set: death in the African-American cohort
Complete case
(n = 530; 65 deaths)
Single imputation
(n = 687; 88 deaths)
Multiple imputation
(n = 687; 88 deaths)
Variable
Hazard
ratio
95% CI*
p value
Hazard
ratio
95% CI
p value
Hazard
ratio
95% CI
p value
Age, years
1.08
1.03, 1.13
0.001
1.06
1.02, 1.10
0.004
1.06
1.02, 1.10
0.006
Male sex
1.56
0.91, 2.66
0.11
1.96
1.25, 3.09
0.004
1.95
1.23, 3.07
0.004
Smoking, pack-years
Never smoked
1.00
1–25
0.96
0.48, 1.92
0.91
1.22
0.69, 2.17
0.490
1.21
0.68, 2.13
0.518
26–60
2.96
1.49, 5.91
0.002
2.16
1.20, 3.89
0.010
2.14
1.19, 3.88
0.012
>60
2.97
1.14, 7.77
0.026
2.80
1.29, 6.06
0.009
2.86
1.32, 6.20
0.008
0.95
0.90, 1.01
0.13
0.93
0.88, 0.99
0.020
0.93
0.87, 0.99
0.016
Tibial blood pressure, 10 mmHg
1.00
1.00
Instrumental ADL* difficulties
(maximum = 5)
0, 1
1.00
2
2.03
0.84, 4.87
0.11
1.00
2.34
1.07, 5.13
0.033
2.33
1.06, 5.11
0.035
3–5
2.09
0.48, 8.98
0.32
2.21
0.91, 5.34
0.079
2.25
0.93, 5.44
0.073
Digit symbol score
0.97
0.95, 0.99
0.010
0.98
0.97, 1.00
0.079
0.98
0.96, 1.00
0.071
Glucose, ln(mg/dl)
2.02
0.97, 4.20
0.059
2.05
1.15, 3.65
0.016
1.94
1.08, 3.49
0.026
Major electrocardiogram abnormality
1.78
1.06, 2.98
0.028
1.95
1.27, 3.02
0.003
1.89
1.20, 2.98
0.006
* CI, confidence interval; ADL, activities of daily living.
Am J Epidemiol 2003;157:74–84
1.00
84 Arnold and Kronmal
missing many cases. With filled-in data available on all cases,
variables need not be eliminated because of missing data, and
models resulting from different variable groups would always
include the same cases, providing consistency in numbers
reported within a paper or between papers on the same study.
We would like to encourage investigators in epidemiologic
studies to avail themselves of the programs available (10) for
state-of-the-art imputation of missing data. Although the data
preparation and variable selection are time consuming, much
of that work is done in the context of data analysis. The
programs themselves run very quickly, and the method has
been shown to be superior to other methods of imputation. As
Rubin has stated, the multiple imputation method provides
statistically valid inferences in the challenging setting where
ultimate users of the data are not the database constructors,
where a variety of analyses and models will be used, and
where there is no one reason for the missing data (5). In the
setting of a large, observational study where it would be
impractical to impute all data in one large model, we have
demonstrated an approach to creating multiple imputed data
sets. In observational studies with minimal missing data and
with no reason to suspect that data are not missing at random,
our exploration provides some reassurance that findings
published prior to implementation of missing data replacement would not have changed much had an optimal method
for missing data imputation been used.
ACKNOWLEDGMENTS
The research reported in this article was supported by
contracts N01-HC-85079 through N01-HC-85086, N01-HC35129, and N01-HC-15103 from the National Heart, Lung,
and Blood Institute.
The following institutions and principal investigators participated in this study: Wake Forest University School of Medicine,
Dr. Gregory L. Burke; Wake Forest University—Electrocardiogram Reading Center, Dr. Pentti Rautaharju; University of
California, Davis, Dr. John Robbins; The Johns Hopkins
University, Dr. Linda P. Fried; The Johns Hopkins University—
MRI Reading Center, Dr. Nick Bryan and Dr. Norm J. Beauchamp; University of Pittsburgh, Dr. Lewis H. Kuller; University of California, Irvine—Echocardiography Reading Center
(baseline), Dr. Julius M. Gardin; Georgetown Medical
Center—Echocardiography Reading Center (follow-up), Dr.
John Gottdiener; New England Medical Center, Boston—
Ultrasound Reading Center, Dr. Daniel H. O’Leary; University
of Vermont—Central Blood Analysis Laboratory, Dr. Russell
P. Tracy; University of Arizona, Tucson—Pulmonary Reading
Center, Dr. Paul Enright; Retinal Reading Center, University of
Wisconson, Dr. Ron Klein; University of Washington—Coordinating Center, Dr. Richard A. Kronmal; National Heart, Lung,
and Blood Institute Project Office, Dr. Diane Bild.
REFERENCES
1. Rubin DB. Multiple imputation for nonresponse in surveys.
New York, NY: Wiley, 1987.
2. Little RJA, Rubin DB. Statistical analysis with missing data.
New York, NY: Wiley, 1989.
3. Rubin DB, Schenker N. Multiple imputation in health-care
databases: an overview and some applications. Stat Med 1991;
10:585–98.
4. Little RJA. Regression with missing x’s: a review. J Am Stat
Assoc 1992;87:1227–37.
5. Rubin DB. Multiple imputation after 18+ years. J Am Stat
Assoc 1996;91:473–89.
6. Greenland S, Finkle WD. A critical look at methods for handling missing covariates in epidemiologic regression analysis.
Am J Epidemiol 1995;142:1255–64.
7. Vach W, Blettner M. Biased estimation of the odds ratio in
case-control studies due to the use of ad hoc methods of correcting for missing values for confounding variables. Am J Epidemiol 1991;134:895–907.
8. Schafer JL. Analysis of incomplete multivariate data. New
York, NY: Chapman & Hall, 1997.
9. Schafer JL. Multiple imputation: a primer. Stat Methods Med
Res 1999;8:3–15.
10. Schafer JL. Software for multiple imputation. University Park,
PA: The Pennsylvania State University Department of Statistics, 1999. (http://www.stat.psu.edu/~jls/misoftwa.html).
11. van Buuren S, Oudshoorn K, eds. Multiple imputation online.
Leiden, Netherlands: TNO Prevention and Health, Department
of Statistics, 2001. (http://www.multiple-imputation.com).
12. Fried LP, Borhani NO, Enright PL, et al. The Cardiovascular
Health Study: design and rationale. Ann Epidemiol 1991;1:
263–76.
13. Tell GS, Fried LP, Hermanson B, et al. Recruitment of adults
65 years and older as participants in the Cardiovascular Health
Study. Ann Epidemiol 1993;3:358–66.
14. Manolio T, Hermanson B, Hill J, et al. Respondent burden in
studies of the elderly: experience from the Cardiovascular
Health Study (CHS). Inclusion of elderly individuals in clinical
trials. In: Proceedings of an American College of Cardiology
Workshop, 1993:135–47. Seattle, WA: The Cardiovascular
Health Study, 2001. (http://chs3.chs.biostat.washington.edu/
chs/abstract/maan93.htm).
15. Barnard J, Meng XL. Applications of multiple imputation in
medical studies: from AIDS to NHANES. Stat Methods Med
Res 1999;8:17–36.
16. Ezzati-Rice TM, Johnson W, Khare M, et al. A simulation
study to evaluate the performance of model-based multiple
imputations in NCHS Health Examination Surveys. In: Proceedings of the Bureau of the Census 11th Annual Research
Conference. Washington, DC: US Department of Commerce,
1995:257–66.
17. Schafer JL, Khare M, Ezzati-Rice TM. Multiple imputation of
missing data in NHANES III. In: Proceedings of the Bureau of
the Census Ninth Annual Research Conference. Washington,
DC: US Department of Commerce, 1993:459–87.
18. Schafer JL, Olsen MK. Multiple imputation for multivariate
missing-data problems: a data analyst’s perspective. Multivariate
Behav Res 1998;33:545–71.
19. Manolio TA, Kronmal RA, Burke GL, et al. Short term predictors of incident stroke in older adults. The Cardiovascular
Health Study. Stroke 1996;27:1479–86.
20. Gardin JM, Arnold A, Gottdiener JS, et al. Left ventricular
mass in the elderly. The Cardiovascular Health Study.
Hypertension 1997;29:1095–103.
21. Fried LP, Kronmal RA, Newman AB. Risk factors for 5-year
mortality in older adults. The Cardiovascular Health Study.
JAMA 1996;279:585–92.
Am J Epidemiol 2003;157:74–84