Measurement of Exposure to Nutrients: An Approach to the

American Journal of Epidemiology
Copyright C 1996 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved
Vol. 143, No. 5
Panted In U.S.A
Measurement of Exposure to Nutrients: An Approach to the Selection of
Informative Foods
Steven D. Mark,1 Donald G. Thomas,1 and Adriano Decarii2
Frequently, epidemiologic questionnaires are designed to measure several individual level exposures,
including exposure to one or more nutrients. Although most nutrients are contained in a large number of foods,
constraints on questionnaire length permit the inclusion of only a subset of these. In this paper, the authors
review the two common methods of food selection, and they propose two new methods. When the intent is
to estimate the effect of the nutrient on disease risk using a logistic regression model, the authors show that
their Max_r method is optimal. With the use of case-control data, they examine the assumption of nondifferential measurement error that is essential to the validity of all analyses that rely on shortened questionnaires. They conclude by combining the statistical considerations developed for judging adequacy of a
selection method with their empirical results and suggest new goals for dietary questionnaires and a new
approach to questionnaire construction consistent wrth those goals. Am J Epidemiol 1996;143:514-21.
diet; epidemiologic methods; measurement error; misclassification; nutrition; questionnaires
which exists for some group of individuals thought to
be representative of the population in which the study
is to be conducted. This CFQ may come from an
extensive questionnaire applied once to a specific cohort of interest that is to be followed over time (1-3,
5), or from a population-based survey such as the
second National Health and Nutrition Examination
Survey (NHANES IT) (10, 11).
In the first section of the paper, we review two
commonly used methods of food selection and provide
two new methods. In section 2, we discuss statistical
considerations that determine the optimality of any
selection procedure and show the desirable features
that our Max_r method attains. In section 3, we illustrate the differences between the approaches by applying each method to a 146-item CFQ questionnaire
administered to 1,623 randomly selected Italians,
1,016 of whom served as controls in a case-control
study of gastric cancer (6-8). In particular, we contrast the performance when the task is to choose
for each of seven nutrients, a subset of ten foods.
In section 4, we discuss the assumption of nondifferential misclassification that is fundamental to all
the selection schemes. We use the case-control data to
test its veracity, and we examine the effect on estimation of odds ratios when nutrient sums based on 10
foods replace those based on the entire 146. In the
Discussion, we combine the statistical considerations
with the empirical result. We raise potential alternatives to the standard procedure of measuring exposure
by using simple unweighted sums of the selected
Many epidemiologic studies rely on assessments of
an individual's exposure to a particular nutrient or
nutrients, either to estimate the effect of variation of
intake of that nutrient on disease risk, or to control for
the variation of that nutrient while estimating the effect of some other exposure. Typically, these assessments are made through self-administered questionnaires that enquire about the frequency and portions of
foods consumed. Combined with databases that contain information on how much nutrient is contained in
each unit of food (nutrient density), an individual's
exposure is calculated as the sum of the food-specific
nutrient times the quantity of that food consumed.
Unfortunately, most nutrients are contained in such a
large number of foods that limitations of questionnaire
size (1-12) prevent including them all. The problem
arises as to which foods to include, how to judge the
adequacy of the selected foods, and how to assess the
potential gains of a more inclusive list. Typically, the
selection of foods is made by utilizing information
from a complete food questionnaire (CFQ) (1-12)
Received for publication November 22, 1994, and in final form
June 25, 1995.
Abbreviations: ARE, asymptotic relative efficiency; CFQ, complete food questionnaire; CS, criterion statistic; PC, personal computer, %TP, percent of total population intake.
1
Biostatlstics Branch, National Cancer Institute, Bethesda, MD.
2
Istituto di Statistica Medica e Biometria, University of Milan, and
Istrtuto Nationole Tumore, Milan, Italy.
Correspondence to Dr. Steven D. Mark, Blostatistics Branch,
National Cancer Institute, Executive Plaza North, Room 403, 6130
Executive Blvd., MSC 7368, Bethesda, MD 20892-7368.
514
Measurement of Exposure to Nutrients: Selection of Informative Foods
foods, suggest new goals for dietary questionnaires,
and sketch out an approach to questionnaire construction consistent with those goals.
FOUR METHODS OF SELECTION
For concreteness, we focus on measuring dietary
nitrates, a nutrient hypothesized to have an etiologic
role in gastric cancer. From the CFQ, we know that
nitrate is contained in 119 of the 146 foods. We wish
to select 10 foods to measure so as to minimize the
effect on the results of our study of having the nutrient
sums, Wj, based on the 10 measured foods for individual /, rather than the true nutrient sums Z,, based on
the 119 foods. Letting Ftj denote the amount of nitrate
consumed by individual i in foody, Block et al. (9, 10)
have proposed that the foods be chosen to maximize
the percent of total population intake,
x
where N represents the total number of individuals in
the CFQ. Although not explicitly formulated by those
investigators in this manner, one justification for this
approach is that it chooses the foods that, on the
average, make W, "as close" to Z, as possible, where
distance is measured as (Z, — W,). Treating our CFQ as
if it were the population, we maximize the sample
%TP by computing the average nitrate of each of the
119 foods and choosing the foods with the 10 largest
averages. We call this method M0M1 because it is
based on the first moments of the Z,. The adequacy of
the 10 foods selected is assessed by computing %TP
(9-12), the criterion statistic (CS) for M0M1.
If we knew the absolute range of nitrate intake that
was compatible with health, and our goal were to
survey a population to determine the distribution of
people within this range, M0M1 might be the method
of choice. However, as many authors have emphasized
(1, 4, 5, 13), when the intent is to test for and estimate
the effect of a nutrient on disease rates, classifying
individuals as accurately as possible on some absolute
scale is not as important as preserving the relation of
individuals to each other. Willett (1) cites a hypothetical universe in which everyone eats exactly one carrot
a day: in this case, enquiring about carrots would not
be helpful in distinguishing high from low consumers
of beta-carotene, even if carrots were the largest
source of beta-carotene in the diet. Willett, adapting a
proposal by Heady (14), recommends stepwise regression as the means to select the foods. This algorithm,
which we call Stepwise, has been frequently adopted
(2, 4, 5, 6) as the primary means of compiling abbreviated food questionnaires. To select the 10 best foods
for nitrates with Stepwise, the total nitrate intake Z, is
Am J Epidemiol
Vol. 143, No. 5, 1996
515
defined to be the dependent variable, and the 119
foods that contain nitrate are the independent variables. The first 10 foods to enter the regression using
a forward selection stepwise regression algorithm constitute the choice. Stepwise regression algorithms are
an approximate means of maximizing the model R2,
which is called the percent between-person variation
explained, and is calculated as
R2 =
,. - z)2) x IOO,
where the overbar, ", indicates a sample average and
Zj is the predicted value of the nitrate for individual i
based on the 10 foods chosen and their estimated
regression coefficients. R2 is used as a measure of how
well the selected foods capture the interindividual
variability. Typically foods are chosen to reach an R2
of at least 80 percent (1). Because, for each individual,
the nutrient score Wt is used and not the Z,, a more
accurate measure of between-person variation explained might be
R2W = (2?=, M - w)2)/(Sf=, (z, - z)2) x ioo.
R2 (the CS for Stepwise) will always be at least as large
as the R2W, and is usually larger using Stepwise to select
foods for nitrates R2 = 0.959, whereas R2W = 0.782.
Our proposal, like Stepwise, intends to choose the
10 foods that best preserve the relation between individuals when the surrogate Wt is substituted for the
true Z,. If individuals are categorized in terms of their
distance from the mean, and distance is measured in
terms of standard deviation, person fs true distance is
A, = (Z, - E[Z,])/(a^), and his/her classified distance
is 5, = (Wj - E[Wj)/(a-w). Here, uz is the standard
deviation of Z, crw is the standard deviation of W, and
£[•] is the expectation operator. The goal is to choose
the 10 foods that minimize the mean squared error,
£(A, - 8,)2. We utilize the CFQ to select the subset
of 10 foods that minimizes the sample analog of this
quantity. Because it can easily be shown that minimizing the standardized differences is the same as maximizing the sample correlation coefficient of Wt with
Z-(r), we call this method Max_r. For Max_r, r is the
CS and measures the adequacy of the foods chosen.
Just as stepwise regression only approximately maximizes R2, Max_r only approximately maximizes r. To
absolutely determine the best 10 of 119 foods requires
examining the correlation coefficient in over 101 subsets. We have developed an efficient direct search
algorithm for the personal computer (PC) that enumerates the space of all possible subsets and calculates the
r for approximately 5 X 109 distinct subsets per hour
on a 90 mHz pentium PC. The program permits the
implementation of various selection strategies. For this
516
Mark et al.
paper, where na is the number of foods which contain
nutrient a (n nitnue = 119), the absolute best five of the
na is chosen, then with those five fixed, the next
absolute best five are chosen from the remaining
foods. For nitrates, this procedure results in r = 0.977,
and required 40 minutes. The default search mechanism implemented by the Max_r procedure uses a less
time-intensive approach, ranks all 119 foods in less
than 3 seconds, and ends up with the same optimal 10.
The success of the default mechanism to arrive at the
identical 10 as the more time-intensive method was
duplicated in six of the seven nutrients we examined.
The simplest approach to capturing between-person
variability would be to choose the foods with the 10
largest variances. This we call M0M2, because it is
based on die first two moments of the distribution of
the Z, {R2W is the CS statistic for M0M2). It can be
shown that if the foods are uncorrelated, Stepwise,
Max_r, and M0M2 will pick identical subsets. Although of course correlations between foods always
exist, for many nutrients, it is true that the variances
are much larger than the covariances and dominate the
selection process. Indeed, for nitrates, die Stepwise,
Max_r, and M0M2 procedures all pick identical sets
of the best five and best 10 foods. The M0M1 method
will give identical choices to the other three if, in
addition to the F{J being uncorrelated, the foods with
larger average intakes have larger variances (for instance, if the coefficient of variation were constant).
For nitrates, the MOM1 method chooses two of five,
and seven of 10 foods, the same as the other methods.
QUANTIFYING THE CONSEQUENCES OF
SELECTION
If Stepwise, Max_r, and M0M2 are similar in intent, is one preferable, or is perhaps some other
method superior to all, and some other criterion statistic more meaningful dian %TP, R2, r, or R2W1 When
the goal is to make judgments regarding the relation
between changes in Z, and changes in a disease probability (PlD,]), and if we assume that the two are
linked by die logistic regression model
= ft, +
(1)
then die important question is how to choose foods so
tiiat replacing Z, widi W, in equation 1 minimizes the
distortion in our conclusions. In particular, we would
Like to preserve the type 1 error level at die null
hypomesis of no nutrient effect (/3, = 0), maximize
die power of tests to reject die null hypomesis when
j3, # 0, and minimize die bias of estimates of / 3 r
Provided diat die misclassification incurred by replacing Z, widi Wj is non-differential (see section on
non-differential misclassification), all metiiods preserve me type 1 error level at the null hypodiesis.
Results from measurement error dieory (15) show that
die asymptotic relative efficiency (ARE) of tests of
/3j = 0 when W{ is substituted for Z, in equation 1
equals r2. Because ARE is a measure of die loss of
power, to preserve power we wish to maximize r. In
nutritional epidemiology, die power of a study is
largely a function of die number of cases. If choosing
five foods for nitrates gives an r of 0.945, and choosing 10 gives r = 0.977, tiien using five questions
versus 119 is equivalent to having only 89 percent
(0.89 = 0.9452) of the cases, and using 10 questions
rather tiian 119 is equivalent to having 95 percent of
die cases.
To assess die effect on estimation when /3 t ¥= 0
requires consideration of die regression of Z, on W{. A
model proposed by Rosner et al. (16) is
Z,- = «o + c*i W, + e, where e
(2)
The model of equation 2 would be true if, for instance,
die (Zi,W() were joindy normal. When equation 2 is
true, Rosner's result relating die estimate of fil using
W/ (&(W,)) to the estimate of fi{ using Z, (/^(Z,-)) can
be expressed as
,) « J8,(Z,) r
(3)
Rosner et al. (16) and Carroll et al. (17) have shown
that this approximation is robust to deviations in die
error distribution. In particular, Rosner et al. have
found it to be accurate when the Z, and Wt are quintiles
ratiier tiian continuous variables. Because die ratio of
die standard deviations is always positive, equation 3
indicates tiiat, provided r is positive, die direction of
die true trend found using Z, will always be preserved
when using Wj. If, as is commonly the case, radier dian
entering Z, in equation 1, die relevant metric is diought
to be some function of die order statistics, such as die
ranks or the quintiles, tiien crz = crw, so diat die
coefficient is always attenuating and is equal to r. The
least attenuation is provided by maximizing die r. If
die CFQ used is indeed a random sample of die population of interest (or diought to represent such a
sample), tiien die mediods of Rosner et al. can be used
to obtain consistent estimates of /^(Z,) from /^
AN APPLICATION OF THE FOUR METHODS TO
SEVEN NUTRIENTS
We have applied the four mediods to select 10 foods
for each of seven different nutrients. We have undertaken tiiis problem radier dian die problem of assembling a more comprehensive list of 80 to 120 foods
because 1) the small size of die list makes comparison
Am J Epidemiol
Vol. 143, No. 5, 1996
Measurement of Exposure to Nutrients: Selection of Informative Foods
tractable, and 2) this corresponds to the challenge
presented when one is conducting a study in which the
nutrient is only one of several risk factors of interest
(4, 5, 11,12). Due to the fact that the adequacy of short
food lists to assess the intake of specific nutrients has
been found to vary inversely as a function of the
TABLE 1.
number of foods that contain that nutrient (5), we have
chosen nutrients contained in a wide range of foods
(78 to 164). We have also chosen nutrients which are
of particular interest in gastric cancer and/or in other
malignancies. Table 1 lists the number of foods chosen
by each selection method that were not chosen by
Comparison of the choice of 10 foods for seven nutrients using four methods
Nutrient
Alphatocophero) (n= 105)*
No. different*
ARE ratio
r*
%TP«
Beta-carotene (n = 78)
No. different
ARE ratio
r
%TP
Calories ( n - 164)
No. different
ARE ratio
r
R2
*
R2
%TP
Nitrate (n= 119)
No. different
ARE ratio
r
***
Rt
%TP
Protein (n= 151)
No. different
ARE ratio
r
Riw
FP
%TP
Thlarrdne (n= 130)
No. different
ARE ratio
r
FPW
Ft*
%TP
Vitamin C (n = 87)
No. different
ARE ratio
r
R*w
Ft*
%TP
Step wise
MOM2
M0M1
0.892
60.2
81.7
53.6
0
1.00
0.892
60.2
81.7
53.6
3
0.949
0.869
59.3
77.4
60.8
2
0.964
0.876
59.1
79.3
63.9
0.985
89.2
97.1
79.3
0
1.00
0.985
89.2
97.1
79.3
0
1.00
0.985
89.2
97.3
79.3
1
0.992
0.981
84.3
96.4
81.1
0.808
30.8
67.2
32.5
2
0.937
0.782
29.4
66.97
29.5
3
0.878
0.757
38.1
59.3
43.4
4
0.759
0.704
41.9
61.4
50.6
0.977
78.2
96.8
66.2
0
1.00
0.977
78.2
95.9
66.2
0
1.00
0.977
78.2
95.8
66.2
3
0.988
0.971
74.7
0.771
22.6
56.3
29.0
3
0.724
0.656
18.4
63.2
34.6
5
0.872
0.720
33.0
41.6
43.0
5
0.663
0.628
27.2
53.7
50.9
0.849
34.3
74.6
35.7
2
0.970
0.836
31.5
76.9
35.1
3
0.892
0.802
34.2
65.4
47.5
2
0.810
0.764
35.5
66.0
50.8
0.95
64.7
94.0
61.4
3
0.973
0.937
63.5
91.92
57.6
0
1.00
0.95
64.7
94.0
61.4
5
0.884
0.893
53.9
91.8
67.4
Max_r
94.8
69.5
* n = number of foods that contain nutrient; no. different = number of foods chosen by method not chosen by
Max_r; ARE ratio ° (r 2 for method)^/* for Max_r); r= Pearson correlation coefficient of W. with Z,; R*w = criterion
statistic for MOM2; Ffl = criterion statistic for Stepwise; %TP = criterion statistic for M0M1.
Am J Epidemiol
Vol. 143, No. 5, 1996
517
518
Mark et al.
Max_r (number different), the ratio of the ARE's
(method rVMax^ r 2 ) and the value of the four different criterion statistics r, R2, R2W, and %TP.
Although not a perfect correspondence, the foods
with lower n's generally have higher Pearson r's by
the Max_r method. For Max_r, the lowest correlation
(r = 0.77), and thus the greatest loss in efficiency
(ARE = 59 percent) occurred for protein. For three of
the nutrients (alphatocopherol, beta-carotene, and nitrate), the Stepwise procedure chose the same 10 foods
as Max_r. Relative to Max_r, the lowest efficiency
using Stepwise occurred for protein. Overall, the ARE
of Stepwise for protein was 43 percent (0.77 X 0.72).
Despite disagreement on as many as three of the 10
foods selected, Stepwise had ARE's greater than 90
percent for the other six nutrients. The ratio of apparent percent variation explained (R2) to actual percent
variation explained (R2W), ranged from 1.08 for betacarotene to 3.4 for protein. Larger differences tended
to occur for the foods with the larger N.
Similar results with regard to nutrient selections and
efficiency were found using the M0M2 method. Not
surprisingly, MOM1 generally performed worst in
terms of ARE.
So far, we have assumed that the quantity Z, entered
in equation 1 is the nutrient sum. If the rank or quintile
were the measure of interest, the preferred course
would be to choose Wj to maximize the Spearman
correlation coefficient, or the correlation of the quintile classification of Wt and Zt. Table 2 contains the
Pearson r, the Spearman r, and the correlation between
the quintiles for the subsets selected by the Max_r
method. The differences between the Pearson and
Spearman correlation coefficients are slight, so that
considerations based on the Pearson r, or changes in
the Pearson r, accurately reflect the Spearman r. The
correlation among the quintiles, though lower, is similar in magnitude.
NON-DIFFERENTIAL MISCLASSIFICATION
AND p,[W)
Fundamental to all of these methods is the concept
that an investigator who knew the nutrient sum Z,
TABLE 2. Pearson, Spearman, and quintile correlations for Z,
and the Max_r chosen W;
Nutrlem
Alphatocopherol
Beta-carotene
Calories
Protein
Nitrate
Thiamine
Vitamin C
Pearson
r
Spearman
r
Quintile
r
0.892
0.982
0.807
0.702
0.977
0.849
0.966
0.891
0.973
0.761
0.680
0.970
0.820
0.957
0.856
0.942
0.723
0.650
0.939
0.786
0.925
totaled over all possible foods, would not want to
know the sum W, over a given subset. The proposition
that interest in Wt stems entirely from its role as a
substitute for Zt, is formally expressed in equation 4,
P(D,
= Pr(A | Z,).
(4)
This assumption, which asserts the conditional independence of D, and Wit is called non-differential misclassification and is crucial to all four of the selection
methods. Without non-differential misclassification,
even when /3j = 0 in equation 1, the estimate £, may
be biased when Z, is replaced with Wt. Because, in the
context of this paper, W, is a sum based on a subset of
the nutrients contained in Z(, it may seem as though
equation 4 is always true. Nonetheless, as the following hypothetical example illustrates, differential misclassification can occur.
Suppose that a history of gastric cancer in a firstdegree relative is a risk factor for the development of
gastric cancer, but that the amount of beta-carotene
one eats is not. Due to individuals' concern for their
health and the supposition that green leafy vegetables
prevent cancer, persons with a positive family history
eat a wide variety of such foods, including kale and
collards. Further suppose that persons without a positive family history seldom consume kale or collards,
but eat more carrots and broccoli, so that overall they
ingest the same amount of beta-carotene. In a complete food questionnaire that enquired about all possible vegetables, one would find that the nutrient sum
for beta-carotene, Z,, did not predict cancer. However,
if W, were computed from a short list that excluded
kale and collards, higher values of beta-carotene
would be associated with lower probabilities of cancer. In this case, equation 4 would be false and betacarotene would appear to be protective.
To simplify the presentation, we have written equations 1 and 4 as if nutrient exposure were the only
measured covariate. In fact, testing and estimation are
typically performed with multivariate models that include other covariates, Xt. Provided these Xt are themselves measured without error, the essence of the arguments in the section on the consequences of
selection still pertain when the regression in equations
1 and 2 include these X, covariates. In particular,
equation 3 remains true provided r is replaced by the
partial correlation (controlling forX,) of Z, and Wh and
the standard deviations are replaced with the standard
deviations conditional on Xt. The requirement for nondifferential misclassification becomes
Pr[D |
- Pr[D |
(5)
In our hypothetical tale of kale, collards, carrots, broccoli, and beta-carotene, if X, contained information on
Am J Epidemiol
Vol. 143, No. 5, 1996
Measurement of Exposure to Nutrients: Selection of Informative Foods
family history of cancer, then equation 5 would be true
and misclassification would be non-differential.
For each of our nutrients, we test the truth of equation 5 by performing a score test of 0 3 = 0 in the
logistic model
+ 03 Wh (6)
where W, is the sum of the 10 foods chosen by Max_r,
and 02 ^ d X, are vectors of dimension 8. The covariates
chosen for X, are seven non-nutritional covariates
thought to be important by the original investigators (7,
8), i.e., age, sex, Quetelet index, family history of gastric
cancer, study area, place of residence, and migration
from south, plus the total calories, Z ^ . Using Z^ is in
some sense "illegitimate": a study dependent on a short
form would not have access to Z ^ . However, because
adjustment for caloric intake is often considered essential
by many nutritional epidemiologists (1), we include the
Zcai covariate to demonstrate that our results are not due
to its omission. Dropping Z ^ from the model entirely,
using the Max_r estimate W^ rather than Z ^ , or adjusting for calories by the residual method (1), makes no
substantial difference to the tests and estimates of the
parameters in table 3.
Column 1 of table 3 contains the score test of 0 3 =
0 in equation 6. The null hypothesis is rejected for
beta-carotene, nitrate, and thiamine, indicating that W,
does have information not contained in Z,. Column 2
(01(ZI-)) contains the estimates and 95 percent confidence intervals for the odds ratio of each nutrient when
the X, covariates are entered and the total nutrient sum
Z, is used in equation 7,
0,Z,
2 'X,,
in the model. To contrast the nature of the different
information contained in a unit increase of W, compared
with a unit increase of Z,, the regressors have been
reparameterized so that equation 6 becomes
logitPr[D|L f ,X i ,W,] = 0o* + 0,*W, + 02*'X/
logitPr[D|2,,X,.,W,] = jBo + frZ, + ft'X,
logitPr[D|Z,,X,] =
519
(7)
For each nutrient, the unit of measure was chosen to
approximate the size of a quintile among the controls.
Column 3 (fii(W)) contains the same information when
W, is used rather than Zt. Columns 4 and 5 contain risk
estimates when the information from both W, and Z, are
+ 03*£/- (8)
Here, L, is the sum of the nutrients in the foods not
chosen by Max_r (L, = Z, - W^). Beta-carotene, nitrate,
and thiamine, all of whose score tests rejected, have
estimates for 0X* and 0 3 *, which are on opposite sides
of the null estimate of 1, and have confidence intervals
that do not overlap.
DISCUSSION
We have started from the premise that the effect of
an increase in a unit of nutrient Z, on disease is
described by the model logit Pr[D | Z,,X,]=0 o + 0 7 Z,
+ 02'X,. When constraints imposed by questionnaire
length prohibit knowledge of Z,, but allow knowledge
of a related quantity Wt, considerations of power and
bias suggest choosing Wt so that the Pearson correlation between Z, and W, is maximized. We have shown
that the square of the correlation is a direct measure of
the loss of power to test the null hypothesis, and that
larger correlations are directly proportional to less
attenuation in risk estimates. In nutritional epidemiology, the Wt used as substitutes for Z, have traditionally
been simple sums over the included foods. Typically,
these foods have been chosen by methods that maximize (M0M1) or approximately maximize (Stepwise)
some statistic from a complete food questionnaire
thought to be representative of the study population.
Using the identical simple sum restriction, we have
developed a user friendly, interactive program (available on request from the first author) that runs on IBM
compatible PC's, and approximately maximizes the
Pearson coefficient, r, of W, and Zt. In addition to
allowing numerous automated strategies for choosing
TABLE 3. Testing (or and estimating the degree of non-differential misclassification (In the presence of KCal and covariates)
Nutrient
(unit)
Alphatocopherol (1 mg)
Beta-carotene (500 M9)
Nitrate (1 mg)
Protein (10 g)
Thiamine (100 (ig)
Vitamin C (20 mg)
*
f
t
§
II
Pi (W,)t
Pi (Z,)i
Estimate
95% Cl
Estimate
95% Cl
Estimate
95% Cl
Estimate
95% Cl
0.39
<0.0001
0.02
0.12
0.004
0.573
0.900
0.979
0.992
1.201
0.953
0.995
0.857-0.946
0.954-1.00
0.953-1.032
1.107-1.304
0.910-0.997
0.993-0.998
0.892
0.991
0.999
1.088
1.010
0.906
0.842-0.944
0.966-1.02
0.957-1.044
0.984-1.203
0.951-1.073
0.858-0.956
0.890
1.00
1.020
1.141
1.010
1.048
0.840-0.942
0.977-1.03
0.973-1.069
1.029-1.267
0.951-1.074
0.877-1.252
0.931
0.603
0.754
1.267
0.882
0.993
0.840-0.942
0.510-0.714
0.594-0.958
1.139-1.408
0.822-0.946
0.985-1.001
Score test of p, = 0 In equation 6.
Estimate (per unit nutrient) and 95% confidence interval (Cl) for the odds ratio of p, in equation 7.
Estimate (per unit nutrient) and 95% Cl for the odds ratio of B. In equation 7 when Wt Is substitutedforZ..
Estimate (per unit nutrient) and 95% Cl for the odds ratios of p,» In equation 8.
Estimate (per unit nutrient) and 95% Cl for the odds ratios of B,* In equation 8.
Am J Epidemiol
P,* (*-,)"
Pi*•(M0§
Score
test*
Vol. 143, No. 5, 1996
520
Mark et al.
a best subset, the program also permits the user the
flexibility of specifying his/her own subset of foods.
This is useful if there are several nutrients of interest
and one wishes to see how well merged portions of
separate lists perform.
In this paper, we have examined the performance of
Max_r when the task was to choose k = 10 foods. In
a real application, the number of foods to choose
should at least partially be determined by examining
the relation of changes in ARE (r 2 ) to changes in k.
For instance, increasing from 5 to 10 foods results in
an increased ARE of 13 percent for beta-carotene and
7 percent for nitrate, but 43 percent for protein and 38
percent for total calories. Thus, a k between 5 and 10
might be adequate for the two micronutrients, but
would likely be deemed inadequate for the two macronutrients. Of course, the calculated r (or any of the
criterion statistics) based on the CFQ is likely to be an
overestimate of the true r, and can only be used as a
guide to power considerations. Bias corrections could
be made using bootstrap procedures, although given
the inexact nature of power calculations, we doubt that
such corrections would be worth pursuing.
One noteworthy feature of our empirical results was
the good performance of MOM2: the lowest efficiency
of the MOM2 method was 0.87. This suggests that
despite concerns to the contrary (1,4), accounting for
the covariances between the foods is not of paramount
importance. Should this observation be confirmed in
other studies, it has practical implications. Whereas
Max_r and Stepwise require preexisting estimates of
the variances and covariances from a CFQ, M0M2
only requires knowing the order of the variances. In a
situation where one has no CFQ, a nutritionist or
epidemiologist familiar with the study population
might do a credible job guessing the 10 foods with the
most variability.
All four of the methods operate under the restriction
that the Wt are an unweighted sum over the chosen
foods. In terms of maximizing the correlation with Z(,
the optimum weights would be the estimated coefficients of the usual least-squares regression of Zt on the
selected foods. If the goal is efficient testing of the null
hypothesis, or if an investigator is ultimately going to
use a function of the order statistics, our results suggest that the potential gains of weighting are small.
Using the Z, from Stepwise rather than the Wt from
Max_r results in an average increase in ARE of only
3 percent (range 1-7 percent). This increase does not
take into account the upward variance adjustment that
must be made if estimated weights are used. If, however, the desire is to obtain consistent estimates of /3X
in equation 7 using an absolute nutrient scale, then one
is forced to rely on measurement error corrections.
These are, in effect, a form of weighting. For example,
Rosner's correction procedure is equivalent to replacing Zj in equation 7 with Z,• — 6LQ + &i Wit where OQ,
a, are the least square estimates of OCQ, a{ of equation
2. Using Z, rather than the Z, results from regressing Z,
on the 10 individuals chosen FtJ rather or on their sum,
Wj. Because the inflation in the variance of /3, due to
the estimation of weighting is proportional to the residual variance of the regression of Z, on the surrogates, using the Z, rather than the Z, has some potential
efficiency advantages. The principle disadvantage is
that the procedure for estimating the variance of the
risk estimate is more complicated. If the regression
coefficients, &j, for each of the 10 foods is equal to the
common ax in equation 2, then the residual variance of
both methods is the same and there is no gain in
efficiency. Because, as we increase k, all the coefficients must tend to 1, and the covariances between the
Fy are not of paramount importance, it is not surprising that in practice we often found little variation
among the coefficients.
Any measurement error procedure or weighting
technique requires that the sample of persons in the
CFQ are indeed a random sample of the population
under study, and that the CFQ be large enough to give
approximately unbiased estimates of the regression
parameters. Because large random validation subsamples are a feature of few if any studies, weighting
has not been used in the past.
To guarantee even the rudimentary property of unbiased estimates of /3j when /3j = 0, misclassification
that arises from using the surrogate Wt must be nondifferential. In the last section, we gave a hypothetical
example of how such misclassification might arise
from confounding by an unmeasured non-nutritional
covariate. In table 3, we tested for non-differential
misclassification while controlling total calories and
the seven non-nutritional covariates thought to be most
important. For three of the six nutrients tested, we
found differential misclassification. A unit increase in
beta-carotene, nitrate, or thiamine among the chosen
foods had no effect on gastric cancer, whereas a unit of
increase among the excluded foods was protective.
The results could be due to confounding by a nutritional covariate. For example, the protective effect of
L/ for beta-carotene may arise from the carotenoid
lutein (or from another non-carotenoid nutrient) that
the Ll foods have in common and the W, foods lack.
Similarly misleading conclusions could arise even if a
complete and accurate food history were obtained, but
the analysis relied on transforming foods into nutrients. Thus, these empirical findings should not only
Am J Epidemiol
Vol. 143, No. 5, 1996
Measurement of Exposure to Nutrients: Selection of Informative Foods
raise concerns about substituting W( for Z,, but should
also make one cautious about predicting the potential
impact of supplementation of a particular nutrient,
based on effects of the nutrient estimated in food
consumption studies.
Although we have examined the effect of extreme
shortening of a questionnaire, in some sense all nutritional questionnaires are "short": a questionnaire long
enough to capture the entire exposure to a number of
nutrients, as well as the exposure to other important
life-style confounders such as exercise, is impractical.
According to the measurement error approach, to obtain unbiased estimates from an extensive but still
incomplete questionnaire, we heed to be able to assess
differential misclassification (equation 5), and to estimate the regression of Z, on W, (equation 2). One
useful approach to achieving this while keeping questionnaire length within acceptable bounds may be to
construct "partial questionnaires" (18) or questionnaires in which information is what we call "missing
by design." For instance, if disease status were measured on everyone, but exposure were measured in a
manner such that groups of individuals (chosen with
known probability) were each asked a subset of the
food questions, Rubin's likelihood-based method of
multiple imputation (19), or the semi-parametric methods proposed by Robins et al. (20), would allow one to
assess and adjust for differential misclassification of a
common short list, and/or to construct consistent estimates of the risks that would have resulted if each
participant had answered every question.
We conclude with the following recommendations.
If measuring a specific nutrient is but one aspect of a
large study, we would suggest applying the Max_r
algorithm to a preexisting CFQ of relevance to the
study population. If nutritional exposures are the central concern of the study, we recommend a missingby-design questionnaire that obtains overlapping, but
nonidentical, exposure measurements from different
persons. In contrast to any procedure that is in some
sense optimal for selecting identical, incomplete, and
biased measurements of the true exposure on all persons, such questionnaires permit unbiased risk estimates on an absolute scale. This facilitates comparison
of the results of studies in different populations, and
allows data sets to retain their usefulness as new
micronutrients, which were not the specific focus of
interest during the design stage of the questionnaire,
become of interest. How to efficiently structure such
questionnaires in general, and for nutritional purposes
in particular, is currently an active area of our
research.
Am J Epidemiol
Vol. 143, No. 5, 1996
521
ACKNOWLEDGMENTS
Dr. Adriano Decarli's work was partially supported by
Italian National Research Council grant no. 94.0119.pf39.
The authors thank Dr. Ray Carroll for his helpful comments on the manuscript.
REFERENCES
1. Willett W, ed. Nutritional epidemiology. New York; Oxford
University Press, 1990.
2. Willett WC, Sampson L, Stampfer MJ, et al. Reproducibility
and validity of a semiquantitative food frequency questionnaire. Am J Epidemiol 1985;122:51-65.
3. Willett WC, Sampson L, Browne ML, et al. The use of a
self-administered questionnaire to assess diet four years in the
past. Am J Epidemiol 1988;127:188-99.
4. Byers T, Marshall J, Fiedler R, et al. Assessing nutrient intake
with an abbreviated dietary interview. Am J Epidemiol 1985;
122:41-50.
5. Stryker WS, Salvini S, Stampfer MJ, et al. Contributions of
specific foods to absolute intake and between-person variation
of nutrient consumption. J Am Diet Assoc 1991;91:172-8.
6. Decarli A, Ferraroni M, Palli D. A reduced questionnaire to
investigate the Mediterranean diet in epidemiologic studies.
Epidemiology 1994;5:251-6.
7. Buiatti E, Palli D, Decarli A, et al. A case-control study of
gastric cancer and diet in Italy. Int J Cancer 1989;44:611-6.
8. Buiatti E, Palli D, Decarli A, ct al. A case-control study of
gastric cancer and diet in Italy. H Association with nutrients.
Int J Cancer 1990;45:896-901.
9. Block G, Dresser CM, Hartman AM, et al. Nutrient sources in
the American diet: quantitative data from the NHANES II
survey. I. Vitamins and minerals. Am J Epidemiol 1985; 122:
13-26.
10. Block G, Hartman AM, Dresser CM, et al. A data-based
approach to diet questionnaire design and testing. Am J Epidemiol 1986;124:453-69.
11. Cummings SR, Block G, McHenry K, et al. Evaluation of two
food frequency methods of measuring dietary calcium intake.
Am J Epidemiol 1987;126:796-802.
12. Howe GR, Harrison L, Jain M. A short diet history for
assessing dietary exposure to n-nitrosamines in epidemiologic
studies. Am J Epidemiol 1986;124:595-602.
13. Hebert JR, Miller DR. Methodologic considerations for investigating the diet-cancer link. Am J Clin Nutr 1988;47:
1068-77.
14. Heady JA. Diets of bank clerks: development of a method of
classifying the diets of individuals for use in epidemiological
studies. J Royal Stat Soc A 1961;124:336-361.
15. Lagakos SW. Effects of mismodelling and mismeasuring explanatory variables on tests of their association with a response variable. Stat Med 1988;7:257-74.
16. Rosner B, Willett WC, Spiegelman D. Correction of logistic
regression relative risk estimates and confidence intervals for
systematic within-pcrson measurement error. Stat Med 1989;
8:1051-69.
17. Carroll RJ, Ruppert D, Stefanski LA, eds. Nonlinear measurement error models. New York: Chapman and Hall, 1995.
18. Wacholder S, Carroll RJ, Pee D, et al. The partial questionnaire design for case-control studies. Stat Med 1994;13:
623-34.
19. Rubin DB, ed. Multiple imputation for nonresponse in surveys. New York: John Wiley and Sons, 1987.
20. Robins JM, Rotnitsky A, Zhao LP. Estimation of regression
coefficients when some regressors are not always observed. J
Am Stat Assoc 1994;89:846-66.