Use of Census-based Aggregate Variables to Proxy for

American Journal of Epidemiology
Copyright © 1998 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved
Vol. 148, No. 5
Printed in U.S.A.
Use of Census-based Aggregate Variables to Proxy for Socioeconomic
Group: Evidence from National Samples
Arline T. Geronimus1 and John Bound 2
Increasingly, investigators append census-based socioeconomic characteristics of residential areas to
individual records to address the problem of inadequate socioeconomic information on health data sets. Little
empirical attention has been given to the validity of this approach. The authors estimate health outcome
equations using samples from nationally representative data sets linked to census data. They investigate
whether statistical power is sensitive to the timing of census data collection or to the level of aggregation of
the census data; whether different census items are conceptually distinct; and whether the use of multiple
aggregate measures in health outcome equations improves prediction compared with a single aggregate
measure. The authors find little difference in estimates when using 1970 compared with 1980 US Bureau of the
Census data or zip code compared with tract level variables. However, aggregate variables are highly
multicollinear. Associations of health outcomes with aggregate measures are substantially weaker than with
microievel measures. The authors conclude that aggregate measures can not be interpreted as if they were
microievel variables nor should a specific aggregate measure be interpreted to represent the effects of what
it is labeled. Am J Epidemiol 1998; 148:475-86.
aggregation; census tract; geocoding; health surveys; social class; socioeconomic factors; zip code
Social inequalities in health are difficult to study
(1-3). A 1994 conference of the National Institutes of
Health documented the severe inadequacies of socioeconomic data on health data sets and led to a thoughtful set of recommendations to improve the situation (4,
5). One of the recommendations was to geocode individual records and to link them to socioeconomic
characteristics of residential areas drawn from census
data. It was suggested that this would be "one powerful and economical way of augmenting existing data
bases" (4, p. 305). This approach is already evident in
the study of cancer (e.g., references 6-8), infant mortality (e.g., references 9-11), and, to a lesser extent,
other health outcomes (e.g., references 12-14).
A number of limitations to the validity of this approach have been suggested, however. Investigators
have found this procedure to result in imprecise esti-
mates and to require large sample sizes in order to
detect significant differences (15, 16). Drawing on a
statistical framework to illuminate biases, the one
analysis based on nationally representative samples
(17) raised questions about the proper interpretation of
coefficient estimates derived through this approach.
Yet, in some cases, it may be the only option available
to study social differentials in health.
Investigators using this approach face conceptual
and related methodological decisions. For example,
the census is taken once per decade, with time-lags
between data collection and public availability. Potentially, the most proximate census data available were
collected more than 10 years prior to the primary data
set being analyzed. Whether it is justifiable to append
census data that are at least one decade old to individual records to proxy current socioeconomic characteristics is an empirical question.
Another important methodological question is, what
difference does the level of aggregation of the census
data make to the relations observed? Nationally, the
typical zip code contains roughly 25,000 inhabitants,
while the typical census tract contains 5,000. The
census block group is even smaller, generally containing about 1,000 inhabitants. It is plausible that investigators should prefer the smallest and most homogeneous census-defined region, the census block group.
However, census block data are rarely available.
Received for publication September 29, 1997, and accepted for
publication March 20, 1998.
Abbreviations: NMIHS, National Maternal and Infant Health Survey; PSID, Panel Study of Income Dynamics; SEI, socioeconomic
index.
1
Department of Health Behavior and Health Education and the
Population Studies Center, University of Michigan, Ann Arbor, Ml.
2
Department of Economics and the Population Studies Center,
University of Michigan, Ann Arbor, Ml.
Reprint requests to Dr. Arline T. Geronimus, Department of
Health Behavior and Health Education, School of Public Health,
University of Michigan, 1420 Washington Heights, Ann Arbor, Ml
48109-2029.
475
476
Geronimus and Bound
Moreover, block group studies will systematically exclude rural residents. Often available geographic identifiers permit linkages to census data only at the zip
code level. And only zip-coded data have the potential
for complete coverage. Yet, because zip codes cover a
large number of inhabitants, use of data aggregated at
this level has been called "an option of last resort"
(18). However, do zip code-level data perform substantially worse in health outcome equations than data
collected at the census-tract level?
In addition, the census contains many data items that
are theoretically related to socioeconomic group. To
what extent do they represent distinct entities? Will
choosing one rather than another affect regression
results? Are they sufficiently distinct that regression
coefficients can lead directly to clear policy advice to
reduce health disparities? Do they capture sufficiently
well-defined or separate components of the ways in
which social position affects health that using multiple
measures will improve estimation compared with a
single measure?
Aggregate socioeconomic variables are sometimes
used by researchers to estimate "contextual" or
"neighborhood" effects. Although some of our findings are relevant to such applications, we are specifically concerned with the many cases where investigators substitute aggregate variables for microlevel data
they would have used had it been available. We address: 1) the statistical power of aggregate data relative
to microlevel data and relative to other aggregate data,
e.g., between different census years or measured at
different levels of aggregation; and 2) the general
interpretation of results derived in this manner.
MATERIALS AND METHODS
The creation of a data set linking census information
to microlevel data from the 1985 wave of the Panel
Study of Income Dynamics (the PSID-Geocode file),
together with a special release of the 1988 National
Maternal and Infant Health Survey (NMIHS) that includes geographic identifiers, provided a unique opportunity to address the above research questions. We
performed similar analyses using PSID and NMIHS
data, and we found our results to be robust across the
two data sets. For reasons of space, we report here
only the PSID results. The NMIHS results are available from the authors.
The PSID is an ongoing longitudinal study of the
determinants of family income (19, 20). Data from a
representative sample of persons have been collected
annually since 1968. In 1985, over 60 percent of the
original set of sample households remained in the
study. We restricted our PSID samples to the men and
women from the original 1968 PSID families who are
between the ages of 18 and 64 years and who identified themselves as black or white. We apply sample
weights to account for the initial oversampling of
some groups, differential attrition, and the expansion
over time of the proportion of younger families in the
sample. Validation studies suggest that analyses of the
PSID yield nationally representative results for blacks
and whites when the sample weights are applied
(21, 22).
We analyze two PSID samples: one restricted to
observations with valid zip codes for both 1970 and
1980; the second restricted to observations with valid
zip codes and census tract identifiers for 1980. Ninetynine percent of the 1985 PSID respondents had valid
1980 zip codes, while 68 percent were matchable to
valid 1970 census information using zip codes. We
were able to match 72 percent of the PSID sample to
1980 census tract information.
Health-related variables collected by the PSID are
limited. We focus on adult self-reported health status
using responses to a question that asked respondents to
rate their health on a 5-point scale from excellent to
poor. Such measures are highly correlated with clinical measures (23-26) and predict subsequent death,
health care utilization, and labor market behavior
(27)—often better than clinical measures.
In table 1, we list the socioeconomic variables studied. We studied microlevel socioeconomic measures
commonly used by social epidemiologists including
income, education, and occupation measured continuously and then categorically. We also studied family
income to needs but do not report these results which
were virtually identical to those for family income.
The poverty variable also takes account of differences
in family income to needs.
We based the microlevel occupational measures on
respondent's own occupation. We used current occupation, if available, and previous occupation for those
currently unemployed. Even so, almost 20 percent of
each sample had missing occupational data. Means,
standard deviations, and correlations for the microlevel occupation measures are based on the subsets of
each sample for which we had occupational data. In
regression analyses, we used the full study samples,
but included a dummy variable to indicate when occupational data were missing.
For ease of presentation, we divided the continuous
occupation variable—the socioeconomic index
(SEI)—by 10. The SEI (also known as the Duncan
index) is the most widely used ranking of occupations
in the social sciences. It is estimated by regressing
occupational prestige scores on age-standardized occupational levels of earnings and education for a limited set of occupations and then applying weights for
Am J Epidemiol
Vol. 148, No. 5, 1998
Aggregate Socioeconomic Proxies
TABLE 1.
477
Variable descriptions
Variable
Description
Microlevel
Income*
Education
SEI
Poor
High school graduate
Professional
Natural logarithm of family income
Educational attainment in years of schooling
Socioeconomic index (SEI) score corresponding to respondent's current
or most recent occupation
Family income to needs below poverty threshold
High school graduate
Current or last occupation classified as professional or managerial
Aggregate^
Income*
Education
Poor
High school graduate
Professional
Unemployed
On AFDC^:
*
t
tract
t
Log of median family income in residential area
Mean educational attainment in years of residents aged >25 years
Fraction of the non-elderly residents with incomes below the poverty level
Fraction of the population aged >25 years with a high school diploma
Fraction of adult residents employed in professional, managerial, farming, and
protective service occupations
Fraction of population aged >16 years unemployed
Fraction of households receiving public assistance
All incomes are in 1997 dollars.
All aggregate variables refer to the characteristics as of 1970 or 1980 of the respondent's zip code or census
of residence.
AFDC, Aid to Families with Dependent Children.
earnings and education levels to all other occupations
to arrive at predicted prestige scores (28). Thus, it is
not entirely distinct from the income and education
constructs. SEI scores correspond to the 1970 census
occupation codes in the data. We use the SEI for the
total population, not the male SEI.
The aggregate variables were drawn from US Bureau of the Census Summary Tape Files (STF), which
contain detailed tabulations of the nation's population
and housing characteristics. We studied a range of
possible aggregate measures that have been used as
socioeconomic proxies. All income variables were
transformed into natural logarithms, and all dollar
amounts were first inflated to 1997 dollars using the
Consumer Price Index.
We estimate complete first-order correlation matrices for the microlevel socioeconomic variables and the
full array of aggregate variables at the zip code level
and at the census tract level from the 1970 and 1980
censuses. The correlations between microlevel and
aggregate measures of socioeconomic position offer
an indication of the reliability of specific aggregate
variables as proxies for the microlevel variables. The
correlations among aggregate variables indicate the
extent to which the aggregate variables are multicollinear. These correlations are suggestive of an investigator's ability to estimate coefficients in regressions
using multiple measures and also address the question of the conceptual distinctiveness of aggregate
measures.
We estimate selected health outcome equations usAm J Epidemiol
Vol. 148, No. 5, 1998
ing various versions of the socioeconomic variables:
we include only microlevel variables but then substitute aggregate for microlevel variables; 1970 for 1980
census variables; and zip code level for tract level
aggregate variables. We include only one socioeconomic covariate in some models and multiple measures in others. We evaluate the stability of coefficients on the socioeconomic variables across models
and the goodness of fit of various models. Because
outcome measures are discrete (ordered polychotomous), we use methods appropriate to limited dependent variables. We assume that the ordered categorical
responses reflect an underlying latent continuous variable that is distributed normally—with larger values
representing better health—and estimate a series of
ordered probits.
The chi-square statistic for each model offers information on the strength of the association between the
specific socioeconomic variable and the health outcome and also provides an indication of the reliability
with which an investigator could expect to estimate
coefficients on the specific variable. Moreover, to the
extent that one views any socioeconomic variable as a
component of a more global construct of interest, a
better fitting model can be interpreted to have
picked up a larger component of the overall theoretical
construct.
In actual applications, the investigator is interested
in the magnitude of the coefficient on the socioeconomic variable. By comparing coefficients between a
microlevel variable and its aggregate counterpart (e.g.,
478
Geronimus and Bound
family income and median income), we gain a sense of
how well one can substitute for the other in regression
equations.
RESULTS
In table 2, we report means and standard deviations
of the socioeconomic variables. For ease of interpretation, income variables can be converted to 1997
dollar values by exponentiating. For example, a mean
of 11 corresponds to a family income of roughly
$60,000 in 1997 dollars (e11 « $60,000). Means for
aggregate and microlevel variables are generally similar. The aggregate variables show substantially less
variation than the microlevel variables, implying that
TABLE 2.
aggregate data provide substantially less statistical
power than microlevel data. Note also that variation in
aggregate variables at the tract level is only somewhat
greater than at the zip code level. This suggests that
using tract data will increase statistical power compared with using zip code level data, but that this
increase will be small.
Correlations
Table 3 presents simple first-order correlations
between the microlevel socioeconomic variables in
the PSID, the 1980 census variables, and the 1970 census variables. In table 4, correlations between PSID
microlevel variables and aggregate variables from
Summary statistics: various socioeconomic measures by sample*
Sample
Socioeconomic
status
measure
Microlevel
Income
Education
SEI/1Ot
Poor
High school graduate
Professional
Aggregate 1980 zip codes
Income
Education
Fraction
Poor
High school graduate
Professional
Unemployed
On AFDC*
1970 zip codes
Income
Education
Fraction
Poor
High school graduate
Professional
Unemployed
On AFDC
1980 census tracts
Income
Education
Fraction
Poor
High school graduate
Professional
Unemployed
On AFDC
1980/1970 (/) = 4,393)
Zip code/census tract (n = 4,762)
Mean
Standard
deviation
Mean
Standard
deviation
10.96
13.09
3.92
0.18
0.84
0.36
0.84
2.50
2.02
0.38
0.36
0.49
10.91
13.09
3.92
0.18
0.85
0.35
0.84
2.50
2.00
0.38
0.36
0.49
10.64
12.65
0.29
0.90
10.63
12.67
0.28
0.89
0.10
0.70
0.30
0.O7
0.07
0.08
0.14
0.10
0.04
0.06
0.10
0.71
0.30
0.06
0.07
0.08
0.13
0.10
0.04
0.06
9.94
11.99
0.25
0.81
0.09
0.57
0.26
0.04
0.07
0.15
0.10
0.02
0.04
10.62
12.68
0.34
1.01
0.10
0.71
0.30
0.06
0.07
0.09
0.15
0.12
0.04
0.07
0.04
* Data source: Panel Study of Income Dynamics (PSID).
t SEI/10, socioeconomic index divided by 10; AFDC, Aid to Families with Dependent Children.
Am J Epidemiol
Vol. 148, No. 5, 1998
I
cf
p
en
CO
CO
oo
T A B L E 3.
Correlations b e t w e e n s o c i o e c o n o m i c m e a s u r e s : 1980/1970 zip c o d e s s a m p l e *
Mlcrolovel
1. Income
2. Education
3. SEIt
4. Poor
5. High school graduate
6. Professional
1980 zip codes
7. Median Income
8. Mean education
Fraction
9. Poor
10. High school graduate
11. Professional
12. Unemployed
13.AFDCt
1970 zip codes
14. Median Income
15. Mean education
Fraction
16. Poor
17. High school graduate
18. Professional
19. Unemployed
20. AFDC
1
2
3
4
5
6
1.00
0.32
0.36
-O.70
0.27
0.28
1.00
0.61
-0.27
0.64
0.50
1.00
-0.25
0.28
0.81
1.00
-0.26
-0.18
1.00
0.21
1.00
0.40
0.30
0.30
0.38
0.26
0.32
-0.32
-0.25
0.23
0.24
-0.34
0.33
0.30
-0.29
-0.32
-0.21
0.36
0.37
-0.26
-0.25
-0.18
0.28
0.34
-0.24
-0.22
0.31
-0.29
-0.24
0.26
0.30
0.36
0.29
0.29
0.37
0.25
0.30
-0.31
0.31
0.31
-0.21
-0.27
-0.18
0.35
0.34
-0.15
-0.18
-0.15
0.27
0.31
-0.15
-0.15
9
10
11
12
13
-0.47
0.91
0.93
-0.60
-0.57
1.00
-0.66
-0.42
0.65
0.88
1.00
0.80
-0.62
-0.70
1.00
-0.57
-0.51
1.00
0.75
1.00
0.89
0.68
0.68
0.94
-0.70
-0.44
0.73
0.87
0.64
0.86
-0.49
-0.55
-0.75
0.73
0.69
-0.48
-0.69
-0.40
0.87
0.88
-0.31
-0.42
0.83
-0.58
-0.47
0.52
0.79
-0.59
0.93
0.79
-0.33
-0.55
-0.33
0.75
0.90
-0.35
-0.36
0.46
-0.56
-0.58
0.60
0.56
7
8
0.21
0.25
1.00
0.72
1.00
-0.20
0.26
0.22
-0.19
-0.21
-0.14
0.23
0.26
-0.19
-0.18
-0.85
0.79
0.68
-0.64
-0.78
-0.29
-0.23
0.20
0.23
0.20
0.23
0.29
-0.26
-0.24
0.17
0.26
-0.17
0.24
0.21
-0.12
-0.16
-0.12
0.22
0.24
-0.11
-0.12
14
15
-0.60
-0.53
1.00
0.72
1.00
0.68
-0.61
-0.56
0.56
0.83
-0.81
0.76
0.67
-0.45
-0.65
-0.44
0.94
0.90
-0.32
-0.44
16
17
18
19
20
1.00
-0.59
-0.38
0.47
0.78
1.00
0.80
-0.33
-0.53
1.00
-0.38
-0.43
1.00
0.62
1.00
* Data source: Panel Study of Income Dynamics (PSID).
t SEI, socioeconomic Index; AFDC, Aid to Families with Dependent Children.
o
o.
o'
T3
3
X
<&'
01
a
CD
3
I"
c/>
B>
Q.
TABLE 4.
Correlations between socioeconomic measures: 1980 zip codes/census tracts sample*
Mlcrolevel
1. Income
2. Education
3.SEII
4. Poor
5. High school graduate
6. Professional
1980 zip codes
7. Median Income
8. Mean education
Fraction
9. Poor
10. High school graduate
11. Protesslonal
12. Unemployed
13.AFDCt
1980 census tracts
14. Median Income
15. Mean education
Fraction
16. Poor
17. High school graduate
18. Professional
19. Unemployed
20. AFDC
1.00
0.32
0.35
-0.70
0.25
0.27
1.00
0.62
-0.25
0.64
0.49
0.40
0.30
0.30
0.38
-0.34
0.33
0.30
-0.28
-0.30
-0.19
0.35
0.37
-0.23
-0.24
0.45
0.34
0.31
0.42
-0.38
0.36
0.35
-0.31
-0.34
-O.20
0.37
0.39
-0.24
-0.26
1.00
-0.25
0.27
0.81
1.00
-0.24
-0.18
1.00
0.20
1.00
0.26 -0.31
0.31 -0.24
0.23
0.24
0.21
0.23
-0.17
0.30 -0.20 -0.14
0.27 -0.27 0.26
0.22
0.33 -0.22 0.21
0.25
-0.22
0.25 -0.16 -0.16
-0.22
0.28 -0.21 -0.18
0.29
0.34
-0.35
-0.27
0.25
0.27
0.22
0.24
-0.18
0.36 -0.22 -0.14
0.29 -0.31 0.30
0.21
0.36 -0.26 0.24
0.26
-0.23
0.28 -0.20 -0.17
-0.23
0.33 -0.24 -0.18
• Data source: Panel Study ot Income Dynamics (PSID).
t SEI, socioeconomic Index; AFDC, Aid to Families with Dependent Children.
!
00
p
01
8
00
CD
O
Q.
1.00
0.72
11
12
13
1.00
0.76
-0.58
-0.70
1.00
-0.55
-0.50
1.00
0.72
1.00
0.63
0.75
0.53
0.75
-0.47
-0.47
-0.30
0.63
0.78
-0.45
-0.41
0.47
-0.47
-0.44
0.81
0.58
14
15
-0.60
-0.48
1.00
0.71
1.00
0.67
-0.58
-0.40
0.59
0.81
-0.81
0.76
0.68
-0.58
-0.72
-0.45
0.91
0.89
-0.54
-0.57
16
17
18
19
20
1.00
-0.63
-0.41
0.60
0.82
1.00
0.76
-0.55
-0.68
1.00
-0.53
-0.51
1.00
0.69
1.00
1.00
-0.84 -0.46 1.00
0.78
0.91 -0.65
0.66
0.91 -0.39
-0.60 -0.57
0.62
-0.75 -0.57
0.86
0.81
0.60
10
0.57
0.82
-0.68
-0.39
-0.65 -0.35
0.77 -0.51
0.66
0.75 -0.55
0.83
0.54
0.71 -0.33
0.61
-0.49 -0.46
0.50 -0.47
-0.61 -0.47
0.70 -0.57
Aggregate Socioeconomic Proxies
the 1980 census measured at the zip code and tract
levels are shown.
Not surprisingly, among the microlevel measures,
categorical and continuous versions of the same variable are highly correlated (—0.70 for income; 0.64 for
education; and 0.81 for occupation). However, with
the exception of the correlation between SEI and years
of schooling, correlations among microlevel socioeconomic variables are not very high, which suggests that
they measure distinct aspects of the construct of socioeconomic position. The relatively high correlation
between SEI and years of schooling may be an artifact
of the composite nature of the SEI.
Generally, when an aggregate version of a specific
variable is compared with the microlevel version of
the same variable, the correlation is small to moderate.
These correlations range from 0.24 for the correlation
between being in a professional occupation and fraction of the adult population in professional occupations, using 1970 census data at the zip code level, to
0.45 for median income compared with family income, using 1980 data at the tract level. For specific
aggregate variables, there is some indication that 1980
variables are more highly correlated with the microlevel variable than 1970 variables, and the tract level
variables are more highly correlated with the individual level variables than are the zip code level variables, but the differences in all cases are small.
Correlations among aggregate proxies tend to be
larger. In a given year or level of aggregation, it is
unusual to find a correlation below 0.50. Most fall
between 0.65 and 0.94. Correlations between aggregate variables in 1970 and 1980 tend to be very high,
for example, 0.89 for median income and 0.94 for
mean education. The only correlation lower than 0.83
is for fraction unemployed (0.60). Correlations between the same variable measured at the zip code level
compared with the census tract level in a given year
are also generally moderate to high.
Regression models
In table 5, we estimate the effects of socioeconomic
group on self-reported health status, first using microlevel measures and then various aggregate measures.
In columns 1-2, we compare aggregate variables measured in 1980 with those measured in 1970, while in
columns 3-4 we compare aggregate proxies measured
at the zip code level with those measured for census
tracts. In all cases, we control for race, age, and sex of
the respondent. Coefficients on the explanatory variables can be interpreted as the effect of a one-unit
change in the explanatory variable on overall selfreported health measured in standard deviation units.
Columns 1 and 3 list the coefficient estimates using
Am J Epidemiol
Vol. 148, No. 5, 1998
481
different measures of socioeconomic group. Columns
2 and 4 list the chi-square statistic for the test of each
model against the model with no socioeconomic status
indicators, which provides information on which models fit the data better than others, and, hence, on how
well different socioeconomic proxies predict the
health outcome.
Most models based on microlevel socioeconomic
measures have substantially higher chi-square statistics than those using aggregate socioeconomic variables. In models that can be compared directly (e.g.,
models with microlevel income compared with those
with median income, etc.), the goodness of fit statistic
is always substantially higher for the model using the
microlevel variable. In addition, the aggregate version
of a given variable always picks up a substantially
larger coefficient than the corresponding microlevel
variable—two to three times larger in many cases, four
to five times larger for the aggregate compared with
the microlevel professional variable. More generally,
socioeconomic variables measured at the aggregate
level have very different estimated effects on health
from those measured at the microlevel.
Results between the regressions that include aggregate rather than microlevel socioeconomic variables
show little difference in either coefficient estimates or
goodness of fit between the zip code or tract levels of
aggregation or between 1980 and 1970 census data.
However, in both 1970 and 1980 and at both the zip
code and census tract levels of aggregation, models
including aggregate income or education variables and
the aggregate variable based on occupational type,
consistently fit substantially better than models using
the remaining aggregate variables.
Estimates reported in tables 6 and 7 offer information on the question of whether prediction of health
outcomes is improved when multiple aggregate measures are included in models, relative to a single measure. This also addresses the question of the conceptual comparability of the procedure of including
multiple aggregate measures in a model with that of
including multiple microlevel measures. In all cases,
microlevel models fit the data better than models that
use aggregate proxies (compare panel A with panel B
or C in each table). In microlevel models, goodness of
fit improves when multiple variables are included relative to inclusion of a single variable. The inclusion of
a second microlevel variable does not dramatically
alter the coefficient on the already included income or
education variable, although the coefficient on the SEI
does change more dramatically when education is
included in the model, presumably because it is a
composite. Increases in predictive power associated
with including multiple aggregate measures relative to
482
Geronimus and Bound
TABLE 5. The effect of socioeconomic group on self-rated health by sample: comparisons across
various socioeconomic measures*,t
Sample
^ofdnpfvinomif*
O v v i U t f u v l Iwll IVw
status
measure
1980/1970 (n = 4,393)
Coefficient
Microlevel
Income
Education
SEI/1O$,§
Poor
High school graduate
Professional^
Aggregate 1980 zip codes
Income
Education
Fraction
Poor
High school graduate
Professional
Unemployed
On AFDC§
1970 zip codes
Income
Education
Fraction
Poor
High school graduate
Professional
Unemployed
On AFDC
X2
Zip code/census tract (n = 4,762)
Coefficient
X2
0.35
0.15
0.12
-0.61
0.67
0.41
278.5
477.8
250.8
182.8
217.0
198.5
0.33
0.14
0.12
-0.55
0.62
0.39
268.7
456.8
268.5
165.0
198.9
224.1
0.66
0.25
110.2
171.4
0.65
0.24
114.6
168.3
-1.45
1.56
2.20
-3.86
-1.91
40.2
152.2
170.3
68.9
39.7
-1.38
1.44
2.25
-3.70
-1.91
37.0
135.1
178.1
66.7
42.0
0.74
0.27
117.8
168.0
-1.89
1.37
1.92
-4.32
-2.21
58.2
153.4
133.4
26.5
30.4
1980 census tracts
Income
Education
Fraction
Poor
High school graduate
Professional
Unemployed
On AFDC
0.60
0.22
137.7
188.8
-1.36
1.30
2.03
-3.12
-2.04
52.2
143.1
206.6
64.6
64.3
* Data source: Panel Study of Income Dynamics (PSID).
t Specifications include controls for age, race, and sex.
$ Specifications also include dummy for missing occupation.
§ SEI/10, socioeconomic index divided by 10; AFDC, Aid to Families with Dependent Children.
the inclusion of only one are more modest. The inclusion of a second aggregate variable often has a large
impact on the coefficient on the already included aggregate variable. There are virtually no differences
between using 1970 data and 1980 data, while those
between aggregating at the zip code versus census
tract levels of aggregation are small.
DISCUSSION
Neither of the following appear to affect regression
results appreciably, in terms of the goodness of fit of
models or the magnitude of coefficient estimates: us-
ing 1970 compared with 1980 census data or using zip
code versus census tract level data. These results may
seem counterintuitive. Yet, regarding census year, the
tabulations indicate that economic characteristics of
geographic units in 1970 are excellent proxies for the
economic characteristics of the same unit in 1980.
Correlations between aggregate variables in 1970 and
1980 were all above 0.8 except for fraction unemployed. Unemployment may vary over time within
locales due to regional and other macroeconomic effects on employment levels. Generally, the relative
wealth or poverty of specific locales appears to remain
Am J Epidemiol Vol. 148, No. 5, 1998
Aggregate Socioeconomic Proxies
483
TABLE 6. The effects of socioeconomic group on self-rated overall health using 1970 and 1980 zip
code sample*,t
SESJ
variable
Regression coefficient (standard error) by model§
1
2
3
4
5
0.23
(0.02)
0.27
(0.02)
6
7
A. Microlevel variables
Income
0.35
(0.02)
Education
0.12
(0.01)
SEI*
X2 statistics on SES variables
0.13
(0.01)
0.15
(0.01)
278.5
477.8
250.8
583.2
0.22
(0.02)
0.13
(0.01)
0.12
(0.01)
0.09
(0.01)
0.03
(0.01)
0.01
(0.01)
399.8
508.9
597.0
B. Aggregate variables, 1980 zip codes
Income
0.66
(0.05)
Education
0.16
(0.07)
X2 statistics on SES variables
0.21
(0.02)
0.25
(0.02)
2.20
(0.13)
Professional
110.2
171.4
0.21
(0.07)
170.3
174.8
0.15
(0.07)
0.13
(0.04)
0.11
(0.04)
1.83
(0.18)
1.10
(0.35)
1.05
(0.35)
176.3
177.6
180.5
C. Aggregate variables, 1970 zip codes
Income
0.74
(0.05)
Education
0.24
(0.08)
0.27
(0.02)
Professional
X2 statistics on SES variables
0.22
(0.02)
1.92
(0.13)
117.8
168.0
0.40
(0.07)
133.4
174.0
0.24
(0.08)
0.27
(0.04)
0.22
(0.04)
1.30
(0.17)
0.03
(0.30)
0.01
(0.30)
153.9
168.0
174.0
* Data source: Panel Study of Income Dynamics (PSID).
t All specifications include controls for age, race, and sex; SEI specifications also include a dummy variable for
missing occupation.
$ SES, socioeconomic status; SEI, socioeconomic index.
§ Models 1-3 each include only a single SES variable; models 4-6 each include two SES variables; model 7
includes all three SES variables.
remarkably stable in the United States, at least over a
10-year period.
Our finding that use of census tract level data does
not greatly improve estimation over using zip code
level data appears to be due to the fact that socioeconomic variation within census tracts is almost as great
as that within zip code areas. For example, comparing
the variation in income measured at the micro and
aggregate level in the PSID suggests that 11 percent of
variation in individual income is between zip codes.
Thus, there is 89 percent as much variation in income
within zip codes as in the general population. In cenAm J Epidemiol
Vol. 148, No. 5, 1998
sus tracts, our estimates imply that there is 84 percent
as much variation within tracts as in the overall
population.
Our data did not permit analysis at the block group
level. However, given the little difference it made to
move from the zip code to census tract level of aggregation, we would not assume, a priori, that moving to
block group data would alter results qualitatively. In
Australia, Hyndman et al. (29) found that data collected at the level of "collector's districts" did yield
substantially more reliable estimates than those collected at the larger "postcode" level of aggregation.
484
Geronimus and Bound
TABLE 7. The effects of socioeconomic group on self-rated overall health using 1980 zip code and
census tract sample*,t
SES*
variable
1
2
Regression coefficient (standard error) by model§
3
4
5
6
7
A. Microlevel variables
Income
0.33
(0.02)
Education
0.22
(0.02)
0.14
(0.01)
SEI*
X2 statistics on SES variables
0.12
(0.01)
0.12
(0.01)
268.7
456.8
0.25
(0.02)
268.5
558.8
0.20
(0.02)
0.12
(0.01)
0.11
(0.01)
0.09
(0.01)
0.03
(0.01)
0.01
(0.01)
406.6
505.9
587.1
B. Aggregate variables, 1980 zip codes
Income
0.65
(0.05)
Education
0.20
(0.07)
0.24
(0.01)
Professional
X2 statistics on SES variables
0.20
(0.02)
2.25
(0.14)
114.6
168.3
0.23
(0.06)
178.1
173.7
0.19
(0.07)
0.09
(0.03)
0.06
(0.04)
1.85
(0.18)
1.48
(0.31)
1.46
(0.31)
186.6
183.1
188.1
C. Aggregate variables, 1980 census tracts
Income
0.60
(0.04)
Education
0.23
(0.06)
0.22
(0.01)
Professional
X2 statistics on SES variables
188.8
206.6
199.3
0.19
(0.06)
0.08
(0.03)
0.05
(0.03)
1.65
(0.15)
1.44
(0-25)
1.35
(0.25)
217.5
211.6
219.0
0.17
(0.02)
2.03
(0.11)
137.7
0.22
(0.05)
* Data source: Panel Study of Income Dynamics (PSID).
t All specifications include controls for age, race, and sex; SEI specifications also include a dummy variable
for missing occupation.
X SES, socioeconomic status; SEI, socioeconomic index.
§ Models 1-3 each include only a single SES variable; models 4-6 each include two SES variables; model 7
includes all three SES variables.
However, the generalizability of results from Australia
to the United States is an open question. How social
stratification is reflected in geography and government
statistical units may vary between the two countries.
Krieger (30) compared census tract to block group
level results in her analysis based on health maintenance organization data in California. In half of her
calculations, estimates based on block groups were
more reliable than those based on census tracts, but in
some of her calculations the reverse was true, and in
no case were differences in confidence intervals very
great (see her table 2). Block group data may perform
better than census tract or zip code level data in a less
select sample, but this remains an empirical question.
Our findings suggest that there is little advantage in
the inclusion of multiple aggregate measures compared with a single aggregate measure in health
outcome equations. There is little to be gained in
explanatory power by including multiple aggregate
measures, and their multicollinearity exacerbates
problems in the interpretation of coefficients in such a
model. While not an explicit objective of this study,
our findings also raise questions about the merit of
including a socioeconomic index of occupation when
microlevel data on income or education are available.
Our findings indicate that conceptual differences
Am J Epidemiol
Vol. 148, No. 5, 1998
Aggregate Socioeconomic Proxies
among aggregate variables are more blurred than those
between their microlevel counterparts. One implication is that choosing an aggregate measure on theoretical grounds may be ascribing greater construct validity to specific measures than is merited. More
generally, the findings suggest a qualified recommendation on the question of which single aggregate measure to include. Across the PSID and NMIHS samples,
models including median income consistently had better predictive power than when some of the other
aggregate measures were included. In samples for
neither data set did fraction unemployed or fraction on
Aid to Families with Dependent Children fit the data
as well as other aggregate variables. In the PSID
samples, but not the NMIHS sample, the aggregate
education variables and the occupational position variable fit the data as well or better than median income,
while in the NMIHS, but not the PSID, the aggregate
poverty variable had roughly the same goodness of fit
associated with it as median income (not shown).
These findings lead us to believe that median
income—the most commonly used aggregate variable
in the literature to date—may be a sensible single
aggregate measure to use. When data permit, investigators may wish to conduct analyses to test the sensitivity of their results to different aggregate measures.
Although our findings should give investigators
some assurance about the use of imperfect data, they
also suggest that caution should be exercised in the
interpretion of results based on census-based aggregate measures. Perhaps one reason it makes little difference whether an investigator uses aggregate data
measured 10 or 20 years ago, or at the zip code or
census tract level, is because aggregate measures are
simply poor proxies for microlevel characteristics. Indeed, the differences in coefficient estimates depending on whether microlevel versus aggregate socioeconomic measures were used show that the aggregate
measures are not akin to their microlevel counterparts.
In general, they picked up larger coefficients and were
more highly multicollinear than respective microlevel
measures.
Estimating larger coefficients with aggregate compared with microlevel measures may appear in conflict
with the common assumption that variables measured
with error will tend to underestimate relations. However, applicable to the current context, Geronimus et
al. (17) outlined a statistical framework that identifies
two sources of bias. First, there is an errors-in-variable
bias that arises because the aggregate variable is only
imperfectly correlated with the microlevel variable it
represents. This bias is different from the standard
errors-in-variables bias which is proportional to the
reliability of a measure. Instead, the errors-in-variables
Am J Epidemiol
Vol. 148, No. 5, 1998
485
bias arises because socioeconomic variation within
geographic areas is correlated with microlevel covariates, such as race, that are also included in the estimating equations (17, 31). The second source of bias
is an aggregation bias, which arises from the fact that
the aggregate variable may itself be correlated with the
residual in the microlevel equation. While the first
problem is likely to exert a downward bias on the
coefficient, the magnitude of that bias will typically be
smaller than in the more standard case. Meanwhile, the
aggregation bias suggests that the aggregate variable is
a proxy for a broader construct than the microlevel
variable (32) and this may lead it to pick up a larger
coefficient, as it has in the two national samples we
analyzed. (See reference 17 for explication of these
points.)
Our empirical findings and this statistical framework together suggest aggregate measures tap a more
global construct than do microlevel measures and
should not be interpreted as equivalent to microlevel
constructs. It may also be inappropriate to think of
them as reflecting phenomena specific to their labels.
This last concern also influences the interpretation of
coefficients in applications where aggregate variables
are used to measure "contextual" effects. That is,
while a significant coefficient on an aggregate variable
may suggest there is some characteristic of the respondent's neighborhood that affects the health outcome
under study, whether or not it is the specific entity
measured by the variable is a more difficult question.
In conclusion, investigators limited to using censusbased aggregate measures of socioeconomic group
need not be overly concerned about how recent the
data are (at least within a 20-year period) or whether
they are measured at the zip code or census tract level.
However, there are clear limits to the knowledge to be
gained by this approach. Geocoding data sets may be
more economical than implementing the other recommendations made at the 1994 National Institutes of
Health conference. For example, the participants also
recommended routine collection of a detailed and
diverse set of individual socioeconomic characteristics on government surveys; funding the development of improved health measures on national
surveys—including the PSID—that already have detailed socioeconomic data; and augmenting the individual socioeconomic information collected on vital
statistics data (4, 5). Implemention of at least some of
these more costly recommendations rather than overreliance on geocoding survey or vital statistics data
may be worth the extra effort and resources if important advances in understanding social inequalities in
health are to be made.
486
Geronimus and Bound
ACKNOWLEDGMENTS
Supported by the National Institutes of Child Health and
Human Development (contract no. 263-MD-626341) and
the Centers for Disease Control and Prevention (grant no.
U83/CCU51249-02). John Bound is a Fellow of the National Bureau of Economic Research.
The authors thank Drs. Christine Bachrach, Nancy Moss,
and James Weed for their efforts to help them gain access to
the special release of the National Maternal and Infant
Health Survey, Dr. Sherman James for helpful comments on
a previous draft of the paper, Dr. Lisa Neidert for help with
data preparation, Marianne Hillemeier and Pat Burns for
research assistance, and Mary-Claire Toomey and Judy
Mullin for technical assistance with the manuscript.
REFERENCES
1. Williams DR. Socioeconomic differentials in health: a review
and redirection. Soc Psychol Q 1990;53(2):81-99.
2. Angell M. Privilege and health—what is the connection?
N Engl J Med 1993;329:126-7.
3. Feinstein JS. The relationship between socioeconomic status
and health: a review of the literature. Milbank Q 1993;71:
279-322.
4. Moss N, Krieger N. Measuring social inequalities in health.
Public Health Rep 1995,110:302-5.
5. Syme SL, Moss N, Krieger N, rapporteurs. Recommendations
of the conference "Measuring Social Inequalities in Health."
Int J Health Serv 1996;26:521-7.
6. Devesa SS, Diamond EL. Socioeconomic and racial differences in lung cancer incidence. Am J Epidemiol 1983;118:
818-31.
7. McWhorter WP, Schatzkin AG, Horm JW, et al. Contribution
of socioeconomic status to black/white differences in cancer
incidence. Cancer 1989;63:982-7.
8. Mandelblatt J, Andrews H, Kerner J, et al. Determinants of
late stage diagnosis of breast and cervical cancer: the impact
of age, race, social class, and hospital type. Am J Public
Health 1991;81:646-9.
9. Wise PH, Kotelchuck M, Wilson ML, et al. Racial and socioeconomic disparities in childhood mortality in Boston. N Engl
J Med 1985;313:360-6.
10. Gould JB, Davey B, LeRoy S. Socioeconomic differentials in
neonatal mortality: racial comparison of California singletons.
Pediatrics 1989;83:181-6.
11. Collins JW, David RJ. Differences in neonatal mortality by
race, income, and prenatal care. Ethnicity Dis 1992,2:18-26.
12. Kraus JF, Fife D, Ramstein K, et al. The relationship of family
income to the incidence, external causes, and outcomes of
serious brain injury, San Diego County, California. Am J
Public Health 1986;76:1345-7.
13. Marder D, Targonski P, Orris P, et al. Effect of racial and
socioeconomic factors on asthma mortality in Chicago. Chest
1992; 101:426S-429S.
14. Byrne C, Nedelman J, Luke RG. Race, socioeconomic status,
and the development of end-stage renal disease. Am J Kidney
Dis 1994;23:16-22.
15. Cherkin DC, Grothaus L, Wagner EH. Is magnitude of copayment effect related to income? Using census data for health
services research. Soc Sci Med 1992;34:33-41.
16. Greenwald HP, Polissar NL, Borgatta EF, et al. Detecting
survival effects of socioeconomic status: problems in the use
of aggregate measures. J Clin Epidemiol 1994;47:903-9.
17. Geronimus AT, Bound J, Neidert LJ. On the validity of using
census geocode characteristics to proxy individual socioeconomic characteristics. J Am Statist Assoc 1996;91:529-37.
18. Kreiger N, Williams DR, Moss NE. Measuring social class in
US public health research: concepts, methodologies, guidelines. Annu Rev Public Health 1997;18:341-78.
19. Hill MS. The Panel Study of Income Dynamics: a user's
guide. Newbury Park, CA: Sage Publications, 1992.
20. Institute for Social Research. A Panel Study of Income
Dynamics: procedures and tape codes, 1985 interviewing year
(documentation), vol. I, wave XVHI, a supplement. Ann Arbor, MI: Institute for Social Research, University of Michigan,
1988.
21. Duncan G, Hill D. Assessing the quality of household panel
survey data: the case of the PSID. J Business Econ Stat
1989:7:441-51.
22. Becketti S, Gould W, Lillard L, et al. The Panel Study of
Income Dynamics after fourteen years: an evaluation. J Labor
Econ 1988;6:472-92.
23. Maddox G, Douglas E. Self-assessment of health: a longitudinal study of elderly subjects. J Health Soc Behav 1993;14:
87-93.
24. LaRue A, Bank L, Jarvic L, et al. Health in old age: how
physicians' ratings and self-ratings compare. J Gerontology
1979;34:687-91.
25. Farraro KF. Self-ratings of health among the old and old-old.
J Health Soc Behav 1980;21:377-83.
26. Mossey JM, Shapiro E. Self-rated health: a predictor of mortality among the elderly. Am J Public Health 1982;72:800-8.
27. Manning WG, Newhouse JP, Ware JE Jr. The status of health
in demand estimation, or beyond excellent, good, fair and
poor. In: Fuchs VR, ed. Economic aspects of health. Chicago:
University of Chicago Press, 1982:143-84.
28. Duncan O. A socioeconomic index for all occupations. In:
Reiss AJ Jr, ed. Occupations and social status. New York:
Free Press, 1961:109-38.
29. Hyndman JCT, Holman CDJ, Hockey RL, et al. Misclassification of social disadvantage based on geographical areas:
comparison of postcode and collector's districts analyses. Int
J Epidemiol 1995 ;24:165-76.
30. Kreiger N. Overcoming the absence of socioeconomic data in
medical records: validation and application of a census-based
methodology. Am J Public Health 1992;82:703-10.
31. Dickens WT, Ross BA. Consistent estimation using data from
more than one sample. Technical working paper no. 33. Cambridge, MA: National Bureau of Economic Research, 1984.
32. Hammond JL. Two sources of error in ecological correlations.
Am Sociol Rev 1973;38:764-77.
Am J Epidemiol Vol. 148, No. 5, 1998