sequence of aggregation that the range of values is decreased (refer

Letters to the Editor
sequence of aggregation that the range of values is
decreased (refer to the standard deviations in table 2 (1)). At
a simple level, it is easy to see that the aggregate-level
income attached to the lowest-income person in a sample
will be above the individual income (there is no one on the
same or lower income level to aggregate with); for the richest person, the aggregate income will be lower than the individual income for the same reason. The individual and
aggregate coefficients are simply not comparable, as they
have different meanings (e.g., the increased risk of mortality for someone earning $10,000 less than someone else
as opposed to the increased risk of mortality for someone
living in an area in which the median income is $10,000 less
than in another area). As the range of area-based incomes is
highly compressed compared with individual incomes, it is
almost certain that the coefficients for the former will be
considerably larger than those for the latter. In fact, taking
the standard deviations into account, the standardized
regression coefficients are, if anything, larger for individual
than for area-based measures. For example, the standardized
coefficient for microlevel education is 0.375; for aggregate
1980 education based on zip code, it is 0.225.
We think that the work of Geronimus and Bound (1)
should provide some reassurance to researchers on the use
of area-based measures when no individual data on socioeconomic position exist. Certainly, when other sources of
data are not available, as in some routine and administrative
data systems, it can be valuable to link area-based measures
to the data. However, it is also clear that individual and
aggregate indices do not measure the same thing. Areabased measures are not a reliable substitute for individual
data; equally, individual-based measures do not provide
complete socioeconomic data. If possible, information
should be collected about individual socioeconomic circumstances across the life course (10), together with area-based
measures. Only by measuring both micro- and macrolevel
data and exploring their interactions will researchers better
understand the respective roles of individual and area-based
socioeconomic influences on health.
REFERENCES
1. Geronimus AT, Bound J. Use of census-based aggregate variables to proxy for socioeconomic group: evidence from
national samples. Am J Epidemiol 1998; 148:475-86.
2. Dolk H, Mertens B, Kleinschmidt I, et al. A standardisation
approach to the control of socioeconomic confounding in
small area studies of environment and health. J Epidemiol
Community Health 1995;49(suppl 2):S9-14.
3. Carr-Hill R, Rice N. Is enumeration district level an improvement on ward level analysis in studies of deprivation and
health? J Epidemiol Community Health 1995;49(suppl 2):
S28-9.
4. Hyndman JC, Holman CD, Hockey RL, et al. Misclassification
of social disadvantage based on geographical areas: comparison of postcode and collector's district analyses. Int J
Epidemiol 1995;24:165-76.
5. Krieger N. Overcoming the absence of socioeconomic data in
medical records: validation and application of a census-based
methodology. Am J Public Health 1992,82:703-10.
6. Carstairs V, Morris R. Deprivation and health in Scotland.
Aberdeen, Scotland: Aberdeen University Press, 1991.
7. Davey Smith G, Hart C, Watt G, et al. Individual social class,
area-based deprivation, cardiovascular diseaseriskfactors, and
mortality: the Renfrew and Paisley Study. J Epidemiol
Community Health 1998;52:399-405.
8. Kunst AE, Mackenbach JP. The size of mortality differences
Am J Epidemiol
Vol. 150, No. 9, 1999
997
associated with educational level in nine industrialized countries. Am J Public Health 1994,84:932-7.
9. Davey Smith G, Hart C, Hole D, et al. Education and occupational social class: which is the more important indicator of
mortality risk? J Epidemiol Community Health 1998;52:
153-60.
10. Davey Smith G, Hart C, Blane D, et al. Lifetime socioeconomic position and mortality: prospective observational study.
BMJ 1997;314:547-52.
George Davey Smith
Yoav Ben-Shlomo
Department of Social Medicine
University of Bristol
Bristol, United Kingdom BS8 2PR
Carole Hart
Department of Public Health
University of Glasgow
Glasgow, Scotland G12 8RZ
THE A UTHORS REPLY
We welcome the comments of Davey Smith et al. (1) on
our paper (2) reporting on our empirical assessment of the
increasingly common practice of using census-based
aggregate socioeconomic variables to measure individual
socioeconomic characteristics when microlevel data are
unavailable. We are pleased to see the additional support,
conceptual elaboration, and empirical examples that Davey
Smith et al. provide for our principal conclusions that areabased measures are not a reliable substitute for individual
data; that, in many cases, when aggregate socioeconomic
data are used, the size of the area does not appear to be key;
and that the knowledge to be gained from using this
approach is limited. Major advances in the understanding
of social inequalities in health are likely to require substantially improved data collection, not overreliance on
geocoding to substitute for uncollected microdata as well
as to measure important contextual effects.
Davey Smith et al. (1) do raise a question about our interpretation of the fact that a unit change in an aggregate variable will typically have a larger effect on health outcomes
than a unit change in a comparable individual-level variable.
Here, let us clarify the context in which we were working and
in which we believe this interpretation is reasonable (refer to
Geronimus et al. (3) for more details). We are imagining a
situation in which the only data that researchers have available to them are aggregate-level variables. Researchers use
these aggregate-level variables as proxies for individual
characteristics. How fair is it to interpret such estimates as
comparable to those that would have been obtained if
microlevel information were available? In the fields of sociology and economics, a substantial literature discusses the
circumstances under which it is appropriate to use aggregatelevel variables to proxy for individual-level ones (4—7).
However, different circumstances apply in epidemiology,
motivating us to explore this question in the specific context
of health outcomes.
Suppose that some health outcome, y (one could think of
y as the mortality rate), depends linearly on individual-level
socioeconomic position, x, and other factors that are independent of socioeconomic position, e. Thus, we can write
the following: y - a + JT(3 + e, where a and (3 are parameters
998
Letters to the Editor
and, by assumption, e is independent of x. Now suppose that
we have information on y and x from a random sample of
persons arrayed across distinct geographic areas. Let i index
the person within an area and j the geographic areas. Thus,
we can write y as yijt x as Xy, and e as £,-,-. Each of these variables can be decomposed into components that vary within
and between geographic areas, therefore, for example,
defining Xj as the mean of x within location j and vy as the
individual-specific deviation around this mean, Xy = Xj + Vy.
Now, one could imagine estimating (J by running a
regression analysis of yy on Xy. However, if one had data
only on the location means of x, then one could imagine
estimating (3 by running a regression analysis of y« on Xj. If
not only Xy but also x. are independent of e, then either procedure will consistently estimate p\ Since we would expect
substantially less variance in jr • than in xy, the estimates
will imply that a standard deviation change in the aggregate
variable will have less of an impact on the outcome than a
standard deviation change in the individual-level variable.
In this context, the unstandardized coefficients would be
comparable (i.e., they would consistently estimate the same
parameter), while standardized coefficients would not be
comparable.
The situation we have outlined is one in which it would
be perfectly legitimate to use an aggregate variable as a
proxy for a microlevel one. However, the assumption that e
is independent of x is unlikely to hold in practice. In particular, suppose that x represents only a component of socioeconomic position and that, among other things, e includes
other unmeasured components of socioeconomic position.
In this case, we would expect cov(e^c) > 0, and our estimates
of P will tend to exaggerate the causal effect of x alone on
the outcome variable.
Under what circumstances will this tend to be more true
when we use Xj rather than Xyl The use of x,• will produce
the larger magnitude coefficient if, when both are included
in the same regression, the coefficients on the two variables
are of the same sign (6, 8). In the context, where x is an indicator of socioeconomic position, a natural interpretation is
that the aggregate variable represents a broader construct
than the microlevel variable and is likely to exaggerate the
effect of the microlevel counterpart on outcomes of interest
(2, 3, 7).
Let us emphasize that we are talking within the context
in which the aggregate-level variable (xj) represents the
location-specific mean (or something like it) of the individual-level variable (xy) and in which the investigator
wishes to use the aggregate variable as a proxy for the
microlevel one. An example is when the investigator
wants to estimate the effect of individual income on a
health outcome but uses the areal measure of median
income in a defined geographic unit in which the person
resides as a proxy for individual income. It is within this
context—in which, among other things, the aggregateand individual-level variables, Xj and Xy, are measured in
the same units—that it makes sense to compare unstandardized regression coefficients.
The context just described is precisely the context of our
paper: In our health outcome equations, we compared coefficients for income, education, and occupation, first measured at the microlevel and then at the aggregate level. We
were simulating what many US researchers have in fact
been doing, that is, using median income in a census area to
substitute for uncollected data on individual income. We did
this exercise to arrive at an empirical sense of the validity of
this increasingly used approach to remediating shortcomings in health data sets.
One source of the confusion may be that Davey Smith et
al. (1) come from a research tradition in the United
Kingdom, where socioeconomic variables measured at the
individual level are often collected. The availability of
microlevel economic characteristics on health data sets
offers researchers the option of including them in health
outcome equations with the clear intention that they have
meanings different from economic characteristics measured
areally. For example, individual occupation can be used to
measure a person's social class, while the Carstairs deprivation score can be included as a measure of the socioeconomic character of the area in which that person
resides. Each of these variables is conceptualized explicitly
as making an independent contribution to the health outcome. In such a case, it is important to find a way to compare the coefficients on these conceptually and metrically
distinct variables to assess the relative magnitudes of each
variable's independent contribution to a person's health.
Using standardized coefficients represents one way to do
this, while the relative index of inequality (RII) represents
another.
Unfortunately, it is the exceptional US health data set
that includes detailed individual-level socioeconomic information and the geocodes necessary to make linkages to
areal information on the characteristics of respondents' residential areas. This lack of individual and contextual information inhibits our ability to better understand health
inequalities. Instead, social epidemiologists too often find
themselves in the rather confusing position of having to
rely on census-based variables to do "double duty": serving
sometimes as proxies for an individual characteristic,
sometimes as measures of contextual effects. Our findings
that neither size of area nor length of time since census data
collection (at least within a 20-year period) is critical (2)
should be reassuring to those investigators who are forced
to continue to use this approach in the absence of microlevel socioeconomic data. Davey Smith et al. (1) offer additional examples of the question of size that should also
offer a measure of comfort to the many investigators who
have access to zip-code-level data only. Even so, we
believe that our broader findings imply that results arrived
at by substituting aggregate socioeconomic variables for
uncollected microlevel data require cautious interpretation,
with attention to the demonstrated limits (2, 3). Our ultimate goal should be to get both sorts of socioeconomic
measures (individual and contextual) together in the same
data sets. While it currently may be expedient or economically prudent to use census-based proxies on double duty,
greater advances in our understanding of health inequalities
will likely be derived from analyses that conceptually and
empirically separate individual and contextual socioeconomic indicators.
REFERENCES
1. Davey Smith G, Hart C, Ben-Shlomo Y. Re: "Use of censusbased aggregate variables to proxy for socioeconomic group:
evidence from national samples." (Letter). Am J Epidemiol
1999; 150:996-7.
2. Geronimus AT, Bound J. Use of census-based aggregate variables to proxy for socioeconomic group: evidence from
national samples. Am J Epidemiol 1998; 148:475-86.
3. Geronimus AT, Bound J, Neidert LJ. On the validity of using
Am J Epidemiol
Vol. 150, No. 9, 1999
Letters to the Editor 999
census geocode characteristics to proxy individual socioeconomic characteristics. J Am Stat Assoc 1996;91:529-37.
Theil H. Linear aggregation of economic relations.
Amsterdam, the Netherlands: North-Holland Publishing Co,
1954.
Hannan MT. Aggregation and disaggregation in the social sciences. Lexington, MA: Lexington Books, 1971.
Firebaugh G. A rule for inferring individual-level relationships
from aggregate data. Am Sociol Rev 1978;43:557-72.
Hammond JL. Two sources of error in ecological correlations.
Am Sociol Rev 1973;38:764-77.
Mundlak Y. On the pooling of time series and cross section
data. Econometrica 1978,46:69-85.
Am J Epidemiol Vol. 150, No. 9, 1999
Arline T. Geronimus
Department of Health Behavior and
Health Education
University of Michigan School of
Public Health
Ann Arbor, MI 48109-2029
John Bound
Department of Economics
University of Michigan
Ann Arbor, MI 48109-2029