American Journal of Epidemiology Copyright © 1998 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 148, No. 5 Printed in U.S.A. Use of Census-based Aggregate Variables to Proxy for Socioeconomic Group: Evidence from National Samples Arline T. Geronimus1 and John Bound 2 Increasingly, investigators append census-based socioeconomic characteristics of residential areas to individual records to address the problem of inadequate socioeconomic information on health data sets. Little empirical attention has been given to the validity of this approach. The authors estimate health outcome equations using samples from nationally representative data sets linked to census data. They investigate whether statistical power is sensitive to the timing of census data collection or to the level of aggregation of the census data; whether different census items are conceptually distinct; and whether the use of multiple aggregate measures in health outcome equations improves prediction compared with a single aggregate measure. The authors find little difference in estimates when using 1970 compared with 1980 US Bureau of the Census data or zip code compared with tract level variables. However, aggregate variables are highly multicollinear. Associations of health outcomes with aggregate measures are substantially weaker than with microievel measures. The authors conclude that aggregate measures can not be interpreted as if they were microievel variables nor should a specific aggregate measure be interpreted to represent the effects of what it is labeled. Am J Epidemiol 1998; 148:475-86. aggregation; census tract; geocoding; health surveys; social class; socioeconomic factors; zip code Social inequalities in health are difficult to study (1-3). A 1994 conference of the National Institutes of Health documented the severe inadequacies of socioeconomic data on health data sets and led to a thoughtful set of recommendations to improve the situation (4, 5). One of the recommendations was to geocode individual records and to link them to socioeconomic characteristics of residential areas drawn from census data. It was suggested that this would be "one powerful and economical way of augmenting existing data bases" (4, p. 305). This approach is already evident in the study of cancer (e.g., references 6-8), infant mortality (e.g., references 9-11), and, to a lesser extent, other health outcomes (e.g., references 12-14). A number of limitations to the validity of this approach have been suggested, however. Investigators have found this procedure to result in imprecise esti- mates and to require large sample sizes in order to detect significant differences (15, 16). Drawing on a statistical framework to illuminate biases, the one analysis based on nationally representative samples (17) raised questions about the proper interpretation of coefficient estimates derived through this approach. Yet, in some cases, it may be the only option available to study social differentials in health. Investigators using this approach face conceptual and related methodological decisions. For example, the census is taken once per decade, with time-lags between data collection and public availability. Potentially, the most proximate census data available were collected more than 10 years prior to the primary data set being analyzed. Whether it is justifiable to append census data that are at least one decade old to individual records to proxy current socioeconomic characteristics is an empirical question. Another important methodological question is, what difference does the level of aggregation of the census data make to the relations observed? Nationally, the typical zip code contains roughly 25,000 inhabitants, while the typical census tract contains 5,000. The census block group is even smaller, generally containing about 1,000 inhabitants. It is plausible that investigators should prefer the smallest and most homogeneous census-defined region, the census block group. However, census block data are rarely available. Received for publication September 29, 1997, and accepted for publication March 20, 1998. Abbreviations: NMIHS, National Maternal and Infant Health Survey; PSID, Panel Study of Income Dynamics; SEI, socioeconomic index. 1 Department of Health Behavior and Health Education and the Population Studies Center, University of Michigan, Ann Arbor, Ml. 2 Department of Economics and the Population Studies Center, University of Michigan, Ann Arbor, Ml. Reprint requests to Dr. Arline T. Geronimus, Department of Health Behavior and Health Education, School of Public Health, University of Michigan, 1420 Washington Heights, Ann Arbor, Ml 48109-2029. 475 476 Geronimus and Bound Moreover, block group studies will systematically exclude rural residents. Often available geographic identifiers permit linkages to census data only at the zip code level. And only zip-coded data have the potential for complete coverage. Yet, because zip codes cover a large number of inhabitants, use of data aggregated at this level has been called "an option of last resort" (18). However, do zip code-level data perform substantially worse in health outcome equations than data collected at the census-tract level? In addition, the census contains many data items that are theoretically related to socioeconomic group. To what extent do they represent distinct entities? Will choosing one rather than another affect regression results? Are they sufficiently distinct that regression coefficients can lead directly to clear policy advice to reduce health disparities? Do they capture sufficiently well-defined or separate components of the ways in which social position affects health that using multiple measures will improve estimation compared with a single measure? Aggregate socioeconomic variables are sometimes used by researchers to estimate "contextual" or "neighborhood" effects. Although some of our findings are relevant to such applications, we are specifically concerned with the many cases where investigators substitute aggregate variables for microlevel data they would have used had it been available. We address: 1) the statistical power of aggregate data relative to microlevel data and relative to other aggregate data, e.g., between different census years or measured at different levels of aggregation; and 2) the general interpretation of results derived in this manner. MATERIALS AND METHODS The creation of a data set linking census information to microlevel data from the 1985 wave of the Panel Study of Income Dynamics (the PSID-Geocode file), together with a special release of the 1988 National Maternal and Infant Health Survey (NMIHS) that includes geographic identifiers, provided a unique opportunity to address the above research questions. We performed similar analyses using PSID and NMIHS data, and we found our results to be robust across the two data sets. For reasons of space, we report here only the PSID results. The NMIHS results are available from the authors. The PSID is an ongoing longitudinal study of the determinants of family income (19, 20). Data from a representative sample of persons have been collected annually since 1968. In 1985, over 60 percent of the original set of sample households remained in the study. We restricted our PSID samples to the men and women from the original 1968 PSID families who are between the ages of 18 and 64 years and who identified themselves as black or white. We apply sample weights to account for the initial oversampling of some groups, differential attrition, and the expansion over time of the proportion of younger families in the sample. Validation studies suggest that analyses of the PSID yield nationally representative results for blacks and whites when the sample weights are applied (21, 22). We analyze two PSID samples: one restricted to observations with valid zip codes for both 1970 and 1980; the second restricted to observations with valid zip codes and census tract identifiers for 1980. Ninetynine percent of the 1985 PSID respondents had valid 1980 zip codes, while 68 percent were matchable to valid 1970 census information using zip codes. We were able to match 72 percent of the PSID sample to 1980 census tract information. Health-related variables collected by the PSID are limited. We focus on adult self-reported health status using responses to a question that asked respondents to rate their health on a 5-point scale from excellent to poor. Such measures are highly correlated with clinical measures (23-26) and predict subsequent death, health care utilization, and labor market behavior (27)—often better than clinical measures. In table 1, we list the socioeconomic variables studied. We studied microlevel socioeconomic measures commonly used by social epidemiologists including income, education, and occupation measured continuously and then categorically. We also studied family income to needs but do not report these results which were virtually identical to those for family income. The poverty variable also takes account of differences in family income to needs. We based the microlevel occupational measures on respondent's own occupation. We used current occupation, if available, and previous occupation for those currently unemployed. Even so, almost 20 percent of each sample had missing occupational data. Means, standard deviations, and correlations for the microlevel occupation measures are based on the subsets of each sample for which we had occupational data. In regression analyses, we used the full study samples, but included a dummy variable to indicate when occupational data were missing. For ease of presentation, we divided the continuous occupation variable—the socioeconomic index (SEI)—by 10. The SEI (also known as the Duncan index) is the most widely used ranking of occupations in the social sciences. It is estimated by regressing occupational prestige scores on age-standardized occupational levels of earnings and education for a limited set of occupations and then applying weights for Am J Epidemiol Vol. 148, No. 5, 1998 Aggregate Socioeconomic Proxies TABLE 1. 477 Variable descriptions Variable Description Microlevel Income* Education SEI Poor High school graduate Professional Natural logarithm of family income Educational attainment in years of schooling Socioeconomic index (SEI) score corresponding to respondent's current or most recent occupation Family income to needs below poverty threshold High school graduate Current or last occupation classified as professional or managerial Aggregate^ Income* Education Poor High school graduate Professional Unemployed On AFDC^: * t tract t Log of median family income in residential area Mean educational attainment in years of residents aged >25 years Fraction of the non-elderly residents with incomes below the poverty level Fraction of the population aged >25 years with a high school diploma Fraction of adult residents employed in professional, managerial, farming, and protective service occupations Fraction of population aged >16 years unemployed Fraction of households receiving public assistance All incomes are in 1997 dollars. All aggregate variables refer to the characteristics as of 1970 or 1980 of the respondent's zip code or census of residence. AFDC, Aid to Families with Dependent Children. earnings and education levels to all other occupations to arrive at predicted prestige scores (28). Thus, it is not entirely distinct from the income and education constructs. SEI scores correspond to the 1970 census occupation codes in the data. We use the SEI for the total population, not the male SEI. The aggregate variables were drawn from US Bureau of the Census Summary Tape Files (STF), which contain detailed tabulations of the nation's population and housing characteristics. We studied a range of possible aggregate measures that have been used as socioeconomic proxies. All income variables were transformed into natural logarithms, and all dollar amounts were first inflated to 1997 dollars using the Consumer Price Index. We estimate complete first-order correlation matrices for the microlevel socioeconomic variables and the full array of aggregate variables at the zip code level and at the census tract level from the 1970 and 1980 censuses. The correlations between microlevel and aggregate measures of socioeconomic position offer an indication of the reliability of specific aggregate variables as proxies for the microlevel variables. The correlations among aggregate variables indicate the extent to which the aggregate variables are multicollinear. These correlations are suggestive of an investigator's ability to estimate coefficients in regressions using multiple measures and also address the question of the conceptual distinctiveness of aggregate measures. We estimate selected health outcome equations usAm J Epidemiol Vol. 148, No. 5, 1998 ing various versions of the socioeconomic variables: we include only microlevel variables but then substitute aggregate for microlevel variables; 1970 for 1980 census variables; and zip code level for tract level aggregate variables. We include only one socioeconomic covariate in some models and multiple measures in others. We evaluate the stability of coefficients on the socioeconomic variables across models and the goodness of fit of various models. Because outcome measures are discrete (ordered polychotomous), we use methods appropriate to limited dependent variables. We assume that the ordered categorical responses reflect an underlying latent continuous variable that is distributed normally—with larger values representing better health—and estimate a series of ordered probits. The chi-square statistic for each model offers information on the strength of the association between the specific socioeconomic variable and the health outcome and also provides an indication of the reliability with which an investigator could expect to estimate coefficients on the specific variable. Moreover, to the extent that one views any socioeconomic variable as a component of a more global construct of interest, a better fitting model can be interpreted to have picked up a larger component of the overall theoretical construct. In actual applications, the investigator is interested in the magnitude of the coefficient on the socioeconomic variable. By comparing coefficients between a microlevel variable and its aggregate counterpart (e.g., 478 Geronimus and Bound family income and median income), we gain a sense of how well one can substitute for the other in regression equations. RESULTS In table 2, we report means and standard deviations of the socioeconomic variables. For ease of interpretation, income variables can be converted to 1997 dollar values by exponentiating. For example, a mean of 11 corresponds to a family income of roughly $60,000 in 1997 dollars (e11 « $60,000). Means for aggregate and microlevel variables are generally similar. The aggregate variables show substantially less variation than the microlevel variables, implying that TABLE 2. aggregate data provide substantially less statistical power than microlevel data. Note also that variation in aggregate variables at the tract level is only somewhat greater than at the zip code level. This suggests that using tract data will increase statistical power compared with using zip code level data, but that this increase will be small. Correlations Table 3 presents simple first-order correlations between the microlevel socioeconomic variables in the PSID, the 1980 census variables, and the 1970 census variables. In table 4, correlations between PSID microlevel variables and aggregate variables from Summary statistics: various socioeconomic measures by sample* Sample Socioeconomic status measure Microlevel Income Education SEI/1Ot Poor High school graduate Professional Aggregate 1980 zip codes Income Education Fraction Poor High school graduate Professional Unemployed On AFDC* 1970 zip codes Income Education Fraction Poor High school graduate Professional Unemployed On AFDC 1980 census tracts Income Education Fraction Poor High school graduate Professional Unemployed On AFDC 1980/1970 (/) = 4,393) Zip code/census tract (n = 4,762) Mean Standard deviation Mean Standard deviation 10.96 13.09 3.92 0.18 0.84 0.36 0.84 2.50 2.02 0.38 0.36 0.49 10.91 13.09 3.92 0.18 0.85 0.35 0.84 2.50 2.00 0.38 0.36 0.49 10.64 12.65 0.29 0.90 10.63 12.67 0.28 0.89 0.10 0.70 0.30 0.O7 0.07 0.08 0.14 0.10 0.04 0.06 0.10 0.71 0.30 0.06 0.07 0.08 0.13 0.10 0.04 0.06 9.94 11.99 0.25 0.81 0.09 0.57 0.26 0.04 0.07 0.15 0.10 0.02 0.04 10.62 12.68 0.34 1.01 0.10 0.71 0.30 0.06 0.07 0.09 0.15 0.12 0.04 0.07 0.04 * Data source: Panel Study of Income Dynamics (PSID). t SEI/10, socioeconomic index divided by 10; AFDC, Aid to Families with Dependent Children. Am J Epidemiol Vol. 148, No. 5, 1998 I cf p en CO CO oo T A B L E 3. Correlations b e t w e e n s o c i o e c o n o m i c m e a s u r e s : 1980/1970 zip c o d e s s a m p l e * Mlcrolovel 1. Income 2. Education 3. SEIt 4. Poor 5. High school graduate 6. Professional 1980 zip codes 7. Median Income 8. Mean education Fraction 9. Poor 10. High school graduate 11. Professional 12. Unemployed 13.AFDCt 1970 zip codes 14. Median Income 15. Mean education Fraction 16. Poor 17. High school graduate 18. Professional 19. Unemployed 20. AFDC 1 2 3 4 5 6 1.00 0.32 0.36 -O.70 0.27 0.28 1.00 0.61 -0.27 0.64 0.50 1.00 -0.25 0.28 0.81 1.00 -0.26 -0.18 1.00 0.21 1.00 0.40 0.30 0.30 0.38 0.26 0.32 -0.32 -0.25 0.23 0.24 -0.34 0.33 0.30 -0.29 -0.32 -0.21 0.36 0.37 -0.26 -0.25 -0.18 0.28 0.34 -0.24 -0.22 0.31 -0.29 -0.24 0.26 0.30 0.36 0.29 0.29 0.37 0.25 0.30 -0.31 0.31 0.31 -0.21 -0.27 -0.18 0.35 0.34 -0.15 -0.18 -0.15 0.27 0.31 -0.15 -0.15 9 10 11 12 13 -0.47 0.91 0.93 -0.60 -0.57 1.00 -0.66 -0.42 0.65 0.88 1.00 0.80 -0.62 -0.70 1.00 -0.57 -0.51 1.00 0.75 1.00 0.89 0.68 0.68 0.94 -0.70 -0.44 0.73 0.87 0.64 0.86 -0.49 -0.55 -0.75 0.73 0.69 -0.48 -0.69 -0.40 0.87 0.88 -0.31 -0.42 0.83 -0.58 -0.47 0.52 0.79 -0.59 0.93 0.79 -0.33 -0.55 -0.33 0.75 0.90 -0.35 -0.36 0.46 -0.56 -0.58 0.60 0.56 7 8 0.21 0.25 1.00 0.72 1.00 -0.20 0.26 0.22 -0.19 -0.21 -0.14 0.23 0.26 -0.19 -0.18 -0.85 0.79 0.68 -0.64 -0.78 -0.29 -0.23 0.20 0.23 0.20 0.23 0.29 -0.26 -0.24 0.17 0.26 -0.17 0.24 0.21 -0.12 -0.16 -0.12 0.22 0.24 -0.11 -0.12 14 15 -0.60 -0.53 1.00 0.72 1.00 0.68 -0.61 -0.56 0.56 0.83 -0.81 0.76 0.67 -0.45 -0.65 -0.44 0.94 0.90 -0.32 -0.44 16 17 18 19 20 1.00 -0.59 -0.38 0.47 0.78 1.00 0.80 -0.33 -0.53 1.00 -0.38 -0.43 1.00 0.62 1.00 * Data source: Panel Study of Income Dynamics (PSID). t SEI, socioeconomic Index; AFDC, Aid to Families with Dependent Children. o o. o' T3 3 X <&' 01 a CD 3 I" c/> B> Q. TABLE 4. Correlations between socioeconomic measures: 1980 zip codes/census tracts sample* Mlcrolevel 1. Income 2. Education 3.SEII 4. Poor 5. High school graduate 6. Professional 1980 zip codes 7. Median Income 8. Mean education Fraction 9. Poor 10. High school graduate 11. Protesslonal 12. Unemployed 13.AFDCt 1980 census tracts 14. Median Income 15. Mean education Fraction 16. Poor 17. High school graduate 18. Professional 19. Unemployed 20. AFDC 1.00 0.32 0.35 -0.70 0.25 0.27 1.00 0.62 -0.25 0.64 0.49 0.40 0.30 0.30 0.38 -0.34 0.33 0.30 -0.28 -0.30 -0.19 0.35 0.37 -0.23 -0.24 0.45 0.34 0.31 0.42 -0.38 0.36 0.35 -0.31 -0.34 -O.20 0.37 0.39 -0.24 -0.26 1.00 -0.25 0.27 0.81 1.00 -0.24 -0.18 1.00 0.20 1.00 0.26 -0.31 0.31 -0.24 0.23 0.24 0.21 0.23 -0.17 0.30 -0.20 -0.14 0.27 -0.27 0.26 0.22 0.33 -0.22 0.21 0.25 -0.22 0.25 -0.16 -0.16 -0.22 0.28 -0.21 -0.18 0.29 0.34 -0.35 -0.27 0.25 0.27 0.22 0.24 -0.18 0.36 -0.22 -0.14 0.29 -0.31 0.30 0.21 0.36 -0.26 0.24 0.26 -0.23 0.28 -0.20 -0.17 -0.23 0.33 -0.24 -0.18 • Data source: Panel Study ot Income Dynamics (PSID). t SEI, socioeconomic Index; AFDC, Aid to Families with Dependent Children. ! 00 p 01 8 00 CD O Q. 1.00 0.72 11 12 13 1.00 0.76 -0.58 -0.70 1.00 -0.55 -0.50 1.00 0.72 1.00 0.63 0.75 0.53 0.75 -0.47 -0.47 -0.30 0.63 0.78 -0.45 -0.41 0.47 -0.47 -0.44 0.81 0.58 14 15 -0.60 -0.48 1.00 0.71 1.00 0.67 -0.58 -0.40 0.59 0.81 -0.81 0.76 0.68 -0.58 -0.72 -0.45 0.91 0.89 -0.54 -0.57 16 17 18 19 20 1.00 -0.63 -0.41 0.60 0.82 1.00 0.76 -0.55 -0.68 1.00 -0.53 -0.51 1.00 0.69 1.00 1.00 -0.84 -0.46 1.00 0.78 0.91 -0.65 0.66 0.91 -0.39 -0.60 -0.57 0.62 -0.75 -0.57 0.86 0.81 0.60 10 0.57 0.82 -0.68 -0.39 -0.65 -0.35 0.77 -0.51 0.66 0.75 -0.55 0.83 0.54 0.71 -0.33 0.61 -0.49 -0.46 0.50 -0.47 -0.61 -0.47 0.70 -0.57 Aggregate Socioeconomic Proxies the 1980 census measured at the zip code and tract levels are shown. Not surprisingly, among the microlevel measures, categorical and continuous versions of the same variable are highly correlated (—0.70 for income; 0.64 for education; and 0.81 for occupation). However, with the exception of the correlation between SEI and years of schooling, correlations among microlevel socioeconomic variables are not very high, which suggests that they measure distinct aspects of the construct of socioeconomic position. The relatively high correlation between SEI and years of schooling may be an artifact of the composite nature of the SEI. Generally, when an aggregate version of a specific variable is compared with the microlevel version of the same variable, the correlation is small to moderate. These correlations range from 0.24 for the correlation between being in a professional occupation and fraction of the adult population in professional occupations, using 1970 census data at the zip code level, to 0.45 for median income compared with family income, using 1980 data at the tract level. For specific aggregate variables, there is some indication that 1980 variables are more highly correlated with the microlevel variable than 1970 variables, and the tract level variables are more highly correlated with the individual level variables than are the zip code level variables, but the differences in all cases are small. Correlations among aggregate proxies tend to be larger. In a given year or level of aggregation, it is unusual to find a correlation below 0.50. Most fall between 0.65 and 0.94. Correlations between aggregate variables in 1970 and 1980 tend to be very high, for example, 0.89 for median income and 0.94 for mean education. The only correlation lower than 0.83 is for fraction unemployed (0.60). Correlations between the same variable measured at the zip code level compared with the census tract level in a given year are also generally moderate to high. Regression models In table 5, we estimate the effects of socioeconomic group on self-reported health status, first using microlevel measures and then various aggregate measures. In columns 1-2, we compare aggregate variables measured in 1980 with those measured in 1970, while in columns 3-4 we compare aggregate proxies measured at the zip code level with those measured for census tracts. In all cases, we control for race, age, and sex of the respondent. Coefficients on the explanatory variables can be interpreted as the effect of a one-unit change in the explanatory variable on overall selfreported health measured in standard deviation units. Columns 1 and 3 list the coefficient estimates using Am J Epidemiol Vol. 148, No. 5, 1998 481 different measures of socioeconomic group. Columns 2 and 4 list the chi-square statistic for the test of each model against the model with no socioeconomic status indicators, which provides information on which models fit the data better than others, and, hence, on how well different socioeconomic proxies predict the health outcome. Most models based on microlevel socioeconomic measures have substantially higher chi-square statistics than those using aggregate socioeconomic variables. In models that can be compared directly (e.g., models with microlevel income compared with those with median income, etc.), the goodness of fit statistic is always substantially higher for the model using the microlevel variable. In addition, the aggregate version of a given variable always picks up a substantially larger coefficient than the corresponding microlevel variable—two to three times larger in many cases, four to five times larger for the aggregate compared with the microlevel professional variable. More generally, socioeconomic variables measured at the aggregate level have very different estimated effects on health from those measured at the microlevel. Results between the regressions that include aggregate rather than microlevel socioeconomic variables show little difference in either coefficient estimates or goodness of fit between the zip code or tract levels of aggregation or between 1980 and 1970 census data. However, in both 1970 and 1980 and at both the zip code and census tract levels of aggregation, models including aggregate income or education variables and the aggregate variable based on occupational type, consistently fit substantially better than models using the remaining aggregate variables. Estimates reported in tables 6 and 7 offer information on the question of whether prediction of health outcomes is improved when multiple aggregate measures are included in models, relative to a single measure. This also addresses the question of the conceptual comparability of the procedure of including multiple aggregate measures in a model with that of including multiple microlevel measures. In all cases, microlevel models fit the data better than models that use aggregate proxies (compare panel A with panel B or C in each table). In microlevel models, goodness of fit improves when multiple variables are included relative to inclusion of a single variable. The inclusion of a second microlevel variable does not dramatically alter the coefficient on the already included income or education variable, although the coefficient on the SEI does change more dramatically when education is included in the model, presumably because it is a composite. Increases in predictive power associated with including multiple aggregate measures relative to 482 Geronimus and Bound TABLE 5. The effect of socioeconomic group on self-rated health by sample: comparisons across various socioeconomic measures*,t Sample ^ofdnpfvinomif* O v v i U t f u v l Iwll IVw status measure 1980/1970 (n = 4,393) Coefficient Microlevel Income Education SEI/1O$,§ Poor High school graduate Professional^ Aggregate 1980 zip codes Income Education Fraction Poor High school graduate Professional Unemployed On AFDC§ 1970 zip codes Income Education Fraction Poor High school graduate Professional Unemployed On AFDC X2 Zip code/census tract (n = 4,762) Coefficient X2 0.35 0.15 0.12 -0.61 0.67 0.41 278.5 477.8 250.8 182.8 217.0 198.5 0.33 0.14 0.12 -0.55 0.62 0.39 268.7 456.8 268.5 165.0 198.9 224.1 0.66 0.25 110.2 171.4 0.65 0.24 114.6 168.3 -1.45 1.56 2.20 -3.86 -1.91 40.2 152.2 170.3 68.9 39.7 -1.38 1.44 2.25 -3.70 -1.91 37.0 135.1 178.1 66.7 42.0 0.74 0.27 117.8 168.0 -1.89 1.37 1.92 -4.32 -2.21 58.2 153.4 133.4 26.5 30.4 1980 census tracts Income Education Fraction Poor High school graduate Professional Unemployed On AFDC 0.60 0.22 137.7 188.8 -1.36 1.30 2.03 -3.12 -2.04 52.2 143.1 206.6 64.6 64.3 * Data source: Panel Study of Income Dynamics (PSID). t Specifications include controls for age, race, and sex. $ Specifications also include dummy for missing occupation. § SEI/10, socioeconomic index divided by 10; AFDC, Aid to Families with Dependent Children. the inclusion of only one are more modest. The inclusion of a second aggregate variable often has a large impact on the coefficient on the already included aggregate variable. There are virtually no differences between using 1970 data and 1980 data, while those between aggregating at the zip code versus census tract levels of aggregation are small. DISCUSSION Neither of the following appear to affect regression results appreciably, in terms of the goodness of fit of models or the magnitude of coefficient estimates: us- ing 1970 compared with 1980 census data or using zip code versus census tract level data. These results may seem counterintuitive. Yet, regarding census year, the tabulations indicate that economic characteristics of geographic units in 1970 are excellent proxies for the economic characteristics of the same unit in 1980. Correlations between aggregate variables in 1970 and 1980 were all above 0.8 except for fraction unemployed. Unemployment may vary over time within locales due to regional and other macroeconomic effects on employment levels. Generally, the relative wealth or poverty of specific locales appears to remain Am J Epidemiol Vol. 148, No. 5, 1998 Aggregate Socioeconomic Proxies 483 TABLE 6. The effects of socioeconomic group on self-rated overall health using 1970 and 1980 zip code sample*,t SESJ variable Regression coefficient (standard error) by model§ 1 2 3 4 5 0.23 (0.02) 0.27 (0.02) 6 7 A. Microlevel variables Income 0.35 (0.02) Education 0.12 (0.01) SEI* X2 statistics on SES variables 0.13 (0.01) 0.15 (0.01) 278.5 477.8 250.8 583.2 0.22 (0.02) 0.13 (0.01) 0.12 (0.01) 0.09 (0.01) 0.03 (0.01) 0.01 (0.01) 399.8 508.9 597.0 B. Aggregate variables, 1980 zip codes Income 0.66 (0.05) Education 0.16 (0.07) X2 statistics on SES variables 0.21 (0.02) 0.25 (0.02) 2.20 (0.13) Professional 110.2 171.4 0.21 (0.07) 170.3 174.8 0.15 (0.07) 0.13 (0.04) 0.11 (0.04) 1.83 (0.18) 1.10 (0.35) 1.05 (0.35) 176.3 177.6 180.5 C. Aggregate variables, 1970 zip codes Income 0.74 (0.05) Education 0.24 (0.08) 0.27 (0.02) Professional X2 statistics on SES variables 0.22 (0.02) 1.92 (0.13) 117.8 168.0 0.40 (0.07) 133.4 174.0 0.24 (0.08) 0.27 (0.04) 0.22 (0.04) 1.30 (0.17) 0.03 (0.30) 0.01 (0.30) 153.9 168.0 174.0 * Data source: Panel Study of Income Dynamics (PSID). t All specifications include controls for age, race, and sex; SEI specifications also include a dummy variable for missing occupation. $ SES, socioeconomic status; SEI, socioeconomic index. § Models 1-3 each include only a single SES variable; models 4-6 each include two SES variables; model 7 includes all three SES variables. remarkably stable in the United States, at least over a 10-year period. Our finding that use of census tract level data does not greatly improve estimation over using zip code level data appears to be due to the fact that socioeconomic variation within census tracts is almost as great as that within zip code areas. For example, comparing the variation in income measured at the micro and aggregate level in the PSID suggests that 11 percent of variation in individual income is between zip codes. Thus, there is 89 percent as much variation in income within zip codes as in the general population. In cenAm J Epidemiol Vol. 148, No. 5, 1998 sus tracts, our estimates imply that there is 84 percent as much variation within tracts as in the overall population. Our data did not permit analysis at the block group level. However, given the little difference it made to move from the zip code to census tract level of aggregation, we would not assume, a priori, that moving to block group data would alter results qualitatively. In Australia, Hyndman et al. (29) found that data collected at the level of "collector's districts" did yield substantially more reliable estimates than those collected at the larger "postcode" level of aggregation. 484 Geronimus and Bound TABLE 7. The effects of socioeconomic group on self-rated overall health using 1980 zip code and census tract sample*,t SES* variable 1 2 Regression coefficient (standard error) by model§ 3 4 5 6 7 A. Microlevel variables Income 0.33 (0.02) Education 0.22 (0.02) 0.14 (0.01) SEI* X2 statistics on SES variables 0.12 (0.01) 0.12 (0.01) 268.7 456.8 0.25 (0.02) 268.5 558.8 0.20 (0.02) 0.12 (0.01) 0.11 (0.01) 0.09 (0.01) 0.03 (0.01) 0.01 (0.01) 406.6 505.9 587.1 B. Aggregate variables, 1980 zip codes Income 0.65 (0.05) Education 0.20 (0.07) 0.24 (0.01) Professional X2 statistics on SES variables 0.20 (0.02) 2.25 (0.14) 114.6 168.3 0.23 (0.06) 178.1 173.7 0.19 (0.07) 0.09 (0.03) 0.06 (0.04) 1.85 (0.18) 1.48 (0.31) 1.46 (0.31) 186.6 183.1 188.1 C. Aggregate variables, 1980 census tracts Income 0.60 (0.04) Education 0.23 (0.06) 0.22 (0.01) Professional X2 statistics on SES variables 188.8 206.6 199.3 0.19 (0.06) 0.08 (0.03) 0.05 (0.03) 1.65 (0.15) 1.44 (0-25) 1.35 (0.25) 217.5 211.6 219.0 0.17 (0.02) 2.03 (0.11) 137.7 0.22 (0.05) * Data source: Panel Study of Income Dynamics (PSID). t All specifications include controls for age, race, and sex; SEI specifications also include a dummy variable for missing occupation. X SES, socioeconomic status; SEI, socioeconomic index. § Models 1-3 each include only a single SES variable; models 4-6 each include two SES variables; model 7 includes all three SES variables. However, the generalizability of results from Australia to the United States is an open question. How social stratification is reflected in geography and government statistical units may vary between the two countries. Krieger (30) compared census tract to block group level results in her analysis based on health maintenance organization data in California. In half of her calculations, estimates based on block groups were more reliable than those based on census tracts, but in some of her calculations the reverse was true, and in no case were differences in confidence intervals very great (see her table 2). Block group data may perform better than census tract or zip code level data in a less select sample, but this remains an empirical question. Our findings suggest that there is little advantage in the inclusion of multiple aggregate measures compared with a single aggregate measure in health outcome equations. There is little to be gained in explanatory power by including multiple aggregate measures, and their multicollinearity exacerbates problems in the interpretation of coefficients in such a model. While not an explicit objective of this study, our findings also raise questions about the merit of including a socioeconomic index of occupation when microlevel data on income or education are available. Our findings indicate that conceptual differences Am J Epidemiol Vol. 148, No. 5, 1998 Aggregate Socioeconomic Proxies among aggregate variables are more blurred than those between their microlevel counterparts. One implication is that choosing an aggregate measure on theoretical grounds may be ascribing greater construct validity to specific measures than is merited. More generally, the findings suggest a qualified recommendation on the question of which single aggregate measure to include. Across the PSID and NMIHS samples, models including median income consistently had better predictive power than when some of the other aggregate measures were included. In samples for neither data set did fraction unemployed or fraction on Aid to Families with Dependent Children fit the data as well as other aggregate variables. In the PSID samples, but not the NMIHS sample, the aggregate education variables and the occupational position variable fit the data as well or better than median income, while in the NMIHS, but not the PSID, the aggregate poverty variable had roughly the same goodness of fit associated with it as median income (not shown). These findings lead us to believe that median income—the most commonly used aggregate variable in the literature to date—may be a sensible single aggregate measure to use. When data permit, investigators may wish to conduct analyses to test the sensitivity of their results to different aggregate measures. Although our findings should give investigators some assurance about the use of imperfect data, they also suggest that caution should be exercised in the interpretion of results based on census-based aggregate measures. Perhaps one reason it makes little difference whether an investigator uses aggregate data measured 10 or 20 years ago, or at the zip code or census tract level, is because aggregate measures are simply poor proxies for microlevel characteristics. Indeed, the differences in coefficient estimates depending on whether microlevel versus aggregate socioeconomic measures were used show that the aggregate measures are not akin to their microlevel counterparts. In general, they picked up larger coefficients and were more highly multicollinear than respective microlevel measures. Estimating larger coefficients with aggregate compared with microlevel measures may appear in conflict with the common assumption that variables measured with error will tend to underestimate relations. However, applicable to the current context, Geronimus et al. (17) outlined a statistical framework that identifies two sources of bias. First, there is an errors-in-variable bias that arises because the aggregate variable is only imperfectly correlated with the microlevel variable it represents. This bias is different from the standard errors-in-variables bias which is proportional to the reliability of a measure. Instead, the errors-in-variables Am J Epidemiol Vol. 148, No. 5, 1998 485 bias arises because socioeconomic variation within geographic areas is correlated with microlevel covariates, such as race, that are also included in the estimating equations (17, 31). The second source of bias is an aggregation bias, which arises from the fact that the aggregate variable may itself be correlated with the residual in the microlevel equation. While the first problem is likely to exert a downward bias on the coefficient, the magnitude of that bias will typically be smaller than in the more standard case. Meanwhile, the aggregation bias suggests that the aggregate variable is a proxy for a broader construct than the microlevel variable (32) and this may lead it to pick up a larger coefficient, as it has in the two national samples we analyzed. (See reference 17 for explication of these points.) Our empirical findings and this statistical framework together suggest aggregate measures tap a more global construct than do microlevel measures and should not be interpreted as equivalent to microlevel constructs. It may also be inappropriate to think of them as reflecting phenomena specific to their labels. This last concern also influences the interpretation of coefficients in applications where aggregate variables are used to measure "contextual" effects. That is, while a significant coefficient on an aggregate variable may suggest there is some characteristic of the respondent's neighborhood that affects the health outcome under study, whether or not it is the specific entity measured by the variable is a more difficult question. In conclusion, investigators limited to using censusbased aggregate measures of socioeconomic group need not be overly concerned about how recent the data are (at least within a 20-year period) or whether they are measured at the zip code or census tract level. However, there are clear limits to the knowledge to be gained by this approach. Geocoding data sets may be more economical than implementing the other recommendations made at the 1994 National Institutes of Health conference. For example, the participants also recommended routine collection of a detailed and diverse set of individual socioeconomic characteristics on government surveys; funding the development of improved health measures on national surveys—including the PSID—that already have detailed socioeconomic data; and augmenting the individual socioeconomic information collected on vital statistics data (4, 5). Implemention of at least some of these more costly recommendations rather than overreliance on geocoding survey or vital statistics data may be worth the extra effort and resources if important advances in understanding social inequalities in health are to be made. 486 Geronimus and Bound ACKNOWLEDGMENTS Supported by the National Institutes of Child Health and Human Development (contract no. 263-MD-626341) and the Centers for Disease Control and Prevention (grant no. U83/CCU51249-02). John Bound is a Fellow of the National Bureau of Economic Research. The authors thank Drs. Christine Bachrach, Nancy Moss, and James Weed for their efforts to help them gain access to the special release of the National Maternal and Infant Health Survey, Dr. Sherman James for helpful comments on a previous draft of the paper, Dr. Lisa Neidert for help with data preparation, Marianne Hillemeier and Pat Burns for research assistance, and Mary-Claire Toomey and Judy Mullin for technical assistance with the manuscript. REFERENCES 1. Williams DR. Socioeconomic differentials in health: a review and redirection. Soc Psychol Q 1990;53(2):81-99. 2. Angell M. Privilege and health—what is the connection? N Engl J Med 1993;329:126-7. 3. Feinstein JS. The relationship between socioeconomic status and health: a review of the literature. Milbank Q 1993;71: 279-322. 4. Moss N, Krieger N. Measuring social inequalities in health. Public Health Rep 1995,110:302-5. 5. Syme SL, Moss N, Krieger N, rapporteurs. Recommendations of the conference "Measuring Social Inequalities in Health." Int J Health Serv 1996;26:521-7. 6. Devesa SS, Diamond EL. Socioeconomic and racial differences in lung cancer incidence. Am J Epidemiol 1983;118: 818-31. 7. McWhorter WP, Schatzkin AG, Horm JW, et al. Contribution of socioeconomic status to black/white differences in cancer incidence. Cancer 1989;63:982-7. 8. Mandelblatt J, Andrews H, Kerner J, et al. Determinants of late stage diagnosis of breast and cervical cancer: the impact of age, race, social class, and hospital type. Am J Public Health 1991;81:646-9. 9. Wise PH, Kotelchuck M, Wilson ML, et al. Racial and socioeconomic disparities in childhood mortality in Boston. N Engl J Med 1985;313:360-6. 10. Gould JB, Davey B, LeRoy S. Socioeconomic differentials in neonatal mortality: racial comparison of California singletons. Pediatrics 1989;83:181-6. 11. Collins JW, David RJ. Differences in neonatal mortality by race, income, and prenatal care. Ethnicity Dis 1992,2:18-26. 12. Kraus JF, Fife D, Ramstein K, et al. The relationship of family income to the incidence, external causes, and outcomes of serious brain injury, San Diego County, California. Am J Public Health 1986;76:1345-7. 13. Marder D, Targonski P, Orris P, et al. Effect of racial and socioeconomic factors on asthma mortality in Chicago. Chest 1992; 101:426S-429S. 14. Byrne C, Nedelman J, Luke RG. Race, socioeconomic status, and the development of end-stage renal disease. Am J Kidney Dis 1994;23:16-22. 15. Cherkin DC, Grothaus L, Wagner EH. Is magnitude of copayment effect related to income? Using census data for health services research. Soc Sci Med 1992;34:33-41. 16. Greenwald HP, Polissar NL, Borgatta EF, et al. Detecting survival effects of socioeconomic status: problems in the use of aggregate measures. J Clin Epidemiol 1994;47:903-9. 17. Geronimus AT, Bound J, Neidert LJ. On the validity of using census geocode characteristics to proxy individual socioeconomic characteristics. J Am Statist Assoc 1996;91:529-37. 18. Kreiger N, Williams DR, Moss NE. Measuring social class in US public health research: concepts, methodologies, guidelines. Annu Rev Public Health 1997;18:341-78. 19. Hill MS. The Panel Study of Income Dynamics: a user's guide. Newbury Park, CA: Sage Publications, 1992. 20. Institute for Social Research. A Panel Study of Income Dynamics: procedures and tape codes, 1985 interviewing year (documentation), vol. I, wave XVHI, a supplement. Ann Arbor, MI: Institute for Social Research, University of Michigan, 1988. 21. Duncan G, Hill D. Assessing the quality of household panel survey data: the case of the PSID. J Business Econ Stat 1989:7:441-51. 22. Becketti S, Gould W, Lillard L, et al. The Panel Study of Income Dynamics after fourteen years: an evaluation. J Labor Econ 1988;6:472-92. 23. Maddox G, Douglas E. Self-assessment of health: a longitudinal study of elderly subjects. J Health Soc Behav 1993;14: 87-93. 24. LaRue A, Bank L, Jarvic L, et al. Health in old age: how physicians' ratings and self-ratings compare. J Gerontology 1979;34:687-91. 25. Farraro KF. Self-ratings of health among the old and old-old. J Health Soc Behav 1980;21:377-83. 26. Mossey JM, Shapiro E. Self-rated health: a predictor of mortality among the elderly. Am J Public Health 1982;72:800-8. 27. Manning WG, Newhouse JP, Ware JE Jr. The status of health in demand estimation, or beyond excellent, good, fair and poor. In: Fuchs VR, ed. Economic aspects of health. Chicago: University of Chicago Press, 1982:143-84. 28. Duncan O. A socioeconomic index for all occupations. In: Reiss AJ Jr, ed. Occupations and social status. New York: Free Press, 1961:109-38. 29. Hyndman JCT, Holman CDJ, Hockey RL, et al. Misclassification of social disadvantage based on geographical areas: comparison of postcode and collector's districts analyses. Int J Epidemiol 1995 ;24:165-76. 30. Kreiger N. Overcoming the absence of socioeconomic data in medical records: validation and application of a census-based methodology. Am J Public Health 1992;82:703-10. 31. Dickens WT, Ross BA. Consistent estimation using data from more than one sample. Technical working paper no. 33. Cambridge, MA: National Bureau of Economic Research, 1984. 32. Hammond JL. Two sources of error in ecological correlations. Am Sociol Rev 1973;38:764-77. Am J Epidemiol Vol. 148, No. 5, 1998
© Copyright 2026 Paperzz