American Journal of Epidemiology Copyright © 1999 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 149, No. 11 Printed in U.S.A. Comparison of Methods for Classifying Hispanic Ethnicity in a Populationbased Cancer Registry Susan L. Stewart,1 Karen C. Swallen,2 Sally L. Glaser,1 Pamela L. Horn-Ross,1 and Dee W. West1 The accuracy of ethnic classification can substantially affect ethnic-specific cancer statistics. In the Greater Bay Area Cancer Registry, which is part of the Surveillance, Epidemiology, and End Results (SEER) Program and of the statewide California Cancer Registry, Hispanic ethnicity is determined by medical record review and by matching to surname lists. This study compared these classification methods with self-report. Ethnic selfidentification was obtained by surveying 1,154 area residents aged 20-89 years who were diagnosed with cancer in 1990 and were reported to the registry as being Hispanic or White non-Hispanic. Predictive value positive, sensitivity, and relative bias were used to assess the accuracy of Hispanic classification by medical record and surname. Among those persons classified as Hispanic by either or both of these sources, only twothirds agreed (predictive value positive = 66%), and many self-identified Hispanics were classified incorrectly (sensitivity = 68%). Classification based on either medical record or surname alone had a lower sensitivity (59% and 61%, respectively) but a higher predictive value positive (77% and 70%, respectively). Ethnic classification by medical record alone resulted in an underestimate of Hispanic cancer cases and incidence rates. Bias was reduced when medical records and surnames were used together to classify cancer cases as Hispanic. Am J Epidemiol 1999;149:1063-71. bias (epidemiology); classification; ethnic groups; Hispanic Americans; incidence; neoplasms; population studies; SEER program There is considerable interest today in assessing racial and ethnic differences in patterns of disease to help understand disease causation and control. Although there is no widely accepted definition of ethnicity or race, both have been associated with various genetic, socioeconomic, cultural, and nutritional factors (1-2). Clearly, the interpretation of such associations depends in part on the methods used to classify subjects by race or ethnicity. In the United States, an ethnic group of particular interest is the Hispanic population. Hispanics are the nation's fastest growing minority and will be the largest by the year 2000 (3). A number of epidemiologic studies have found that Hispanics in various geographic areas, as compared with White non-Hispanics, have lower incidence rates of cancer at several anatomic sites, including the oral cavity (4), esophagus (5), stomach (4, 5), colon (5-7), rectum (4-7), pancreas (5), lung and bronchus (4-8), breast (7, 9), cervix (4), testes (5), prostate (5, 10), bladder (4-7), and kid- ney (5, 6) as well as lower rates of melanoma (5), mesothelioma (5), chronic lymphocytic leukemia (5), and non-Hodgkin's lymphoma (5). In addition, lower cancer mortality has been found for Hispanics (11) nationally and lower overall cancer incidence among Hispanics in Florida (12) and Illinois (13). In contrast, several studies have found Hispanics to be at increased risk for cancer at several sites, including the cervix (6, 7, 12-14), liver (5, 7, 12), gallbladder (5, 12), stomach (7), nasal cavity (5), penis (5), thyroid (5), and heart and soft tissue (12) as well as for acute lymphocytic leukemia (5) and Kaposi's sarcoma (5). For many research purposes, Hispanic ethnicity is assessed by self-identification. The US Bureau of the Census currently uses this method. However, because data on self-identification are not always available, the Census Bureau has used other methods to classify people as Hispanic, including Spanish birth or parentage, Mexican race, Spanish language, Spanish heritage, Spanish origin, and Spanish surname (15). The PasselWord Spanish surname list, created from the 1980 decennial US Census (16), is the one currently used by the Census Bureau (17). In health surveillance, such as cancer registration, assignment of ethnicity is often based on medical record report. These classifications may involve subjective appraisals by hospital personnel, and accuracy varies Received for publication December 29, 1997, and accepted for publication October 7, 1998. Abbreviations: GUESS, Generally Useful Ethnic Search System; PV+, predictive value positive; PV-, predictive value negative; SEER, Surveillance, Epidemiology, and End Results. 1 Northern California Cancer Center, Union City, CA. 2 Department of Sociology, University of Wisconsin, Madison, Wl. 1063 1064 Stewart et al. considerably (18). Therefore, when ethnic-specific cancer incidence and survival rates are computed by using surveillance data, the numerator (number of cancer cases) and the denominator (population count) are obtained from sources that typically use different methods of ethnic classification. Discrepancies between these two classification methods may influence the accuracy of disease rate calculations, making comparisons between ethnic groups especially difficult. In particular, if the number of cancer cases reported to the registry as Hispanic is lower than the number of cases of self-identified Hispanics, the risk for cancers in this population will be underestimated. If, on the other hand, the registry systematically overcounts Hispanic cases, the risk will be overestimated. The purpose of this study was to examine the extent of misclassification of Hispanic ethnicity in patient data collected by the San Francisco-Oakland populationbased cancer registry compared with ethnic selfidentification based on telephone interview. Quantification of misclassification would enable us to estimate the accuracy of different methods available to the registry for classifying persons as Hispanic and to adjust incidence rates for misclassification. A more accurate evaluation of cancer incidence would permit better planning and evaluation of cancer control programs in this rapidly growing segment of the San Francisco Bay Area and US population. MATERIALS AND METHODS The aims of this study were to 1) determine the extent of misclassification associated with methods available to the registry for classifying Hispanics and 2) estimate misclassification-adjusted standardized incidence rates for comparison with unadjusted rates. The properties of the proposed adjustment method (19) and statistical models of misclassification as a function of self-reported socioeconomic, cultural, and demographic factors (20) are described elsewhere. The study included persons who were identified by the Greater Bay Area Cancer Registry, which is part of the Surveillance, Epidemiology, and End Results (SEER) Program and of the statewide California Cancer Registry. Eligible subjects were persons aged 20-89 years when diagnosed with incident invasive or in situ cancer of the colon, lung, female breast, cervix, or prostate during 1990; residing in one of the five registry counties in the San Francisco Bay Area (Alameda, Contra Costa, Marin, San Mateo, and San Francisco); and reported to the registry as being of White race. All eligible subjects were initially classified as either Hispanic or non-Hispanic. Persons were placed in the Hispanic group for study selection if 1) they were reported to the registry as being of Spanish or Hispanic origin on the basis of their medical records, and/or 2) their surnames appeared on the Census Bureau's 1980 Spanish surname list, and/or 3) their surnames were determined to be Hispanic as a result of using the Generally Useful Ethnic Search System (GUESS) program, developed by the New Mexico Tumor Registry (21). The non-Hispanic group consisted of White persons not determined to be Hispanic by any of these three classification methods. For each cancer site, all those classified as Hispanic were chosen for interviews, and an equal number of White nonHispanics was selected by random number assignment. The data were sampled by cancer site to enable adjustment of site-specific incidence rates in case of ethnic misclassification due to site-related factors, such as socioeconomic status, not measured directly by the registry. The initial sample consisted of all 780 Hispanics (756 White and 24 non-White) and 781 of 6,452 White non-Hispanics. After subsequent registry updates, in which eligibility for the study was verified, the sample included 743 of 750 White Hispanics and 776 of 6,382 White non-Hispanics. A brief telephone interview was conducted with subjects or their next of kin. A bilingual interviewer administered the entire interview in Spanish or English, according to the respondent's preference. Questions included ethnic self-identification (as used by the 1980 and 1990 US Censuses), place of birth, immigration history, familial ethnic origin and identification, language preference for speaking and reading, and socioeconomic indicators. The questionnaire was translated into Spanish by using standard methodology, with back translation and resolution of discrepancies. Overall, 72 percent of the interviews were with the patient, 11 percent with the spouse, 9 percent with a child, 2 percent with a sibling, and 6 percent with other next of kin. Next-of-kin interviews were conducted to avoid possible biases due to excluding patients with short survival times and under the assumption that close relatives would be aware of the patient's ethnic selfidentification. Correctness of classification, as measured by predictive value, did not differ significantly between self- and next-of-kin respondents. The measures of accuracy used to assess the different classification methods were as follows: predictive value positive (PV+), the percentage of persons classified as Hispanic who self-identified as Hispanic; predictive value negative (PV-), the percentage of persons classified as non-Hispanic who self-identified as non-Hispanic; sensitivity, the percentage of selfidentified Hispanics who were classified as Hispanic; specificity, the percentage of self-identified nonHispanics who were classified as non-Hispanic; and relative bias, the amount by which the percentage clasAm J Epidemiol Vol. 149, No. 11, 1999 Hispanic Classification 1065 sified as Hispanic differed from the percentage selfidentifying as Hispanic, as a percentage of the latter ((sensitivity/PV+) - 1). To estimate these measures, values from the interviewed sample were weighted in proportion to the inverse of their sampling fraction (the number of eligible subjects divided by the number of subjects interviewed) by cancer site and classification as Hispanic or non-Hispanic. The following five classification methods were compared with ethnic self-identification: 1) report to the registry as Hispanic on the basis of medical record review, 2) surname included on the 1980 Spanish surname list, 3) report to the registry and/or surname included on the Census Bureau list, 4) surname judged to be Spanish by the GUESS program, and 5) classification as Hispanic by any of the other four methods (composite). The latter method was used to assign persons to the Hispanic group for sampling purposes. Cases currently are reported to the SEER registry as Hispanic by using method 3, hereafter referred to as registry-surname. Statistical analyses were performed by using SAS software (SAS/STAT version 6; SAS Institute, Inc., Cary, North Carolina) (22). Cochran-Mantel-Haenszel statistics were generated to test for an association between response rate and Hispanic classification, controlling for cancer site. The log-odds of response (i.e., participation in the study) as a function of site was modeled separately for the Hispanic and nonHispanic groups by using logistic regression. The five measures (PV+, PV-, sensitivity, specificity, and relative bias) were computed for subgroups of subjects categorized by cancer site, sex, and age; sensitivity was also computed by national origin. For the composite method of classification, the value of a measure in a given subgroup was compared with the mean of the subgroup measures for the given categorization by using the SAS procedure PROC CATMOD. For instance, the sensitivity for males and the sensitivity for females were each compared with the average of the sensitivities for males and females. To adjust incidence rates for misclassification, estimates of the proportion of self-identified Hispanics in age-sex-site groups were produced by applying the estimates based on the interviewed sample to the entire group of eligible patients, following the method of Tenenbein (23). To create efficient estimates, logistic regression models of Hispanic identity were developed and tested against saturated models of age, sex, and cancer site. The age distributions of males and females were very different; therefore, the age groups for females were defined as less than 40 years, 40-64 years, and 65 years or older, and the age groups for males were defined as less than 65 years and 65 years Am J Epidemiol Vol. 149, No. 11, 1999 or older. Separate models were created for those categorized as Hispanic for sampling purposes and for those sampled as non-Hispanic. Then, for each cancer site, the total proportion of patients self-identifying as Hispanic in each age-sex group was estimated as the sum of the corresponding estimates in the Hispanic and non-Hispanic categories weighted by the proportion of eligible patients in each category. Age-adjusted incidence rates for Hispanics were estimated by applying the estimated proportion of Hispanics in each age-sexsite group to the total number of White cancer cases (including those classified by the registry as Hispanic), dividing by the 1990 Bay Area Hispanic population in the given age-sex group, and standardizing to the 1970 US population. RESULTS Telephone interviews were completed for 560 eligible persons classified as Hispanic and for 594 persons classified as White non-Hispanic, a total of 76 percent of the patients in the sample selected for interview. Response rates for Hispanics and non-Hispanics, controlling for site, were not significantly different. Participation was significantly higher for breast cancer patients (85 percent for non-Hispanics and 87 percent for Hispanics) and lower for both non-Hispanics with lung cancer (63 percent) and Hispanics with cervical cancer (64 percent). The characteristics of the interviewed sample are described in table 1. Because of the great difference between the age distribution of the cervical cancer subjects and that of subjects with cancer at other sites, inferences about misclassification among younger people were based primarily on data from the cervical cancer group, in which self-identified Hispanics were of predominately Mexican and Central American origin. Accuracy of classification methods Estimates of the predictive value negative (PV-) and the predictive value positive (PV+), the specificity and sensitivity, and the relative bias for the five methods of classification described above, overall and by cancer site, are shown in table 2. The PV- and the specificity of all methods were very high (95-98 percent); that is, persons classified as non-Hispanic were extremely likely to agree with that identification, and most selfidentified non-Hispanics were classified correctly. However, the situation was rather different regarding classification of Hispanics. Although the registry and surname methods each had a fairly high PV+ value (77 and 70 percent, respectively), the sensitivity was only about 60 percent. That is, persons classified as Hispanic by either of the two methods were likely to be Hispanic, 1066 Stewart et al. TABLE 1. Characteristics of the interviewed sample, Greater Bay Area Cancer Registry, San Francisco Bay Area, California Cancer site All five sites Colon Lung Breast Sampled as Hispanic Age group Male (years) No. 20-39 40-64 >65 Total 44 121 171 20-39 40-64 >65 Total 20-39 40-64 >65 Total No. % No. % No. % 4 26 100 100 167 122 389 26 43 31 100 1 54 130 185 1 29 70 100 110 162 137 409 27 40 33 100 3 8 28 39 8 21 72 100 0 10 34 44 0 1 17 30 48 2 35 63 100 0 0 23 77 100 5 28 33 15 85 100 3 22 28 53 6 42 53 100 1 9 0 25 25 0 50 50 100 0 14 16 30 0 47 53 100 8 96 81 185 4 52 44 100 102 47 63 28 4 100 12 161 7 100 6 71 20-39 40-64 11 20-39 40-64 88 36 5 129 >65 Total 20-39 40-64 >65 Total 25 35 112 58 181 Total Prostate Female Male % >65 Cervix Sampled as non-Hispanic Female 0 14 65 79 0 3 26 71 100 50 6 62 32 100 68 18 0 12 82 100 75 87 but a great many self-identified Hispanics were not classified as such by the registry or by surname. The accuracy of the GUESS surname program lowered the predictive value to 56 percent without increasing the sensitivity. The composite method had a low PV+ value (55 percent), since all incorrect classifications based on the GUESS program were included, but the sensitivity improved to 70 percent. The registry-surname method fared rather well, with the sensitivity (68 percent) approaching that of the composite method and the PV+ value (66 percent) approaching that of the surname list. The near equality of sensitivity and PV+ gave this method the lowest relative bias. Compared with selfidentification, the percentage of Hispanics was underestimated by report to the registry and by the surname list and was overestimated by the composite method. Comparisons of predictive value for the composite method showed no significant differences in PV- values by cancer site but significantly low PV+ values for breast cancer subjects (47 percent) and significantly high PV+ values for cervical cancer subjects (64 percent). Although specificity values were high for all sites and methods, sensitivity values ranged from a low of 43 percent (registry method, lung cancer) to a high of 88 percent (composite method, cervical can- 29 0 14 86 100 cer). For every site, the composite method was the most sensitive, and in four of the five sites the registry method was the least sensitive. For the composite method, sensitivity was significantly higher and specificity was lower for subjects with cervical cancer, and specificity was higher for those with colon, lung, or prostate cancer. With respect to relative bias, the percentage of Hispanics was underestimated by the registry method and overestimated by the composite method at every site. Both the GUESS and registrysurname methods seemed to have the least bias overall—less than 20 percent at four of the five sites. The accuracy of the classification methods by sex and age is shown in table 3. Overall, the patterns were similar for males and females. Differences in predictive value for the composite method by sex were not statistically significant in the Hispanic or the nonHispanic group. In addition, the sensitivities did not differ, but the specificity was significantly higher for males. Bias values tended to be more positive for females than for males, with more overestimation by the composite method but less underestimation by the registry and surname methods. As mentioned previously, the group aged 20-39 years was composed almost entirely of women with Am J Epidemiol Vol. 149, No. 11, 1999 Hispanic Classification 1067 TABLE 2. Accuracy of methods of classifying Hispanic ethnicity, by cancer site, Greater Bay Area Cancer Registry, San Francisco Bay Area, California Cancer site All five sites Colon Lung Breast Cervix Prostate Classification method PV-* PV+f Specificity}: Sensitivity§ Registry Surname Registry-surname GUESS# Composite 96 96 97 96 97 77 98 98 97 96 95 59 61 68 Registry Surname Registry-surname GUESS Composite 96 97 97 97 78 74 99 56 58 65 65 70 66 56 55 98 70 60 58 98 98 96 96 Registry Surname Registry-surname GUESS Composite 96 96 96 96 96 76 63 59 47 48 99 98 97 96 96 Registry Surname Registry-surname GUESS Composite 96 96 97 96 97 66 65 61 46 47 98 98 97 Registry Surname Registry-surname GUESS Composite 94 95 Registry Surname Registry-surname GUESS Composite 98 99 99 98 99 96 95 97 61 70 71 43 45 49 45 52 Relative biasH -23 -13 3 9 26 -27 -21 -6 8 23 -44 -28 -17 -A 8 53 -20 47 -27 95 95 60 49 60 -1 7 28 83 73 70 66 64 96 93 92 91 89 75 78 86 77 88 -9 8 21 16 36 87 99 98 98 97 97 66 80 80 75 82 -23 75 70 62 61 7 14 21 34 * PV-, predictive value negative; percentage of persons classified as non-Hispanic who self-identified as non-Hispanic. t PV+, predictive value positive; percentage of persons classified as Hispanic who self-identified as Hispanic, i Percentage of self-identified non-Hispanics who were classified as non-Hispanic. § Percentage of self-identified Hispanics who were classified as Hispanic. H Amount by which the percentage of persons who were classified as Hispanic differed from the percentage of persons who self-identified as Hispanic, as a percentage of the latter; relative bias = (sensitivity/PV+) - 1. # GUESS, Generally Useful Ethnic Search System. cervical cancer. The PV+ value was highest for this age group: comparisons for the composite method indicated that the PV+ value was significantly higher and the specificity was significantly lower for persons less than age 40 years and that the reverse was true for those aged 65 years or older. Sensitivity and PV- values did not differ significantly by age. In all three age groups, the percentage of Hispanics was underestimated by the registry and surname methods and overestimated by the composite method. Subjects who self-identified as Hispanic were asked to specify their country of origin. The sensitivity of each classification, by place of Hispanic origin, is given in table 4. For each sensitivity, the denominator was the weighted number of self-identified Hispanics Am J Epidemiol Vol. 149, No. 11, 1999 with the given place of origin. As usual, the composite method was the most sensitive. It correctly classified all Central Americans as Hispanic, and there were significant differences in sensitivity among the other places of origin—higher for persons of Mexican origin and lower for those who did not specify an origin in Latin America or Spain. Adjustment of incidence rates Estimates of misclassification with respect to Hispanic ethnicity make it possible to estimate the proportion of self-identified Hispanics in each segment of the population and make appropriate adjustments to cancer incidence rates. Estimates were created for 1068 Stewart et al. TABLE 3. Accuracy of methods of classifying Hispanic ethnicity, by sex and age, Greater Bay Area Cancer Registry, San Francisco Bay Area, California Classification method PV-* PV+t Specificity* Sensitivity§ Relative biasTI Males Registry Surname Registry-surname GUESS# Composite 96 97 97 97 97 80 74 68 60 57 99 98 98 97 96 49 58 59 58 62 -38 -22 -12 -4 8 Females Registry Surname Registry-surname GUESS Composite 96 96 97 96 97 76 68 65 54 54 98 97 96 95 -15 -8 11 17 94 64 62 73 63 75 Registry Surname Registry-su rname GUESS Composite 92 93 95 92 95 88 80 78 71 70 97 95 94 92 90 70 75 81 70 81 -20 Registry Surname Registry-surname GUESS Composite 97 97 98 97 98 69 64 59 54 60 -13 -6 18 52 98 97 96 96 95 Registry Surname Registry-surname GUESS Composite 97 97 97 97 97 77 99 69 98 65 98 96 96 52 54 59 55 61 Patient subgroup Aged 20-39 years Aged 40-64 years Aged >65 years 51 51 59 70 64 73 37 -6 4 -1 16 19 41 -33 -21 -9 7 21 * PV-, predictive value negative; percentage of persons classified as non-Hispanic who self-identified as non-Hispanic. t PV+, predictive value positive; percentage of persons classified as Hispanic who self-identified as Hispanic. X Percentage of self-identified non-Hispanics who were classified as non-Hispanic. § Percentage of self-identified Hispanics who were classified as Hispanic. H Amount by which the percentage of persons who were classified as Hispanic differed from the percentage of persons who self-identified as Hispanic, as a percentage of the latter; relative bias = (sensitivity/PV+) - 1. # GUESS, Generally Useful Ethnic Search System. None of the explanatory variables was significant for the non-Hispanic group (likelihood ratio p = 0.75), indicating that the proportion of the non-Hispanic population group who were really Hispanic was estimated those sampled as Hispanic and for those sampled as non-Hispanic by using logistic regression models of ethnic identity as a function of age, sex, and cancer site. TABLE 4. Sensitivity of methods of classifying Hispanic ethnicity, by national origin,* Greater Bay Area Cancer Registry, San Francisco Bay Area, California National origin Classification method Registry Surname Registry-surname GUESSt Composite % of self-identified Hispanics Mexico Central American country Spain Other 28 37 39 38 42 19 20 23 21 25 19 15 90 85 93 100 51 50 65 40 65 35 16 14 79 83 97 88 99 95 Other Latin America country * All values are expressed as percentages. t GUESS, Generally Useful Ethnic Search System. Am J Epidemiol Vol. 149, No. 11, 1999 Hispanic Classification appropriately by the sample proportion. In the Hispanic group, the final model (likelihood ratio p = 0.38, model p = 0.002) estimated separate proportions for males aged 65 years or older, males less man age 65 years, females less than age 40 years, females aged 40-64 years, and females aged 65 years or older. Age-adjusted cancer incidence rates based on the registry alone, the registry-surname classifications, and the adjustment for misclassification are shown in table 5. Results suggest that the true cancer rates for Hispanics are higher than those based on ethnicity as classified by report to the registry alone, primarily because Hispanics are being misclassified as non-Hispanic. Although the proportion of the non-Hispanic sample that was misclassified was small (about 3 percent), the non-Hispanic group comprised almost 90 percent of the White cancer cases, resulting in rather large standard errors for estimated Hispanic rates. When the registry classification was augmented by the 1980 Spanish surname list, the rates obtained were generally closer to the estimates based on self-identification. The exception was cancer of the cervix, which seemed better estimated by report to the registry alone. oo en co O co c S++ co N co .2 » S Vol. 149, No. 11, 1999 co r^ CM_ CM m cp o> in i- CO CO c CD JD C CD u .52 o ^3 "5 to T- in a> CO CM •o o in in o c CD mo in co *r T- CO CM CM £ CO ?• I ^ O5 •* rt to in N CO CO 1 in 9 mis CD •5>E S u CC co co co co CM o> i n • * CJ looi in co ca a. CD 5 5 o co CM i^ in II _. r. in II i - CM CO T" f*- O> Tfr o co n o I CF) CD <0 2 a 00 O i- w i- oo i n I CM •* co CM r^ •* re o CO CD CO CM in [^ in CD CD in ? co co u CM i - I -CD Q. co co JO o oo oo i ^ in E r~ CM T3 CD 95% Cl CD CD U i . CO a«s CD •o oo f I o co • * co CM O S II T - CO o> co r^ ^ CO O CC co 'c | CO co 5% C co o 2 a. oo CM i n i- h- 5 i- X •D r- CM .92 •c u c CD ite This study found that persons who were classified as non-Hispanic by both surname and medical record report to the cancer registry were very likely to identify themselves as such, and most self-identified nonHispanics were classified correctly. However, among persons who were classified as Hispanic by medical record and/or surname, only two-thirds were likely to agree, and almost one-third of self-identified Hispanics were not classified correctly. Classification based on either medical record or surname alone had a lower sensitivity but a higher PV+ value, so that less error occurred when classification was based on the union of the two methods. Bay Area cancer incidence rates generally were underestimated for Hispanics if ethnic classification was based on medical record report alone. These results can be compared with those of other studies of misclassification of Hispanic ethnicity, keeping in mind that predictive values depend on the prevalence of Hispanics in the population. Hazuda et al. (24) illustrated the importance of surname in determining self-identification as Mexican American by comparing surname with a "gold standard," which was defined as having three or four Mexican or Mexican American grandparents. For surname, they reported a sensitivity of 95.1 percent, specificity of 74.9 percent, PV+ of 80.0 percent, and PV- of 93.5 percent. Although a direct comparison of these results with ours is complicated by the difference in comparison o i-ii n CM Q> DISCUSSION Am J Epidemiol in o 1069 in • * • * CM co co 2= CD CO o o CM CO "5 c g co 1 E o S a. S2.c CD c i a> I T3 CO c ail aj '<o O <2i 1070 Stewart et al. criteria, the results do underscore our conclusion that surname alone is not an adequate predictor of Hispanic ethnicity. Winkleby and Rockhill (25) compared surname with self-reported ethnicity, finding sensitivities of 62-96 percent and PV+ values of 35-100 percent. In a comparison between Spanish surname and self-identified ethnicity carried out in a San Francisco Bay Area health maintenance organization (26), Spanish surname was 88 percent sensitive in classifying Hispanic men and 70 percent sensitive in classifying Hispanic women. Using the surname method, we found lower sensitivities for both sexes but essentially matched the high PV- (98 percent) and specificity (95 percent) values found in this study. We eliminated one of the major sources of misclassification in the Kaiser study (26) by excluding Filipinos. Howard et al. (27) compared the GUESS identification method and the 1980 Spanish surname list with self-identification. Compared with our results, they found higher sensitivities (75-89 percent) and approximately equal specificities (90-95 percent). All of these studies found that females are more likely than males to be misclassified (24-27). Although we did not find any decrease in sensitivity for females, this finding appears to be due to the differing distributions of national origin for the men and women who identified themselves as Hispanic. Only 38 percent of the men, compared with 59 percent of the women, were of Mexican or Central American origin, for whom sensitivity of the classification methods is very high. In analyses that simultaneously controlled for a number of sociodemographic factors, we found that among women who had Spanish surnames, self-identification as Hispanic was associated with ability to speak Spanish, having a Spanish maiden name or mother's maiden name, younger age, and having no health insurance. For Spanish-surnamed men, Hispanic selfidentification was associated with ability to speak Spanish and frequent use of Spanish. Men who had government health insurance or were recent immigrants (from non-Hispanic countries) were less likely to self-identify as Hispanic (20). When the results presented here are evaluated, the following points should be considered. First, the measure of accuracy deemed most important depends on the reason for counting the Hispanic population. For example, for efficient selection of a research sample of Hispanics, it may be useful to choose a classification method with a high predictive value, such as the registry method; however, the disadvantage is that the sample may not represent Hispanics who are misclassified. For community outreach and education purposes, a highly sensitive method may be preferred. For incidence rate calculations, a method with low bias must be found, possibly by combining methods that when taken alone underestimate the number of Hispanics. Second, the sample studied here consisted of San Francisco Bay Area residents who were diagnosed with specific types of cancer in 1990, and the results may not be applicable to other places and times. In particular, regional migration patterns and the methods used to report ethnicity to a registry are likely to affect the accuracy of classification methods. Finally, various studies have demonstrated that ethnic identification is not constant over time. Approximately 5-10 percent of persons who originally report their ethnicity as Hispanic will claim nonHispanic ethnicity when reinterviewed, and an offsetting proportion of original non-Hispanics will claim Hispanic ethnicity (16, 24, 28). In addition, since a telephone survey was conducted to obtain acceptable response rates, self-reported ethnicity may in some cases differ from that reported to the Census Bureau because of the mode of administration. When these points are considered, the above results suggest the following for the San Francisco Bay Area: 1. Hispanic cancer rates based on report to the registry alone may be biased downward because of misclassification of self-reported Hispanics as non-Hispanic. This downward bias will create an underestimate of cancer incidence in Hispanics, which may explain in part the lower incidence rates for Hispanics found in various studies of registry-based cancer incidence (4-6, 8, 10, 12-14). 2. The 1980 Spanish surname list tends to undercount Hispanics. Broadening this list by using the GUESS program does not seem to be a useful way to identify Hispanics in the San Francisco Bay Area, although this method is superior to the registry alone in terms of bias. The GUESS method was developed in New Mexico and has been shown to be a more sensitive (although less specific) predictor of Hispanic self-identification in that state (27). The Hispanic population in New Mexico was composed mainly of long-term US residents of Mexican ancestry, whereas the population surveyed in northern California was of more diverse Hispanic descent. However, among those of Mexican or Central American origin, the GUESS method is highly sensitive. 3. Augmenting registry data with the Spanish surname list seems to be a feasible way to increase sensitivity and reduce bias in incidence rate calculations. Am J Epidemiol Vol. 149, No. 11, 1999 Hispanic Classification ACKNOWLEDGMENTS This research was supported by contract NO1-CN-05224 from the Survey, Epidemiology, and End Results (SEER) Program of the National Cancer Institute. The authors thank Dr. Eliseo Perez-Stable, University of California San Francisco, for his help with the project. REFERENCES 1. Crews DE, Bindon JR. Ethnicity as a taxonomic tool in biomedical and biosocial research. Ethn Dis 1991; 1:42-9. 2. Osborne NG, Feit MD. The use of race in medical research. JAMA 1992;267:275-9. 3. National Coalition of Hispanic Health and Human Services Organizations. Delivering preventive health care to Hispanics: a manual for providers. Washington, DC: US Government Printing Office, 1988. 4. Trapido EJ, Chen F, Davis K, et al. Cancer in south Florida Hispanic women. A 9-year assessment. Arch Intern Med 1994; 154:1083-8. 5. Trapido EJ, Chen F, Davis K, et al. Cancer among Hispanic males in south Florida. Nine years of incidence data. Arch Intern Med 1994; 154:177-85. 6. Wolfgang PE, Semeiks PA, Burnett WS. Cancer incidence in New York City Hispanics, 1982 to 1985. Ethn Dis 1991;1: 263-72. 7. Rosenwaike I. Cancer mortality among Mexican immigrants in the United States. Public Health Rep 1988; 103:195-201. 8. Polednak AP. Lung cancer rates in the Hispanic population of Connecticut, 1980-1988. Public Health Rep 1993;108: 471-6. 9. Bondy ML, Spitz MR, Halabi S, et al. Low incidence of familial breast cancer among Hispanic women. Cancer Causes Control 1992;3:377-82. 10. Gilliland FD, Becker TM, Key CR, et al. Contrasting trends of prostate cancer incidence and mortality in New Mexico's Hispanics, non-Hispanic whites, American Indians, and blacks. Cancer 1994;73:2192-9. 11. Sorlie PD, Backlund E, Johnson NJ, et al. Mortality by Hispanic status in the United States. JAMA 1993;270:2564-8. 12. Trapido EJ, McCoy CB, Stein NS, et al. The epidemiology of cancer among Hispanic women. The experience in Florida. Am J Epidemiol Vol. 149, No. 11, 1999 1071 Cancer 1990;66:2435^1. 13. Mallin K, Anderson K. Cancer mortality in Illinois Mexican and Puerto Rican immigrants. Int J Cancer 1988;41:670-6. 14. Polednak AP. Estimating cervical cancer incidence in the Hispanic population of Connecticut by use of surnames. Cancer 1993;71:3560-4. 15. Giachello AL, Gell R, Aday LA, et al. Uses of the 1980 census for Hispanic health services research. Am J Public Health 1983;73:266-74. 16. Passel JS, Word DL. Constructing the list of Spanish surnames for the 1980 census: an application of Bayes' theorem. Presented at the Annual Meeting of the Population Associates of America, Denver, CO, April 1980. 17. Perkins RC. Evaluating the Passel-Word Spanish surname list: 1990 post enumeration survey results. Presented at the Joint Statistical Meetings, San Francisco, CA, August 1993. 18. Blustein J. The reliability of racial classifications in hospital discharge abstract data. Am J Public Health 1994;84:1018—21. 19. Stewart SL, Swallen KC, Glaser SL, et al. Adjustment of cancer incidence rates for ethnic misclassification. Biometrics 1998;54:774-81. 20. Swallen KC, West DW, Stewart SL, et al. Predictors of misclassification of Hispanic ethnicity in a population-based cancer registry. Ann Epidemiol 1997;7:200-6. 21. Buechley RW. Generally Useful Ethnic Search Program, GUESS. Presented at the Annual Meeting of the American Names Society, New York, NY, December 1976. 22. SAS Institute, Inc. SAS/STAT user's guide, version 6, 4th ed. Cary, NC: SAS Institute Inc, 1989. 23. Tenenbein A. A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 1970; 65:1350-61. 24. Hazuda HP, Comeaux PJ, Stern MP, et al. A comparison of three indicators for identifying Mexican Americans in epidemiologic research. Methodological findings from the San Antonio Heart Study. Am J Epidemiol 1986;123:96-112. 25. Winkleby MA, Rockhill B. Comparability of self-reported Hispanic ethnicity and Spanish surname coding. Hispanic J BehavSci 1992; 14:487-95. 26. Perez-Stable EJ, Hiatt RA, Sabogal F, et al. Use of Spanish surnames to identify Latinos: comparison to selfidentification. J Natl Cancer Inst Monogr 1995; 18:11-15. 27. Howard CA, Samet JM, Buechley RW, et al. Survey research in New Mexico Hispanics: some methodological issues. Am J Epidemiol 1983; 117:27-34. 28. Johnson RA. Measurement of Hispanic ethnicity in the US census: an evaluation based on latent-class analysis. J Am Stat Assoc 1990;85:58-65.
© Copyright 2026 Paperzz