Comparison of Methods for Classifying Hispanic Ethnicity in a

American Journal of Epidemiology
Copyright © 1999 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved
Vol. 149, No. 11
Printed in U.S.A.
Comparison of Methods for Classifying Hispanic Ethnicity in a Populationbased Cancer Registry
Susan L. Stewart,1 Karen C. Swallen,2 Sally L. Glaser,1 Pamela L. Horn-Ross,1 and Dee W. West1
The accuracy of ethnic classification can substantially affect ethnic-specific cancer statistics. In the Greater
Bay Area Cancer Registry, which is part of the Surveillance, Epidemiology, and End Results (SEER) Program
and of the statewide California Cancer Registry, Hispanic ethnicity is determined by medical record review and
by matching to surname lists. This study compared these classification methods with self-report. Ethnic selfidentification was obtained by surveying 1,154 area residents aged 20-89 years who were diagnosed with
cancer in 1990 and were reported to the registry as being Hispanic or White non-Hispanic. Predictive value
positive, sensitivity, and relative bias were used to assess the accuracy of Hispanic classification by medical
record and surname. Among those persons classified as Hispanic by either or both of these sources, only twothirds agreed (predictive value positive = 66%), and many self-identified Hispanics were classified incorrectly
(sensitivity = 68%). Classification based on either medical record or surname alone had a lower sensitivity (59%
and 61%, respectively) but a higher predictive value positive (77% and 70%, respectively). Ethnic classification
by medical record alone resulted in an underestimate of Hispanic cancer cases and incidence rates. Bias was
reduced when medical records and surnames were used together to classify cancer cases as Hispanic. Am J
Epidemiol 1999;149:1063-71.
bias (epidemiology); classification; ethnic groups; Hispanic Americans; incidence; neoplasms; population
studies; SEER program
There is considerable interest today in assessing
racial and ethnic differences in patterns of disease to
help understand disease causation and control.
Although there is no widely accepted definition of ethnicity or race, both have been associated with various
genetic, socioeconomic, cultural, and nutritional factors (1-2). Clearly, the interpretation of such associations depends in part on the methods used to classify
subjects by race or ethnicity.
In the United States, an ethnic group of particular
interest is the Hispanic population. Hispanics are the
nation's fastest growing minority and will be the
largest by the year 2000 (3). A number of epidemiologic studies have found that Hispanics in various geographic areas, as compared with White non-Hispanics,
have lower incidence rates of cancer at several
anatomic sites, including the oral cavity (4), esophagus
(5), stomach (4, 5), colon (5-7), rectum (4-7), pancreas (5), lung and bronchus (4-8), breast (7, 9), cervix
(4), testes (5), prostate (5, 10), bladder (4-7), and kid-
ney (5, 6) as well as lower rates of melanoma (5),
mesothelioma (5), chronic lymphocytic leukemia (5),
and non-Hodgkin's lymphoma (5). In addition, lower
cancer mortality has been found for Hispanics (11)
nationally and lower overall cancer incidence among
Hispanics in Florida (12) and Illinois (13). In contrast,
several studies have found Hispanics to be at increased
risk for cancer at several sites, including the cervix (6,
7, 12-14), liver (5, 7, 12), gallbladder (5, 12), stomach
(7), nasal cavity (5), penis (5), thyroid (5), and heart
and soft tissue (12) as well as for acute lymphocytic
leukemia (5) and Kaposi's sarcoma (5).
For many research purposes, Hispanic ethnicity is
assessed by self-identification. The US Bureau of the
Census currently uses this method. However, because
data on self-identification are not always available, the
Census Bureau has used other methods to classify people as Hispanic, including Spanish birth or parentage,
Mexican race, Spanish language, Spanish heritage,
Spanish origin, and Spanish surname (15). The PasselWord Spanish surname list, created from the 1980
decennial US Census (16), is the one currently used by
the Census Bureau (17).
In health surveillance, such as cancer registration,
assignment of ethnicity is often based on medical record
report. These classifications may involve subjective
appraisals by hospital personnel, and accuracy varies
Received for publication December 29, 1997, and accepted for
publication October 7, 1998.
Abbreviations: GUESS, Generally Useful Ethnic Search System;
PV+, predictive value positive; PV-, predictive value negative;
SEER, Surveillance, Epidemiology, and End Results.
1
Northern California Cancer Center, Union City, CA.
2
Department of Sociology, University of Wisconsin, Madison, Wl.
1063
1064
Stewart et al.
considerably (18). Therefore, when ethnic-specific cancer incidence and survival rates are computed by using
surveillance data, the numerator (number of cancer
cases) and the denominator (population count) are
obtained from sources that typically use different methods of ethnic classification. Discrepancies between
these two classification methods may influence the
accuracy of disease rate calculations, making comparisons between ethnic groups especially difficult.
In particular, if the number of cancer cases reported
to the registry as Hispanic is lower than the number of
cases of self-identified Hispanics, the risk for cancers
in this population will be underestimated. If, on the
other hand, the registry systematically overcounts
Hispanic cases, the risk will be overestimated.
The purpose of this study was to examine the extent of
misclassification of Hispanic ethnicity in patient data
collected by the San Francisco-Oakland populationbased cancer registry compared with ethnic selfidentification based on telephone interview. Quantification of misclassification would enable us to estimate
the accuracy of different methods available to the registry for classifying persons as Hispanic and to adjust
incidence rates for misclassification. A more accurate
evaluation of cancer incidence would permit better planning and evaluation of cancer control programs in this
rapidly growing segment of the San Francisco Bay Area
and US population.
MATERIALS AND METHODS
The aims of this study were to 1) determine the
extent of misclassification associated with methods
available to the registry for classifying Hispanics and
2) estimate misclassification-adjusted standardized
incidence rates for comparison with unadjusted rates.
The properties of the proposed adjustment method (19)
and statistical models of misclassification as a function
of self-reported socioeconomic, cultural, and demographic factors (20) are described elsewhere.
The study included persons who were identified by
the Greater Bay Area Cancer Registry, which is part of
the Surveillance, Epidemiology, and End Results
(SEER) Program and of the statewide California
Cancer Registry. Eligible subjects were persons aged
20-89 years when diagnosed with incident invasive or
in situ cancer of the colon, lung, female breast, cervix,
or prostate during 1990; residing in one of the five registry counties in the San Francisco Bay Area (Alameda,
Contra Costa, Marin, San Mateo, and San Francisco);
and reported to the registry as being of White race.
All eligible subjects were initially classified as
either Hispanic or non-Hispanic. Persons were placed
in the Hispanic group for study selection if 1) they
were reported to the registry as being of Spanish or
Hispanic origin on the basis of their medical records,
and/or 2) their surnames appeared on the Census
Bureau's 1980 Spanish surname list, and/or 3) their
surnames were determined to be Hispanic as a result of
using the Generally Useful Ethnic Search System
(GUESS) program, developed by the New Mexico
Tumor Registry (21). The non-Hispanic group consisted
of White persons not determined to be Hispanic by any
of these three classification methods. For each cancer
site, all those classified as Hispanic were chosen for
interviews, and an equal number of White nonHispanics was selected by random number assignment. The data were sampled by cancer site to enable
adjustment of site-specific incidence rates in case of
ethnic misclassification due to site-related factors,
such as socioeconomic status, not measured directly
by the registry. The initial sample consisted of all 780
Hispanics (756 White and 24 non-White) and 781 of
6,452 White non-Hispanics. After subsequent registry
updates, in which eligibility for the study was verified,
the sample included 743 of 750 White Hispanics and
776 of 6,382 White non-Hispanics.
A brief telephone interview was conducted with subjects or their next of kin. A bilingual interviewer administered the entire interview in Spanish or English,
according to the respondent's preference. Questions
included ethnic self-identification (as used by the 1980
and 1990 US Censuses), place of birth, immigration
history, familial ethnic origin and identification, language preference for speaking and reading, and socioeconomic indicators. The questionnaire was translated
into Spanish by using standard methodology, with back
translation and resolution of discrepancies. Overall, 72
percent of the interviews were with the patient, 11 percent with the spouse, 9 percent with a child, 2 percent
with a sibling, and 6 percent with other next of kin.
Next-of-kin interviews were conducted to avoid possible biases due to excluding patients with short survival
times and under the assumption that close relatives
would be aware of the patient's ethnic selfidentification. Correctness of classification, as measured by predictive value, did not differ significantly
between self- and next-of-kin respondents.
The measures of accuracy used to assess the different classification methods were as follows: predictive
value positive (PV+), the percentage of persons classified as Hispanic who self-identified as Hispanic; predictive value negative (PV-), the percentage of persons classified as non-Hispanic who self-identified as
non-Hispanic; sensitivity, the percentage of selfidentified Hispanics who were classified as Hispanic;
specificity, the percentage of self-identified nonHispanics who were classified as non-Hispanic; and
relative bias, the amount by which the percentage clasAm J Epidemiol Vol. 149, No. 11, 1999
Hispanic Classification 1065
sified as Hispanic differed from the percentage selfidentifying as Hispanic, as a percentage of the latter
((sensitivity/PV+) - 1). To estimate these measures,
values from the interviewed sample were weighted in
proportion to the inverse of their sampling fraction (the
number of eligible subjects divided by the number of
subjects interviewed) by cancer site and classification
as Hispanic or non-Hispanic.
The following five classification methods were
compared with ethnic self-identification: 1) report to
the registry as Hispanic on the basis of medical record
review, 2) surname included on the 1980 Spanish surname list, 3) report to the registry and/or surname
included on the Census Bureau list, 4) surname judged
to be Spanish by the GUESS program, and 5) classification as Hispanic by any of the other four methods
(composite). The latter method was used to assign persons to the Hispanic group for sampling purposes.
Cases currently are reported to the SEER registry as
Hispanic by using method 3, hereafter referred to as
registry-surname.
Statistical analyses were performed by using SAS
software (SAS/STAT version 6; SAS Institute, Inc.,
Cary, North Carolina) (22). Cochran-Mantel-Haenszel
statistics were generated to test for an association
between response rate and Hispanic classification,
controlling for cancer site. The log-odds of response
(i.e., participation in the study) as a function of site
was modeled separately for the Hispanic and nonHispanic groups by using logistic regression. The five
measures (PV+, PV-, sensitivity, specificity, and relative bias) were computed for subgroups of subjects
categorized by cancer site, sex, and age; sensitivity
was also computed by national origin. For the composite method of classification, the value of a measure
in a given subgroup was compared with the mean of
the subgroup measures for the given categorization by
using the SAS procedure PROC CATMOD. For
instance, the sensitivity for males and the sensitivity
for females were each compared with the average of
the sensitivities for males and females.
To adjust incidence rates for misclassification, estimates of the proportion of self-identified Hispanics in
age-sex-site groups were produced by applying the
estimates based on the interviewed sample to the entire
group of eligible patients, following the method of
Tenenbein (23). To create efficient estimates, logistic
regression models of Hispanic identity were developed
and tested against saturated models of age, sex, and
cancer site. The age distributions of males and females
were very different; therefore, the age groups for
females were defined as less than 40 years, 40-64
years, and 65 years or older, and the age groups for
males were defined as less than 65 years and 65 years
Am J Epidemiol
Vol. 149, No. 11, 1999
or older. Separate models were created for those categorized as Hispanic for sampling purposes and for
those sampled as non-Hispanic. Then, for each cancer
site, the total proportion of patients self-identifying as
Hispanic in each age-sex group was estimated as the
sum of the corresponding estimates in the Hispanic and
non-Hispanic categories weighted by the proportion of
eligible patients in each category. Age-adjusted incidence rates for Hispanics were estimated by applying
the estimated proportion of Hispanics in each age-sexsite group to the total number of White cancer cases
(including those classified by the registry as Hispanic),
dividing by the 1990 Bay Area Hispanic population in
the given age-sex group, and standardizing to the 1970
US population.
RESULTS
Telephone interviews were completed for 560 eligible persons classified as Hispanic and for 594 persons
classified as White non-Hispanic, a total of 76 percent
of the patients in the sample selected for interview.
Response rates for Hispanics and non-Hispanics, controlling for site, were not significantly different.
Participation was significantly higher for breast cancer
patients (85 percent for non-Hispanics and 87 percent
for Hispanics) and lower for both non-Hispanics with
lung cancer (63 percent) and Hispanics with cervical
cancer (64 percent).
The characteristics of the interviewed sample are
described in table 1. Because of the great difference
between the age distribution of the cervical cancer subjects and that of subjects with cancer at other sites,
inferences about misclassification among younger people were based primarily on data from the cervical cancer group, in which self-identified Hispanics were of
predominately Mexican and Central American origin.
Accuracy of classification methods
Estimates of the predictive value negative (PV-) and
the predictive value positive (PV+), the specificity and
sensitivity, and the relative bias for the five methods of
classification described above, overall and by cancer
site, are shown in table 2. The PV- and the specificity
of all methods were very high (95-98 percent); that is,
persons classified as non-Hispanic were extremely
likely to agree with that identification, and most selfidentified non-Hispanics were classified correctly.
However, the situation was rather different regarding
classification of Hispanics. Although the registry and
surname methods each had a fairly high PV+ value (77
and 70 percent, respectively), the sensitivity was only
about 60 percent. That is, persons classified as Hispanic
by either of the two methods were likely to be Hispanic,
1066
Stewart et al.
TABLE 1. Characteristics of the interviewed sample, Greater Bay Area Cancer Registry, San Francisco
Bay Area, California
Cancer
site
All five sites
Colon
Lung
Breast
Sampled as Hispanic
Age
group
Male
(years)
No.
20-39
40-64
>65
Total
44
121
171
20-39
40-64
>65
Total
20-39
40-64
>65
Total
No.
%
No.
%
No.
%
4
26
100
100
167
122
389
26
43
31
100
1
54
130
185
1
29
70
100
110
162
137
409
27
40
33
100
3
8
28
39
8
21
72
100
0
10
34
44
0
1
17
30
48
2
35
63
100
0
0
23
77
100
5
28
33
15
85
100
3
22
28
53
6
42
53
100
1
9
0
25
25
0
50
50
100
0
14
16
30
0
47
53
100
8
96
81
185
4
52
44
100
102
47
63
28
4
100
12
161
7
100
6
71
20-39
40-64
11
20-39
40-64
88
36
5
129
>65
Total
20-39
40-64
>65
Total
25
35
112
58
181
Total
Prostate
Female
Male
%
>65
Cervix
Sampled as non-Hispanic
Female
0
14
65
79
0
3
26
71
100
50
6
62
32
100
68
18
0
12
82
100
75
87
but a great many self-identified Hispanics were not
classified as such by the registry or by surname. The
accuracy of the GUESS surname program lowered the
predictive value to 56 percent without increasing the
sensitivity. The composite method had a low PV+ value
(55 percent), since all incorrect classifications based on
the GUESS program were included, but the sensitivity
improved to 70 percent. The registry-surname method
fared rather well, with the sensitivity (68 percent)
approaching that of the composite method and the PV+
value (66 percent) approaching that of the surname list.
The near equality of sensitivity and PV+ gave this
method the lowest relative bias. Compared with selfidentification, the percentage of Hispanics was underestimated by report to the registry and by the surname
list and was overestimated by the composite method.
Comparisons of predictive value for the composite
method showed no significant differences in PV- values by cancer site but significantly low PV+ values for
breast cancer subjects (47 percent) and significantly
high PV+ values for cervical cancer subjects (64 percent). Although specificity values were high for all
sites and methods, sensitivity values ranged from a
low of 43 percent (registry method, lung cancer) to a
high of 88 percent (composite method, cervical can-
29
0
14
86
100
cer). For every site, the composite method was the
most sensitive, and in four of the five sites the registry
method was the least sensitive. For the composite
method, sensitivity was significantly higher and specificity was lower for subjects with cervical cancer, and
specificity was higher for those with colon, lung, or
prostate cancer. With respect to relative bias, the percentage of Hispanics was underestimated by the registry method and overestimated by the composite
method at every site. Both the GUESS and registrysurname methods seemed to have the least bias overall—less than 20 percent at four of the five sites.
The accuracy of the classification methods by sex
and age is shown in table 3. Overall, the patterns were
similar for males and females. Differences in predictive value for the composite method by sex were not
statistically significant in the Hispanic or the nonHispanic group. In addition, the sensitivities did not
differ, but the specificity was significantly higher for
males. Bias values tended to be more positive for
females than for males, with more overestimation by
the composite method but less underestimation by the
registry and surname methods.
As mentioned previously, the group aged 20-39
years was composed almost entirely of women with
Am J Epidemiol
Vol. 149, No. 11, 1999
Hispanic Classification
1067
TABLE 2. Accuracy of methods of classifying Hispanic ethnicity, by cancer site, Greater Bay Area
Cancer Registry, San Francisco Bay Area, California
Cancer
site
All five sites
Colon
Lung
Breast
Cervix
Prostate
Classification
method
PV-*
PV+f
Specificity}:
Sensitivity§
Registry
Surname
Registry-surname
GUESS#
Composite
96
96
97
96
97
77
98
98
97
96
95
59
61
68
Registry
Surname
Registry-surname
GUESS
Composite
96
97
97
97
78
74
99
56
58
65
65
70
66
56
55
98
70
60
58
98
98
96
96
Registry
Surname
Registry-surname
GUESS
Composite
96
96
96
96
96
76
63
59
47
48
99
98
97
96
96
Registry
Surname
Registry-surname
GUESS
Composite
96
96
97
96
97
66
65
61
46
47
98
98
97
Registry
Surname
Registry-surname
GUESS
Composite
94
95
Registry
Surname
Registry-surname
GUESS
Composite
98
99
99
98
99
96
95
97
61
70
71
43
45
49
45
52
Relative
biasH
-23
-13
3
9
26
-27
-21
-6
8
23
-44
-28
-17
-A
8
53
-20
47
-27
95
95
60
49
60
-1
7
28
83
73
70
66
64
96
93
92
91
89
75
78
86
77
88
-9
8
21
16
36
87
99
98
98
97
97
66
80
80
75
82
-23
75
70
62
61
7
14
21
34
* PV-, predictive value negative; percentage of persons classified as non-Hispanic who self-identified as
non-Hispanic.
t PV+, predictive value positive; percentage of persons classified as Hispanic who self-identified as Hispanic,
i Percentage of self-identified non-Hispanics who were classified as non-Hispanic.
§ Percentage of self-identified Hispanics who were classified as Hispanic.
H Amount by which the percentage of persons who were classified as Hispanic differed from the percentage of
persons who self-identified as Hispanic, as a percentage of the latter; relative bias = (sensitivity/PV+) - 1.
# GUESS, Generally Useful Ethnic Search System.
cervical cancer. The PV+ value was highest for this
age group: comparisons for the composite method
indicated that the PV+ value was significantly higher
and the specificity was significantly lower for persons
less than age 40 years and that the reverse was true for
those aged 65 years or older. Sensitivity and PV- values did not differ significantly by age. In all three age
groups, the percentage of Hispanics was underestimated
by the registry and surname methods and overestimated
by the composite method.
Subjects who self-identified as Hispanic were asked
to specify their country of origin. The sensitivity of
each classification, by place of Hispanic origin, is
given in table 4. For each sensitivity, the denominator
was the weighted number of self-identified Hispanics
Am J Epidemiol
Vol. 149, No. 11, 1999
with the given place of origin. As usual, the composite
method was the most sensitive. It correctly classified
all Central Americans as Hispanic, and there were significant differences in sensitivity among the other
places of origin—higher for persons of Mexican origin
and lower for those who did not specify an origin in
Latin America or Spain.
Adjustment of incidence rates
Estimates of misclassification with respect to
Hispanic ethnicity make it possible to estimate the proportion of self-identified Hispanics in each segment of
the population and make appropriate adjustments to
cancer incidence rates. Estimates were created for
1068
Stewart et al.
TABLE 3. Accuracy of methods of classifying Hispanic ethnicity, by sex and age, Greater Bay Area
Cancer Registry, San Francisco Bay Area, California
Classification
method
PV-*
PV+t
Specificity*
Sensitivity§
Relative
biasTI
Males
Registry
Surname
Registry-surname
GUESS#
Composite
96
97
97
97
97
80
74
68
60
57
99
98
98
97
96
49
58
59
58
62
-38
-22
-12
-4
8
Females
Registry
Surname
Registry-surname
GUESS
Composite
96
96
97
96
97
76
68
65
54
54
98
97
96
95
-15
-8
11
17
94
64
62
73
63
75
Registry
Surname
Registry-su rname
GUESS
Composite
92
93
95
92
95
88
80
78
71
70
97
95
94
92
90
70
75
81
70
81
-20
Registry
Surname
Registry-surname
GUESS
Composite
97
97
98
97
98
69
64
59
54
60
-13
-6
18
52
98
97
96
96
95
Registry
Surname
Registry-surname
GUESS
Composite
97
97
97
97
97
77
99
69
98
65
98
96
96
52
54
59
55
61
Patient subgroup
Aged 20-39 years
Aged 40-64 years
Aged >65 years
51
51
59
70
64
73
37
-6
4
-1
16
19
41
-33
-21
-9
7
21
* PV-, predictive value negative; percentage of persons classified as non-Hispanic who self-identified as
non-Hispanic.
t PV+, predictive value positive; percentage of persons classified as Hispanic who self-identified as Hispanic.
X Percentage of self-identified non-Hispanics who were classified as non-Hispanic.
§ Percentage of self-identified Hispanics who were classified as Hispanic.
H Amount by which the percentage of persons who were classified as Hispanic differed from the percentage of
persons who self-identified as Hispanic, as a percentage of the latter; relative bias = (sensitivity/PV+) - 1.
# GUESS, Generally Useful Ethnic Search System.
None of the explanatory variables was significant
for the non-Hispanic group (likelihood ratio p = 0.75),
indicating that the proportion of the non-Hispanic population group who were really Hispanic was estimated
those sampled as Hispanic and for those sampled as
non-Hispanic by using logistic regression models of
ethnic identity as a function of age, sex, and cancer
site.
TABLE 4. Sensitivity of methods of classifying Hispanic ethnicity, by national origin,* Greater Bay Area
Cancer Registry, San Francisco Bay Area, California
National origin
Classification
method
Registry
Surname
Registry-surname
GUESSt
Composite
% of self-identified
Hispanics
Mexico
Central
American
country
Spain
Other
28
37
39
38
42
19
20
23
21
25
19
15
90
85
93
100
51
50
65
40
65
35
16
14
79
83
97
88
99
95
Other
Latin America
country
* All values are expressed as percentages.
t GUESS, Generally Useful Ethnic Search System.
Am J Epidemiol
Vol. 149, No. 11, 1999
Hispanic Classification
appropriately by the sample proportion. In the
Hispanic group, the final model (likelihood ratio p =
0.38, model p = 0.002) estimated separate proportions
for males aged 65 years or older, males less man age
65 years, females less than age 40 years, females aged
40-64 years, and females aged 65 years or older.
Age-adjusted cancer incidence rates based on the registry alone, the registry-surname classifications, and the
adjustment for misclassification are shown in table 5.
Results suggest that the true cancer rates for Hispanics
are higher than those based on ethnicity as classified by
report to the registry alone, primarily because Hispanics
are being misclassified as non-Hispanic. Although the
proportion of the non-Hispanic sample that was misclassified was small (about 3 percent), the non-Hispanic
group comprised almost 90 percent of the White cancer
cases, resulting in rather large standard errors for estimated Hispanic rates. When the registry classification
was augmented by the 1980 Spanish surname list, the
rates obtained were generally closer to the estimates
based on self-identification. The exception was cancer
of the cervix, which seemed better estimated by report
to the registry alone.
oo en
co
O
co c S++ co N
co .2 » S
Vol. 149, No. 11, 1999
co
r^
CM_
CM
m cp
o> in
i-
CO CO
c
CD
JD
C
CD
u
.52 o ^3
"5
to
T-
in a>
CO CM
•o
o
in in
o
c
CD
mo
in co *r T-
CO
CM CM
£
CO
?•
I
^ O5
•*
rt to
in N
CO CO
1
in
9
mis
CD
•5>E S
u
CC co
co co
co CM
o> i n
• * CJ
looi
in
co
ca
a.
CD
5
5
o
co CM
i^ in
II
_. r.
in
II
i - CM
CO T"
f*- O>
Tfr
o
co
n
o
I
CF)
CD
<0
2
a
00
O
i-
w i-
oo i n
I
CM
•*
co CM
r^
•*
re
o
CO CD
CO CM
in [^
in
CD
CD
in
?
co co
u
CM i -
I
-CD
Q.
co
co
JO
o
oo oo
i ^ in
E
r~ CM
T3
CD
95% Cl
CD
CD
U
i . CO
a«s
CD
•o
oo
f
I
o
co
• * co
CM O
S
II
T - CO
o> co
r^ ^
CO
O
CC co
'c
|
CO
co
5% C
co
o
2
a.
oo
CM i n
i- h-
5 i-
X
•D
r- CM
.92
•c
u
c
CD
ite
This study found that persons who were classified as
non-Hispanic by both surname and medical record
report to the cancer registry were very likely to identify
themselves as such, and most self-identified nonHispanics were classified correctly. However, among
persons who were classified as Hispanic by medical
record and/or surname, only two-thirds were likely to
agree, and almost one-third of self-identified
Hispanics were not classified correctly. Classification
based on either medical record or surname alone had a
lower sensitivity but a higher PV+ value, so that less
error occurred when classification was based on the
union of the two methods. Bay Area cancer incidence
rates generally were underestimated for Hispanics if
ethnic classification was based on medical record
report alone.
These results can be compared with those of other
studies of misclassification of Hispanic ethnicity,
keeping in mind that predictive values depend on the
prevalence of Hispanics in the population. Hazuda et
al. (24) illustrated the importance of surname in determining self-identification as Mexican American by
comparing surname with a "gold standard," which was
defined as having three or four Mexican or Mexican
American grandparents. For surname, they reported a
sensitivity of 95.1 percent, specificity of 74.9 percent,
PV+ of 80.0 percent, and PV- of 93.5 percent.
Although a direct comparison of these results with
ours is complicated by the difference in comparison
o
i-ii n CM
Q>
DISCUSSION
Am J Epidemiol
in
o
1069
in • *
• * CM
co co
2=
CD
CO
o o
CM CO
"5
c
g
co
1
E
o S a.
S2.c CD
c
i a>
I T3
CO
c
ail
aj '<o
O
<2i
1070
Stewart et al.
criteria, the results do underscore our conclusion that
surname alone is not an adequate predictor of Hispanic
ethnicity.
Winkleby and Rockhill (25) compared surname with
self-reported ethnicity, finding sensitivities of 62-96
percent and PV+ values of 35-100 percent. In a comparison between Spanish surname and self-identified
ethnicity carried out in a San Francisco Bay Area
health maintenance organization (26), Spanish surname was 88 percent sensitive in classifying Hispanic
men and 70 percent sensitive in classifying Hispanic
women. Using the surname method, we found lower
sensitivities for both sexes but essentially matched the
high PV- (98 percent) and specificity (95 percent) values found in this study. We eliminated one of the major
sources of misclassification in the Kaiser study (26) by
excluding Filipinos. Howard et al. (27) compared the
GUESS identification method and the 1980 Spanish
surname list with self-identification. Compared with
our results, they found higher sensitivities (75-89 percent) and approximately equal specificities (90-95
percent).
All of these studies found that females are more
likely than males to be misclassified (24-27). Although
we did not find any decrease in sensitivity for females,
this finding appears to be due to the differing distributions of national origin for the men and women who
identified themselves as Hispanic. Only 38 percent of
the men, compared with 59 percent of the women, were
of Mexican or Central American origin, for whom sensitivity of the classification methods is very high. In
analyses that simultaneously controlled for a number of
sociodemographic factors, we found that among
women who had Spanish surnames, self-identification
as Hispanic was associated with ability to speak
Spanish, having a Spanish maiden name or mother's
maiden name, younger age, and having no health insurance. For Spanish-surnamed men, Hispanic selfidentification was associated with ability to speak
Spanish and frequent use of Spanish. Men who had
government health insurance or were recent immigrants (from non-Hispanic countries) were less likely to
self-identify as Hispanic (20).
When the results presented here are evaluated, the
following points should be considered. First, the measure of accuracy deemed most important depends on
the reason for counting the Hispanic population. For
example, for efficient selection of a research sample
of Hispanics, it may be useful to choose a classification method with a high predictive value, such as the
registry method; however, the disadvantage is that the
sample may not represent Hispanics who are misclassified. For community outreach and education purposes, a highly sensitive method may be preferred.
For incidence rate calculations, a method with low
bias must be found, possibly by combining methods
that when taken alone underestimate the number of
Hispanics.
Second, the sample studied here consisted of San
Francisco Bay Area residents who were diagnosed
with specific types of cancer in 1990, and the results
may not be applicable to other places and times. In particular, regional migration patterns and the methods
used to report ethnicity to a registry are likely to affect
the accuracy of classification methods.
Finally, various studies have demonstrated that ethnic identification is not constant over time.
Approximately 5-10 percent of persons who originally
report their ethnicity as Hispanic will claim nonHispanic ethnicity when reinterviewed, and an offsetting proportion of original non-Hispanics will claim
Hispanic ethnicity (16, 24, 28). In addition, since a
telephone survey was conducted to obtain acceptable
response rates, self-reported ethnicity may in some
cases differ from that reported to the Census Bureau
because of the mode of administration.
When these points are considered, the above results
suggest the following for the San Francisco Bay Area:
1. Hispanic cancer rates based on report to the registry alone may be biased downward because of
misclassification of self-reported Hispanics as
non-Hispanic. This downward bias will create an
underestimate of cancer incidence in Hispanics,
which may explain in part the lower incidence
rates for Hispanics found in various studies of
registry-based cancer incidence (4-6, 8, 10,
12-14).
2. The 1980 Spanish surname list tends to undercount Hispanics. Broadening this list by using
the GUESS program does not seem to be a useful way to identify Hispanics in the San
Francisco Bay Area, although this method is
superior to the registry alone in terms of bias.
The GUESS method was developed in New
Mexico and has been shown to be a more sensitive (although less specific) predictor of
Hispanic self-identification in that state (27). The
Hispanic population in New Mexico was composed mainly of long-term US residents of
Mexican ancestry, whereas the population surveyed in northern California was of more diverse
Hispanic descent. However, among those of
Mexican or Central American origin, the GUESS
method is highly sensitive.
3. Augmenting registry data with the Spanish surname list seems to be a feasible way to increase
sensitivity and reduce bias in incidence rate
calculations.
Am J Epidemiol
Vol. 149, No. 11, 1999
Hispanic Classification
ACKNOWLEDGMENTS
This research was supported by contract NO1-CN-05224
from the Survey, Epidemiology, and End Results (SEER)
Program of the National Cancer Institute.
The authors thank Dr. Eliseo Perez-Stable, University of
California San Francisco, for his help with the project.
REFERENCES
1. Crews DE, Bindon JR. Ethnicity as a taxonomic tool in biomedical and biosocial research. Ethn Dis 1991; 1:42-9.
2. Osborne NG, Feit MD. The use of race in medical research.
JAMA 1992;267:275-9.
3. National Coalition of Hispanic Health and Human Services
Organizations. Delivering preventive health care to Hispanics:
a manual for providers. Washington, DC: US Government
Printing Office, 1988.
4. Trapido EJ, Chen F, Davis K, et al. Cancer in south Florida
Hispanic women. A 9-year assessment. Arch Intern Med 1994;
154:1083-8.
5. Trapido EJ, Chen F, Davis K, et al. Cancer among Hispanic
males in south Florida. Nine years of incidence data. Arch
Intern Med 1994; 154:177-85.
6. Wolfgang PE, Semeiks PA, Burnett WS. Cancer incidence in
New York City Hispanics, 1982 to 1985. Ethn Dis 1991;1:
263-72.
7. Rosenwaike I. Cancer mortality among Mexican immigrants
in the United States. Public Health Rep 1988; 103:195-201.
8. Polednak AP. Lung cancer rates in the Hispanic population of
Connecticut, 1980-1988. Public Health Rep 1993;108:
471-6.
9. Bondy ML, Spitz MR, Halabi S, et al. Low incidence of familial breast cancer among Hispanic women. Cancer Causes
Control 1992;3:377-82.
10. Gilliland FD, Becker TM, Key CR, et al. Contrasting trends of
prostate cancer incidence and mortality in New Mexico's
Hispanics, non-Hispanic whites, American Indians, and
blacks. Cancer 1994;73:2192-9.
11. Sorlie PD, Backlund E, Johnson NJ, et al. Mortality by
Hispanic status in the United States. JAMA 1993;270:2564-8.
12. Trapido EJ, McCoy CB, Stein NS, et al. The epidemiology of
cancer among Hispanic women. The experience in Florida.
Am J Epidemiol Vol. 149, No. 11, 1999
1071
Cancer 1990;66:2435^1.
13. Mallin K, Anderson K. Cancer mortality in Illinois Mexican
and Puerto Rican immigrants. Int J Cancer 1988;41:670-6.
14. Polednak AP. Estimating cervical cancer incidence in the
Hispanic population of Connecticut by use of surnames.
Cancer 1993;71:3560-4.
15. Giachello AL, Gell R, Aday LA, et al. Uses of the 1980 census
for Hispanic health services research. Am J Public Health
1983;73:266-74.
16. Passel JS, Word DL. Constructing the list of Spanish surnames
for the 1980 census: an application of Bayes' theorem.
Presented at the Annual Meeting of the Population Associates
of America, Denver, CO, April 1980.
17. Perkins RC. Evaluating the Passel-Word Spanish surname list:
1990 post enumeration survey results. Presented at the Joint
Statistical Meetings, San Francisco, CA, August 1993.
18. Blustein J. The reliability of racial classifications in hospital
discharge abstract data. Am J Public Health 1994;84:1018—21.
19. Stewart SL, Swallen KC, Glaser SL, et al. Adjustment of cancer incidence rates for ethnic misclassification. Biometrics
1998;54:774-81.
20. Swallen KC, West DW, Stewart SL, et al. Predictors of misclassification of Hispanic ethnicity in a population-based cancer registry. Ann Epidemiol 1997;7:200-6.
21. Buechley RW. Generally Useful Ethnic Search Program,
GUESS. Presented at the Annual Meeting of the American
Names Society, New York, NY, December 1976.
22. SAS Institute, Inc. SAS/STAT user's guide, version 6, 4th ed.
Cary, NC: SAS Institute Inc, 1989.
23. Tenenbein A. A double sampling scheme for estimating from
binomial data with misclassifications. J Am Stat Assoc 1970;
65:1350-61.
24. Hazuda HP, Comeaux PJ, Stern MP, et al. A comparison of
three indicators for identifying Mexican Americans in epidemiologic research. Methodological findings from the San
Antonio Heart Study. Am J Epidemiol 1986;123:96-112.
25. Winkleby MA, Rockhill B. Comparability of self-reported
Hispanic ethnicity and Spanish surname coding. Hispanic J
BehavSci 1992; 14:487-95.
26. Perez-Stable EJ, Hiatt RA, Sabogal F, et al. Use of
Spanish surnames to identify Latinos: comparison to selfidentification. J Natl Cancer Inst Monogr 1995; 18:11-15.
27. Howard CA, Samet JM, Buechley RW, et al. Survey research
in New Mexico Hispanics: some methodological issues. Am J
Epidemiol 1983; 117:27-34.
28. Johnson RA. Measurement of Hispanic ethnicity in the US
census: an evaluation based on latent-class analysis. J Am Stat
Assoc 1990;85:58-65.