American Journal of Epidemiology
Copyright © 1997 by The Johns Hopkins University School of Hygiene and Public Health
All rights reserved
Vol. 146, No. 6
Printed in U.S.A
On Measures of Agreement Calculated from Contingency Tables with
Categories Defined by the Empirical Quantiles of the Marginal Distributions
Craig B. Borkowf and Mitchell H. Gail
Epidemiologists sometimes collect bivariate continuous data on a number of subjects, compute the
empirical (sample) quantiles of the marginal data, and then use these values to partition the original data into
two-way contingency tables. Tables created in this manner have row and column categories defined by the
random empirical marginal quantiles rather than by preset cutpoints, so these tables have fixed marginal
totals. Hence, instead of the conventional multinomial distribution, these tables have the empirical bivariate
quantile-partitioned (EBQP) distribution. In this paper, the authors demonstrate how to use empirical methods
appropriate for EBQP tables to make inferences and construct confidence intervals for three commonly used
measures of agreement: kappa, weighted kappa, and another class of measures derived from conditional
proportions in the extreme rows of the table. They also show that if one incorrectly applies conventional
methods appropriate for multinomial tables to statistics calculated from EBQP tables, one can obtain
substantially misleading results. In addition, the authors present alternative parametric methods for estimating
these measures of agreement and illustrate corresponding methods of inference and confidence interval
construction. Finally, they show that these empirical (EBQP) methods can have low efficiency compared with
parametric methods for some of these measures of agreement. Am J Epidemiol 1997; 146:520-6.
epidemiologic methods; nutrition surveys; questionnaires; statistics
Epidemiologists sometimes collect bivariate continuous data on a number of subjects and then partition
these data into two-way contingency tables with row
and column categories defined by the empirical quantiles of the marginal data. For example, Pietinen et al.
(1,2) conducted an extensive study of Finnish men
aged 55-69 years to test the reproducibility and validity of several methods of measuring the intake of
food items and nutrients. Among the nutrients considered was vitamin E, represented by the total amount of
active tocopherols. In the validation part of the study,
the vitamin E intake of 157 men was measured by two
methods. First, the subjects kept prospective "food
records" to record the foods they consumed on 12
two-day periods during a 6-month interval. Second,
the subjects completed retrospective "food use ques-
tionnaires" to estimate how much of certain foods they
had consumed during the previous year.
The food record and food use questionnaire measurements have means (standard deviations) of 10.1
(3.8) and 11.6 (6.0) mg, respectively. Both marginal
measurements are skewed toward the right, and the
food use questionnaire measurements tend to be larger
than the corresponding food record measurements for
each individual.
We computed the empirical marginal quintiles and
used these values to partition the original data into a
5 X 5 table (table 1). (We first added tiny random
errors to each component of the original bivariate data
to break any ties.) Because 157/5 = 31.4, the marginal
totals of the rows and columns are fixed at 31 or 32
observations, even though the interior counts are still
random. For example, four individuals fell in both the
first quintile of the food record measurements and the
third quintile of the food use questionnaire measurements.
The empirical quintiles of the food record measurements are 6.8, 8.6, 9.9, and 12.9 mg, while the empirical quintiles of the food use questionnaire measurements are 6.5, 9.1, 11.2, and 15.9 mg. Several ties
occurred at these empirical quantiles, but the manner
in which the ties were broken had only a small effect
on the resulting table. Note that for imprecise mea-
Received for publication September 3, 1996, and in final form
June 26, 1997.
Abbreviations: BVN, bivariate normal (data or distribution);
EBQP, empirical bivariate quantile-partitioned (data or distribution);
MULT, multinomial (data or distribution); PBQP, parametric bivariate
quantile-partitioned (data or distribution).
From the National Cancer Institute, Division of Cancer Epidemiology and Genetics, Biostatistics Branch, Bethesda, MD.
Reprint requests to Dr. Craig B. Borkowf, National Heart, Lung,
and Blood Institute, Division of Epidemiology and Clinical Applications, Office of Biostatistics Research, Two Rockledge Centre,
Room 8100D, 6701 Rockledge Drive, MSC 7938, Bethesda, MD
20892-7938.
520
Contingency Tables with Quantile Categories
521
TABLE 1. Empirical bivariate counts for the vitamin E data
partitioned by empirical marginal quintiles*
Food
record
quintiles
—
(ioW)
Food use questionnaire quintiles
2
3
4
5
^gh)
Total
1 (low)
2
3
4
5 (high)
13
9
7
2
0
13
9
5
2
2
4
7
7
12
2
1
4
7
9
10
0
2
6
6
18
31
31
32
31
32
Total
31
31
32
31
32
157
* From original data in a study by Pietinen et al. (1).
surement instruments, such as food records and food
use questionnaires, the empirical quantiles themselves
are usually of less interest than the empirical bivariate
quantile-partitioned (EBQP) tables that they jointly
define. For better calibrated and more reliable measurement instruments, however, the empirical quantiles may be of considerable interest and should be
reported.
Because the data are partitioned by the random
empirical marginal quintiles rather than by preset cutpoints, the resulting table of observed counts does not
have the conventional multinomial (MULT) distribution. Instead, this table has the EBQP distribution. The
asymptotic theory for this distribution was developed
by Borkowf et al. (3). We present evidence that empirical estimates of measures of agreement calculated
from EBQP tables (table 1) have variances that can
differ substantially from the variances appropriate for
estimates calculated from corresponding MULT tables. Similarly, we illustrate the construction of confidence intervals using empirical (EBQP) methods and
show that these intervals differ from those appropriate
for MULT methods.
A scatter plot of the loge-transformed data (figure 1)
suggests that these data are consistent with the assumption of bivariate normality and, hence, that the
original data are bivariate lognormal. On this new
scale, the food record and food use questionnaire measurements have means (standard deviations) of 2.24
(0.36) and 2.33 (0.48) loge-mg, respectively. Using the
log-transformed data, we constructed normal probability plots of the marginal data and of linear combinations of these data (not shown). These graphic tests
also suggest that it is quite plausible to assume that the
log-transformed data are bivariate normal (BVN).
In light of these observations, we also present parametric methods under the assumption of bivariate normality for estimating the expected counts in EBQP
tables. In turn, we can use these tables to calculate
parametric estimates of measures of agreement of
interest. We say that tables produced by parametric
Am J Epidemiol
Vol. 146, No. 6, 1997
1
2
3
4
Food Record Measurements
FIGURE 1. Plot of natural logarithm of vitamin E measurements for
the food record and the food use questionnaire (loge-mg).
methods have the parametric bivariate quantilepartitioned (PBQP) distribution. The general asymptotic theory for this distribution appears in Borkowf
(4) and Borkowf and Gail (National Cancer Institute,
unpublished manuscript), along with a discussion of
the special case in which the original data are BVN.
The BVN distribution is a location-scale family in
which the correlation parameter, p, solely determines
the distribution of the estimated counts in PBQP tables. To see this fact, let X and Y be two normal
variates with means [ix and yuy, variances o^ and a^,
and correlation p. Then the q'x and q'
qyy population
ii
quantiles of these variates are 0x =
+ crx <f>~ (qx)
and 6y = yuy + cry <b (qy), where <J> denotes the
standardd normal distribution
ri
function. Next, define the
standardized normal variates U - (X - ixx)lcrx and
V = (Y — yiy)l(Ty with means 0, variances 1, and
correlation p. Then P{X ^QX,Y< 6y} = P{U < O " 1
(qx), V < O " 1 ^ ) } , and since the right hand expression only depends on p, the proportion of observations
below any pair of quantiles must also depend only on
p. In turn, the distribution of the estimated counts in
PBQP tables depends only on the sample correlation,
p, and not on the unknown means and variances of the
marginal distributions. We use the notation BVN(p) to
denote a BVN distribution with means 0, variances 1,
and correlation p.
In the vitamin E example, the sample correlation of
the log-transformed data are p = 0.6810, and the
resulting estimated PBQP counts appear in table 2.
Note that as a result of the assumption of bivariate
normality, the estimated counts in this table are symmetric about both the diagonal and antidiagonal. We
present evidence that parametric estimates of measures
of agreement calculated from PBQP tables (table 2)
522
Borkowf and Gail
TABLE 2. Parametric estimates of the expected counts for
the vitamin E data partitioned by marginal quintiles*,t
Food
record
qu Miles
Empirical and parametric modeling approaches
Food use questionnaire quinines
1
2
3
4
(low)
METHODS FOR BIVARIATE QUANTILEPARTITIONED DATA
5
Total
(high)
1 (low)
2
3
4
5 (high)
17.3
8.1
4.0
1.6
0.3
8.1
9.4
7.5
4.7
1.6
4.0
7.5
8.4
7.5
4.0
1.6
4.7
7.5
9.4
8.1
0.3
1.6
4.0
8.1
17.3
Total
31.4
31.4
31.4
31.4
31.4
31.4
31.4
31.4
31.4
31.4
157
* The authors assume the original data are bivariate lognoimal.
The sample correlation of the log-transformed data isp = 0.6810.
t Numbers in the table may not add up to the marginal totals due
to rounding errors.
have substantially smaller variances than do empirical
estimates calculated from EBQP tables (table 1).
We reviewed articles in the American Journal of
Epidemiology over the past decade to determine which
measures of agreement were commonly used and
which methods of inference were employed. We found
that researchers tended to use empirical bivariate tertiles, quartiles, and quintiles to partition their data.
They chose to measure the agreement in these square
tables with several statistics, including kappa (5),
weighted kappa (6), and row proportions (see the next
section and reference 7). The latter statistic measures
agreement (or disagreement) in the extreme quantiles
of the table and is usually used with quintiles. Statistics such as "percent exact agreement" and "percent
agreement within one quantile" are essentially equivalent to weighted kappa with appropriate weights.
Only in rare cases, and then only for kappa, did researchers report confidence intervals, and these intervals were constructed by using the incorrect assumption that the tables had the MULT distribution.
In this paper, we first describe the empirical (EBQP)
and parametric (PBQP) modeling approaches in more
detail. Next, we consider several measures of agreement of interest to epidemiologists (kappa, weighted
kappa, and row proportions) and numerically compute
their asymptotic means and variances for EBQP,
PBQP, and MULT tables constructed from BVN data
partitioned by quintiles. We also show how to use
empirical, parametric, and MULT methods to analyze
the vitamin E data, and we discuss the implications of
incorrectly using MULT methods to analyze statistics
calculated from EBQP tables. Finally, we compare the
advantages and disadvantages of using empirical
(EBQP), parametric (PBQP), and MULT methods to
analyze bivariate continuous data.
Suppose we have bivariate continuous data (Xk, Yk)
{k = 1, 2, . . . , i), where each of these t pairs of
observations is drawn independently from some fixed
distribution, F. For the empirical approach, we use the
original data to calculate the {d — 1) empirical quantiles of each marginal distribution that correspond to
the cumulative marginal proportions lid, lid,. . ., (d —
\)ld. We then use these random empirical quantiles to
partition the original data into adX dEBQP table. Let
{Pij} ('»j ~ 1. 2, .. ., d) be the proportion of observations that fall in the ith row andyth column quantilepartitioned categories. For example, for the vitamin E
data in table 1, p23 = 7/157. Empirical methods for
obtaining asymptotic and estimated variances of {p^}
are discussed in Borkowf et al. (3). While these variances may be written in closed form, their formulas
are very complex, and therefore we do not repeat them
here. To compute these estimated variances efficiently, we need not only the {/?,-,} but also the complete original data. As the sample size increases (t —>
oo), the sample EBQP proportions, {/?,-,}, tend to the
asymptotic proportions, {ir,-,}.
Alternatively, we may consider a parametric modeling approach in which we assume that the original
data come from the BVN distribution. In this case, we
use the original data to compute the sample correlation, p, which is the maximum likelihood estimate of
the true correlation, p. Since the BVN distribution is a
location-scale family, we do not need to estimate the
unknown means or variances of the marginal distributions. We use p to obtain parametric estimates {TT,-,} of
{tTjj}. For example, for the vitamin E data, p =
0.6810, which we use to compute TT23 = 7.5/157.
Parametric methods for obtaining estimates {TT,-,} and
their asymptotic and estimated variances are discussed
in Borkowf (4). These variances can often be written
in closed form using complex formulas derived by the
multivariate delta method (8).
In the following sections, we denote summation
over a subscript by a plus sign. For example, pi+ =
^j=iPij- Note that iri+ — Tri+ = TT+J = TT+J — l/d,
while pi+ and p+j are constant given t, apart from
rounding.
Three measures of agreement
We now define three measures of agreement of
interest to epidemiologists.
Kappa, K. The kappa statistic, K (5), is a measure
of the diagonal agreement between two categorical
variables, which are the quantile-partitioned categories
Am J Epidemiol Vol. 146, No. 6, 1997
Contingency Tables with Quantile Categories
in this case. Landis and Koch (9, 10) discussed the use
of kappa with ordinal categories and provided some
useful benchmarks for its interpretation. Let Il o =
2f=1 IT,-,- and Ue = 2? = , TT,+ TT+(. Then Uo denotes the
limiting proportion of observed diagonal counts, while
He denotes the limiting proportion of diagonal counts
expected under the assumption that the original bivariate measurements are independent. The kappa measure is defined by
K
=
n o -n e
l-n •
(i)
The value of K ranges from -°° to 1, and its magnitude
and interpretation depend on the somewhat arbitrary
choice of table dimensions, d (11). Note that independence implies K = 0, while K = 1 corresponds to
perfect diagonal agreement.
Weighted kappa, K r The weighted kappa statistic,
KW (6), is a measure of the overall agreement between
two categorical variables. With appropriate weights, it
corresponds to the intraclass correlation with the quantile-partitioned categories as scores (12). Let w,-, = 1 —
(i - jfl{d - I) 2 , Uow = 2f=1 2f=1 W(,ir&, and Uew =
27=12j=1wi:/Tr,- + TT+J. Then IIOVV denotes a limiting
weighted proportion, while Uew denotes the corresponding limiting weighted proportion expected under
the assumption that the original bivariate measurements are independent. The weighted kappa measure
(with these particular weights) is defined by
l-IU
(2)
The value of KW ranges from — 1 to 1, independence
implies KW — 0, and KW — 1 corresponds to perfect
diagonal agreement. A nice feature of KW is that its
magnitude depends only slightly on the choice of table
dimensions for d & 5. We have shown in unpublished
work that as the table dimensions increase {d —> °°), KW
converges to the asymptotic value of Spearman's rank
correlation, ps. In particular, for the BVN distribution,
ps = 6/TT arcsin('/2p) (13).
Row proportions, a. Row proportions, a, are conditional proportions that measure agreement (or disagreement) in the extreme quantiles of a table (e.g.,
reference 7). We often use these measures when we
want to compare a "test method" of measurement
(columns) to a "gold standard" of measurement
(rows). For example, we may be interested in the true
row proportion
which gives the proportion of observations that fall in
the first column of the table given that they already fall
Am J Epidemiol
Vol. 146, No. 6, 1997
523
in the first row. The value of a,], ranges from 0 to 1
and depends strongly on the table dimensions, d. Note
that independence implies a,|, = TT+1 = \ld, while
a ^ = 1 corresponds to perfect agreement in the first
row and column. Similarly, we may define other row
proportions of interest. For example, ax 2\\ — (Tn +
7T12)/7r1+, a 5 u = 7r15/7r1+ and a 5 | 5 = IT55/TT5+ all
measure agreement or disagreement in the extreme
row quintiles of a 5 X 5 table. We focus on a ^ for
simplicity.
We obtain sample estimates of the above three measures of agreement (K, KW, and a) for EBQP and PBQP
tables by replacing the {TT^} in the above formulas
with {ptj} and {Tr,-,}, respectively. We also use the
multivariate delta method (8) to compute both asymptotic and estimated variances of these measures of
agreement for EBQP and PBQP tables, as described in
Borkowf (4). In particular, we note that the variance
formulas for kappa and weighted kappa differ from the
conventional formulas appropriate for MULT methods
(14).
THEORETICAL AND APPLIED RESULTS
General results for the bivariate normal
distribution
We now present the asymptotic means, variances,
and variance ratios of the three measures of agreement
for 5 X 5 tables where the underlying data come from
the BVN distribution for selected correlations (p = 0,
0.25, 0.5, 0.75, and 0.9). Readers who prefer a more
concrete discussion should skip to the vitamin E example in the next section.
Table 3 (K) shows five asymptotic parameters for
kappa, K, where the underlying distribution is BVN(p).
The asymptotic variances of (?"2 k) for EBQP and
MULT tables tend to be quite close, with the greatest
differences found for higher correlations. The ratios of
the asymptotic variances show that the relative differences are, at most, about 4 percent for 0 ^ p ^ 0.9.
Thus, even if we incorrectly use the MULT variances
for K calculated from EBQP tables, the subsequent
results will not be terribly misleading for BVN data.
Borkowf et al. (3), however, present examples of certain unusual bivariate distributions that give very different asymptotic variances of (t1'2 k) for EBQP and
MULT tables. By contrast, the parametric (PBQP)
method produces asymptotic variances of {t "2 k) that
are only 20-33 percent as large as those of the empirical (EBQP) method for 0 < p < 0.9. This result
indicates that empirical methods are quite inefficient
for estimating kappa compared with parametric methods.
Next, table 3 (KJ) shows that the asymptotic means
of weighted kappa, KW, are slightly less than the cor-
524
Borkowf and Gail
TABLE 3. Asymptomatic means, variances, and variance
ratios of three measures of agreement for BVN(p)* data in
5 x 5 tablesf
Measure of
agreement
and parameter):
Correlation (r)
0.0
0.25
0.5
0.75
0.9
Mean
EBQP*
MULT*
PBQP*
Ratio 1$
Ratio 2%
0.00
0.25
0.25
0.05
1.00
0.20
0.06
0.29
0.29
0.08
1.01
0.26
0.15
0.33
0.34
0.10
1.04
0.30
0.29
0.37
0.38
0.12
1.04
0.32
0.47
0.40
0.38
0.13
0.97
0.33
Mean
EBQP
MULT
PBQP
Ratio 1
Ratio 2
0.00
1.00
1.00
0.79
1.00
0.79
0.22
0.92
0.91
0.71
0.99
0.77
0.45
0.68
0.66
0.50
0.97
0.73
0.70
0.31
0.29
0.20
0.95
0.65
0.85
0.08
0.08
0.04
1.00
0.53
"in
Mean
EBQP
MULT
PBQP
Ratio 1
Ratio 2
0.20
0.64
0.80
0.15
1.25
0.24
0.31
0.78
1.06
0.19
1.37
0.25
0.44
0.82
1.23
0.18
1.50
0.22
0.60
0.73
1.20
0.12
1.65
0.17
0.75
0.53
0.94
0.06
1.78
0.11
K
* Distributions: BVN(p), standard bivariate normal distribution
with means 0, variances 1, and correlation p; EBQP, empirical
bivariate quantile-partitioned; MULT, multinomial; PBQP, parametric
bivariate quantile-partitioned.
t The results in this table hold regardless of the means and
variances of the BVN distribution.
t The parameters denote the means of the measures of
agreement; the asymptotic variances of f " times the parameter
estimates for EBQP, MULT, and PBQP tables. Ratio 1, the ratios of
the MULT to the EBQP asymptotic variances; Ratio 2, the ratios of
the PBQP to the EBQP asymptotic variances.
relation of the BVN(p) distribution. This result follows
from the previously mentioned relations among p, ps,
and KW. For example, p = 0.75 corresponds to ps =
0.73, which is well approximated by KW = 0.70. Note
that the asymptotic variances of {t1/2 K J for EBQP and
MULT tables differ by no more than about 5 percent
for 0 < p < 0.9. Thus, even if we incorrectly use the
MULT variances for kw calculated from EBQP tables,
the subsequent results will not be very misleading for
BVN data. By contrast, the parametric method produces asymptotic variances of (f"2 kw) that decrease
monotonically from 79 to 53 percent as large as their
empirical counterparts as p increases from 0 to 0.9.
This result indicates that empirical methods are reasonably efficient for estimating weighted kappa compared with parametric methods, especially for low
correlations.
Furthermore, table 3 (a,|,) shows five asymptotic
parameters for the row proportion a,|,, where the
underlying distribution is BVN(p). The asymptotic
variances of (r"2 a^,) for EBQP tables are consistently
less than those for MULT tables, and their variance
ratios increase monotonically from 1.25 to 1.78 as p
increases from 0 to 0.9. These differences in the
asymptotic variances are mainly due to the fact that the
row total p1+ is constant given t in EBQP tables but
random in MULT tables. Thus, confidence intervals
for aji, that are constructed by incorrectly using
MULT methods will tend to be too wide. By contrast,
the parametric method produces asymptotic variances
of (t"2 a,|,) that are only 11-25 percent as large as
those of the empirical method for 0 ^ p ^ 0.9. This
result indicates that empirical methods are quite inefficient for estimating a,|, compared with parametric
methods. Similar patterns hold for other types of row
proportions (not shown).
More details about the asymptotic variance ratios of
these measures of agreement in tables of various dimensions (d = 2, 3, 4, 5, and 6) and for several
distributions appear in Borkowf (4).
Analysis of the vitamin E data
We used three approaches to analyze the vitamin E
data (table 4). For the empirical analysis, we computed
the point estimates of the measures of agreement from
the EBQP counts in table 1 and their estimated variances from the complete data using the methods of
Borkowf et al. (3). For the MULT analysis, we incorrectly treated the EBQP counts in table 1 as though
they were MULT and computed point estimates and
TABLE 4. Results for three measures of agreement and
three methods of analysis using the vitamin E data in 5 x 5
tables
Measure of
atyee merit
and method
of analysis
a
Point
estimate
Standard
error
95%
confidence
interval
EBQP*
MULT*,t
PBQP*
0.196
0.196
0.242
0.0563
0.0478
0.0266
0.085-0.306
0.102-0.290
0.190-0.294
EBQP
MULT
PBQP
0.631
0.631
0.627
0.0523
0.0452
0.0424
0.526-O.735
0.542-0.719
0.544-0.710
in
EBQP
MULT
PBQP
0.419
0.419
0.551
0.0763
0.0886
0.0305
0.270-0.569
0.246-0.593
0.491-0.611
* Distributions: EBQP, empirical bivariate quantile-partitioned;
MULT, multinomial; PBQP, parametric bivariate quantile-partitioned.
t For the MULT analysis, the authors incorrectly treat the EBQP
counts in table 1 as though they were MULT counts and then
compute point estimates and estimated variances accordingly.
Am J Epidemiol
Vol. 146, No. 6, 1997
Contingency Tables with Quantile Categories
estimated variances accordingly. For the parametric
analysis, we computed the point estimates from the
estimated PBQP counts in table 2 and their estimated
variances from the sample correlation p = 0.6810
using the methods of Borkowf (4). We also constructed nominal 95 percent confidence intervals of
these measures of agreement of the form (estimate ±
1.96 standard error).
The results in table 4 show that the empirical
(EBQP) and parametric (PBQP) methods produce similar point estimates for kappa (K = 0.196 vs. 0.242)
and weighted kappa (kw = 0.630 vs. 0.627), but very
different estimates for the selected row proportion
(&,|, = 0.419 vs. 0.551). These discrepancies reflect
the fact that weighted kappa depends on all of the cells
of the table, whereas kappa depends on only the diagonal cells and the row proportion a,|, depends on only
the (1, 1) cell. Measures that depend on fewer EBQP
cells tend to be more variable and thus more likely to
differ from those calculated from the estimated PBQP
cells, even if the assumption that the data are BVN is
correct. Furthermore, just as we observe in the theoretical studies in the previous section, the parametric
method gives smaller estimated standard errors than
does the empirical method for the vitamin E data.
If we incorrectly analyze the EBQP counts in table
1 with MULT methods, we obtain estimated standard
errors that are about 15 and 14 percent smaller for
kappa and weighted kappa, but about 16 percent larger
for the row proportion a,|, than the correctly estimated
EBQP standard errors. Thus, confidence intervals
based on the incorrect MULT standard errors are too
narrow for kappa and weighted kappa, but too wide for
the row proportion ar,|j.
DISCUSSION
In this paper, we have illustrated how to use recently
developed empirical methods of inference to analyze
measures of agreement calculated from EBQP tables.
We need both the EBQP tables and the original bivariate continuous data to estimate the variances of measures calculated from these tables. Simulations and
numerical studies show that confidence intervals for
several measures of agreement calculated from EBQP
tables have near nominal coverage for the BVN distribution for samples of moderate size (t s 30), while
other underlying distributions may require larger sample sizes (3, 4).
Preliminary simulations show that breaking ties at
random produces confidence intervals of nominal size
when few ties occur at each empirical quantile and the
table dimensions, d, are much smaller than the sample
size, t. For discrete data with many ties occurring at
the empirical quantiles, however, the joint EBQP disAm J Epidemiol
Vol. 146, No. 6, 1997
525
tribution may not be well-defined, in which case empirical methods are suspect.
We have also shown how to use parametric methods
of inference to analyze measures of agreement calculated from PBQP tables. We need the original bivariate data to estimate the underlying parameters of the
assumed distribution efficiently. Simulations and numerical studies show that estimated PBQP variances
for several measures of agreement converge rapidly to
their asymptotic values, and hence confidence intervals achieve near nominal coverage for BVN(p) data
for moderate sample sizes. Furthermore, empirical
methods have low efficiency compared with parametric methods for several measures of agreement, especially in tables with low dimensions (4). These results
are reminiscent of the findings of Donner and Eliasziw
(15) that the reliability coefficient calculated from
dichotomously partitioned data in MULT tables has
low efficiency compared with the intraclass correlation calculated from the underlying continuous BVN
data.
We suggest using parametric methods if the (possibly transformed) data appear to be consistent with a
particular parametric model, such as the BVN distribution. Parametric methods give much more precise
estimates of some measures of agreement, such as
kappa and row proportions, than empirical methods.
Empirical methods are useful, however, when we have
reason to doubt the validity of the parametric assumptions. Furthermore, we recommend using empirical
methods for some measures of agreement, such as
weighted kappa, for which these methods are reasonably efficient, at least for BVN data (table 3).
We have also demonstrated that conventional
MULT methods should not be used to analyze statistics computed from EBQP tables. In particular, for the
BVN distribution, numerical calculations for 5 X 5
tables indicate that MULT methods produce confidence intervals for kappa and weighted kappa that, on
average, can be either too narrow or too wide and
confidence intervals for row proportions that are much
too wide (table 3). Similar results occur in the vitamin
E example (table 4). Furthermore, MULT methods can
produce very misleading confidence intervals, even
for kappa and weighted kappa, for certain unusual
bivariate distributions (3).
In some circumstances, it may be advantageous to
use preset cutpoints to create MULT tables instead of
the random empirical quantiles to create EBQP tables.
For example, we may prefer to create MULT tables
when the two measurements that we wish to compare
are measured on the same scale and there exist welldefined thresholds or cutpoints of interest. Using preset cutpoints defines tables with categories that have
526
Borkowf and Gail
meaning in reference to the original scale of measurement and also facilitates comparisons among similarly
constructed tables from different studies. In addition,
statistical packages, such as StatXact (16), provide
estimates and variances of measures of agreement
(including kappa and weighted kappa) calculated from
MULT tables.
By contrast, when we use empirical quantiles instead of preset cutpoints, the resulting EBQP tables
lack information about the absolute values of the bivariate measurements. In particular, we cannot determine from EBQP tables alone whether one measurement is typically higher than the other. We can remedy
this deficiency, however, by reporting the values of the
empirical quantiles along with these tables. Furthermore, we need special software to compute the variances of statistics calculated from EBQP tables (see
below).
Conversely, it may be advantageous to construct
EBQP tables when the underlying measurements are
hard to calibrate or standardize. Since we use the
bivariate ranks to create these tables, they are invariant
to monotonic increasing transformations of the marginal data. Thus, we may create EBQP tables, for
example, in studies where two different laboratories
produce measurements with different intercepts or
scalings or where we obtain the two quantitative measurements by disparate techniques, such as food
records and food use questionnaires. Finally, EBQP
tables have (nearly) balanced marginal totals, which
avoids problems due to entire rows or columns with
sparse cells.
In addition to empirical and parametric methods, we
should consider other measures of agreement and analyses that examine the original bivariate continuous
data. For example, when the data are BVN, the sample
correlation p is an optimal location and scale invariant
measure of overall agreement. More generally, we can
use the bivariate ranks of the original data to estimate
nonparametric statistics, such as Spearman's rank correlation, ps. The original data also lend themselves to
a variety of graphic and regression analyses, which at
times may reveal more about the association in the
underlying distribution than some statistics calculated
from EBQP and PBQP tables, such as kappa or row
proportions.
The first author (C. B. B.) can provide a computer
program (EpiQuant 1.0) for constructing EBQP and
PBQP tables, the latter under the assumption of bivariate normality, and for estimating the means and vari-
ances of kappa, weighted kappa, and six selected row
proportions. This program is written in the GAUSS 3.0
programming language (17) and comes with a brief
technical note to explain its use (4).
ACKNOWLEDGMENTS
The authors thank Anne Hartman for bringing the problem of EBQP tables in epidemiologic studies to our attention. They are grateful to her and her collaborators for
providing the original vitamin E data.
REFERENCES
1. Pietinen P, Hartman AM, Haapa E, et al. Reproducibility
and validity of dietary assessment instruments. I. A selfadministered food use questionnaire with a portion size picture
booklet. Am J Epidemiol 1988;128:655-66.
2. Pietinen P, Hartman AM, Haapa E, et al. Reproducibility and
validity of dietary assessment instruments. II. A qualitative
food frequency questionnaire. Am J Epidemiol 1988; 128:
667-76.
3. Borkowf CB, Gail MH, Carroll RJ, et al. Analyzing bivariate
continuous data grouped into categories defined by empirical
quantiles of the marginal distributions. Biometrics (In press).
4. Borkowf CB. The empirical and parametric bivariate quantilepartitioned distributions. Doctoral dissertation. Field of statistics. Cornell University, Ithaca, New York, 1997.
5. Cohen J. A coefficient of agreement for nominal scales. Educ
Psychol Meas 1960;20:37-46.
6. Spitzer RL, Cohen J, Fleiss JL, et al. Quantification of agreement in psychiatric diagnosis. A new approach. Arch Gen
Psychiatry 1967; 17:83-7.
7. Willett WC, Sampson L, Stampfer MJ, et al. Reproducibility
and validity of a semiquantitative food frequency questionnaire. Am J Epidemiol 1985;122:51-65.
8. Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis. Cambridge, MA: MIT Press, 1975:486-502.
9. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.
10. Landis JR, Koch GG. An application of hierarchical kappatype statistics in the assessment of majority agreement among
multiple observers. Biometrics 1977;33:363-74.
11. Maclure M, Willett WC. Misinterpretation and misuse of the
kappa statistic. Am J Epidemiol 1987;126:161-9.
12. Fleiss JL, Cohen J. The equivalence of weighted kappa and the
intraclass correlation coefficient as measures of variability.
Educ Psychol Meas 1973;33:613-19.
13. Moran PAP. Rank correlation and product-moment correlation. Biometrika 1948;35:203-6.
14. Fleiss JL, Cohen J, Everitt BS. Large sample standard errors
of kappa and weighted kappa. Psychol Bull 1969;72:323-7.
15. Dormer A, Eliasziw M. Statistical implications of the choice
between a dichotomous or continuous trait in studies of interobserver agreement. Biometrics 1994;50:550-5.
16. CYTEL Software Corporation. StatXact 3 for Windows. Cambridge, MA: CYTEL Software Corporation, 1995.
17. Aptech Systems, Inc. The GAUSS system. Version 3.0. Maple
Valley, WA: Aptech Systems, Inc., 1992.
Am J Epidemiol Vol. 146, No. 6, 1997
© Copyright 2026 Paperzz