Methods for Two Categorical Variables – RR, OR and Screening

Methods for Two Categorical Variables – RR, OR and Screening Tests
Example: In May of 2000, eight people who worked at the same microwave popcorn production plant
reported to the Missouri Department of Health with fixed obstructive lung disease. These workers had
become ill between 1993 and 2000 while employed at the plant. Because of the cases, researchers
began conducting medical examinations and environmental surveys of workers employed at the plant in
November 2000 to assess their occupational exposure to certain compounds.
Part of this study involved measuring the forced vital capacity (FVC) of the current employees (this is the
volume of air that can be maximally, forcefully exhaled). The study consisted of 116 participants, and
the FVC screening indicated that 21 employees had an airway obstruction. In addition, the popcorn
plant was broken into several areas (the flavor-mixing room, packaging room, etc.). Air and dust
samples in each area were measured to determine the exposure to diacetyl , a marker of organicchemical exposure. Then, the average exposure for each study participant was determined by taking
into account how low they spent at different jobs within the plant and the average exposure in that job
area. Finally, they were classified as having either “low” or “high” exposure. The data can be found in
the file PopcornPlant.jmp on the course website.
Source: The data and example are from “Investigating Statistical Concepts, Applications, and Methods” by Allan
Rossman and Beth Chance, Preliminary Edition. 2005. Brooks/Cole Thomson Learning.
The contingency table and mosaic plot for the data are given below.
Questions:
1. Using the contingency table, find the following marginal probability:
P(High Exposure) =
2. Using the contingency table, find the following marginal probability:
P(Low Exposure) =
1
3. Using the contingency table, find the joint probability that someone is in the High Exposure
group and has an Airway Obstruction:
P(High Exposure and Airway Obstruction) =
4. Using the contingency table, find the following joint probability:
P(Low Exposure and Airway obstruction) =
5. Using the contingency table, find the following conditional probability. Given that an individual
is in the High Exposure category, what is the probability they have an airway obstruction?
P(Airway Obstruction | High Exposure) =
6. Using the contingency table, find the following conditional probability:
P(Airway Obstruction | Low Exposure) =
Other summaries that are often used when investigating the relationship between categorical variables
are the risk difference, relative risk and odds ratio.
Risk Difference and Relative Risk
Let’s again consider the data from the microwave popcorn plant example.
Low Exposure
High Exposure
Total
Airway Not Obstructed
52
43
95
Airway Obstructed
6
15
21
Total
58
58
106
We have seen P(Airway Obstruction | High Exposure) is ____________________ than
P(Airway Obstruction | Low Exposure). Since these conditional probabilities differ, it appears there may
be an __________________ between level of exposure and having an airway obstruction. One way to
compare the two groups (High and Low Exposure) is to look at the risk difference in these two
probabilities.
Risk Difference: This is simply the difference in two ________________ probabilities.
Questions:
7. Compute the risk difference for P(Airway Obstruction | High Exposure) and
P(Airway Obstruction | Low Exposure).
2
8. Does this seem like a large difference to you?
9. Suppose the two conditional probabilities of interest had been 0.95 and 0.79 instead. Does this
seem like a large difference to you?
Note that for these data, P(Airway Obstruction | High Exposure) was more than ___________ as large as
P(Airway Obstruction | Low Exposure). Since this seems like an important feature to describe, we will
compare the two groups using relative risk rather than the risk difference.
Relative Risk: This is the measure of how much a particular risk factor ________________ the
risk of a specified outcome.
For the popcorn example, we can calculate the relative risk as follows:
RR = Relative Risk =
P(Airway Obstruction | High Exposure)
P(Airway Obstruction | Low Exposure)
=
Proportion with Airway Obstruction in High Exposure Group
Proportion with Airway Obstruction in Low Exposure Group
=
Interpretation of this value:
Comments:
 A relative risk of _____ is the reference value for making comparisons. That is, a relative risk of
_____ says there is _____ difference in the two probabilities.
3
 The risk difference and the relative risk are easily displayed in the following mosaic plot.
 Alternatively, we could have calculated the relative risk as follows:
RR =
P(Airway Obstruction | Low Exposure)
=
P(Airway Obstruction | High Exposure)
Interpretation: The risk of airway obstruction for the _________ exposure group is _____ times
more likely than the risk of airway obstruction for the _________ exposure
group.
Odds Ratio
The relative risk is frequently used when investigating the relationship between two categorical
variables. Although this quantity is relatively easy to calculate and interpret, statisticians often use
another quantity known as an odds ratio in this situation.
Odds: With counts given for two _________ categories (High and Low Exposure), the odds of
“yes” versus “no” is computed as the number of “yes” events versus the number of “no”
events for each group.
Again, let’s consider the microwave popcorn example.
Low Exposure
High Exposure
Total
Airway Not Obstructed
52
43
95
Airway Obstructed
6
15
21
Total
58
58
106
4
The odds of Airway Obstruction for High =
Number with airway obstruction in High Group
Number with NO airway obstrcution in High Group
=
The Odds of Airway Obstruction for Low =
Number with airway obstruction in Low Group
Number with NO airway obstrcution in Low Group
=
Odds Ratio: This is simply the _________ of the odds for the two groups:
OR = Odds Ratio =
Odds of airway obstruction for High Exposure
Odds of airway obstruction for Low Exposure
Interpretation of this value:
We could have also calculated the odds ratio in the following manner:
OR =
Odds of airway obstruction for Low Exposure
Odds of airway obstruction for High Exposure
Interpretation of this value:
Comments:
 An odds ratio of _____ implies there is no observable difference between the two odds.
5
 The odds can be visualized using the mosaic plot.
 We can make the following conclusions from this study:
o
The findings indicate that employees with high exposure were _____ times more likely
to have an airway obstruction. Or, the odds of an airway obstruction are _____ times
greater for the group with high exposure.
o
This is an observational study, so while we have evidence the high exposure group has a
greater risk of airway obstruction, we cannot say for sure that diacetyl caused it.
Relative Risks and Odds Ratios in JMP
We can get these values from JMP using the following directions.
 Once the mosaic plot, contingency table and Tests output has been created by choosing Analyze
 Fit Y by X, click on the little red arrow next to Contingency Analysis of Airway Obstruction? By
Exposure. Then choose Relative Risk.
6
 The options to choose can be determined from the question or you can ask for all combinations
to be outputted. However, only one of the ratios is most appropriate for each scenario.
 The following output will be given at the bottom of the JMP output window.
 If you check the Calculate all combinations box in the dialogue box given above, you’ll get the
following output.
 If you select odds ratio from the same drop-down menu you should get the following output.
Note: JMP always alphabetizes the category names in the contingency table, and then divide the
columns left to right.
OR =
Odds of NO obstruction for Low group
Odds of NO obstruction for the High group
=
7
Example: A study was conducted in 1991 by the University of Wisconsin and the Wisconsin Department
of Transportation in which linked police reports and discharge records were used to assess, among other
things, the risk of head injury for motorcyclists in motor-vehicle crashes. The data shown below can be
used to examine the relationship between helmet use and whether brain injury was sustained in the
accident.
Brain Injury
No Brain Injury
Total
Helmet
17
977
994
No Helmet
97
1918
2015
Total
114
2895
3009
Questions:
10. Using JMP, find and interpret the relative risk of brain injury.
11. Using JMP, find and interpret the odds ratio.
12. Looking at the relative risk found in Question 10 and the odds ratio found in Question 11, is
there a relationship/association between helmet use and brain injury? Explain.
8
Example: A case-control study was conducted to determine whether there was an increased risk of
cervical cancer for women who had their first child before age 25. A sample of 49 women with cervical
cancer was taken and 42 had their first child before the age of 25. From a sample of 317 “similar”
women without cervical cancer it was found that 203 of them had their first child before the age of 25.
Research Question – Do these data suggest that having a child before the age of 25 increases
the risk of cervical cancer?
Age ≤ 25
Age > 25
Total
Cancer (Case)
42
7
49
No Cancer (Control)
203
114
317
Total
245
121
366
Questions:
13. Why can’t we meaningfully calculate P(cancer | risk factor status)?
14. Even though it is not appropriate to do so, calculate P(cancer | risk factor status) for both risk
factor groups.
15. Find the odds of cancer for both risk factor groups.
16. Find the odds ratio for having cancer associated with the risk factor being present.
9
17. Now, find P(risk factor status | cancer) for each group of women.
18. Find the odds of the risk factor for both the cases and the controls.
19. Find the odds ratio for having the risk factor associated with being a case.
20. What do you notice? Why do you suppose the odds ratio is much more commonly used than
the relative risk?
10
Screening Tests
In this section we’ll discuss a very special case in which we want to analyze the relationship between
two categorical variables: screening tests. Our goal in these types of problems is to determine the
validity of such tests.
False Positive: A positive test result _________________________ identifying a condition that
does _____ in fact exist.
False Negative: A test result implying a condition ______________ exist when in fact it _____.
Consider the following contingency table:
Test Result Positive
Test Result Negative
Total
Disease Present
a
c
a+c
Disease Absent
b
d
b+d
Total
a+b
c+d
a+b+c+d
Questions:
21. Given that an individual has the disease, what is the probability of a positive test result? This is
known as the _______________ of a test: P(Test+ | Disease+).
22. Given that in individual does not have the disease, what is the probability of a negative test
result? This is known as the _______________ of a test: P(Test- | Disease-).
23. Given that a test result is positive, what is the probability that the individual being tested has
the disease? This is known as the ____________________ predictive value of a test:
P(Disease+|Test+) =
P(Test  | Disease )  P(Disease )
P(Test  | Disease )  P(Disease )  P(Test  | Disease  )  P(Disease )
24. Given that a test result is negative, what is the probability that the individual being tested does
not have the disease? This is known as the ___________________ predictive value of a test:
P(Disease- | Test-) =
P(Test  | Disease )  P(Disease )
P(Test  | Disease )  P(Disease )  P(Test  | Disease  )  P(Disease )
11
Example: Suppose a test for chlamydial genital tract infection is given to 1300 female patients (age 20 –
24) in a particular clinic, and the results are as follows:
Test Result Positive
Test Result Negative
Total
Disease Present
309
16
325
Disease Absent
49
926
975
Total
358
942
1300
Questions:
25. What is the sensitivity of the test?
26. What is the specificity of the test?
27. Estimate P(Disease+) from the data. Using this estimate, what is the positive predictive value of
the test?
28. Estimate P(Disease-) from the data. Using this estimate, what is the negative predictive value of
the test?
12
Example: Now, suppose the same test is given in the maternity unit (for women age 20 – 24) where the
prevalence of the disease is much lower. These results follow:
Test Result Positive
Test Result Negative
Total
Disease Present
122
6
128
Disease Absent
154
2918
3072
Total
276
2924
3200
Questions:
29. What is the sensitivity of the test?
30. What is the specificity of the test?
31. Estimate P(Disease+) from the data. Using this estimate, what is the positive predictive value of
the test?
32. Estimate P(Disease-) from the data. Using this estimate, what is the negative predictive value of
the test?
13
33. Does the prevalence of the disease have any effect on test performance? Explain.
34. Do you think that estimating P(Disease+) from either set of data yields a “good” estimate for this
disease rate in the general population? Why or why not?
Comment: Positive (and Negative) predictive values require some outside knowledge of the disease rate
for the general population. For example, the Center for Disease Control published 2009 surveillance
data for some common sexually transmitted diseases. The nation rate of reported chlamydia in 2009 for
females age 20 – 24 was 3273.9 cases per 100,000 population members.
Source: http://www.cdc.gov/std/stats09/surv2009-Complete.pdf
Question:
35. Now, using this “known” disease rate and the data from the clinic example (not the maternity
data), find the positive predictive value for the test.
14