A Probability Lesson

A Probability Lesson
FOR ANY HIGH SCHOOL MATH COURSE
T EACHI NG CON T E M PORARY M AT HEMATI CS
JU L I E G R AV ES
JA N UA RY 2 0 1 6
A group of 300 persons was asked to sort
themselves just as you did. Here are their
results.
Liberal
Moderate
Conservative
Under 35
50
10
10
70
35 to 50
20
70
40
130
Over 50
10
50
40
100
80
130
90
300
Now think about selecting a person at random from
this group of 300 individuals. β€œAt random” implies that
each individual has a 1 in 300 chance of being selected.
Liberal
Under 50
35
35 to 20
50
Over 10
50
80
Moderate Conser
vative
10
10
70
70
40
130
50
40
80
100 P(select liberal)=
300
130
90
300
=
4
15
90
P(select conservative)=
300
=
3
10
Liberal
Under 50
35
35 to 20
50
Over 10
50
80
Moderate Conser
vative
10
10
70
70
40
130
50
40
100
130
90
300
P(under
P(over
70
35)=
300
100
50)=
300
=
=
1
3
7
30
Liberal
Under 50
35
35 to 20
50
Over 10
50
80
Moderate Conser
vative
10
10
70
70
40
130
50
40
100
130
90
300 P(select under 35 and liberal)=
P(select over 50 and
50
300
=
1
6
40
conservative)=
300
=
2
15
Liberal
Conditional probability
Under 50
35
35 to 20
50
Over 10
50
80
P(select under 35 given selected
50
liberal)=
80
P(select under 35 given selected
10
conservative)=
90
=
Moderate Conser
vative
10
10
70
70
40
130
50
40
100
130
90
300
5
8
=
1
9
Liberal
Under 50
35
35 to 20
50
Over 10
50
80
Moderate Conser
vative
10
10
70
70
40
130
50
40
100
130
90
300
10
liberal)=
80
1
P(select over 50 given selected
=
8
70
P(select 35 to 50 given selected moderate)=
130
50
5
P(select liberal given selected under 35)= =
70
7
=
7
13
To recap, we have determined the following:
πŸ•
πŸ‘πŸŽ
πŸ’
πŸπŸ“
P(select under 35)=
P(select liberal)=
P(select under 35 given selected liberal)=
πŸ“
πŸ–
P(select liberal given selected under 35)=
πŸ“
πŸ•
𝟏
πŸ”
P(select under 35 AND liberal)=
Notice that P(select under 35 AND liberal) β‰  P(select under 35) X P(select liberal)
1
7 4
β‰ 
βˆ™
6 30 15
However,
P(select under 35 AND liberal) = P(select under 35) x P(select liberal given selected under 35)
1
7 5
=
βˆ™
6 30 7
P(select under 35 AND liberal) = P(select liberal) x P(select under 35 given selected liberal)
1
4 5
=
βˆ™
6 15 8
Sensitivity and Specificity
These two terms both describe the effectiveness of a test
used to detect a disease, a trait, or the presence of a
marker in the blood.
Sensitivity is the rate at which a test identifies a disease when
the disease is present. Sensitivity is a conditional probability.
sensitivity = P(positive test result given that the disease is
present).
Sensitivity is sometimes called the β€œtrue positive rate”.
Specificity is the rate at which a test gives a correct negative
result when the disease is not present.
specificity =P(negative test result given that the disease is not
present).
Specificity is sometimes called the β€œtrue negative rate”.
We would like a test to have as high a specificity as
possible, and as high a sensitivity as possible.
Unfortunately, in most cases there is a trade off
between specificity and sensitivity. Increasing one rate
leads to a decrease in the other rate.
Concept of sensitivity and specificity
Concept of sensitivity and specificity
Concept of sensitivity and specificity
Concept of sensitivity and specificity
Lessons Learned
A test with sensitivity 0.92 will give a positive test result for
92% of patients who have the disease. This implies that 8% of
people who have the disease get a negative (and thus
incorrect) test result.
A test with specificity of 0.96 has a 96% chance of giving a
negative test result to a patient who does not have the
disease. The probability of a person who does not have the
disease getting a positive (and thus incorrect) test result is 4%.
Suppose also that the disease we are testing for is fairly
common. For example, it may be the case that 30% of the
population we can test has the disease.
Note that to an individual, the prevalence of the disease may
not be particularly important. Individuals are not usually
interested in how prevalent any particular disease is. Instead,
they typically want to know if they have the disease.
We will fill in the cells to show what we expect to happen
when the test is administered to 5000 individuals
.
Test
positive
Disease actually
present
Disease not
present
1500 = 0.30 x 5000
1500 individuals have the disease and
3500 do not have the disease.
Test Negative
1500
3500
5000
Test
positive
Disease actually
present
Disease not
present
1380
Test Negative
1500
3500
5000
sensitivity = 0.92
= P(positive test result given that the disease is present)
1380
= 0.92
1500
Test
positive
.
Disease actually
present
Disease not
present
Test Negative
1380
1500
3360
3500
5000
specificity = 0.96
= P(negative test result given that the disease is not present)
3360
= 0.96
3500
Test positive
Disease actually
present
Disease not
present
Test Negative
1380
120
1500
140
3360
3500
1520
3580
5000
Imagine selecting an individual at random from among the 5000 tested.
P(select positive test AND disease present)=
P(select positive test AND disease not present) =
P(select positive test given select disease present)=
P(select disease present given select positive test)=
P(select disease not present given select positive test)=
Test positive
Disease actually
present
Disease not
present
P(select positive
Test Negative
1380
120
1500
140
3360
3500
1520
3580
5000
1520
test)=
5000
β‰ˆ0.30
140
P(select positive test given select disease not present) = 3500 β‰ˆ0.04
1380
P(select positive test given select disease present)=1500 β‰ˆ0.92
1380
P(select disease present given select positive test)=1520 β‰ˆ0.91
P(select disease not present given select positive
140
test)=
1520
β‰ˆ0.09
Test positive
Disease actually
present
Disease not
present
Test Negative
1380
120
1500
140
3360
3500
1520
3580
5000
Among the 1520 individuals that test positive, 140 do not have the disease. This means
the test gave these individuals a false positive result. The false positive rate for this test is
140
β‰ˆ 0.09.
1520
Among the 3580 individuals that test negative, 220 do actually have the disease. These
people received a false negative test result. The false negative rate for this test
is
120
3580
β‰ˆ 0.03.
We can carry out a comparable analysis for a test that has the
same sensitivity (0.92) and the same specificity (0.96) , but now
we will test for a disease that is fairly rare. Suppose only 4% of the
population has the disease we are testing for.
Test
positive
200 = 0.04 x 5000
184 = 0.92 x 200
4608 = 0.96 x 4800
Disease actually
present
Disease not
present
Test Negative
184
200
4608
4800
5000
Test positive
Test Negative
Disease actually
present
184
16
200
Disease not
present
192
4608
4800
376
4624
5000
4624 patients tested negative and among these 16 actually had the
disease so the false negative rate is
16
4624
β‰ˆ 0.0035
376 patients tested positive and of these 192 did not have the disease.
This shows a false positive rate of
192
376
β‰ˆ 0.51.
What happened to cause such dramatic changes in the false
positive and false negative rates?
The only difference between the tables was the prevalence
of the disease, i.e. the probability that an individual in the
population actually has the disease.
We want to understand how the prevalence of the disease
influences the false positive and false negative rates for this
test.
Let p represent the probability that a randomly selected
individual has the disease or trait we are testing for. This
number is our measure of the prevalence of the disease.
p = P(select an individual who has the disease)
If we use N to represent the population size, we can
complete a two way table. For now, we will continue to
use the values 0.96 for specificity and 0.92 for sensitivity.
We will fill in the table cells to show what we expect to
happen when the test is administered to N individuals
.
Test
positive
Disease actually
present
Disease not
present
Test Negative
0.92pN
pN
0.96(1-p)N
(1-p)N
N
We can fill in the other cells.
Test
positive
Test Negative
Disease actually
present
0.92pN 0.08pN
Disease not
present
0.04(1-p)N 0.96(1-p)N
pN
(1-p)N
N
Algebra will help us find the column totals.
Test positive
Disease
actually
present
Disease not
present
Test Negative
0.92pN
0.08pN
pN
0.04(1-p)N
0.96(1-p)N
(1-p)N
(0.88p+0.04)N
(-0.88p+0.96)N
N
Test positive
Disease
actually
present
Disease not
present
Test Negative
0.92pN
0.08pN
pN
0.04(1-p)N
0.96(1-p)N
(1-p)N
(0.88p+0.04)N (-0.88p+0.96)N
To find the false positive rate, we need the probability that a
patient who tests positive does not have the disease.
That is, P(select disease not present given selected test positive)
The false positive rate is
0.04 1βˆ’π‘ 𝑁
0.88𝑝+0.04 𝑁
=
βˆ’0.04𝑝+0.04
0.88𝑝+0.04
N
To find the false negative rate, we determine this probability:
P(select disease present given selected test negative)
The false negative rate is
0.08𝑝𝑁
βˆ’0.88𝑝+0.96 𝑁
Test positive
Disease
actually
present
Disease not
present
=
0.08𝑝
βˆ’0.88𝑝+0.96
Test Negative
0.92pN
0.08pN
pN
0.04(1-p)N
0.96(1-p)N
(1-p)N
(0.88p+0.04)N (-0.88p+0.96)N
N
So the false positive rate and the false negative rate are each a
function of the prevalence (p)of the disease in the population.
The false
The false
βˆ’0.04𝑝+0.04
positive rate is
0.88𝑝+0.04
0.08𝑝
negative rate is
βˆ’0.88𝑝+0.96
Questions
A test for malaria has a 95% true positive rate and a 98 % true negative rate.
If 0.08% of residents of US have malaria, what is the probability that an individual
who tests negative actually has malaria?
If 45% of population in Ghana has malaria, what is the false positive rate?
How high must the sensitivity of the test be to ensure that the false negative rate is
below 25%?
What prevalence of disease results in false positive rate of under 10%? For what
disease prevalence is the false negative rate higher than 50%?
βˆ’0.04π‘₯+0.04
0.88π‘₯+0.04
We can study the function 𝑦 =
to see how the false positive
rate (y) varies as the prevalence of the disease (x) changes.
0.08π‘₯
βˆ’0.88π‘₯+0.96
The function 𝑦 =
represents the relationship between the false
negative rate and the disease prevalence.
False positive and false negative rates as
functions of disease prevalence
We can generalize even further…
p = prevalence of disease in population
F = specificity of test
E = sensitivity of test
Test positive
Disease
actually
present
Disease not
present
Test Negative
EpN
(1-E)pN
pN
(1-F)(1-p)N
F(1-p)N
(1-p)N
((1-F)(1-p)+Ep)N ((1-E)p+F(1-p))N
N
The false positive rate is
The false negative rate is
(1βˆ’πΉ)(1βˆ’π‘)
1βˆ’πΉ 1βˆ’π‘ +𝐸𝑝
1βˆ’πΈ 𝑝
𝐹 1βˆ’π‘ + 1βˆ’πΈ 𝑝
We can think of either of p, F , or E as the independent
variable, with the other two as parameters.
Where might a problem like this fit in the
math courses you teach?
Questions?
[email protected]