Handling Uncertainty
Uncertain knowledge
• Typical example: Diagnosis.
Name
Toothache
…
Cavity
Smith
true
…
true
Mike
true
…
true
Mary
false
…
true
Quincy true
…
false
…
…
…
…
• Can we certainly derive the diagnostic rule:
if Toothache=true then Cavity=true ?
• The problem is that this rule isn’t right always.
– Not all patients with toothache have cavities; some of them have gum
disease, an abscess, etc.
• We could try turning the rule into a causal rule:
– if Cavity=true then Toothache=true
• But this rule isn’t necessarily right either; not all cavities cause pain.
Belief and Probability
• The connection between toothaches and cavities is not a logical
consequence in either direction.
• However, we can provide a degree of belief on the rules. Our
main tool for this is probability theory.
• E.g. We might not know for sure what afflicts a particular
patient, but we believe that there is, say, an 80% chance – that is
probability 0.8 – that the patient has cavity if he has a toothache.
– We usually get this belief from statistical data.
Syntax
• Basic element: random variable – corresponds to an “attribute”
of data.
e.g., Cavity (do I have a cavity?) is one of <cavity, ¬cavity>
Weather is one of <sunny,rainy,cloudy,snow>
• Both Cavity and Weather are discrete random variables
– Domain values must be
• exhaustive and
• mutually exclusive
• Elementary propositions are constructed by the assignment of a
value to a random variable:
e.g., Cavity = ¬cavity,
Weather = sunny
Prior probability and distribution
• Prior or unconditional probability associated with a proposition is
the degree of belief accorded to it in the absence of any other
information.
e.g., P(Cavity = cavity) = 0.1
(or abbrev. P(cavity) = 0.1)
P(Weather = sunny) = 0.7 (or abbrev. P(sunny) = 0.7)
• Probability distribution gives values for all possible assignments:
P(Weather = sunny) = 0.7
P(Weather = rain) = 0.2
P(Weather = cloudy) = 0.08
P(Weather = snow) = 0.02
Conditional probability
• E.g., P(cavity | toothache) = 0.8
i.e., probability of cavity given that toothache is all I know
• It can be interpreted as the probability that the rule
if Toothache=true then Cavity=true
holds.
• Definition of conditional probability:
P(a | b) = P(a ∧ b) / P(b)
if P(b) > 0
• Product rule gives an alternative formulation:
P(a ∧ b) = P(a | b) P(b) = P(b | a) P(a)
Bayes' Rule
• Product rule
P(a∧b) = P(a | b) P(b) = P(b | a) P(a)
Bayes' rule:
P(a | b) = P(b | a) P(a) / P(b)
• Useful for assessing diagnostic probability from causal probability as:
P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect)
• Bayes’s rule is useful in practice because there are many cases
where we do have good probability estimates for these three
numbers and need to compute the fourth.
Applying Bayes’ rule
For example,
• A doctor knows that the meningitis causes the patient to have a
stiff neck 50% of the time.
• The doctor also knows some unconditional facts:
the prior probability that a patient has meningitis is 1/50,000, and
the prior probability that any patient has a stiff neck is 1/20.
• So, what do we have in term of probabilities.
Bayes’ rule (cont’d)
P(StiffNeck=true | Meningitis=true) = 0.5
P(Meningitis=true) = 1/50000
P(StiffNeck=true) = 1/20
P(Meningitis=true | StiffNeck=true)
= P(StiffNeck=true | Meningitis=true) P(Meningitis=true) / P(StiffNeck=true)
= (0.5) * (1/50000) / (1/20)
= 0.0002
That is, we expect only 1 in 5000 patients with a stiff neck to have meningitis.
This is still a very small chance. Reason is a very small apriori probability.
Also, observe that
P(Meningitis=false | StiffNeck=true)
= P(StiffNeck=true | Meningitis=false) P(Meningitis=false) / P(StiffNeck=true)
1/ P(StiffNeck=true) is the same for both conditional probabilities.
It is called the normalization constant (denoted as α).
Bayes’ rule -- more vars
P(cause, effect1, effect2 )
P(cause| effect1, effect2 ) =
= αP(cause, effect1 , effect2 )
P(effect1 , effect2 )
= αP(effect1 , effect2 , cause)
= αP(effect1 | effect2 , cause) P(effect2 , cause)
= αP(effect1 | effect2 , cause) P(effect2 | cause) P(cause)
• Although the effect1 might not be independent of effect2, it might be that given
the cause they are independent.
• E.g.
effect1 is ‘abilityInReading’
effect2 is ‘lengthOfArms’
• There is indeed a dependence of abilityInReading to lengthOfArms. People
with longer arms read better than those with short arms….
• However, given the cause ‘Age’ the abilityInReading is independent of
lengthOfArms.
Naive Bayes
• Two assumptions: Attributes (effects) are
– equally important
– conditionally independent (given the class value)
P(cause| effect1, effect2 )
= αP(effect1 | effect2 , cause) P(effect2 | cause) P(cause)
= αP(effect1 | cause) P(effect2 | cause) P(cause)
P(cause| effect1 ,...,effectn ) = αP(effect1 | cause)...P(effectn | cause) P(cause)
• This means that knowledge about the value of a particular attribute doesn’t
tell us anything about the value of another attribute (if the class is known)
• Although the formula is based on assumptions that are almost never correct,
this scheme works well in practice!
Weather Data
Here we don’t really have
effects, but rather
“evidence.”
Naïve Bayes for classification
• Classification learning:
what’s the probability of the class given an instance?
– Instance (Evidence E) E1=e1, E2=e2, …, En=en
– Class C = {c, …}
• Naïve Bayes assumption:
evidence can be split into independent parts (i.e. attributes of
instance!)
P(c | E)=P(c | e1,e2,…, en)
= P(e1|c) P(e2|c)… P(en|c) P(c) / P(e1,e2,…, en)
The weather data example
P(play=yes | E) =
P(Outlook=Sunny | play=yes) *
P(Temp=Cool | play=yes) *
P(Humidity=High | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
(2/9) *
(3/9) *
(3/9) *
(3/9) *
(9/14) / P(E) = 0.0053 / P(E)
Don’t worry for the 1/P(E); It’s
alpha, the normalization constant.
The weather data example
P(play=no | E) =
P(Outlook=Sunny | play=no) *
P(Temp=Cool | play=no) *
P(Humidity=High | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
(3/5) *
(1/5) *
(4/5) *
(3/5) *
(5/14) / P(E) = 0.0206 / P(E)
Normalization constant
play=yes
play=no
20.5% E
79.5%
P(play=yes | E) + P(play=no | E) = 1
0.0053 / P(E) + 0.0206 / P(E) = 1
P(E) = 0.0053 + 0.0206
So,
i.e.
i.e.
P(play=yes | E) = 0.0053 / (0.0053 + 0.0206) = 20.5%
P(play=no | E) = 0.0206 / (0.0053 + 0.0206) = 79.5%
The “zero-frequency problem”
• What if an attribute value doesn’t P(play=yes | E) =
occur with every class value (e.g.
P(Outlook=Sunny | play=yes) *
“Outlook=overcast” for class
“Play=no”)?
P(Temp=Cool | play=yes) *
– Probability
P(Humidity=High | play=yes) *
P(Outlook=overcast|play=no)
P(Windy=True | play=yes) *
will be zero!
P(play=yes) / P(E) =
• P(Play=“no”|E) will also be zero!
(2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E)
– No matter how likely the other
= 0.0053 / P(E)
values are!
It will be instead: Number of possible
values for ‘Outlook’
• Solution:
– Add 1 to the count for every
attribute value-class
= ((2+1)/(9+3)) * ((3+1)/(9+3)) *
combination (Laplace
((3+1)/(9+2)) * ((3+1)/(9+2)) *(10/16) /
estimator);
P(E) = 0.0069 / P(E)
– Add k (no of possible attribute
values) to the denominator. (see
Number of possible
example on the right).
values for ‘Windy’
Missing values
• Training phase : instance will not be included in the frequency
count for “attribute value-class” combination
• Classification phase : attribute will be omitted from calculation
• Example:
P(play=yes | E) =
P(Temp=Cool | play=yes) *
P(Humidity=High | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
(4/12)*(4/11)*(4/11)*(10/16) /
P(E) = 0.0275 / P(E)
P(play=no | E) =
P(Temp=Cool | play=no) *
P(Humidity=High | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
(2/8)*(5/7)*(4/7)*(6/16) / P(E) =
0.0383 / P(E)
After normalization: P(play=yes | E) = 42%,
P(play=no | E) = 58%
Dealing with numeric attributes
• Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class).
• Probability density function for the normal distribution is:
f ( x | class) =
1
e
σ 2π
−
( x−µ ) 2
2σ 2
• We approximate µ by the sample mean:
1 n
x = ∑ xi
n i =1
• We approximate σ 2 by the sample variance:
1 n
σ =
( xi − x )2
∑
n −1 i =1
2
Weather Data
outlook temperature humidity windy
sunny
85
85 FALSE
sunny
80
90 TRUE
overcast
83
86 FALSE
rainy
70
96 FALSE
rainy
68
80 FALSE
rainy
65
70 TRUE
overcast
64
65 TRUE
sunny
72
95 FALSE
sunny
69
70 FALSE
rainy
75
80 FALSE
sunny
75
70 TRUE
overcast
72
90 TRUE
overcast
81
75 FALSE
rainy
71
91 TRUE
We need to compute:
f(Temperature=66 | yes)
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
f(Temperature=66 | yes)
=e^(- ((66-m)^2 / 2*var) ) /
sqrt(2*3.14*var)
m=
(83+70+68+64+69+75+75+72
+81)/ 9 = 73
var = ( (83-73)^2 + (70-73)^2 +
(68-73)^2 + (64-73)^2 + (6973)^2 + (75-73)^2 + (75-73)^2
+ (72-73)^2 + (81-73)^2 )/ (9-1)
= 38
f(Temperature=66 | yes)
=e^(- ((66-73)^2 / (2*38) ) ) /
sqrt(2*3.14*38) = .034
Weather Data
outlook temperature humidity windy
sunny
85
85 FALSE
sunny
80
90 TRUE
overcast
83
86 FALSE
rainy
70
96 FALSE
rainy
68
80 FALSE
rainy
65
70 TRUE
overcast
64
65 TRUE
sunny
72
95 FALSE
sunny
69
70 FALSE
rainy
75
80 FALSE
sunny
75
70 TRUE
overcast
72
90 TRUE
overcast
81
75 FALSE
rainy
71
91 TRUE
We compute similarly:
f(Humidity=90 | yes)
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
f(Humidity=90 | yes)
=e^(- ((90-m)^2 / 2*var) ) /
sqrt(2*3.14*var)
m=
(86+96+80+65+70+80+70+90
+75)/ 9 = 79
var = ( (86-79)^2 + (96-79)^2 +
(80-79)^2 + (65-79)^2 + (7079)^2 + (80-79)^2 + (70-79)^2
+ (90-79)^2 + (75-79)^2 )/ (9-1)
= 104
f(Humidity=90 | yes)
=e^(- ((90-79)^2 / (2*104) ) ) /
sqrt(2*3.14*104) = .022
Classifying a new day
• A new day E:
P(play=yes | E) =
P(Outlook=Sunny | play=yes) *
P(Temp=66 | play=yes) *
P(Humidity=90 | play=yes) *
P(Windy=True | play=yes) *
P(play=yes) / P(E) =
= (2/9) * (0.034) * (0.022) * (3/9)
*(9/14) / P(E) = 0.000036 / P(E)
P(play=no | E) =
P(Outlook=Sunny | play=no) *
P(Temp=66 | play=no) *
P(Humidity=90 | play=no) *
P(Windy=True | play=no) *
P(play=no) / P(E) =
= (3/5) * (0.0291) * (0.038) * (3/5)
*(5/14) / P(E) = 0.000136 / P(E)
After normalization: P(play=yes | E) = 20.9%,
P(play=no | E) = 79.1%
t
a
c
Tid
Refund
o
eg
l
a
ric
l
a
ric
s
u
o
Tax
Naive
Bayes
o Data
nu –
s
i
g
t
s
te
n
ca
co
a
cl
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Classify: (_, No, Married, 95K, ?)
(Apply also the Laplace normalization)
t
a
c
Tid
Refund
o
eg
l
a
ric
l
a
ric
s
u
o
u – Naive Bayes
Tax
o Data
n
i
g
ss
te
nt
a
cl
ca
co
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Classify: (_, No, Married, 95K, ?)
(Apply also the Laplace normalization)
P(Yes | E) = ?
P(Yes) = (3+1)/(10+2) = 0.33
P(Refund=No|Yes) = (3+1)/(3+2) = 0.8
P(Status=Married|Yes) = (0+1)/(3+3) = 0.17
f (income| Yes) =
1
2πσ
2
e
( x −µ ) 2
−
2σ 2
Approximate µ with: (95+85+90)/3 =90
Approximate σ2 with:
( (95-90)^2+(85-90) ^2+(90-90) ^2 )/
(3-1) = 25
f(income=95|Yes) =
e(- ( (95-90)^2 / (2*25)) ) /
sqrt(2*3.14*25) = .048
P(Yes | E) = α*.8*.17*.048*.33=
α*.002154
l
a
ric
l
a
ric
Tax
o Data
g
go
e
e
t
t
Tid
u
a
cl
ca
ca
co
Refund
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
ss
P(No | E) = ?
P(No) =(7+1)/(10+2) = .67
P(Refund=No|No) = (4+1)/(7+2) = .556
P(Status=Married|No) = (4+1)/(7+3) = .5
1
f (income| No) =
e
σ 2π
−
( x−µ )2
2σ 2
Approximate µ with:
(125+100+70+120+60+220+75)/7 =110
5
No
Divorced 95K
Yes
Approximate σ2 with:
6
No
Married
60K
No
((125-110)^2 + (100-110)^2 + (707
Yes
Divorced 220K
No
110)^2 + (120-110)^2 + (60-110)^2
8
No
Single
85K
Yes
+ (220-110)^2 + (75-110)^2 )/(7-1) =
9
No
Married
75K
No
2975
10 No
Single
90K
Yes
f(income=95|No) =
Classify: (_, No, Married, 95K, ?)
e( -((95-110)^2 / (2*2975)) )
/sqrt(2*3.14* 2975) = .00704
(Apply also the Laplace normalization) P(No | E) = α*.556*.5* .00704*0.67=
α*.001311
4
10
in
nt
s
u
o
Yes
Married
120K
No
l
a
ric
l
a
ric
Tax
o Data
go
g
e
e
t
t
Tid
on
u
it n
s
u
o
s
s
a
cl
ca
ca
Refund
Marital
Status
Taxable
Income
Evade
c
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Classify: (_, No, Married, 95K, ?)
(Apply also the Laplace normalization)
P(Yes | E) = α*.002154
P(No | E) = α*.001311
α = 1/(.002154 + .001311)=288.60
P(Yes|E) = 288.60 *.002154 = 0.62
P(No|E) = 288.60 *.001311 = 0.38
We predict “Yes.”
Text Categorization
• Text categorization is the task of assigning a given document to
one of a fixed set of categories, on the basis of the text it contains.
• Naïve Bayes models are often used for this task.
• In these models,
the query variable is the document category, and
the effect variables are the presence or absence of each word in
the language.
• How such a model can be constructed, given as training data a
set of documents that have been assigned to categories?
Text Categorization
• The model consists of the prior probability P(Category) and
the conditional probabilities P(Wordi | Category) for every word
in the language
– For each category c, P(Category = c) is estimated as the
fraction of all the “training” documents that are of that
category.
– Similarly, P(Wordi = true | Category = c) is estimated as the
fraction of documents of category that contain wordi.
– Also, P(Wordi = true | Category = ¬c) is estimated as the
fraction of documents not of category that contain word.
Text Categorization (cont’d)
• Now we can use naïve Bayes for classifying a new document:
• Assume Word1, …, Wordn are the words occurring in the new
document.
P(Category = c | Word1 = true, …, Wordn = true) =
α*P(Category = c)∏ni=1 P(Wordi = true | Category = c)
P(Category = ¬c | Word1 = true, …, Wordn = true) =
α*P(Category = ¬c)∏ni=1 P(Wordi = true | Category = ¬c)
• α is the normalization constant.
• Observe that similarly with the “missing values”.
The new document doesn’t contain every word for which we
computed the probabilities.
© Copyright 2026 Paperzz