Probability: Review

Probability: Review
• The state of the world is described using random
variables
• Probabilities are defined over events
– Sets of world states characterized by propositions
about random variables
– E.g.,
g , D1, D2: rolls of two dice
• P(D1 > 2)
• P(D1 + D2 = 11)
– W is the state of the weather
• P(W = “rainy” ∨ W = “sunny”)
Kolmogorov’s axioms of
probability
• For any propositions (events) A,
A B
ƒ 0 ≤ P(A) ≤ 1
ƒ P(True) = 1 and P(False) = 0
ƒ P(A ∨ B) = P(A) + P(B) – P(A ∧ B)
Joint probability distributions
• A joint distribution is an assignment of
probabilities to every possible atomic event
P(Cavity, Toothache)
Cavity = false ∧Toothache = false
0.8
Cavityy = false ∧ Toothache = true
0.1
Cavity = true ∧ Toothache = false
0.05
Cavity = true ∧ Toothache = true
0.05
Marginal probability distributions
P(Cavity, Toothache)
Cavity = false ∧Toothache = false
0.8
Cavityy = false ∧ Toothache = true
0.1
Cavity = true ∧ Toothache = false
0.05
Cavity = true ∧ Toothache = true
0.05
P(Cavity)
P(Toothache)
Cavity = false
?
Toothache = false
?
Cavity = true
?
Toochache = true
?
Marginal probability distributions
• Given the joint distribution P(X, Y), how do we
find the marginal distribution P(X)?
P( X = x) = P(( X = x ∧ Y = y1 ) ∨ K ∨ ( X = x ∧ Y = yn ) )
n
= P(( x, y1 ) ∨ K ∨ ( x, yn ) ) = ∑ P( x, yi )
i =1
• General rule: to find P(X = x), sum the
probabilities of all atomic events where X = x.
x
Conditional probability
p
y
Conditional probability
p
y
P( A ∧ B ) P ( A, B)
=
• For any two events A and B, P ( A | B ) =
P( B)
P( B)
P(A ∧ B)
P(A)
( )
P(B)
( )
Conditional distributions
• A conditional distribution is a distribution over the values
of one variable given fixed values of other variables
P(Cavity, Toothache)
Cavity = false ∧Toothache = false
0.8
Cavity = false ∧ Toothache = true
01
0.1
Cavity = true ∧ Toothache = false
0.05
Cavity = true ∧ Toothache = true
0.05
P(Cavity | Toothache = true)
P(Cavity|Toothache = false)
Cavity = false
0.667
Cavity = false
0.941
C it = ttrue
Cavity
0 333
0.333
C it = ttrue
Cavity
0 059
0.059
P(Toothache | Cavity = true)
P(Toothache | Cavity = false)
Toothache= false
0.5
Toothache= false
0.889
Toothache = true
0.5
Toothache = true
0.111
Normalization trick
• To get the whole conditional distribution P(X | y) at once,
select all entries in the joint distribution matching Y = y
and renormalize them to sum to one
P(Cavity, Toothache)
Cavity = false ∧Toothache = false
0.8
Cavity = false ∧ Toothache = true
0.1
Cavity = true ∧ Toothache = false
0.05
Cavity = true ∧ Toothache = true
0.05
Select
Toothache, Cavity = false
Toothache= false
0.8
Toothache = true
0.1
Renormalize
P(Toothache | Cavity = false)
Toothache= false
0.889
Toothache = true
0.111
Normalization trick
• To get the whole conditional distribution P(X | y) at once,
select all entries in the joint distribution matching Y = y
and renormalize them to sum to one
• Why does it work?
P ( x, y )
P ( x, y )
=
∑ P( x′, y) P( y)
x′
by marginalization
Product rule
P ( A, B )
• Definition of conditional probability: P ( A | B ) =
P( B)
• Sometimes we have the conditional probability and want
to obtain the joint:
j
P( A, B ) = P ( A | B) P( B) = P ( B | A) P( A)
Product rule
P ( A, B )
• Definition of conditional probability: P ( A | B ) =
P( B)
• Sometimes we have the conditional probability and want
to obtain the joint:
j
P( A, B ) = P ( A | B) P( B) = P ( B | A) P( A)
• The chain rule:
P ( A1 , K , An ) = P ( A1 ) P ( A2 | A1 ) P ( A3 | A1 , A2 ) K P ( An | A1 , K , An −1 )
n
= ∏ P ( Ai | A1 , K , Ai −1 )
i =1
Bayes Rule
• The product rule gives us two ways to factor
a joint distribution:
Rev. Thomas Bayes
(1702-1761)
P( A, B ) = P ( A | B) P( B) = P ( B | A) P( A)
• Therefore,
P( B | A) P( A)
P( A | B) =
P( B)
• Why is this useful?
– Can get diagnostic probability, e.g., P(cavity | toothache) from
causal probability, e.g., P(toothache | cavity)
P( Evidence | Cause) P(Cause)
P(Cause | Evidence) =
P ( Evidence)
– Can update our beliefs based on evidence
– Important tool for probabilistic inference
Independence
• Two events A and B are independent if and only if
P(A, B) = P(A) P(B)
– In other words, P(A | B) = P(A) and P(B | A) = P(B)
– This is an important simplifying assumption for
modeling, e.g., Toothache and Weather can be
assumed to be independent
• Are two mutually exclusive events independent?
– No, but for mutually exclusive events we have
P(A ∨ B) = P(A) + P(B)
• Conditional independence: A and B are conditionally
independent given C iff P(A,
P(A B | C) = P(A | C) P(B | C)
Conditional independence:
Example
•
•
•
•
•
•
•
Toothache: boolean variable indicating whether the patient has a
toothache
Cavity: boolean variable indicating whether the patient has a cavity
Catch: whether the dentist
dentist’s
s probe catches in the cavity
If the patient has a cavity, the probability that the probe catches in it
doesn't depend
p
on whether he/she has a toothache
P(Catch | Toothache, Cavity) = P(Catch | Cavity)
Therefore, Catch is conditionally independent of Toothache given Cavity
Likewise Toothache is conditionally independent of Catch given Cavity
Likewise,
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
Equivalent statement:
P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity)
Conditional independence:
Example
• How manyy numbers do we need to represent
p
the jjoint
probability table P(Toothache, Cavity, Catch)?
23 – 1 = 7 independent entries
• Write
W it outt the
th joint
j i t di
distribution
t ib ti using
i chain
h i rule:
l
P(Toothache, Catch, Cavity)
= P(Cavity) P(Catch | Cavity) P(Toothache | Catch, Cavity)
= P(Cavity) P(Catch | Cavity) P(Toothache | Cavity)
• How many numbers do we need to represent these
di t ib ti
distributions?
?
1 + 2 + 2 = 5 independent numbers
• In most cases, the use of conditional independence
reduces the size of the representation of the joint
distribution from exponential in n to linear in n
Naïve Bayes model
• Suppose we have many different types of observations
((symptoms,
y
features)) that we want to use to diagnose
g
the underlying cause
• It is usually impractical to directly estimate or store the
)
joint distribution P (Cause, Effect1 , K , Effect n ).
• To simplify things, we can assume that the different
effects are conditionallyy independent
p
given the
g
underlying cause
• Then we can estimate the joint distribution as
Naïve Bayes model
• Suppose we have many different types of observations
((symptoms,
y
features)) that we want to use to diagnose
g
the underlying cause
• It is usually impractical to directly estimate or store the
)
joint distribution P (Cause, Effect1 , K , Effect n ).
• To simplify things, we can assume that the different
effects are conditionallyy independent
p
given the
g
underlying cause
• Then we can estimate the joint distribution as
P (Cause, Effect1 , K , Effectn ) = P (Cause)∏ P ( Effecti | Cause)
i
• This is usually not accurate
accurate, but very useful in practice
Example: Naïve Bayes Spam Filter
• Bayesian decision theory: to minimize the probability of error,
we should classify a message as spam if
P(spam | message) > P(¬spam | message)
– Maximum a posteriori (MAP) decision
Example: Naïve Bayes Spam Filter
• Bayesian decision theory: to minimize the probability of error,
we should classify a message as spam if
P(spam | message) > P(¬spam | message)
– Maximum a posteriori (MAP) decision
• Apply Bayes rule to the posterior:
P (message | spam) P ( spam)
P( spam | message) =
P(message)
P(¬spam | message) =
and
P(message | ¬spam) P (¬spam)
P(message)
• Notice that P(message) is just a constant normalizing factor
and doesn’t affect the decision
• Therefore,
Therefore to classify the message
message, all we need is to find
P(message | spam)P(spam) and P(message | ¬spam)P(¬spam)
Example: Naïve Bayes Spam Filter
• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)
• The message is a sequence of words (w1, …, wn)
• Bag of words representation
– The order of the words in the message
g is not important
p
– Each word is conditionally independent of the others given
message class (spam or not spam)
Bag of words illustration
US Presidential Speeches Tag Cloud
http://chir.ag/projects/preztags/
Bag of words illustration
US Presidential Speeches Tag Cloud
http://chir.ag/projects/preztags/
Bag of words illustration
US Presidential Speeches Tag Cloud
http://chir.ag/projects/preztags/
Example: Naïve Bayes Spam Filter
• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)
• The message is a sequence of words (w1, …, wn)
• Bag of words representation
– The order of the words in the message
g is not important
p
– Each word is conditionally independent of the others given
message class (spam or not spam)
n
P(message | spam) = P( w1 , K, wn | spam) = ∏ P ( wi | spam)
i =1
• Our filter will classify the message as spam if
n
n
i =1
i =1
P( spam)∏ P ( wi | spam) > P (¬spam)∏ P ( wi | ¬spam)
Example: Naïve Bayes Spam Filter
n
P ( spam
p | w1 , K , wn ) ∝ P ( spam
p )∏ P ( wi | spam
p )
i =1
posterior
prior
likelihood
Parameter estimation
• In order to classify a message, we need to know the prior
P(spam) and the likelihoods P(word | spam) and
P(
P(word
d | ¬spam))
– These are the parameters of the probabilistic model
– How do we obtain the values of these parameters?
prior
spam:
¬spam:
0.33
0.67
P(word | spam)
P(word | ¬spam)
Parameter estimation
• How do we obtain the prior P(spam) and the likelihoods
P(word | spam) and P(word | ¬spam)?
– Empirically: use training data
# of word occurrences in spam messages
P(
P(word
d | spam)) =
total # of words in spam messages
– Thi
This iis th
the maximum
i
lik
likelihood
lih d (ML) estimate,
ti t or estimate
ti t
that maximizes the likelihood of the training data:
D
nd
∏∏ P(w
d ,i
| classd ,i )
d =1 i =1
d: index of training document, i: index of a word