review of probability

Review of
Probability
Axioms of Probability Theory
Pr(A) denotes probability that proposition A is true. (A is also called
event, or random variable).
1.
0  Pr( A)  1
2.
Pr(True)  1
3.
Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
Pr( False )  0
A Closer Look at Axiom 3
Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
True
A
A B
B
B
Using the Axioms to prove new properties
Pr( A  A)  Pr( A)  Pr(A)  Pr( A  A)
Pr(True)

Pr( A)  Pr(A)  Pr( False )
1

Pr( A)  Pr(A)  0
Pr(A)

1  Pr( A)
We proved this
Probability of Events
• Sample space and events
– Sample space S: (e.g., all people in an area)
– Events E1  S: (e.g., all people having cough)
E2  S:
(e.g., all people having cold)
• Prior (marginal) probabilities of events
–
–
–
–
P(E) = |E| / |S| (frequency interpretation)
P(E) = 0.1
(subjective probability)
0 <= P(E) <= 1 for all events
Two special events:  and S: P() = 0 and P(S) = 1.0
• Boolean operators between events (to form
compound events)
– Conjunctive (intersection):
E1 ^ E2
( E1  E2)
– Disjunctive (union):
E1 v E2
( E1  E2)
– Negation (complement): ~E
(EC = S – E)
5
• Probabilities of compound events
– P(~E) = 1 – P(E) because P(~E) + P(E) =1
– P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2)
– But how to compute the joint probability P(E1 ^ E2)?
~E
E
E1
E2
E1 ^ E2
• Conditional probability (of E1, given E2)
– How likely E1 occurs in the subspace of E2
| E1  E 2 | | E1  E 2 | / | S | P ( E1  E 2)
P ( E1 | E 2) 


| E2 |
| E2 | / | S |
P ( E 2)
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)
6
Using Venn diagrams and
decision trees is very useful in
proofs and reasonings
The main thing to remember for Bayes
~E
E
E1
E2
E1 ^ E2
| E1  E 2 | | E1  E 2 | / | S | P ( E1  E 2)
P ( E1 | E 2) 


| E2 |
| E2 | / | S |
P ( E 2)
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)
7
Independence, Mutual Exclusion and Exhaustive sets
of events
• Independence assumption
– Two events E1 and E2 are said to be independent of each other if
P ( E1 | E 2)  P ( E1) (given E2 does not change the likelihood of E1)
– It can simplify the computation
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)  P ( E1) P ( E 2)
P ( E1  E 2)  P ( E1)  P ( E 2)  P ( E1  E 2)
 P ( E1)  P ( E 2)  P ( E1) P ( E 2)
 1  (1  P ( E1)(1  P ( E 2))
• Mutually exclusive (ME) and exhaustive (EXH) set of
events
– ME:
– EXH:
E i  E j   ( P ( E i  E j )  0), i , j  1,.., n, i  j
E1  ...  En  S ( P ( E1  ...  En )  1)
Mutual Exclusive set of events
E i  E j   ( P ( E i  E j )  0), i , j  1,.., n, i  j
Exhaustive sets of events
E1  ...  En  S ( P ( E1  ...  En )  1)
Mutual Exclusive and Exhaustive set of events
No
overlap
E i  E j   ( P ( E i  E j )  0), i , j  1,.., n, i  j
AND
E1  ...  En  S ( P ( E1  ...  En )  1)
Random
Variables
11
Discrete Random Variables
• X denotes a random variable.
• X can take on a finite number of values in
set {x1, x2, …, xn}.
• P(X=xi), or P(xi), is the probability that the random
variable X takes on value xi.
• P( ). is called probability mass function.
• E.g.
P( Room)  0.7,0.2,0.08,0.02
These are four possibilities of value of X. Sum of these values must be 1.0
Discrete Random Variables:
visualization
• Finite set of possible outcomes
X   x 1 , x 2 , x 3 ,..., x n 
P ( xi )  0
0.4
0.35
0.3
n
 P( x )  1
i 1
X binary:
i
P ( x )  P ( x ) 1
0.25
0.2
0.15
0.1
0.05
0
X1
X2
X3
X4
13
Continuous Random Variable
• Probability distribution (density function)
over continuous values
X  0 ,10 
P (x)  0
10
 P ( x ) dx  1
P (x)
0
7
P (5  x  7 ) 
 P ( x ) dx
5
5
7
x
Continuous Random Variables
• X takes on values in the continuum.
• p(X=x), or p(x), is a probability density function (PDF).
b
Pr( x [a, b])   p( x)dx
• E.g.
a
p(x)
x
Probability Distribution
• Probability distribution P(X|x)
– X is a random variable
• Discrete
• Continuous
– x is background state of information
Joint and
Conditional
Probabilities
Joint and Conditional
Probabilities
• Joint Probabilities
P (x, y)  P ( X  x  Y  y)
– Probability that both X=x and Y=y
• Conditional Probabilities
P(x | y)  P(X  x |Y  y)
– Probability that X=x given we know that Y=y
Joint and Conditional Probability
• P(X=x and Y=y) = P(x,y)
• If X and Y are independent then
P(x,y) = P(x) P(y)
• P(x | y) is the probability of x given y
P(x | y) = P(x,y) / P(y)
P(x,y) = P(x | y) P(y)
• If X and Y are independent then
P(x | y) = P(x)
divided
Law of Total Probability
Discrete case
Continuous case
 P( x)  1
 p( x) dx  1
x
P ( x )   P ( x, y )
y
P( x)   P( x | y ) P( y )
y
p( x)   p( x, y ) dy
p ( x)   p ( x | y ) p ( y ) dy
Rules of Probability: Marginalization
• Product Rule
P ( X , Y )  P ( X | Y ) P (Y )  P (Y | X ) P ( X )
• Marginalization
n
P (Y )   P (Y , x i )
i 1
X binary:
P (Y )  P (Y , x )  P (Y , x )
21
Questions and Problems
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Axioms of probability theory
Use of Veitch Diagrams to understand basics of Probability theory.
Conditional Probability.
Independence, Mutual Exclusion and Exhaustive sets of events
Derivation of Bayes Theorem.
What are random variables?
What are discrete random variables?
What are continuous random variables.
Joint and Conditional Probabilities
Law of Total Probability
Marginalization Rules.
What are Gaussian Normal Distribution?
What is mean value?
What is variance?
What are Gaussian Networks?
Questions and Problems
1.
Give example of applying Bayesian Reasoning in real life problem
Gaussian,
Mean and
Variance
N(m, s)
Gaussian (normal) distributions
P( x) 
  ( x  m )2 
1

exp 
2s
2 s


N(m, s)
different mean
different variance
26
Gaussian networks
Each variable is a linear
function of its parents,
with Gaussian noise
Joint probability density functions:
X
Y
X
Y
27
Reverend Thomas Bayes
(1702-1761)
Clergyman and
mathematician who first
used probability
inductively.
These researches
established a
mathematical basis for
probability inference
Bayes Rule
P (H , E )  P (H | E )P (E )  P (E | H )P (H )
P(E | H )P(H )
P(H | E ) 
P(E )
Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
P (H , E )  P (H | E )P (E )  P (E | H )P (H )
10/40 = probability that you smoke if you
have cancer = P(smoke/cancer)
P(E | H )P(H )
P(H | E ) 
P(E )
10/100 = probability that you have
cancer if you smoke
40 People who
have cancer
100 People who
smoke
1000-100 = 900 people who do not smoke
1000-40 = 960 people who do not have cancer
10 People who smoke
and have cancer
True
All
people =
1000
E = smoke, H = cancer
A
A B
B
B
Prob(Cancer/Smoke) =
P (smoke/Cancer) * P (Cancer) / P(smoke)
P(smoke) = 100/1000
P(cancer) = 40/1000
P(smoke/Cancer) = 10/40 = 25%
Prob(Cancer/Smoke) = 10/40 * 40/1000/ 100 = 10/1000 / 100 = 10/10,000 =/1000 = 0.1%
10/40 = probability that you smoke if you
have cancer = P(smoke/cancer)
40 People who
have cancer
100 People who
smoke
10 People who smoke
and have cancer
10/100 = probability that you have
cancer if you smoke
1000-100 = 900 people who do not smoke
1000-40 = 960 people who do not have cancer
True
E = smoke, H = cancer
A
All
people =
1000
A B
B
Prob(Cancer/Smoke) =
P (smoke/Cancer) * P (Cancer) / P(smoke)
P(smoke) = 100/1000
P(cancer) = 40/1000
P(smoke/Cancer) = 10/40 = 25%
B
Prob(Cancer/Smoke) = 10/40 * 40/1000/ 100 = 10/1000 / 100 = 10/10,000 = 1/1000 = 0.1%
E = smoke, H = cancer
Prob(Cancer/Not smoke) = 30/40 * 40/100 / 900 =
30/100*900 = 30 / 90,000 = 1/3,000 = 0.03 %
Prob(Cancer/Not Smoke) =
P (Not smoke/Cancer) * P (Cancer) / P(Not smoke)
Bayes’ Theorem with relative likelihood
• In the setting of diagnostic/evidential reasoning
H i P(Hi )
hypotheses
P(E j | Hi )
E1
Ej
Em
– Know prior probability of hypothesis
conditional probability
– Want to compute the posterior probability
evidence/m anifestati ons
P(Hi )
P(E j | Hi )
P(Hi | E j )
• Bayes’ theorem (formula 1): P ( H i | E j )  P ( H i ) P ( E j | H i ) / P ( E j )
• If the purpose is to find which of the n hypotheses H1 ,..., H n
is more plausible given E j, then we can ignore the
denominator and rank them, use relative likelihood
rel ( H i | E j )  P ( E j | H i ) P ( H i )
32
Relative likelihood
• P ( E j ) can be computed fromP ( E j | H i ) and P ( H i ,) if we assume
all hypotheses H1 ,..., H n are ME and EXH
P ( E j )  P ( E j  ( H1  ...  H n )
Mutually
exclusive (ME)
and exhaustive
(EXH)
(by EXH)
n
  P(E j  Hi )
(by ME)
i 1
n
  P(E j | Hi )P(Hi )
i 1
• Then we have another version of Bayes’ theorem:
P(Hi | E j ) 
P(E j | Hi )P(Hi )
n
 P(E
k 1
j
| Hk )P(Hk )

rel ( H i | E j )
n
 rel ( H
k 1
k
| Ej)
n
where
 P(E
k 1
j
| H k ) P ( H k ), the sum of relative likelihood of all
n hypotheses, is a normalization factor
Naïve Bayesian Approach
• Knowledge base:
E1 ,..., Em :
evidence/m anifestati on
H1 ,..., H n :
hypotheses /disorders
E j and H i are binary and hypotheses form a ME & EXH set
P ( E j | H i ), i  1,...n, j  1,...m
conditiona l probabilit ies
• Case input:
E1 ,..., El
• Find the hypothesis H i with the highest posterior
probability P ( H i | E1 ,..., El )
P ( E1 ,...E l | H i ) P ( H i )
• By Bayes’ theorem P ( H i | E1 ,..., E l ) 
P ( E ,...E )
1
• Assume all pieces of evidence are conditionally
independent, given any hypothesis
P ( E1 ,...El | H i )   lj 1 P ( E j | H i )
34
l
absolute posterior probability
• The relative likelihood
rel ( H i | E1 ,..., El )  P ( E1 ,..., El | H i ) P ( H i )  P ( H i ) lj 1 P ( E j | H i )
• The absolute posterior probability
P ( H i | E1 ,..., El ) 
rel ( H i | E1 ,..., El )
n
 rel ( H k | E1 ,..., El )
k 1

substitute
P ( H i ) lj 1 P ( E j | H i )
l
P
(
H
)

 k j 1 P ( E j | H k )
n
k 1
• Evidence accumulation (when new evidence is
discovered)
rel ( H i | E1 ,..., El , El 1 )  P ( El 1 | H i )rel ( H i | E1 ,..., El )
rel ( H i | E1 ,..., El , ~ El 1 )  (1  P ( El 1 | H i ))rel ( H i | E1 ,..., El )
Do not worry, many examples will follow
Bayesian Networks and
Markov Models
•
•
•
•
•
Bayesian AI
Bayesian Filters
Bayesian networks
Decision networks
Reasoning about changes over time
• Dynamic Bayesian Networks
• Markov models

Download Report

review of probability

Paperzz.com

Your Paperzz