Bayesian Models and Logistic Regression Notation Joint probability

Notation
Bayesian Models and
Logistic Regression
Probability theory as basis for the construction
of classifiers:
• Multivariate probabilistic models
• Independence assumptions
• Naive Bayesian classifier
• Forest-augmented networks (FANs)
• Random (= statistical = stochastic) variable: upper-case letter, e.g. V or X, or
upper-case string, e.g. RAIN
• Binary variables: take one of two values,
X = true (abbreviated x) and X = false
(abbreviated ¬x)
• Conjunctions:
X = x, Y = y
P
X P (X) = P (x) + P (¬x), where X is binary
Joint probability distribution
Joint (= multivariate) distribution:
P (X1 , X2 , . . . , Xn)
Example of joint probability distribution:
P (X1 , X2, X3 )
P (x1 , x2, x3)
P (¬x1 , x2, x3)
P (x1 , ¬x2, x3)
P (x1 , x2, ¬x3)
P (¬x1 , ¬x2, x3)
P (x1 , ¬x2, ¬x3)
P (¬x1 , x2, ¬x3)
P (¬x1 , ¬x2, ¬x3)
Note that:
P
with:
= 0.1
= 0.05
= 0.10
= 0.0
= 0.3
= 0.2
= 0.1
= 0.15
X1 ,X2,X3 P (X1 , X2, X3 ) = 1
P (x3 ) =
X
X1,X2
Chain rule
Definition of conditional probability distribution:
P (X1 | X2 , . . . , Xn ) =
P (X1 , X2 , . . . , Xn)
P (X2 , . . . , Xn)
⇒ P (X1 , X2, . . . , Xn ) =
P (X1 | X2 , . . . , Xn )P (X2 , . . . , Xn)
Furthermore,
P (X2 , . . . , Xn ) =
P (X2 | X3 , . . . , Xn )P (X3 , . . . , Xn)
.. .. ..
P (Xn−1 , Xn) = P (Xn−1 | Xn)P (Xn )
P (Xn ) = P (Xn )
Chain rule yields factorisation:
Marginalisation:
P (X1 , X2 , x3) = 0.55
= y) as
• Templates: X, Y means X = x, Y = y, for
any value x, y, i.e. the choice of the values
x and y does not really matter
•
• Approximation: logistic regression
(X = x) ∧ (Y
P(
n
^
i=1
Xi ) =
n
Y
i=1
P (Xi |
n
^
k=i+1
Xk )
Chain rule - digraph
X3
X1
Does the chain rule help?
X3
X3
X2
X1
(1)
X2
(2)
X1
X2
P (X1 , X2 , X3 ) = P (X1 | X2 , X3 ) ·
P (X2 | X3 )P (X3 )
Factorisation (1):
i.e. we need:
P (x1 | x2, x3)
P (x1 | ¬x2, x3)
P (x1 | x2, ¬x3)
P (x1 | ¬x2, ¬x3)
P (x2 | x3)
P (x2 | ¬x3)
P (x3 )
P (X1 , X2 , X3 ) = P (X1 | X2 , X3 ) ·
P (X2 | X3 )P (X3 )
Other factorisation (2):
P (X1 , X2 , X3 ) = P (X2 | X1 , X3 ) ·
P (X1 | X3 )P (X3 )
Note P (¬x1 | x2, x3) = 1 − P (x1 | x2, x3), etc.
⇒ 7 probabilities required (as for P (X1 , X2 , X3))
⇒ different factorisations possible
Use stochastic independence
Definition Bayesian Network (BN)
P (X1 , X2 , X3 ) = P (X2 | X1 , X3 ) ·
P (X3 | X1 )P (X1 )
Assume that X2 and X3 are conditionally independent given X1 :
P (X2 | X1 , X3 ) = P (X2 | X1 )
A Bayesian network B is a pair B = (G, P ),
where:
• G = (V (G), A(G)) is an acyclic directed graph,
with
– V (G) = {X1, X2 , . . . , Xn }, a set of vertices
(nodes)
and
P (X3 | X1 , X2 ) = P (X3 | X1 )
– A(G) ⊆ V (G) × V (G) a set of arcs
|=
|=
Notation: X2 X3 | X1, X3 X2 | X1
X3
• P : ℘(V (G)) → [0, 1] is a joint probability
distribution, such that
X3
P (V (G)) =
n
Y
P (Xi | πG(Xi))
i=1
X1
X2
X1
X2
Only 5 = 2 + 2 + 1 probabilities (instead of 7)
where πG(Xi ) denotes the set of immediate
ancestors (parents) of vertex Xi in G
Example Bayesian network
Reasoning: evidence propagation
• Nothing known:
P (FL, PN, MY, FE, TEMP)
MYALGIA
FLU
P (MY = y|FL = y) = 0.96
NO
YES
NO
YES
FEVER
P (MY = y|FL = n) = 0.20
NO
YES
PNEUMONIA
Myalgia (MY)
(yes/no)
P (FL = y) = 0.1
TEMP
NO
YES
<=37.5
>37.5
• Temperature >37.5 ◦C:
P (FE = y|FL = y, PN = y) = 0.95
Flu (FL)
(yes/no)
P (FE = y|FL = n, PN = y) = 0.80
MYALGIA
FLU
NO
YES
P (FE = y|FL = y, PN = n) = 0.88
P (FE = y|FL = n, PN = n) = 0.001
P (PN = y) = 0.05
Pneumonia (PN)
(yes/no)
Fever (FE)
(yes/no)
TEMP
(≤ 37.5/
> 37.5)
FEVER
NO
YES
PNEUMONIA
TEMP
NO
YES
<=37.5
>37.5
• Likely symptoms of the flu?
MYALGIA
FLU
NO
YES
Bayesian network structure
learning
NO
YES
FEVER
P (TEMP ≤ 37.5|FE = y) = 0.1
P (TEMP ≤ 37.5|FE = n) = 0.99
NO
YES
NO
YES
PNEUMONIA
NO
YES
TEMP
<=37.5
>37.5
Special form Bayesian networks
Bayesian network B = (G, P ), with
Problems:
• digraph G = (V (G), A(G)), and
• for many BNs too many probabilities have
to be assessed
• probability distribution
P (V ) =
Y
• complex BNs do not necessarily yield better
classifiers
P (X | π(X))
X∈V (G)
• complex BNs may yield better estimates to
the probability distribution
Spectrum
naive Bayesian
network
general Bayesian
networks
Un restricted
Structure
Learning
tree−augmented
Bayesian network
(TAN)
Solution:
use simple probabilistic models for classification:
• naive (independent) form BN
• T ree-Augmented Bayesian Network (TAN)
Restricted Structure Learning
• F orest-Augmented Bayesian Network (FAN)
Naive (independent) form BN
Example of naive Bayes
P (myalgia | flu ) = 0.96
···
E2
E1
Myalgia
y/n
Em
C
Disease
pneu /flu
• C is a class variable
• The evidence variables Ei in the evidence
E ⊆ {E1, . . . , Em} are conditionally independent given the class variable C
P (myalgia | pneu ) = 0.28
Fever
y/n
P (flu ) = 0.67
P (pneu ) = 0.33
P (fever | flu ) = 0.88
P (myalgia | pneu ) = 0.82
TEMP
≤ 37.5/
> 37.5
P (TEMP ≤ 37.5 | flu ) = 0.20
P (TEMP ≤ 37.5 | pneu ) = 0.26
This yields, using Bayes’ rule:
Evidence: E = {TEMP > 37.5}; computation
of the probability of flu using Bayes’ rule:
P (E | C)P (C)
P (C | E) =
P (E)
P (flu | TEMP > 37.5) =
|=
with, as Ei Ej | C, for i 6= j:
Y
P (E | C) =
P (E) =
P (E | C)
by cond. ind.
E∈E
X
P (E | C)P (C)
marg. & cond.
P (TEMP > 37.5) | flu )P (f lu)
P (TEMP > 37.5)
P (TEMP > 37.5) = P (TEMP > 37.5 | flu )P (flu ) +
P (TEMP > 37.5 | pneu )P (pneu )
= 0.8 · 0.67 + 0.74 · 0.33 ≈ 0.78
C
⇒ P (flu | TEMP ≥ 37.5) = 0.8 · 0.67/0.78 ≈ 0.687
Classifier: cmax = arg maxC P (C | E)
Computing probabilities from data
• We want to add arcs to a naive Bayesian
network to improve its performance
Compute the weighted average of
• estimate PbD (V | π(V )) of the conditional
probability distribution for variable V based
on the dataset D
• Dirichlet prior
knowledge
Θ,
which
reflects
More complex Bayesian networks
• Result: possibly TAN
···
E2
prior
E1
Em
C
These are combined as follows:
PD (V | π(V )) =
• Which arc should be added?
n
n0
Θ
PbD (V | π(V )) +
n + n0
n + n0
C
where
• n is the size of the dataset D
• n0 is the estimated size of the (virtual) ‘dataset’
on which the prior knowledge is based (equivalence sample size)
E0
E
Compute mutual information between variables
E, E 0 conditioned on the class variable C:
IP (E, E 0 | C) =
X
E,E 0,C
P (E, E 0, C) · log
P (E, E 0 | C)
P (E | C)P (E 0 | C)
FAN algorithm
Performance evaluation
Choose k ≥ 0. Given evidence variables Ei, a
class variable C, and a dataset D:
1. Compute
mutual
information −IP (Ei, Ej | C)
∀(Ei, Ej ), i 6= j, in a
complete undirected graph
• Success rate σ based on:
cmax = argmaxc{P (c | xi )}
for xi ∈ D, i = 1, . . . , n = |D|
2. Construct a minimum-cost
spanning forest containing
exactly k edges
• Total entropy (penalty):
3. Change each tree in the forest into a directed tree
E=−
4. Add an arc from the class
vertex C to every evidence
vertex Ei in the forest
– if P (c | xi) = 1, then ln P (c | xi) = 0
– if P (c | xi) ↓ 0 then ln P (c | xi) → −∞
Example COMIK BN
CLOTTING-FACTORS
Results for COMIK Dataset
80
FEVER
>0.70
<=0.55
0.56-0.70
NO
YES-WITHOUT-CHILLS
WITH-CHILLS
GALL-BLADDER
NONE
COURVOISIER
FIRM-OR-TENDER
ASCITES
NO
YES
INTERMITTENT-JAUNDICE
UPPER-ABDOMINAL-PAIN
SLIGHT-MODERATE
SEVERE
NO
YES
LIVER-SURFACE
SMOOTH
NODULAR
GI-CANCER
NO
YES
ALKALINE-PHOSPHATASE
<400U/L
400-1000U/L
>1000U/L
LDH
<1300U/L
>=1300U/L
AGE
31-64YR
>=65YR
CONGESTIVE-HEART-FAILURE
BILIARY-COLICS-GALLSTONES
NO
YES
ALCOHOL
75
LEUKAEMIA-LYMPHOMA
NO
YES
Correct conclusion (%)
BILIRUBIN
<200UMOL/L
>=200UMOL/L
ASAT
ln P (c | xi)
i=1
5. Learn conditional probability distributions from D using Dirichlet distributions
<40U/L
40-319U/L
>=320U/L
n
X
70
65
60
55
NO
YES
1-4DRINKS/DAY
>=5DRINKS/DAY
50
HISTORY-GE-2-WEEKS
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of added arcs
NO
YES
SPIDERS
NO
YES
1200
WEIGHT-LOSS
NO
YES
JAUNDICE-DUE-TO-CIRRHOSIS
NO
YES
COMIK
1175
ACUTE-NON-OBSTRUCTIVE
CHRONIC-NON-OBSTRUCTIVE
BENIGN-OBSTRUCTIVE
MALIGNANT-OBSTRUCTIVE
OTHER
1150
1125
• Dataset with 1002 patient cases with liver
and biliary tract disease, collected by the
Danish COMIK group
• Has been used as vehicle for learning various
probabilistic models
1100
Entropy
Based on COMIK dataset:
1075
1050
1025
1000
975
950
925
900
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of added arcs
Naive Bayes: odds-likelihood form
Comparison Bayesian networks: NHL
For class variable C and evidence E:
BM-DEPRESSION
GENERAL-HEALTH-STATUS
POOR
AVERAGE
GOOD
Q
NO
YES
CT&RT-SCHEDULE
NONE
RT
CT
CT-NEXT-RT
P (C | E) =
PERFORATION
NO
YES
HEMORRHAGE
THERAPY-ADJUSTMENT
NO
YES
E∈E P (E | C)P (C)
P (E)
NO
YES
|=
if E E 0 | C, ∀E, E 0 ∈ E; for C = true:
AGE
10-19
20-29
30-39
40-44
45-49
50-54
55-59
60-64
65-69
70-79
80-89
>=90
EARLY-RESULT
CLINICAL-STAGE
CR
PR
NC
PD
I
II1
II2
III
IV
POST-CT&RT-SURVIVAL
Q
NO
YES
P (c | E) =
IMMEDIATE-SURVIVAL
NO
YES
BULKY-DISEASE
YES
NO
E∈E P (E | c)P (c)
P (E)
For C = false:
5-YEAR-RESULT
HISTOLOGICAL-CLASSIFICATION
ALIVE
DEATH
LOW-GRADE
HIGH-GRADE
SURGERY
POST-SURGICAL-SURVIVAL
NONE
CURATIVE
PALLIATIVE
P (¬c | E) =
NO
YES
CLINICAL-PRESENTATION
NONE
HEMORRHAGE
PERFORATION
OBSTRUCTION
CLINICAL-STAGE
GENERAL-HEALTH-STATUS
POOR
AVERAGE
GOOD
I
II1
II2
III
IV
⇒
CT&RT-SCHEDULE
NONE
RT
CT
CT-NEXT-RT
CLINICAL-STAGE
CT&RT-SCHEDULE
I
II1
II2
III
IV
NONE
RT
CT
CT-NEXT-RT
GENERAL-HEALTH-STATUS
POOR
AVERAGE
GOOD
10-19
20-29
30-39
40-44
45-49
50-54
55-59
60-64
65-69
70-79
80-89
>=90
SURGERY
5-YEAR-RESULT
NONE
CURATIVE
PALLIATIVE
ALIVE
DEATH
10-19
20-29
30-39
40-44
45-49
50-54
55-59
60-64
65-69
70-79
80-89
>=90
E∈E P (E | ¬c)P (¬c)
P (E)
Q
P (c | E)
P (E | c) P (c)
= Q E∈E
P
P (¬c | E)
E∈E (E | ¬c) P (¬c)
=
AGE
AGE
Q
m
Y
λi · O(c)
i=1
5-YEAR-RESULT
ALIVE
DEATH
= O(c | E)
SURGERY
NONE
CURATIVE
PALLIATIVE
CLINICAL-PRESENTATION
BULKY-DISEASE
HISTOLOGICAL-CLASSIFICATION
YES
NO
LOW-GRADE
HIGH-GRADE
BULKY-DISEASE
NONE
HEMORRHAGE
PERFORATION
OBSTRUCTION
HISTOLOGICAL-CLASSIFICATION
YES
NO
CLINICAL-PRESENTATION
NONE
HEMORRHAGE
PERFORATION
OBSTRUCTION
LOW-GRADE
HIGH-GRADE
Here is O(c | E) the conditional odds, and λi =
P (Ei | c)/P (Ei | ¬c) is a likelihood ratio
Odds, likelihoods and logarithms
Odds and probabilities
Odds:
P (c | E)
P (¬c | E)
P (c | E)
=
1 − P (c | E)
O(c | E) =
10
9
8
Odds O
7
Back to probabilities:
6
5
P (c | E) =
4
O(c | E)
1 + O(c | E)
3
Logarithmic odds-likelihood form:
2
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ln O(c | E) = ln
Probability P
Note that:
• O(c | E) = 1 if P (c | E) = 0.5
• O(c | E) → ∞ if P (c | E) ↑ 1
=
=
m
Y
λi · O(c)
i=1
m
X
ln λi + ln O(c)
i=1
m
X
ωi
i=0
with ω0 = ln O(c) and ωi = ln λi, i = 1, . . . , m
Log-odds and weights
Logistic regression
Log-odds:
ln O(c | E) = ln
m
Y
y
λi · O(c)
i=1
m
X
=
ln λi + ln O(c)
i=1
m
X
=
ωi
i=0
Back to probabilities:
x
decision hyperplane
O(c | E)
1 + O(c | E)
P
exp( m
i=0 ωi )
=
P
1 + exp( m
i=0 ωi )
P (c | E) =
Hyperplane: {ω | β T ω = 0} where
Adjust ωi with weights βi based in existing interactions between variables:
ln O(c | E) =
m
X
• c = β0ω0 is the intercept (recall that ω0 =
ln O(c), which is independent of any evidence
E)
• ωi, i = 1, . . . , m correspond to the probabilities we want to find
βi ω i = β T ω
i=0
Maximum likelihood estimate
Maximum likelihood estimate
Database D, |D| = N , with independent instances xi ∈ D, then likelihood function l:
l(β) =
N
Y
=
Pβ (Ci | x0i)
Log-likelihood function L:
i=1
N X
T
yiβ ω − ln 1 + e
βT ω
(1 − yi) ln(1 − Pβ (ci | x0i))
N X

yiω − Pβ (ci | x0i)
i=1
Can be solved using equation solving methods,
such as Newton-Raphson method
yi ln Pβ (ci | x0i) +
i=1
T
= 0
ln Pβ (Ci | x0i)
i=1
N X

N
X
eβ ω 
∂L(β)
y i ω −
=
T
∂β
1 + eβ ω
i=1
=
L(β) = ln l(β)
=
yi ln Pβ (ci|x0i) + (1 − yi) ln(1 − Pβ (ci|x0i))
Maximisation of L(β):
where Ci is the class value for instance i, and
x0i is xi, without Ci
=
N X
i=1
i=1
N
X
L(β) =
where ci is (always) the value Ci = true; yi = 1
if ci is the class value for xi, yi = 0, otherwise
∂L(β)
∂ 2L(β)
βr+1 = βr −
·
∂β
∂β T ∂β
!−1
for r = 0, 1, . . ., and β0 = 0; result: approximation for β
WEKA output
Scheme:
Relation:
Instances:
Attributes:
Test mode:
weka.classifiers.Logistic
weather.symbolic
14
5
outlook
temperature
humidity
windy
play
evaluate on training data
=== Classifier model (full training set) ===
Logistic Regression (2 classes)
Coefficients...
Variable
Coeff.
1
34.9227
2
-48.1161
3
7.8472
4
17.3933
5
-33.0445
6
22.2601
7
-82.415
8
-54.6671
Intercept
66.1064

Download Report

Bayesian Models and Logistic Regression Notation Joint probability

Paperzz.com

Your Paperzz