Lecture 3.key

I
BETA DISTRIBUTION
LO R E M P
S
U
M
Royal Institute of
Technology
DD2447 STAT.
METH. IN CS
FALL 2014
★
★
Lecture 3 - Bayesian
PDF
•
Beta(x|a, b) =
•
where
•
and
1
x(a
B(a, b)
B(a, b) =
(a) (a)
(a + b)
(a) = (a
1) (a
1) (b 1)
x
1)
In particular, for integer n
•
(n) = (n
Chap. 3
1)!
BIASED DICE - FIRST
COIN
BETA DISTRIBUTION
★
Assumption: we don’t know the outcome probabilities
★
We need a prior over the outcome probabilities
★
First likelihood
★
N1 heads and N0 tails
★
p(head) = θ
★
sequence
★
counts
p(D) =
p(D) =
N1
(1
N1 + N0
N1
)N0
N1
iid Bernoulli
(1
)N0
Binomial
BACK TO BAYES PRIOR FOR BINOMIAL
★
★
p( | 1 ,
Beta distribution up to a constant
Posterior p( |D)
0)
=
p(D| )p( | 1 , 0 )
= N1 (1
)N0 1 1 (1
N1 + 1 1
=
(1
)N0 + 0
= Beta( |N1 + 1 , N0 +
1
1
)
0
K)
1
1
LAPLACE’S RULE OF
SUCCESSION
0)
Beta is a conjunctive prior for Binomial
= ( 1, . . . ,
0
1
★
xi
)
Prior
{1, . . . , K}
SK the K-dim. probability simplex, i.e.,
k
=1
Dir( | ) =
Likelihood
i [K]
=(
Prior
1, . . . ,
K)
hyperparameters
Dir( | ) =
Likelihood
p( |D)
1
B( )
k
Posterior
k
N1 + 1
N +2
Black Swan “paradox”
Prior that gives posterior of the same sort is called conjunctive
where
p(x = 1|D) =
(1
★
D = {x1 , . . . , xN }
Uniform prior, i.e., γ1=γ0=1, gives
1
k [K]
Nk
k
k [K]
BACK TO THE DICE –
DIRICHLET-MULTINOMIAL
p( |D)
p( |D)
1
B( )
k
k
1
k [K]
Nk
k
k [K]
p(D| )p( )
Nk
k
S3
k [K]
k
k
1
k [K]
Nk +
k
k
1
k [K]
= Dir( |N1 +
1 , . . . , NK
+
K)
POSTERIOR FOR DIRICHLETMULTINOMIAL
α = (2, 2, 2)
ANOTHER COLOR
CODING
α = (20, 2, 2)
Samples from Dir (alpha=0.1)
1
0.5
0
1
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
SAMPLES
α = (0.1, 0.1,0.1,0,1 0.1)
!
=
=
1
0.5
3.4.4
3.4.4
0
1
0.5
0
1
0.5
3.4. The Dirichlet-multinomial model
Samples from Dir (alpha=1)
3.4. The Dirichlet-multinomial
model
where
θ are all
the
0.5
0
1
0.5
0
p(X = j|θj )
#
(3.50)
p(θ −j , θj |D)dθ −j dθj
θj p(θj |D)dθj = E [θj |D] = $
αj + Nj
αj + Nj
=
α0 + N
k (αk + Nk )
81
p(X = j|D) =
p(X = j|D) =
1
!
p(X = j|θ)p(θ|D)dθ
p(X = j|θ)p(θ|D)dθ
"!
# bag of words
!
Worked example:
language
models using
"!
#
!
2
==
3
4
p(X==j|θ
j|θ) j ) 5p(θp(θ,−j
, θj |D)dθ −j dθj
p(X
j
−j θj |D)dθ −j dθj
(3.49)
(3.49)
p(X = j| j )
p(
j , j |D)d
One application
of Bayesian smoothing using the Dirichlet-multinomial model is to language
!!
αj + Nj α +αN
j + Nj
αwhich
j + Nj words
j
j
=
θ
(3.51) in a sequence. Here
p(θ
E [θ
= we p(X = j| j )p( j |D)d j
modeling,
which
means
occur
j
j 4|D)dθ
j =
j |D] = $
1
2 =
3θj p(θj |D)dθ
5 predicting
= = might
(3.51)next
[θ
j =E
j |D] = $
kN+)Nk ) α +αN
0+N
kk(α
(α
+
k
0
k
will take a very simple-minded approach,
and
assume
that the i’th word, Xi ∈ {1, . . . , K}, is
where
areall
allthe
thecomponents
componentsof of
except
also
Exercise
Nj + j
Nj +
−j are
j . See
where θθ−j
θθ
except
θjall
.θSee
Exercise
3.13.3.13.
sampled
independently
from
thealsoother
words
using
a Cat(θ)
distribution. This is called
= the
E[ j |D] =
=
The 1above
above expression
expression
avoidsthe
zero-count
problem,
as saw
we saw
in Section
2
3avoids
4the
5
The
zero-count
problem,
justjust
as we
in Section
3.3.4.1.3.3.4.1.
In In
N+
k Nk + k
bagofofofBayesian
wordssmoothing
model.isGiven
amore
past
sequence
of multinomial
words,case
how
can
we predict which one is likely
fact,
this form
form
Bayesian
smoothing
is
even
important
in the
case
fact, this
even
more
important
in the
multinomial
than than
the the
binary
case,
since
the
likelihoodofofdata
data
sparsity
increases
we start
partitioning
the data
to since
come
next?
binary case,
the
likelihood
sparsity
increases
onceonce
we start
partitioning
the data
into
many categories.
categories.
into many
For example, suppose we observe the following sequence (part of a children’s nursery rhyme):
Maryofofhad
a little
lamb,
its
fleece asmodel
white
as
One application
application
Bayesian
smoothingusing
using
Dirichlet-multinomial
model
to snow
language
Bayesian
smoothing
thethe
Dirichlet-multinomial
is to is
language
modeling,
which means
meanspredicting
predictingwhich
which
words
might
occur
a sequence.
modeling, which
words
might
occur
nextnext
in ainsequence.
Here Here
we we
. . . words:
, isK}, is
will take
take aaFurthermore,
verysimple-minded
simple-minded
approach,
and
assume
Xfollowing
suppose
ourandvocabulary
consists
ofword,
the
{1,
. , K},
very
approach,
assume
thatthat
the the
i’th i’th
word,
X
i ∈. .{1,
i ∈
sampled
independentlyfrom
fromallallthetheother
other
words
using
a Cat(θ)
distribution.
is called
sampled independently
words
using
a Cat(θ)
distribution.
This This
is called
the the
bag of
of words
words
model.
Given
pastsequence
sequence
of
words,
how
we predict
which
is unk
likely
model.
Given
a apast
words,
how
cancan
we black
predict
which
onerain
isone
likely
mary
lamb
little
big of
fleece
white
snow
to come
come next?
next?
1
2
3
4
5
6
7
8
9
10
For
following
sequence
(part(part
of aof
children’s
nursery
rhyme):
For example,
example,suppose
supposewe
weobserve
observethethe
following
sequence
a children’s
nursery
rhyme):
Vocabulary:
=
(3.50) (3.50)
3.4.4.1
example:
language
models
using
of
words lamb, little lamb,
3.4.4.1 Worked
Worked example:
language
models
using
bagbag
oflittle
words
Mary
a little
lamb,
SAMPLES
α had
=
(1,1,1,1,1)
Text:
(3.51)
81
components of θ except θj . See also Exercise
3.13.
The above expression avoids the zero-count problem, just as we saw in Probability
Section 3.3.4.1.
that nextIn
is j, posterior to D
Posterior predictive
Posterior
predictive
of Bayesian
1 fact,
2this form
3
4
5 smoothing is even more important in the multinomial case than the
The posterior predictive distribution for a single multinoulli trial is given by the following
binary
case,distribution
since theforlikelihood
of data trial
sparsity
increases
once we start partitioning the data
The posterior
predictive
a single multinoulli
is given
by the following
expression:
expression:into many categories.
p(X = j|D) =
p(X = j| )p( |D)d
!
1
2
3
4
5
−j
3.4.4.1
0
1
!
"!
j
d
j
j
POSTERIOR PREDICTIVE
CATEGORICAL-DIRICHLET
82
Chapter 3. Generative models for discrete data
unk lamb,
stands
for
unknown,
and represents
all other words that do
not appear elsewhere on
aa little
lamb,
little
lamb,
Mary had
hadHere
little
lamb,little
little
lamb,
little
lamb,
Counts:
Token 1
2
3
4
5
6
7
8
9
10
aa little
its
fleece
asas
white
snow
Mary had
hadthe
little
lamb,
itseach
fleece
white
as
snowrhyme, we first strip off punctuation, and remove
list. Tolamb,
encode
line
of
the asnursery
mary lamb little big fleece white black snow rain unk
Word
Count 2
4
4
0
1
1
0
1
0
4
anysuppose
stop our
words
suchconsists
as
“a”,
“as”,
“the”,words:
etc.
Furthermore,
of of
the
following
Furthermore,
suppose
ourvocabulary
vocabulary
consists
the
following
words:We can also perform stemming, which means
reducing big
words to their
base form,
such
as
stripping off the final s in plural words, orDenote
the ing
the above counts by Nj . If we use a Dir(α) prior for θ, the posterior predictive is
snow
rain
unkunk
mary lamb
lamb little
little bigfleece
fleecewhite
whiteblack
black
snow
rain
just
from
verbs
(e.g.,
running
becomes
run).
In
this
example,
no
words
need
stemming.
Finally,
we
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
αj + Nj
1 + Nj
replace each word by its index into the vocabulary to get:
p(X̃ = j|D) = E[θj |D] = !
(3.52)
=
Here unk
all all
other
words
thatthat
do not
elsewhere
on on
′
′
10 + 17
Here
unk stands
standsfor
forunknown,
unknown,and
andrepresents
represents
other
words
do appear
not appear
elsewhere
j ′ αj + Nj
Nj + j
Nj + 1
the list.
rhyme,
we we
firstfirst
stripstrip
off punctuation,
and remove
the
list. To
To encode
encodeeach
eachline
lineofofthethenursery
nursery
rhyme,
off punctuation,
and remove
If
we
set
α
=
1,
we
get
j
Posterior predictive: p(X
= j|D) = E[ j |D] =
=
Index occurrences: any stop words
1 10 such
3 2asas3“a”,
2 “as”,
3 2“the”,
cancan
alsoalso
perform
stemming,
whichwhich
meansmeans
any stop words such
“a”,“as”,
“the”,etc.etc.WeWe
perform
stemming,
N+
17 + 10
p(X̃ = j|D) = (3/27, 5/27, 5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27)
(3.53)
reducing words
as8as
stripping
off off
themodels
finalfinal
s indiscrete
words,
or theoring
1 10toto3their
2 base
10
5form,
10such
6 Chapter
82reducing
3. Generative
for
data
words
their
baseform,
such
stripping
the
splural
in plural
words,
the ing
The
modes
of
the
predictive
distribution
are
X
=
2
(“lamb”)
and
X
=
10
(“unk”).
Note
that
the
from verbs
thisthis
example,
no no
words
needneed
stemming.
Finally,
we we
from
verbs (e.g.,
(e.g.,running
runningbecomes
becomesrun).
run).In In
example,
words
stemming.
Finally,
words “big”, “black” and “rain” are predicted to occur with non-zero probability in the future,
replace
each
word
by
its
index
into
the
vocabulary
to
get:
Counts:
replace
word
by 3its ignore
index
theword
to8 get:
Token each
1
2 now
4 into
5 the
6vocabulary
7order,
10 how often each word occurred, resulting
We
and9 count
in they
a have never been seen before. Later on we will see more sophisticated language
even though
Word
1 Count
10
1
10 33
1
10
1 10 33
mary
lamb
little
big
fleece
white
counts:
222histogram
33 22433 22 4 of word
0
1
1
22 10
5
10
6
8
10 5 10 6 8
black
0
snow
1
rain
0
unk
4
models.
Denote the above counts by Nj . If we use a Dir(α) prior for θ, the posterior predictive is
now ignore the word order, and count how often each word occurred, resulting in a
just We
We now ignore the word order, and count how often each word occurred, resulting
histogram of word counts: αj + Nj
histogram
counts:
p(X̃ = j|D)of=word
E[θ |D]
=!
BAG OF WORDS
j
j′
αj ′ + Nj ′
1 + Nj
=
10 + 17
(3.52)
If we set αj = 1, we get
p(X̃ = j|D) = (3/27, 5/27, 5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27)
(3.53)
The modes of the predictive distribution are X = 2 (“lamb”) and X = 10 (“unk”). Note that the
words “big”, “black” and “rain” are predicted to occur with non-zero probability in the future,
even though they have never been seen before. Later on we will see more sophisticated language
models.
3.5
in a
Naive Bayes classifiers
In this section, we discuss how to classify vectors of discrete-valued features, x ∈ {1, . . . , K}D ,
where K is the number of values for each feature, and D is the number of features. We will use
a generative approach. This requires us to specify the class conditional distribution, p(x|y = c).
The simplest approach is to assume the features are conditionally independent given the class
label. This allows us to write the class conditional density as a product of one dimensional
densities:
BAG OF WORDS
p(x|y = c, θ) =
D
"
j=1
p(xj |y = c, θ jc )
(3.54)
The resulting model is called a naive Bayes classifier (NBC).
The model is called “naive” since we do not expect the features to be independent, even
82
Chapter 3. Generative models for discrete data
NAIVE BAYES
CLASSIFIER –
NBC
•
Counts:
Token
Word
Count
82
Token
Word
Count
1
mary
2
2
lamb
4
3
little
4
4
big
0
5
fleece
1
6
white
1
7
black
0
8
snow
1
9
rain
0
10
unk
4
Chapter 3. Generative models for discrete data
Class Y
Denote the above counts by Nj . If we use a Dir(α) prior for θ, the posterior predictive is
just
1
2
3
4
5
6
α + Nwhite
j
maryp(X̃lamb
big = fleece
! j
= j|D)little
= E[θj |D]
=
2
4
4
0
1 j ′ αj ′ +1 Nj ′
7
8
1black
+ Nj snow
10
0 + 17 1
9
rain
0
10
unk
4
(3.52)
Nj + j
Nj + 1
If we set
αj ==
1, we
get = E[ |D] =
Posterior predictive:
p(X
j|D)
=
j
N
+1/27,
17
+ 10predictive (3.53)
Denote the above
we use
a 1/27,
Dir(α)
prior
θ, the
is
j . If5/27,
p(X̃counts
= j|D)by
=N
(3/27,
5/27,
2/27,
2/27,for
2/27,posterior
1/27, 5/27)
just
The modes of the predictive distribution are X = 2 (“lamb”) and X = 10 (“unk”). Note that the
+ Njare predicted
αj “rain”
1+N
words “big”, “black” and
toj occur with non-zero probability in the future,
! never
p(X̃ = j|D)even
= E[θ
= have
= before. Later on we will see more sophisticated (3.52)
j |D]
though
they
been
seen
language
′ +
′
α
N
10
+
17
Posterior predictive:
′
j
j
j
models.
If we set αj = 1, we get
3.5j|D)Naive
Bayes
classifiers
p(X̃ =
= (3/27,
5/27,
5/27, 1/27, 2/27, 2/27, 1/27, 2/27, 1/27, 5/27)
X1
XD
★
Distribution over class
★
Given class Xi:s independent
(3.53)
In this
section, we
discuss howare
to X
classify
of discrete-valued
features, xNote
∈ {1,that
. . . ,the
K}D ,
The modes of the
predictive
distribution
= 2 vectors
(“lamb”)
and X = 10 (“unk”).
where K is the number of values for each feature, and D is the number of features. We will use
words “big”, “black” and “rain” are predicted to occur with non-zero probability in the future,
a generative approach. This requires us to specify the class conditional distribution, p(x|y = c).
even though they
never
been seen
before. the
Later
on we
see more sophisticated
language
Thehave
simplest
approach
is to assume
features
arewill
conditionally
independent given
the class
models.
label. This allows us to write the class conditional density as a product of one dimensional
densities:
BAG OF WORDS
3.5
D
"
Naive Bayes classifiers
p(x
p(x|y = c, θ) =
j=1
j |y
= c, θ jc )
(3.54)
In this section, we discuss how to classify vectors of discrete-valued features, x ∈ {1, . . . , K}D ,
resulting
model isfor
called
naive Bayes
where K is theThe
number
of values
eacha feature,
andclassifier
D is the(NBC).
number of features. We will use
The model is called “naive” since we do not expect the features to be independent, even
a generative approach.
This
requires
us
to
specify
the
class
conditional distribution, p(x|y = c).
conditional on the class label. However, even if the naive Bayes assumption is not true, it often
The simplest approach
are conditionally
independent
theis class
in is
classifiers
that the
workfeatures
well (Domingos
and Pazzani 1997).
One reasongiven
for this
that the
Dto assume
[C]O(CD) parameters,
xresults
[K]
Datapoint
Classes
label.
This allows
us
write
the(itclass
conditional
density asforaCproduct
of D
one
dimensional
model
is to
quite
simple
only has
classes and
features),
and hence
it is relatively immune to overfitting.
densities:
The D
form of the class-conditional density depends on the type of each feature. We give some
"
D
possibilities
below:
p(x
p(x|y = c, θ)
=
j |y = c, θ jc )
p(x|y
=
c,
)
=
p(xd |ydistribution:
= c, dcp(x|y
) =(3.54)
Class conditional independent
• In j=1
the case of real-valued features, we can use the Gaussian
c, θ) =
#D
2
2
N (xj |µajcnaive
, σjc
), where
ofd=1
feature j in objects of class c, and σjc
is its
jc is the mean
The resulting modelj=1
is called
Bayesµclassifier
(NBC).
variance.
The model is called “naive” since we do not expect the features to be independent, even
•theInclass
the case
of binary
features,
1}, weBayes
can use
the Bernoulli
distribution:
p(x|y =
conditional
on c,
However,
evenxifj ∈the{0,naive
assumption
is not
true, it often
p(xd |y =
is#(now)
dc ) label.
D
c, θ)
Ber(x
is the 1997).
probability
feature
occurs
in class
j |µjc ), where
results in classifiers
that= work
(Domingos
andµjc
Pazzani
Onethat
reason
for jthis
is that
the c.
j=1well
This (it
is sometimes
called theparameters,
multivariatefor
Bernoulli
naive
model. We
see an
model is quite simple
only has O(CD)
C classes
andBayes
D features),
andwill
hence
so of this
probabilities of each outcome in [K]
• Categorical,
application
dc below.
it is relatively immune
to overfitting.
The
form
of
the
class-conditional
density depends on the type of each feature. We give some
but can also be
possibilities below:
• Bernoulli, so dc probability of head
• In the case of real-valued features, we can use the Gaussian distribution: p(x|y = c, θ) =
#D
2
2
j=1 N (xj |µjc , σjc ), where µjc is the mean of feature j in objects of class c, and σjc is its
• Or x real valued and gaussian dist, so dc gives mean and variance
variance.
• In the case of binary features, xj ∈ {0, 1}, we can use the Bernoulli distribution: p(x|y =
#D
c, θ) = j=1 Ber(xj |µjc ), where µjc is the probability that feature j occurs in class c.
This is sometimes called the multivariate Bernoulli naive Bayes model. We will see an
application of this below.
NAIVE BAYES CLASSIFIER –
NBC
Data
D = {x1 , . . . , xN }
Counts
N, Nc , Ndc , Ndck
Likelihood
p(xn , yn | , ) = p(yn | )
p(D| , ) =
d
p(x = k|
Nc
c
c
p(xnd |
d
yn )
=
yn
dyn xd
d
Ndck
dc )
c
Log-likelihood
log p(D| , ) =
Nc log
c
+
c
Optimized by
ˆc = Nc /N
Ndck log p(x = k|
c
and
d
dc )
k
ˆdck = Ndck /Ndc
TRAINING AN NBC
p(xj=1|y=2)
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
100
200
300
BAYESIAN NAIVE
BAYES CLASSIFIER
p(xj=1|y=1)
1
400
500
600
700
0
Prior (Dirichlet on all, perhaps add one)
p( , ) = p( )
p(
c
dc )
= Dir( | )
d
Dir(
c
d
dc |
)
(Recall) likelihood
0
100
200
300
400
500
600
700
p(D| , ) =
p(x = k|
Nc
c
c
c
d
Ndck
dc )
k
Posterior
p( , |D) = Dir( | )
CLASS CONDITIONALS
= Dir( |N + )
What is the class, for unclassified x
p(y = c|x, D)
p(y = c, x|D)
p(y = c|x, D)
=
p(y = c, x|D) =
p(y = c, x, , |D)d( , )
,
p(y = c, x| , )p( , |D)d( , )
,
p(y = c| )p( |D)p(x|y = c, )p( |D)d( , )
=
Cat(y = c| )p( |D)d
d
)
dc |
p(x = k|
Ndck
dc )
k
Dir(Ncd + )
c
d
p(y = c, x|D)
Bayesan: integrate out the parameters
,
=
c
What is the class, for unclassified x
Bayesan: integrate out the parameters
p(y = c, x|D) =
Dir(
Nc
c
c
d
dc
Cat(xd |y = c,
dc
d
Cat(xd |y = c,
dc )p(
|D)d
dc
these are posterior means so
Cat(y = c| )p( |D)d =
dc )p(
|D)d
dc
dc
PREDICTION –
BASED ON DATA D
Cat(y = c| )p( |D)d
Cat(xd |y = c,
dc )p(
Nc +
N+
|D)d
dc
PREDICTION –
BASED ON DATA D
=
c
0
:=
Ndc +
Nc +
i
i 1
0
c
0
0
:=
i
i 1
(a) Multiple alignment:
LOG-SUMEXP TRICK
x
A
A
A
A
1
bat
rat
cat
gnat
goat
x
G
G
G
2
.
A
A
.
.
G
A
A
.
.
A
A
.
x
C
C
C
C
3
(c) Observed emission/transition counts
match
emissions
insert
emissions
(b) Profile-HMM architecture:
D
D
D
I
I
I
I
Begin
M
M
M
End
0
1
2
3
4
state
transitions
A
C
G
T
A
C
G
T
M-M
M-D
M-I
I-M
I-D
I-I
D-M
D-D
D-I
0
0
0
0
0
4
1
0
0
0
0
-
model position
1
2
4
0
0
0
0
3
0
0
0
6
0
0
0
1
0
0
3
2
1
0
0
1
0
2
0
1
0
4
0
0
1
0
0
2
3
0
4
0
0
0
0
0
0
4
0
0
0
0
0
1
0
0
The end

Download Report

Lecture 3.key

Paperzz.com

Your Paperzz