LESS NAÏVE CONDITIONAL CLASSIFIERS

LESS NAÏVE CONDITIONAL CLASSIFIERS
October 5: Lecture 10
Independence:
the key assumption for
naïve Bayes classifier
• In general we have a learning problem where we observe
some data, and want to figure out what is the posterior
probability of something happening. In other words, the
data alter our prior expectations. Which data are relevant?
• Say we have three distinct random variables X, Y, Z
• We say that X is conditionally independent of Y given Z if
• In other words, if I give you Z , Y is totally irrelevant in
estimating the probability Y. This can be generalized to sets
of variables.
Independence:
the key assumption for
naïve Bayes classifier
• The naïve Bayes classifier reads the data and works under
the assumption that any given attribute is independent from
any other attribute given the target value V.
What is the probability
that two things happen
simultaneously given V?
Ask the question more
gradually…..
• This is crucial for efficiency but not true in practice.
Barack …….
Vs Barack…..
Bayesian Belief Networks
encoding causality
• A picture is 1000 words…
• Ovals are variables, arrows indicate causality
The graph is acyclic
• A variable Y is a descendant of X if there is path XY
A variable X is conditionally independent from its nondescendants given the value of its parents
Bayesian Belief Networks
encoding causality
A variable X is conditionally independent from its nondescendants given the value of its parents
In general a BBN represents
the joint probability
distribution of a bunch of
variables, along with some
conditional dependencies.
• For example, Campfire is independent from Lightning and
Thunder, if Storm and BusTourGroup are known.
Bayesian Belief Networks
encoding causality
A variable X is conditionally independent from its nondescendants given the value of its parents
We want to compute the joint
probability. What is missing ?
This table should be
known for every oval
Bayesian Belief Networks
encoding causality
What is the joint
probability?
How is this different
from the case of
independence?
Bayesian Belief Networks
encoding causality
•
Bayesian Belief Networks can be used to infer the joint probability
distribution of a subset of the variables, given specific observed values of
some other variables.
•
In general inferring the exact distribution is NP-hard. Even approximating a
distribution is a hard problem. In practice though sampling methods seem
to work satisfactorily.
•
In the learning problem we want to learn the tables that go with the ovals.
The objective function we try to maximize is P(D|h), i.e. find the hypohesis
(tables) that maximize the probability of observing the data. An algorithm
to do that is gradient ascent following the gradient of ln P(D|h).
•
With more advanced algorithms we can also learn the acyclic graph too.
This helps discovering causalities, for example with data involving genes and
diseases.
Learning Hidden Creatures
•
In many experiments we cannot afford to observe all attributes.
Or it may be impossible to observe them all.
•
In fact in some cases some random variables may be eluding our
grasp. We need to look carefully to see that they’re there.
•
The EM algorithm is a general technique that maintains a current
hypothesis h and estimates the missing data according to it. Then it
uses both the actually observed data and the estimates of the
missing data to find a (better) new hypothesis h’ that becomes the
new h. The whole process gets iterated and converges to a local
maximum.
Learning Hidden Creatures
(i am totally making this up)
•
Consider the following problem. We have a radio-telescope that
receives radioactivity from three different galaxies, and want to
estimate their distance.
Learning Hidden Creatures
(i am totally making this up)
•
Consider the following problem. We have a radio-telescope that
receives radioactivity from three different galaxies, and want to
estimate their distance.
The average intensity of
the activity is
proportional to the
distance. But there are
fluctuations that follow a
Gaussian distribution
Learning Hidden Creatures
(i am totally making this up)
•
In the simple case we have only one distribution whose mean we
want to estimate (given data x1, … xm):
•
Then we just take an average (minimize square errors)
•
We can also calculate confidence intervals.
Learning Hidden Creatures
mixture of Gaussians
•
But now we have a mixture of Gaussians:
•
The key is to imagine the hidden variables
that will make our life easier.
•
Every data point xi is now viewed as a triple <xi, zi1, zi2>.
•
zi1 is 1 if xi comes from first Gaussian, 0 otherwise.
Similarly for zi2
Learning Hidden Creatures
an instance of EM
•
•
But we don’t know the zi’ s!
Otherwise the problem would be just two cases of one Gaussian
•
Find some first estimates for mu1 and mu2
•
This E[zij] is the probability that the instance xi is generated
by the jth distribution
Learning Hidden Creatures
an instance of EM
•
Now update the estimate
•
This is accomplished with this
Homework
Read and unrestand the derivation of EM