lecture 25: clustering

ECE 8527
to Machine Learning and Pattern Recognition
8443 – Introduction
Pattern Recognition
LECTURE 25: CLUSTERING
• Objectives:
Mixture Densities
Maximum Likelihood Estimates
Application to Gaussian Mixture Models
k-Means Clustering
Fuzzy k-Means Clustering
• Resources:
J.B.: EM Estimation
A.M.: GMM Models
D.D.: Clustering
C.B.: Unsupervised Clustering
Wiki: K-Means
A.M.: Hierarchical Clustering
MU: Introduction to Clustering
Java PR Applet
Introduction
• Training procedures that use labeled samples are referred to as supervised.
• Unsupervised procedures use unlabeled data.
• Seven basic reasons why we are interested in unsupervised methods:
1) Data collection and labeling data is very costly and nontrivial (often this is
a research problem in itself).
2) Heuristic methods (application-specific) exist that allow us to improve a
classifier trained using supervised techniques by introducing large
amounts of unlabeled data. This is often faster than labeling data.
3) We would like to exploit “found” data such as that available on the
Internet. Often this data is not truth-marked or is only partially transcribed.
4) Reversal of the training process: train on unlabeled data and then use
supervision to label the groupings.
5) Models often need to be adapted over time.
6) Use unsupervised methods to find features that will be useful for
categorization.
7) Perform rapid exploratory analysis to gain insight into a new problem.
• In this chapter, we begin with parametric models and then step back and
consider nonparametric techniques.
ECE 8527: Lecture 25, Slide 1
Mixture Densities
• Assume:
 The samples come from a known number of c classes.
 The prior probabilities, P(ωj), for j = 1, …, c, are known,.
 The forms of the class-conditional probability densities, p(x|ωj,θj) are known.
 The values for the c parameter vectors ω1, …, ωc are unknown.
 The category labels are unknown.
• The probability density function for the samples is given by:
c
p(x | )   p(x |  j , j ) P( j )
j 1
where θ = (θ1, …, θc )t. P(θj), the prior probabilities are called the mixing
parameters and without loss of generality sum to one.
• A density, p(x|θ), is said to be identifiable if θ ≠ θ’ implies there exists an x
such that p(x|θ) ≠ p(x|θ’). (A density is unidentifiable if we cannot recover a
unique θ from an infinite amount of data.)
• Identifiability of θ is a property of the model and not the procedure used to
estimate the model.
• We have already discussed methods to estimate these mixture coefficients.
ECE 8527: Lecture 25, Slide 2
Maximum Likelihood Estimates
• Given a set D = {x1, …, xn} of n unlabeled samples drawn independently from
the mixture density, the likelihood of the observed samples is:
n
p ( D |  )   p ( x k | )
k 1
• The maximum likelihood estimate is the value of θ that maximizes p(D|θ).
• If we differentiate the log-likelihood:
n
c
1
 i (log p( D | ))  
 i  p(x k |  j , j ) P( j )
k 1 p ( x k | )
j 1
• Assume ωi and θ j are functionally independent if i ≠ j.
• Substitute the posterior:
p(x k | i , i ) P(i )
p ( x k | )
• The gradient can be written as:
P(i | x k , ) 
n
 i (log p( D | ))   P(i | x k , ) i ln p(x k | i , i )
k 1
• The gradient must vanish at the value of θi that maximizes the log likelihood.
Therefore, the ML solution must satisfy:
n
 P(i | x k , ˆ ) i ln p(x k | i , ˆ i )  0 for i  1,..., c
k 1
ECE 8527: Lecture 25, Slide 3
Generalization of the ML Estimate
• We can generalize these results to include the prior probability, P(ωi), among
the unknown quantities.
• The search for the maximum value of p(D|θ) extends over θ and P(ωi), subject
to the constraints:
P(i )  0, i  1,..., c
c
 P(i )  1.
i 1
• It can be shown that the ML estimate for the prior is:
1 n ˆ
ˆ
P(i )   P(i | x k , ˆ )
n k 1
n
 Pˆ (i | x k , ˆ ) i ln p(x k | i , ˆ i )  0
k 1
Pˆ (i | x k , ˆ ) 
p (x k | i , ˆ i )
c
 p (x k |  j , ˆ j ) Pˆ ( j )
j 1
• The first equation simply states the estimate of the prior is computed by
averaging over the entire data set. The third equation we have seen before in
the HMM section of this course. The second equation just restates the ML
principle that the optimal value of θ produces a maximum.
• So the good news here is that doing the obvious maximizes the posterior.
ECE 8527: Lecture 25, Slide 4
Unknown Mean Vectors
• If the only unknown quantities are the mean vectors, μi, we can write:
1
1/ 2
ln p (x k | i , ˆ i )   ln[( 2 ) d / 2  i ]  (x- i )t  i1 (x- i )
2
and its derivative:
  i ln p(x k | i , ˆ i )  i1 (x-i )
• The ML solution must satisfy:
n
1
t
 P(i | x k , ˆ ) i x k -ˆ i   0 where ˆ  (ˆ 1,..., ˆ c )
k 1
• Rearranging terms:
n
 P(i | x k , ˆ )x k
ˆ i  k 1n
 P(i | x k , ˆ )
k 1
• But this does not give us a new estimate explicitly, nor does it typically give
closed form solutions. Instead, we can use a gradient-descent approach:
n
 P(i | x k , ˆ ( j ))x k
ˆ i ( j  1)  k 1n
 P(i | x k , ˆ ( j ))
k 1
ECE 8527: Lecture 25, Slide 5
Example
• Consider the simple twocomponent one-dimensional
normal mixture.
• Generate 25 samples sequentially
assuming μ1 = -2 and μ 2 = 2.
• The likelihood function is
calculated as a function of the
estimates for the two means.
• We see that while the maximum is
achieved near the true means,
there are two peaks of
comparable height:
 μ1 = -2.130 and μ2 = = 1.668
 μ1 = 2.085 and μ2 = -1.257.
ECE 8527: Lecture 25, Slide 6
All Parameters Unknown
• If the means, covariances, and priors are all unknown, the ML principle yields
singular solutions.
• If we only consider solutions in the neighborhood of the largest local
maximum, we can derive estimation equations:
1
Pˆ (i )   Pˆ (i | x k , ˆ )
n k 1
n
n
ˆ i 
 Pˆ (i | x k , ˆ )x k
k 1
n
 Pˆ (i | x k , ˆ )
k 1
n
t
 Pˆ (i | x k , ˆ )(x k  ˆ i )(x k  ˆ i )
ˆ i  k 1
n
 Pˆ (i | x k , ˆ )
k 1
ˆ 1 / 2 exp[ 1 x -ˆ t 
ˆ 1 x -ˆ ]Pˆ ( )

i
k
i
i
k
i
i
p(x k | i , ˆ i ) Pˆ (i )
2
ˆ
ˆ
P(i | x k , )  c
 c
1 / 2
1
t ˆ 1
ˆ
ˆ
ˆ
ˆ j ]Pˆ ( j )
exp[ x k -ˆ j  
 p(x k | i ,  j ) P( j )   j
j x k -
2
j 1
j 1
ECE 8527: Lecture 25, Slide 7
K-Means Clustering
• An approximate technique to determine the parameters of a mixture
distribution is k-Means: k is the number of cluster centers, c, and “means”
refers to the iterative process for finding the cluster centroids.
• We observe that the probability, Pˆ (i | x k , ˆ ) , is large when the squared
t ˆ 1
x -ˆ  , is small.
Mahalanobis distance, x -ˆ  
k
i
i
k
i
• Suppose we merely compute the squared Euclidean distance, x k -ˆ i
2
, find the
mean ̂ m nearest to xk, and approximatePˆ (i | x k , ˆ ) as:
1 if i  m 
ˆ
ˆ
P(i | x k , )  

0
otherwise


• We can formally define the k-Means clustering algorithm:
 Initialize: select the number of clusters, c, and seed the means, μ1, …, μn.
 Iterate:
o Classify n samples according to the nearest mean, μi .
o Recompute each mean using the ni samples assigned to cluster i.
o Until: no change in μi .
 Done: return μ1, …, μn.
• Later we will see this is one case of an iterative optimization algorithm. There
are many ways to cluster, recompute means, merge/split clusters and stop.
ECE 8527: Lecture 25, Slide 8
Fuzzy K-Means Clustering
• In k-Means, each data point is assumed to reside in one and only one
cluster.
• We can allow “fuzzy” membership – a data point can appear in a cluster
with probability Pˆ (i | x k , ˆ ) .
• We can minimize a heuristic global cost function:
c
n
J fuz    [ Pˆ (i | x j , ˆ )]b x j  i
2
i 1 j 1
where b is a free parameter that adjusts the blending of different clusters.
• The probabilities of cluster membership for each point are normalized as:
c
 Pˆ (i | x j )  1
i 1
• The relevant reestimation equations are:
n
j 
b
 [ Pˆ (i | x j )] x j
j 1
n
b
 [ Pˆ (i | x j )]
j 1
Pˆ (i | x j ) 
(1 / d i j )1 /( b 1)
c
1 /( b 1)
 (1 / d i j )
di j  x j   i
2
r 1
• This can be viewed as a form of soft quantization, and fits nicely with our
general notion of probabilistic modeling and EM estimation.
ECE 8527: Lecture 25, Slide 9
Demonstrations
ECE 8527: Lecture 25, Slide 10
Summary
• Introduced the concept of unsupervised clustering.
• Reviewed the reestimation equations for ML estimates of mixtures.
• Discussed application to Gaussian mixture distributions.
• Introduced k-Means and Fuzzy k-Means clustering.
• Demonstrated clustering using the Java Pattern Recognition applet.
ECE 8527: Lecture 25, Slide 11

Download Report

lecture 25: clustering

Paperzz.com

Your Paperzz