Pattern
Classification
All materials in these slides were taken
from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John
Wiley & Sons, 2000
with the permission of the authors and
the publisher
0
Chapter 10
Unsupervised Learning & Clustering
Introduction
Mixture Densities and Identifiability
ML Estimates
Application to Normal Mixtures
K-means algorithm
Unsupervised Bayesian Learning
Data description and clustering
Criterion function for clustering
Hierarchical clustering
Low-dim reps and multidimensional scaling
(self-organizing maps)
1
10.1 Introduction
Previously, all our training samples were labeled:
these samples were said “supervised”
We now investigate a number of “unsupervised”
procedures which use unlabeled samples
Collecting and Labeling a large set of sample
patterns can be costly
We can train with large amounts of (less
expensive) unlabeled data, and only then use
supervision to label the groupings found, this is
appropriate for large “data mining” applications
where the contents of a large database are not
known beforehand
2
This is also appropriate in many
applications when the characteristics of
the patterns can change slowly with time.
Improved performance can be achieved if
classifiers running in a unsupervised mode
are used
We can use unsupervised methods to
identify features that will then be useful
for categorization
We gain some insight into the nature (or
structure) of the data
3
10.2 Mixture Densities &
Identifiability
We shall begin with the assumption that the
functional forms for the underlying probability
densities are known and that the only thing that
must be learned is the value of an unknown
parameter vector
We make the following assumptions:
The samples come from a known number c of classes
The prior probabilities P(j) for each class are known
(j = 1, …,c)
P(x | j, j) (j = 1, …,c) are known
The values of the c parameter vectors 1, 2, …, c
are unknown
4
The category labels are unknown
component densities
c
P ( x | ) P ( x | j , j ). P ( j )
j 1
mixing parameters
where ( 1 , 2 ,..., c )t
This density function is called a mixture
density
Our goal will be to use samples drawn from
this mixture density to estimate the
unknown parameter vector .
Once is known, we can decompose the
mixture into its components and use a MAP
classifier on the derived densities.
5
Definition
A density P(x | ) is said to be identifiable if ’ implies
that there exists an x such that:P(x | ) P(x | ’)
As a simple example, consider the case where x is binary and
P(x | ) is the mixture:
1 x
1 x
1 x
1 x
Assume that:
P( x | ) 1 ( 1 1 ) 2 ( 1 2 )
2
2
1
2 ( 1 2 ) if x 1
1 - 1 ( ) if x 0
2
2 1
P(x = 1 | ) = 0.6 P(x = 0 | ) = 0.4
by replacing these probabilities values, we obtain:
1 + 2 = 1.2
Thus, we have a case in which the mixture distribution
is completely unidentifiable, and therefore unsupervised
learning is impossible.
6
In the discrete distributions, if there are too many
components in the mixture, there may be more unknowns
than independent equations, and identifiability can become a
serious problem!
While it can be shown that mixtures of normal densities are
usually identifiable, the parameters in the simple mixture
density
P( 1 )
P( 2 )
1
1
2
P( x | )
exp ( x 1 )
exp ( x 2 )2
2
2
2
2
cannot be uniquely identified if P(1) = P(2)
(we cannot recover a unique even from an infinite amount of
data!)
= (1, 2) and = (2, 1) are two possible vectors that can
be interchanged without affecting P(x | ).
Identifiability can be a problem, we always assume that the
densities we are dealing with are identifiable!
7
10.3 ML Estimates
Suppose that we have a set D = {x1, …,
xn} of n unlabeled samples drawn
independently from the mixture density
c
p( x | ) p( x | j , j )P ( j )
j 1
( is fixed but unknown!)
n
ˆ arg max p( D | ) with p(D | ) p(x k | )
k 1
The gradient of the log-likelihood is:
n
i l P ( i | x k , ) i ln p( x k | i , i )
k 1
8
Since the gradient must vanish at the value of i
n
that maximizes l ( l ln p( xk | )) , therefore,
k 1
the ML estimate ̂ i
n
P(
k 1
i
must satisfy the conditions
| xk ,ˆ) i ln p( xk | i ,ˆi ) 0 (i 1,..., c)
(a)
By including the prior probabilities as unknown
variables, we finally obtain:
1
P̂ ( i ) P̂ ( i | x k ,ˆ )
n
n
and P̂ ( i | x k ,ˆ ) i ln p( x k | i ,ˆ i ) 0
k 1
where : P̂( i | x k ,ˆ )
p( x k | i ,ˆ i )P̂ ( i )
c
p( x
j 1
k
| j ,ˆ j )P̂ ( j )
9
10.4 Applications to Normal
Mixtures
p(x | i, i) ~ N(i, i)
Case
i
i
P(i)
c
1
?
x
x
x
2
?
?
?
x
3
?
?
?
?
Case 1 = Simplest case
10
Case 1: Unknown mean vectors
i = i i = 1, …, c
ML estimate of = (i) is:
n
ˆ i
P(
k 1
n
i
P(
k 1
| xk , ˆ ) xk
(1)
i
| xk , ˆ )
P ( i | xk , ˆ ) is the fraction of those samples
having value xk that come from the ith class, and ̂ i
is the average of the samples coming from the i-th
class.
11
Unfortunately,
equation (1) does not give ̂ i
explicitly
However, if we have some way of
obtaining good initial estimates ˆ i ( 0 ) for
the unknown means, therefore equation
(1) can be seen as an iterative process for
improving the estimates
n
ˆ i ( j 1 )
P(
k 1
n
i
P(
k 1
| xk , ˆ ( j )) xk
i
| xk , ˆ ( j ))
12
This
is a gradient ascent for maximizing
the log-likelihood function
Example:
Consider the simple two-component onedimensional normal mixture
1
2
1
1
2
p(x | 1, 2 )
exp (x 1 )
exp (x 2 )2
3 2
2
3 2
2
(2 clusters!)
Let’s set 1 = -2, 2 = 2 and draw 25
samples sequentially from this mixture.
The log-likelihood function is:
l( 1, 2 )
n
lnp(xk | 1, 2 )
k 1
13
The maximum value of l occurs at:
ˆ 1 2.130 and ˆ 2 1.668
(which are not far from the true values: 1 =
-2 and 2 = +2)
There is another peak at ˆ 1 2.085 and ˆ 2 1.257
which has almost the same height as can be
seen from the following figure.
This mixture of normal densities is
identifiable
When the mixture density is not identifiable,
the ML solution is not unique
14
15
Case 2: All parameters unknown
No constraints are placed on the
covariance matrix
Let p(x | , 2) be the two-component
normal mixture:
2
1
1 x
1
1 2
2
p( x | , )
exp
exp x
2 2 .
2
2 2 2
16
Suppose = x1, therefore:
1
1
1 2
p( x1 | , )
exp x1
2 2 2 2
2
2
For the rest of the samples:
1
1 2
2
p( x k | , )
exp xk
2 2
2
Finally,
1
1 n 2
1
1 2
p( x1 ,..., xn | , ) exp x1
exp xk
n
2 ( 2 2 )
2 k 2
2
this term
0
The likelihood is therefore large and the maximumlikelihood solution becomes singular.
17
Adding
an assumption
Consider the largest of the finite local maxima
of the likelihood function and use the ML
estimation.
We obtain the following:
1 ˆ
ˆ
P ( ) P ( | x , ˆ)
n
Pˆ ( | x ,ˆ) x
ˆ
Pˆ ( | x ,ˆ)
n
i
i
k 1
k
n
Iterative
scheme
i
k 1
k
k
n
i
i
k 1
k
Pˆ ( | x ,ˆ)(x ˆ )(x ˆ )
n
ˆ
i
k 1
i
k
k
i
k
t
i
Pˆ ( | x ,ˆ)
n
k 1
i
k
18
Where:
ˆ ) Pˆ ( )
p
(
x
|
,
Pˆ ( | x , ˆ)
ˆ
ˆ
p
(
x
|
,
)
P
( )
k
i
1 / 2
i
ˆ
k
i
j
i
j
1
ˆ
exp ( x ˆ ) ( x ˆ ) Pˆ ( )
2
1
ˆ
exp ( x ˆ ) ( x ˆ ) Pˆ ( )
2
1
t
k
i
i
k
i
i
1 / 2
c
j 1
i
c
k
j 1
i
t
j
k
j
1
j
k
j
j
19
K-Means Clustering
Goal: find the c mean vectors 1, 2, …, c
Replace the squared Mahalanobis distance
( xk ˆ i )t ˆ i1 ( xk ˆ i ) by the squared Euclidean distance xk ˆ i
2
Find the mean ̂ m nearest to xk and approximate
1 if i m
ˆ
P̂( i | xk , ) as: P̂ ( i | xk , )
0 otherwise
Use the iterative scheme to find ˆ 1 , ˆ 2 ,..., ˆ c
20
If n is the known number of patterns and c the desired
number of clusters, the k-means algorithm is:
begin
initialize n, c, 1, 2, …,
c(randomly
selected)
do classify n samples according to
nearest i
recompute i
Pˆ ( | x ,ˆ) x
ˆ
Pˆ ( | x ,ˆ)
n
i
k 1
i
k
k
n
k 1
i
k
until no change in i
return 1, 2, …, c
End
21
Considering the example in the previous figure
22
10.5 Unsupervised Bayesian Learning
Other than the ML estimate, the Bayesian
estimation technique can also be used in the
unsupervised case. is a random variable with a
known prior distribution P( ) ,use the training
samples to compute the posterior density P( | D)
The number of classes is known
The prior probability P( ) for each class are known
The forms for p( x | , ) are known, is not known
Part of the knowledge about is contained in a known
prior density P( )
The rest is contained in the training samples dawn
independently from the familiar mixture density
j
j
j
c
p( x | ) p( x | , ) P( )
j 1
j
j
j
23
• We can compute p(/D) as seen previously
P( i | x, D)
p(x | i , D) P( i | D)
c
p(x | , D) P(
j 1
j
j
| D)
p(x | i , D) P( i )
c
p(x | , D) P(
j 1
j
j
)
and passing through the usual formulation introducing the
unknown parameter vector .
p(x | i , D) p(x, θ | i , D)dθ p(x | θ, i ,D) p(θ i ,D)dθ
p(x | i , θi ) p(θ |D)dθ
• Hence, the best estimate of p(x/i) is obtained by
averaging p(x /i, i) over i.
• Since all depends from p( /D), this is the main issue of
the problem.
24
Using Bayes
p(θ | D)
p( D | θ) p(θ)
p( D | θ) p(θ)dθ
Assuming the independence of the samples
n
p ( D | θ) p ( x k | θ)
k 1
or alternately (denoting Dn the set of n samples)
p(x | θ) p(θ | D )
p(θ | D )
p(x | θ) p(θ | D )dθ
n 1
n
n
n 1
n
If p() is almost uniform in the region where
p(D/) peaks, then p(/D) peaks in the same
place.
25
ˆ and the peak is
If the only significant peak occurs at θ θ
very sharp, then
ˆ
p(x | i , D) p(x | i , θ)
and
P( i | x, D)
p(x | i , θˆ i ) P( i )
c
p(x | , θˆ
j 1
j
j
) P( j )
Therefore, the ML estimate is justified.
Both approaches coincide if large amounts of data are
available.
In small sample size problems they can agree or not,
depending on the form of the distributions
The ML method is typically easier
to implement than the Bayesian one
26
The formal Bayesian solution to the unsupervised learning of
the parameters of a mixture density is similar to the supervised
learning of the parameters of a component density.
But, there are significant differences: the issue of identifiability,
and the computational complexity
The issue of identifiability
With SL, the lack of identifiability means that we do not obtain a unique
vector, but an equivalence class, which does not present theoretical
difficulty as all yield the same component density.
With UL, the lack of identifiability means that the mixture cannot be
decomposed into its true components
p(x / Dn) may still converge to p(x), but p(x /i, Dn) will not in
general converge to p(x /i), hence there is a theoretical barrier.
The computational complexity
With SL, the sufficient statistics allows the solutions to be
computationally feasible
27
With UL, samples comes from a mixture density and
there is little hope of finding simple exact solutions for
p(D / ).
Another way of comparing the UL and SL is to consider the
usual equation in which the mixture density is explicit
p(θ | D )
n
p(x n | θ) p(θ | D n 1 )
p(x
n
| θ) p(θ | D
c
p(x
j 1
c
n
n 1
)dθ
| j , θ j ) P( j )
n 1
p
(
x
|
,
θ
)
P
(
)
p
(
θ
|
D
)dθ
n j j
j
p(θ | D n 1 )
j 1
If we consider the case in which P(1)=1 and all other prior
probabilities as zero, corresponding to the supervised case
in which all samples comes from the class 1, then we get
28
p(θ | D n )
p(x n | 1 , θ1 )
n 1
p
(
x
|
,
θ
)
p
(
θ
|
D
)dθ
n 1 1
p(θ | D n1 )
Comparing the two equations, we see that observing an
additional sample changes the estimate of .
Ignoring the denominator which is independent of , the
only significant difference is that
in the SL, we multiply the “prior” density for by the
component density p(xn /1, 1)
In the UL, we multiply the “prior” density by the whole
c
mixture
p(xn | j , θ j ) P( j )
j 1
Assuming that the sample did come from class 1, the
effect of not knowing this category is to diminish the
influence of xn in changing ..
29
© Copyright 2026 Paperzz