7. Decision Trees and Decision Rules

國立雲林科技大學
National Yunlin University of Science and Technology
K*-Means: A new generalized kmeans clustering algorithm
Advisor ：Dr. Hsu
Graduate： Yu Cheng Chen
Author: Yiu Ming Cheung
2003 Elsevier Pattern Recognition Letters 24 (2003)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline








Motivation
Objective
Introduction
A metric for data clustering
Rival penalized mechanism analysis of the metric
k*-Means algorithm
Conclusions
Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
K-means has three major drawbacks.
It implies that the data clusters are ball-shaped.
Dead-unit problem.
It needs to pre-determine the cluster number.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
Presenting a generalized k-mean algorithm which is
applicable to ellipse-shaped data clusters without
dead-unit problem, and performs correct clustering
without pre-assigning the exact cluster number.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

K-mean
1 if j  arg min 1 r κ x t  mr
I ( j | xt )  
 0 otherwise.

new
w
m
m
old
w
2

 (1)


  ( xt  m ) (2)
old
w
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

We will present a new clustering technique named
STep-wise Automatic Rival penalised (STAR) kmeans algorithm (denoted as k-means hereafter).
 The k-means consists of two separate steps.
 The first one is a pre-processing procedure, which assigns each
cluster at least a seed point.
 Then, the next step is to adjust the units.
Intelligent Database Systems Lab
A metric for data clustering
Suppose N inputs x1; x2; . . . ; xN are independently
and identically distributed from a mixture density-ofGaussian population:

k*

p * ( x; * )    *j G x | m*j , *j

(3)
j 1
where k* is the mixture number
  {( *j , m*j , *j ) | 1  j  k }
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A metric for data clustering

Both of k* and Θ are unknown, and need to be
estimated. We therefore model the inputs by
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A metric for data clustering

We measure the distance between p* and p by the
following Kullback–Leibler divergence function:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A metric for data clustering
N.Y.U.S.T.
I. M.

It can be seen that minimizing Eq. (8) is equivalent to
the maximum likelihood (ML) learning of H, i.e.,
minimizing Eq. (7)

Here, we prefer to perform clustering based on the
winner-take-all principle. That is, we assign an input x
into cluster j if
Intelligent Database Systems Lab
A metric for data clustering

(10) can be further specified as

Consequently, minimizing Eq. (8) is approximate to
minimize
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A metric for data clustering
As
N is large enough,
1
N
H 
* t 1 ln p * ( xt ; * )
N
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Rival penalized mechanism analysis of
the metric
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
k*-Means algorithm

K*-means algorithm consists of two steps.

The first step is to let each cluster acquires at least
one seed point.

The other step is to adjust the parameter set H via
minimizing Eq. (14) meanwhile clustering the data
points by Eq. (11)
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
k*-Means algorithm

Step 1:
─

Step1.1
─

Randomly initialize the k seed points m1;. . . ;mk.
Rival
Step 1.2
─
update
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
k*-Means algorithm

Step 2: Initialize αj=1/k for j=1; 2; . . . ; k, and let Σj be
the covariance matrix of those data points with uj=1.

Step 2.1: Given a data point xt, calculate I(j|xt) by
Eq. (11).

Step 2.2: Update the winning seed point mw only by
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results

Experiment 2, we used 2000 data points that are also
from a mixture of three Gaussians as follows:

We randomly initialized six seed points in the input
data space.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions

Not only is this new one applicable to ellipse-shaped
data clusters as well as ball-shaped ones without deadunit problem, but also performs correct clustering
without pre-determining the exact cluster number.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion

…
Intelligent Database Systems Lab

Download Report

7. Decision Trees and Decision Rules

Paperzz.com

Your Paperzz