x - 雲林科技大學

國立雲林科技大學
National Yunlin University of Science and Technology
A Two-Way Visualization Method for
Clustered Data
Advisor :Dr. Hsu
Presenter: Keng-Wei Chang
Author: Yehuda Koren and David Harel
ACM SIGKDD international conference on Knowledge discovery and datamining
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline










Motivation
Objective
Introduction
Basic Notions
Computing The x-Coordinates
Computing The y-Coordinates
Result
Related Work
Conclusions
Personal Opinion
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
A number of technological development have
led to an explosion of raw data that has to be
analyzed
We are especially interested in two families of
tools in this domain
Clustering algorithms and data visualization methods
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
in this paper, we integrate the two approaches
hierarchical clustering depicted as a dendrogram
low-dimensional embedding
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
A number of technological development have
led to an explosion of raw data that has to be
analyzed
We are especially interested in two families of
tools in this domain
Clustering algorithms and data visualization methods
Clustering methods can be broadly classified
Hierarchical and partitional
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
Our main interest here is hierarchical
clustering
The clustering hierarchy is often visualized as
a dendrogram
A full binary tree
has a significant disadvantage
does not provide exploratory visual representations of the data
itself
another issue is that of cluster validity
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
we are particularly interested in methods for
achieving a low-dimensional embedding of
data
principal component analysis (PCA)
multidimensional scaling (MDS)
force-directed placement
solve some limitations of dendrogram
but, cannot utilize external clustering information
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
for a demonstration of the relative merits of the
two approaches
a dendrogram vs. a low-dimensional embedding
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
in this paper, we integrate the two approaches
hierarchical clustering depicted as a dendrogram
low-dimensional embedding
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Basic Notions
given data about n elements {1,…,n}
relationships between pairs of elements are by
distances dij ≥ 0 or
similarities wij ≥ 0
2-dimentional embedding of the data
id defined by two vectors x, y Є  n
the coordinates of element i are ( xi, yi)
Intelligent Database Systems Lab
Computing The x-Coordinates
The embedding must place each element
exactly below its corresponding leaf in the
dendrogram
this means that the x-coordinate must corresponding
leaf in the dendrogram
face the problem of
computing the x-coordinates of the dendrogram leaves
preserves the relationships among the data as much as
possible
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Computing The x-Coordinates
we exhaust all the existing methods, opting for
a twofold process
find the best orientation of the dendrogram
this step determines the ordering of the leaves
decide on the exact gaps between consecutive leaves in
the ordering
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Dendrogram orientation
a dendrogram has 2n-1 different orientations
example:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dendrogram orientation
one way of defining formally what should be
considered a “good” ordering
associate a cost function with the dendrogram
such that finding the best ordering is equivalent to
optimizing this function
be the classical minimum linear arrangement problem
def
LAsim  x  
w
ij
. xi  x j
i, j
minimizes
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dendrogram orientation
in our particular problem
also faced with an ordering task
a permutation of {1, …, n}
however, here we should not consider all possible
permutations, but only agree with dendrogram’s
structure
n!  2n-1
using dynamic programming, running time is exponential in
the dendrogram’s height not in its size
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dendrogram orientation
introduce an additional form of the cost
function
def
LAdist  x  
d
ij
. xi  x j
i, j
maximizes
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dendrogram orientation
given an ordered dendrogram T
a node v
Leaves(v):the set of leaves in the substree rooted by v
x be the ordering on the leaves
Let S be Leaves(v)
L be the set of leaves of left of S
R be the set of leaves of right of S
if |L| = l, |S| = s, we have x(L) = {1,…,l},
x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dendrogram orientation
a key concept of the algorithm is
local arrangement cost, defined as:
def
LocalLA v  
T
w
i , jS


iS,jR
ij
 xi  x j 
 w  x  l 
iS , jL
wij  l  s  xi  
ij
i
w
iL , jR
ij
s
if |L| = l, |S| = s, we have x(L) = {1,…,l},
x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Dendrogram orientation
two additional related terms will be used
def
LeftCut v  
T
w ,
iS , jL
ij
def
RightCut v  
T
w
iS , jR
ij
another term that will be used in the algorithm
InnerCut v  

wij
iLeaves v.left 
jLeaves v.right
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Intelligent Database Systems Lab
Determining coordinates of the leaves
computing the exact gaps between each two
consecutive leaves
example:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Determining coordinates of the leaves
a better approach is to take a weighted average
over all influenced leaf pairs

1

gapi   
 j i , k i k 
1
d kj

  
j  j i ,k i k  j
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Computing The y-Coordinates
Principle component analysis
Classical multidimensional scaling
Eigen-projection
Stress minimization
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Result
Odors dataset
consists of 30 volatile odorous pure chemicals
contains 262 elements, natural clusters : 30
use a UPGMA agglomerative clustering to construct
the dendrogram
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Result
Iris dataset
an example of discriminant analysis
contains 150 elements, natural clusters : 3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Result
Gene expression data:CDC15-synchronized
cell cycle
a much larger dataset of gene-expression data
contains 6113 elements
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work
TreeView
dendrogram over a color-coded matrix
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discussion
success for integrating two key methods in
exploratory data analysis
cluster analysis and low-dimensional embedding
two unique properties
Guaranteed separation between any kind of given
clusters
The ability to deal with a predefined hierarchical
clustering
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion


Advantages
─
has success for integrating two of clustering methods.
─
more intuition in analyzing
Application
─
─

Real data for clustering and analyzing.
May solve the problem lack of clustering information
Limited
─
cannot show the real shape of clusters
Intelligent Database Systems Lab