data mining

2. The Cluster Ensembles
Problem
DATA MINING
Consensus Function
:
:
:
3
DATA MINING
Objective Function
4
DATA MINING
Entropy
1
H ( p )  H ( X )   p( x) log 2 p ( x)  E (log
)
p( X )
x
-
The average uncertainty of single random variable.
확률변수에서의 정보량
확률변수에서의 불확실성의 평균
속성 ( H ( X )  0
H ( X )  0 : 정보가 없음.)
5
DATA MINING
Entropy
H ( X ,Y )
H (X |Y)
H (Y | X )
I ( X ,Y )
H (X )
H (Y )
:
:
6
DATA MINING
Entropy
7
DATA MINING
Entropy
8
DATA MINING
Entropy
9
DATA MINING
Objective Function
10
DATA MINING
Objective Function
 (1)
= { 1, 1, 1, 2, 2, 3, 3}

= { 1, 1, 2, 2, 3, 3, 3}
(3)
11
DATA MINING
Objective Function
n1
(1)
n2 (1)
n3
(1)
=3
n1(3)
=2
=2
(3)
=2
(3)
=3
=2
n2
n3
12
DATA MINING
Objective Function
n11 = 2
n12 = 1
n13 = 0
n21 = 0
n22 = 1
n23 = 1
n31 = 0
n32 = 1
n33 = 2
2
2
1
1
1
1
2log  log  log  log  log  2log
6
6
4
6
4
6

3
2
2
3
2
2
(3log  2log  2log )(3log  2log  2log
7
7
7
7
7
7
1 1 1
log( )4 ( ) 2 ( ) 2
3 6 4

3 2
log( )3 ( ) 2
7 7
13
DATA MINING
Objective Function
14
DATA MINING
Ensembles Clustering

( new )

(1)
= { 1, 1, 1, 2, 2, 3, 3}

(3)
= { 1, 1, 2, 2, 3, 3, 3}
= { 1, 2, 3, 1, 2, 3, 1}
15
DATA MINING
Objective Function
1 1 1
log ( ) 4 ( ) 2
9 6 4
 ( NMI ) ( (1) ,  ( new) ) 
3 2
log( )3 ( ) 2
7 7
1 4 1 3
log( ) ( )
( NMI )
(3)
( new )
6 4

( , 
)
3 2
log( )3 ( ) 2
7 7
16
DATA MINING
Objective Function
1 1 4 1 2
1 4 1 3
log
(
)
(
)
log(
) ( )
^
( ANMI )
9
6
4
6
4 ))  2

(,  )  ((

3 2
3 2
log( )3 ( ) 2
log( )3 ( ) 2
7 7
7 7
17
DATA MINING
Objective Function
18
DATA MINING
Ensembles Clustering

( new )
= { 1, 2, 3, 1, 2, 3, 1}
 (test )
= { 1, 1, 1, 1, 1, 1, 1}

(1)
= { 1, 1, 1, 2, 2, 3, 3}

(3)
= { 1, 1, 2, 2, 3, 3, 3}
19
DATA MINING
Objective Function
1 1 1
log( )( )( )5
2 12 6
 ( NMI ) ( (1) ,  (test ) ) 
1 6
3 2
(log( )( )6 )(log( )3 ( ) 4 )
7 7
7 7
1 1 1
log( )( ) 2 ( ) 4
2 9 6
 ( NMI ) ( (1) ,  (test ) ) 
1 6
3 2
(log( )( )6 )(log( )3 ( ) 4 )
7 7
7 7
20
DATA MINING
Objective Function
1 1 15
1 12 14
log(
)(
)(
)
log(
)( ) ( )
^
( ANMI )
2 12 6
2 9 6

(,  )  ((

2
1 6
3 2
1 6
3 2
(log( )( )6 )(log( )3 ( ) 4 ) (log( )( )6 )(log( )3 ( ) 4 )
7 7
7 7
7 7
7 7
21
DATA MINING
Objective Function
new
1 1 4 1 2
1 4 1 3
log
(
)
(
)
log(
) ( )
^
( ANMI )
9
6
4
6
4 ))  2

(,  )  ((

3 2
3 2
log( )3 ( ) 2
log( )3 ( ) 2
7 7
7 7
test
1 1 15
1 12 14
log(
)(
)(
)
log(
)( ) ( )
^
( ANMI )
2 12 6
2 9 6

(,  )  ((

2
1 6
3 2
1 6
3 2
(log( )( )6 )(log( )3 ( ) 4 ) (log( )( )6 )(log( )3 ( ) 4 )
7 7
7 7
7 7
7 7
22
DATA MINING
Objective Function
23
DATA MINING
Objective Function
 (1)
= { 1, 1, 1, 2, 2, 3, 3}
=1
 (4)
= { 1, 2, ?, 1, 2, ?, ?}
= 4/7
24
DATA MINING
Objective Function
25
DATA MINING
Greedy Optimization
Greedy Algorithm
-
결정을 해야할 때마다 그 순간에 가장 좋다고 생각되는 것을 해답으로 선택함으
로서 최종적인 해답에 도달하는 방법
그 순간의 선택은 그 당시에는 최적이나 global 해답이 아닐 수 있음
따라서 최적의 결과인지 항상 검증해야 한다.
Divide and conquer
-
해결하려는 무제를 크기가 작은 여러 개의 부분 문제로 분할한 후 각각을 따로따
로 처리하고 마지막에 통합하는 방법
쉽게 처리할 수 있을 정도의 충분히 작은 크기가 될때까지 순환적으로 분할
26
DATA MINING
Greedy Optimization
27
3. Efficient Consensus
Functions
DATA MINING
Consensus Functions
Greedy AlgorithmCluster-based Similarity Partitioning Algorithm (CSPA)
HyperGraph Partitioning Algorithm (HGPA)
Meta-Clustering Algorithm (MCLA)
29
DATA MINING
Representing Sets of Clusterings as a Hypergraph
-To transform the given cluster label vectors into a suitable hypergraph representation
- Hypergraph consists of vertices and hyperedges.
30
DATA MINING
CSPA
(CSPA)
A clustering signifies a relationship between objects in the same cluster and can thus
be used to establish a measure of pairwise similarity. This induced similarity measure is
then used to recluster the objects, yielding a combined clustering.
clustering1
clustering2
.
.
.
similarity
Combined clustering
clustering3
31
DATA MINING
CSPA
-The entry-wise average of r such matrices representing the r sets of groupings yields an
overall similarity matrix S with a ner resolution.
-The entries of S denote the fraction of clusterings in which two objects are in the same
cluster, and can be computed in one sparse matrix multiplication S
-
32
DATA MINING
CSPA
1
0
0
1
1
1
0
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
1
0
0
0
0
1
1
0
0
0
1
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
0
1
1
0
0
1
0
0
0
0
0
1
1

1
1
1
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
1
1

33
DATA MINING
CSPA
34
DATA MINING
CSPA
- we can use the similarity matrix to recluster the objects using any reasonable
similarity-based clustering algorithm.
- We chose to partition the induced similarity graph using METIS, because of its robust
and scalable properties.
- CSPA is the simplest and most obvious heuristic, but its computational and storage
- complexity are both quadratic in n
35
DATA MINING
HGPA
- The cluster ensemble problem is formulated as partitioning the hypergraph by cutting a
minimal number of hyperedges.
- All hyperedges are considered to have the same weight. Also, all vertices are equally
weighted.
- Note that this includes nl-way relationship information, while CSPA only considers
pairwise relationships.
- On the other hand this means that if the natural data clusters are highly imbalanced, a
graph-partitioning based approach is not appropriate.
- We use the hypergraph partitioning package HMETIS
- hypergraph partitioning in general has no provision for partially cut hyperedges.
36
DATA MINING
HGPA
3 3 2
k  max{ , , }  1.05
7 7 7
k  2.45
3 3 2
k  min{ , , }  1.05
7 7 7
k  3.67
3 3 2
k  max{ , , }  1.05
7 7 7
k  2.45
37
DATA MINING
HGPA
-
클러스터 안에서의 연결보다, 클러스터 외부와의 연결은 큰 비용이 듬
-
비용이 고정적이라는 가정에서 내부의 연결값의 총합의 최대화가 최적의 분리를 의미
-
그룹간의 원소를 교환하여 최적의 분리를 탐색
38
DATA MINING
HGPA
39
DATA MINING
MCLA
40
DATA MINING
MCLA
C1( M )  {h3 , h4 , h9 }
h1( M )  {v5 , v6 , v7 }
=1/3
=1
=1
41
DATA MINING
MCLA
C2( M )  {h2 , h6 , h8, h11}
h2( M )  {v2 , v3 , v4,v5}
=1/4
=1/4
=3/4
=3/4
42
DATA MINING
MCLA
C3( M )  {h1 , h5 , h7, h10 }
h2( M )  {v2 , v3 , v4,v5}
=1
=3/4
=2/4
=1/4
43
DATA MINING
MCLA
C1( M )
C2( M ) C3( M )
0
0
1
0
¼
¾
0
¼
2/4
0
¾
¼
1/3
¾
0
1
0
0
1
0
0
1
1
1
2
2
3
3
44
DATA MINING
Discussion and Comparision
-
세가지 방법모두 비슷한 결과를 보여줌
-
계산 비용에서 CSPA(nnkr), HGPA(nkr), MCLA(nkkrr)으로, HGPA가 가장 빠름.
-
CSPA는 계산 용량이 많아 실용적이지 않음
45
DATA MINING
Discussion and Comparision
We feed the noisy labelings to the proposed consensus functions. The resulting
combined labeling is evaluated in two ways. First, we measure the normalized
objective function (ANMI) of the ensemble output with all the individual labels in .
Second, we measure the normalized mutual information of each consensus labeling
with the original undistorted labeling using (NMI). For better comparison, we
added a random label generator as a baseline method. Also, performance
measures of a hypothetical consensus function that returns the original labels are
included to illustrate maximum performance for low noise settings.
46
DATA MINING
Discussion and Comparision
47
DATA MINING
Discussion and Comparision
- This experiment indicates that MCLA should be best suited in terms of time complexity
as well as quality. In the applications and experiments described in the following sections,
we observe that each combining method can result in a higher ANMI than the others for
particular setups. In fact, we found that MCLA tends to be best in low noise/diversity
settings and HGPA/CSPA tend to be better in high noise/diversity settings. This is because
MCLA assumes that there are meaningful cluster correspondences, which is more likely to
be true when there is little noise and less diversity. Thus, it is useful to have all three
methods.
- Note that the supra-consensus function is completely unsupervised and avoids the
problem of selecting the best combiner for a data-set beforehand.
48
4. Consensus Clustering Applications
and Experiments
DATA MINING
Data
50
DATA MINING
Experiment Data.1 -2D2K
- Artificial data
- 500 points
- two 2-dimensional Gaussian clusters with means
(-0.227,0,077) and (0.095,0.323)
- Covariance matrices with 0.1 for all diagonal elements.
51
DATA MINING
Experiment Data.2 -8D5K
- Artificial data
- 1000 points
- five 8-dimensional Gaussian clusters with means
- Covariance matrices with 0.1 for all diagonal elements.
52
DATA MINING
Experiment Data.3 -PENDIG
- pen-based recognition of handwritten digits.
- 16 spatial features for each of the 7494 training and 3498 test cases
- Ten classes of oughly equal size in the data corresponding to the digits 0 to 9
53
DATA MINING
Experiment Data.4 –Yahoo!
- For text clustering
- The 20 original Yahoo! News categories in the data
- The raw 21839 * 2340 word-document matrix
-Pruning all words that occur less than 0.01 or more than 0.10 times on average because
they are insignicant or too generic results in d = 2903.
54
DATA MINING
Evaluation Criteria
Internal method
- Mean square error criterion
- Mean cut criterion
- When using internal criteria, clustering becomes an optimization problem, and a clusterer
can evaluate its own performance and tune its results accordingly
External method
- labeling, purity, entropy
- While average purity is intuitive to understand, it favors small clusters
- When using internal criteria, clustering becomes an optimization problem, and a clusterer
can evaluate its own performance and tune its results accordingly
55
DATA MINING
Evaluation Criteria
NMI
- Normalized mutual information provides a measure that is impartial with respect to k
as compared to purity and entropy. It reaches its maximum value of 1 only when the two
sets of labels have a perfect one-to-one correspondence. We shall use the categorization
labels to evaluate the cluster quality by computing NMI, as dened in Equation 3.
56
DATA MINING
1. Feature-Distributed Clustering (FDC)
2. Object-Distributed Clustering (ODC)
3. Robust Centralized Clustering (RCC)
57
DATA MINING
Feature-Distributed Clustering (FDC)
1
3
2
58
DATA MINING
Feature-Distributed Clustering (FDC)
Label of 1
Label of 2
Consensus function
Combined Label
Label of 3
59
DATA MINING
Feature-Distributed Clustering (FDC)
60
DATA MINING
Feature-Distributed Clustering (FDC)
61
DATA MINING
Feature-Distributed Clustering (FDC)
62
DATA MINING
Object-Distributed Clustering (ODC)
 (a)
1
 (b )
2
 (c)
3
 (d )
4
 (e)
5
63
DATA MINING
Obhect-Dustrubyte Clustering (ODC)
- The ODC framework is parameterized by p, the number of partitions, and v, the
repetition factor. The repetition factor v > 1 denes the total number of points
processed in all p partitions combined to be (approximately) vn.
64
DATA MINING
Obhect-Dustrubyte Clustering (ODC)
65
DATA MINING
Robust Centralized Clustering (RCC)
method1
method2
method3
method4
method5
66
DATA MINING
Robust Centralized Clustering (RCC)
method1
method2
method3
Consensus function
Combined Label
method4
method5
67
DATA MINING
Robust Centralized Clustering (RCC)
The experimental results clearly show that cluster ensembles can be used to
increase robustness in risk-intolerant settings. Since it is generally hard to
evaluate clusters in high-dimensional problems, a cluster ensemble can be used
to `throw' many models at a problem and then integrate them using an
consensus function to yield stable results.
68
DATA MINING
Robust Centralized Clustering (RCC)
NMI
# of data
69
DATA MINING
Robust Centralized Clustering (RCC)
70
DATA MINING
FDC
ODC
RCC
- 단일 클러스터링
- 일부 변수만 이용가능
- 모든 객체 접근
- 단일 클러스터링
- 모든 변수 접근
- 중복된 객체 접근
- 복수 클러스터링
- 모든 변수 접근
- 모든 객체 접근
We will do…………..
일부 변수만 접근
중복된 객체 접근
복수 클러스터링
71
5. Conclusion
DATA MINING
Conclusion
- The purpose of this paper was to present the basic problem formulation
and explore some application scenarios.
- They indicate that cluster ensembles are indeed very helpful in
determining a reasonable value of k for the consensus solution since the
ANMI value peaks around the desirable range.
73

Download Report

data mining

Paperzz.com

Your Paperzz