Document Clustering Based On Non

A General Model for
Relational Clustering
Bo Long and Zhongfei (Mark) Zhang
Computer Science Dept./Watson School
SUNY Binghamton
Xiaoyun Wu
Yahoo! Inc.
Philip S.Yu
IBM Watson Research Center
Multi-type Relational Data (MTRD) is
Everywhere!

Bibliometrics

Papers, authors, journals
Key words

Social networks


People, institutions, friendship links
Biological data


Papers
Authors
Genes, proteins, conditions
Corporate databases

Customers, products, suppliers, shareholders
Challenges for Clustering!

Data objects are not identically distributed:


Heterogeneous data objects (papers, authors).
Data objects are not independent

Heterogeneous data objects are related to each
other.
No IID assumption 
Relational Data Flat Data?
Paper ID
word1
word2
……
author1
author2
……
…………
……….
…….
1
1
3
……
1
0
……
………
………..
……..
……
……
…….
…….
……
…….
……
………
……….
……..
Author ID
Paper 1
Paper 2
……
…………
……….
…….
1
1
0
……
………
………..
……..
……
……
…….
……
………
……….
……..
Paper 2
……
…………
……….
…….
Key words
Authors
Word ID
Paper 1
1
1
3
……
………
………..
……..
……
……
…….
…….
………
……….
……..


High dimensional and sparse data
Data redundancy
Papers
Relational Data Flat Data?


No interactions of hidden structures of
different types of data objects
Difficult to discover the global community
structure.
users
queries
Web pages
A General Model: Collective
Factorization on Related Matrices



Formulate multi-type relational data as a set
of related matrices;
cluster different types of objects
simultaneously by factorizing the related
matrices simultaneously.
Make use of the interaction of hidden
structures of different types of objects.
Data Representation

Represent a MTRD set as a set of related matrices:


Relation matrix, R(ij), denotes the relations between ith
type of objects and jth type of objects.
Feature matrix, F(i), denotes the feature values for ith type
of objects.
F(1)
Authors
f
Users
(12)
R
(12)
R
Papers
R (23)
Movies
Words
Matrix Factorization

Exploring the hidden structure of the data matrix by
Cluster
Feature
its factorization:
F C B
(i )
.
(i )
association
matrix
basis
matrix
(i )
R  C A (C )
( ij )
(i )
( ij )
( j) T
Model: Collective Factorization on
Related Matrices (CFRM)
C
min
(i )
( ij )
,B
, A( ij )
w
1i  j  m
b
(i )
w
1i  j  m
a
( ij )
|| R
(i )
( ij )
 C A (C ) ||
|| F  C B ||
(i )
(i )
(i )
2
( ij )
( j) T
2
CFRM Model: Example
1
2
3
f
wb || F  C B || 
(1)
wa
(12)
wa
( 23)
(1)
(1)
(1)
|| R
(12)
C A
|| R
( 23)
C A
(1)
( 2)
2
(12)
(C ) || 
( 23)
( 2) T
( 3) T
2
(C ) ||
2
Spectral Clustering



Algorithms that cluster points using
eigenvectors of matrices derived from the data
Obtain data representation in the lowdimensional space that can be easily clustered
Traditional spectral clustering focuses on
homogeneous data
Main Theorem:
C
(i )
min
( ij )
,B
,A
( ij )
w
1i  j  m
 wb || F
(i )
(i )
( ij )
a
|| R
( ij )
 C A (C ) ||
(i )
( ij )
( j) T
(i )
 C B (i ) ||2
1i  j  m
(C
max( i )
(i ) T
) C
w
1i  j  m
a
 I ki
( ij )
w
1i  m
b
(i )
(i ) T
(i )
(i ) T
tr ((C ) F ( F ) C
(i ) T
( ij )
( j)
( j) T
(i )
tr ((C ) R C (C ) C )
(i )
2
Algorithm Derivation: Iterative
Updating
(C
max
( p) T ( p)
) C
I p
tr((C
( p) T
) M
( p)
C
( p)
)
where,
M
( p)
 wb
 wa
( p)
( F ( p ) ( F ( p ) )T ) 
( pj )
( R ( pj ) C ( j ) (C ( j ) )T ( R ( pj ) )T )
( jp )
(( R ( jp ) )T C ( j ) (C ( j ) )T R ( pj ) )
p j m
 wa
1 j  p
Spectral Relaxation
(C


max
( p) T
( p)
) C
I p
tr((C
( p) T
) M
( p)
C
( p)
)
Apply real relaxation to C(p) to let it be an
arbitrary orthornormal matrix.
By Ky-Fan Theorem, the optimal solution is
given by the leading kp eigenvectors of M(p).
Spectral Relational Clustering (SRC)
Spectral Relational Clustering:
Example

Update C (1) as k1 leading eigenvectors of
M
1

(1)
 wa
(12)
wa
3

( 2)
( 2) T
C (C ) ( R
(12) T
)
Update C (2) as k2 leading eigenvectors of
M ( 2 )  wa
2
R
(12)
( 23)
(12)
( R (12) )T C (1) (C (1) )T R (12) 
R ( 23)C (3) (C (3) )T ( R ( 23) )T
Update C (3) as k3 leading eigenvectors of
M (3)  wa
( 23)
( R( 23) )T C ( 2) (C ( 2) )T R( 23)
Advantages of Spectral Relational
Clustering (SRC)




Simple as traditional spectral approaches
Applicable to relational data with various
structures.
Adaptive low dimension embedding
Efficient: O(tmn2k). For sparse data, it is
reduced to O(tmzk) where z denotes the
number of non-zero elements
Special case 1: k-means and spectral
clustering

Flat data: a special MTRD with only one feature
matrix F,
min || F  CB ||2
C ,B

By the main theorem, k-means is equivalent to the
trace maximization,
T
T
max
tr
(
C
FF
C
)
T
c c Ik
Special case 2: Bipartite Spectral
Graph Partitioning (BSGP)

Bipartite graph: a special case of MTRD with one
relation matrix R,
min(1)
( C ) C  I k1
( C ( 2 ) ) T C ( 2 )  I k2
(1) T

|| R  C A(C ) ||
(1)
( 2) T
2
BSGP restricts the clusters of different types of
objects to have one-to-one associations, i.e.,
diagonal constraints on A.
Experiments

Bi-type relational data:


Tri-type relational data:


Document-word data
Category-document-word data.
Comparing algorithms:




Normalized Cut (NC),
Bipartite Spectral Graph Partitioning (BSGP),
Mutual Reinforcement K-means (MRK)
Consistent Bipartite Graph Co-partitioning
(CBGC).
Experimental Results on Bi-type
Relational Data
NMI Comparisons on Bi-type Relational Data
0.8
Normalized Mutual Information
0.7
0.6
SRC
NC
BSGP
0.5
0.4
0.3
0.2
0.1
0
Multi2
Multi3
Multi5
Multi8
Multi10
Eigenvectors of a multi2 data set
u
1
NC
BSGP
0
1
-0.5
0.5
-1
-1
0
u2
SRC
1
Objective Value
u
1
0
-0.5
-1
-1
0
0
u2
1
0
0.2
0.4
u2
Convergence
2
1
0
0
5
10
Number of iterations
Experimental Results on Tri-type
Relational Data
NMI Comparisons on Tri-type Relational Data
Normalized Mutual Information
1
SRC
MRK
CBGC
0.8
0.6
0.4
0.2
0
BRM
TM1
TM2
TM3
Summary


Collective Factorization on Related Matrices– a
general model for MTRD clustering.
Spectral Relational Clustering– A novel spectral
approach




Simple and applicable to relational data with various
structures.
Adaptive low dimension embedding
Efficient
Theoretic analysis and experiments demonstrate the
effectiveness and the promise of the model and of
the algorithm.