EAAC+语音编码算法的研究与实现 - Computer Science and

SIGIR 2012
SimFusion+: Extending SimFusion Towards Efficient
Estimation on Large and Dynamic Networks
Weiren Yu1, Xuemin Lin1, Wenjie Zhang1,
Ying Zhang1 Jiajin Le2,
1
University of New South Wales & NICTA, Australia
2
Donghua University, China
Contents
1. Introduction
2. Problem Definition
3. Optimization Techniques
4. Experimental Results
2
1. Introduction
 Many applications require a measure of “similarity” between objects.
Citation of
Scientific
Papers
Recommender
System
(amazon.com)
similarity
search
Web Search Engine
(google.com)
(citeseer.com)
Graph
Clustering
3
SimFusion: A New Link-based Similarity Measure
 Structural Similarity Measure
 PageRank
[Page et. al, 99]
 SimRank
[Jeh and Widom, KDD 02]
 SimFusion similarity
 A new promising structural measure
[Xi et. al, SIGIR 05]
 Extension of Co-Citation and Coupling metrics
 Basic Philosophy
 Following the Reinforcement Assumption:
The similarity between objects is reinforced by the similarity of
their related objects.
4
SimFusion Overview
 Features
 Using a Unified Relationship Matrix (URM) to represent
relationships among heterogeneous data
 Defined recursively and is computed iteratively
 Applicable to any domain with object-to-object relationships
 Challenges
 URM may incur trivial solution or divergence issue of SimFusion.
 Rather costly to compute SimFusion on large graphs
 Naïve Iteration: matrix-matrix multiplication
 Requiring O(Kn3) time, O(n2) space [Xi et. al. , SIGIR 05]
 No incremental algorithms when edges update
5
Existing SimFusion: URM and USM
 Data Space: D  {o1 , o2 , } a finite set of data objects (vertices)
 Data Relation (edges)
 Intra-type Relation
Given an entire space D 
R i ,i  Di  Di
N
i1
Di
carrying info. within one space
 Inter-type Relation R i , j  Di  D j carrying info. between spaces
 Unified Relationship Matrix (URM):
L URM
 1,1L1,1

2,1L 2,1



 N ,1L N ,1
1,2L1,2
2,2 L 2,2
N ,2 L N ,2
1, N L1, N 
2, N L 2, N 


N , N L N , N 
 n1 ,
if N j  x   ;
j

Li , j  x, y    N 1 x  , if  x, y   R i , j ;
j

otherwise.
0,
 λi,j is the weighting factor between Di and Dj
 Unified Similarity Matrix (USM):
 s1,1

S  
s
 n ,1
s1,n 

nn

R

sn ,n 
s.t. S  L  S  LT .
6
Example.
Trivial Solution !!!
S=[1]nxn
SimFusion Similarity on Heterogeneous Domain
data object
v1
v3
v2
D1
inter-type
relationship
v4
L URM
v5
v6
data
space
D2
intra-type
relationship
D3
 18
1
4
 161
 1
 10
1
 101
 10
 S  R nn
Λ
D1
D2
D3
D1
1
4
1
4
D2
1
8
D3
1
5
1
4
3
5
1
2
5
8
1
5
S USM
 1

s2,1




 sn ,1
s1,2
1
sn ,2
s1, n 

s2, n 


1 
1
8
1
4
1
4
1
4
0
1
4
0
1
16
1
4
3
5
3
5
3
5
1
2
5
24
1
15
1
15
1
15
1
15
1
10
1
10
1
10
1
10
1
10
5
24
0

0
5 
24

1
15 
1 
15

0
s.t. S  L  S  LT .
High complexity !!!
O(Kn3) time
O(n2) space
7
Contributions
 Revising the existing SimFusion model, avoiding
 non-semantic convergence
 divergence issue
 Optimizing the computation of SimFusion+
 O(Km) pre-computation time, plus O(1) time and O(n) space
 Better accuracy guarantee
 Incremental computation on edge updates
 O(δn) time and O(n) space for handling δ edge updates
8
Revised SimFusion
 Motivation: Two issues of the existing SimFusion model
 Trivial Solution on Heterogeneous Domain
 Divergent Solution on Homogeneous Domain
Root cause: row normalization of URM !!!
9
From URM to UAM
 Unified Adjacency Matrix (UAM)
 n1j , if N j  x   ;

A i , j  x, y   1, if  x, y   R i , j ;
0, otherwise.

 Example
10
Revised SimFusion+
 Basic Intuition
 replace URM with UAM to postpone “row normalization”
in a delayed fashion while preserving the reinforcement
assumption of the original SimFusion
 Revised SimFusion+ Model
Original SimFusion
squeeze similarity scores in S into [0, 1].
11
Optimizing SimFusion+ Computation
 Conventional Iterative Paradigm
 Matrix-matrix multiplication, requiring O(kn3) time and O(n2) space
 Our approach: To convert SimFusion+ computation into
finding the dominant eigenvector of the UAM A.
Pre-compute σmax(A) only once, and cache it for later reuse
 Matrix-vector multiplication, requiring O(km) time and O(n) space
12
Example
Assume
with
 Conventional Iteration:
 Our approach:
13
Key Observation
 Kroneckor product “⊗”:
e.g.
 5
1 
1
2
5
6




 7
X
,
Y

,
X

Y


7 8 
 5
3 4 


3  
 7
6
 5 6    5 6 10
2

7 8   
8 

   7 8 14


15 18 20
6
5
6



4


7 8 
8 


   21 24 28
12 
16 
24 

32 
 Vec operator:
e.g.
vec( X)  [1 3 2 4]T
 Two important Properties:
14
Key Observation
 Two important Properties:
P1.
P2.
 Our main idea:
(1)
Power Iteration
(2)
15
Accuracy Guarantee
 Conventional Iterations: No accuracy guarantee !!!
Question:
|| S(k+1) – S || ≤ ?
 Our Method: Utilize Arnoldi decomposition to build an
order-k orthogonal subspace for the UAM A.
Due to Tk small size and almost “upper-triangularity”,
Computing σmax(Tk) is less costly than σmax(A).
16
Accuracy Guarantee
 Arnoldi Decomposition:
 k-th iterative similarity
 Estimate Error:
17
Example
Assume
with
Given
 Arnoldi Decomposition:
(1)
(2)
(3)
18
Edge Update on Dynamic Graphs
 Incremental UAM
Given old G =(D,R) and a new G’=(D,R’), the incremental UAM is
a list of edge updates, i.e.,
 Main idea
To reuse
and the eigen-pair (αp, ξp) of the old A to compute
is a sparse matrix when the number δ of edge updates is small.
 Incrementally computing SimFusion+
O(δn) time
O(n) space
19
Example
Suppose edges (P1,P2) and (P2,P1) are removed.
20
 Datasets
Experimental Setting
 Synthetic data (RAND 0.5M-3.5M)
 Real data (DBLP, WEBKB)
DBLP
WEBKB
 Compared Algorithms
 SimFusion+ and IncSimFusion+ ;
 SF, a SimFusion algorithm via matrix iteration [Xi et. al, SIGIR 05];
 CSF, a variant SF, using PageRank distribution [Cai et. al, SIGIR 10];
 SR, a SimRank algorithm via partial sums [Lizorkin et. al, VLDBJ 10];
 PR, a P-Rank encoding both in- and out-links [Zhao et. al, CIKM 09];
21
Experiment (1): Accuracy
On DBLP and WEBKB
SF+ accuracy is consistently
stable on different datasets.
SF seems hardly to get sensible similarities
as all its similarities asymptotically approach
the same value as K grows.
22
Experiment (2): CPU Time and Space
On DBLP
SF+ outperforms the other approaches, due to the use of σmax(Tk)
On WEBKB
23
Experiment (3): Edge Updates
Varying δ
IncSF+ outperformed SF+ when δ is small.
For larger δ, IncSF+ is not that good because
the small value of δ preserves the sparseness
of the incremental UAM.
24
Experiment (4) : Effects of
The small choice of
imposes more iterations
on computing Tk and vk, and hence increases
the estimation costs.
25
Conclusions
 A revision of SimFusion+, for preventing the trivial solution
and the divergence issue of the original model.
 Efficient techniques to improve the time and space of
SimFusion+ with accuracy guarantees.
 An incremental algorithm to compute SimFusion+ on
dynamic graphs when edges are updated.
Future Work
 Devise vertex-updating methods for incrementally
computing SimFusion+.
 Extend to parallelize SimFusion+ computing on GPU.
26
27