SIGIR 2012
SimFusion+: Extending SimFusion Towards Efficient
Estimation on Large and Dynamic Networks
Weiren Yu1, Xuemin Lin1, Wenjie Zhang1,
Ying Zhang1 Jiajin Le2,
1
University of New South Wales & NICTA, Australia
2
Donghua University, China
Contents
1. Introduction
2. Problem Definition
3. Optimization Techniques
4. Experimental Results
2
1. Introduction
Many applications require a measure of “similarity” between objects.
Citation of
Scientific
Papers
Recommender
System
(amazon.com)
similarity
search
Web Search Engine
(google.com)
(citeseer.com)
Graph
Clustering
3
SimFusion: A New Link-based Similarity Measure
Structural Similarity Measure
PageRank
[Page et. al, 99]
SimRank
[Jeh and Widom, KDD 02]
SimFusion similarity
A new promising structural measure
[Xi et. al, SIGIR 05]
Extension of Co-Citation and Coupling metrics
Basic Philosophy
Following the Reinforcement Assumption:
The similarity between objects is reinforced by the similarity of
their related objects.
4
SimFusion Overview
Features
Using a Unified Relationship Matrix (URM) to represent
relationships among heterogeneous data
Defined recursively and is computed iteratively
Applicable to any domain with object-to-object relationships
Challenges
URM may incur trivial solution or divergence issue of SimFusion.
Rather costly to compute SimFusion on large graphs
Naïve Iteration: matrix-matrix multiplication
Requiring O(Kn3) time, O(n2) space [Xi et. al. , SIGIR 05]
No incremental algorithms when edges update
5
Existing SimFusion: URM and USM
Data Space: D {o1 , o2 , } a finite set of data objects (vertices)
Data Relation (edges)
Intra-type Relation
Given an entire space D
R i ,i Di Di
N
i1
Di
carrying info. within one space
Inter-type Relation R i , j Di D j carrying info. between spaces
Unified Relationship Matrix (URM):
L URM
1,1L1,1
2,1L 2,1
N ,1L N ,1
1,2L1,2
2,2 L 2,2
N ,2 L N ,2
1, N L1, N
2, N L 2, N
N , N L N , N
n1 ,
if N j x ;
j
Li , j x, y N 1 x , if x, y R i , j ;
j
otherwise.
0,
λi,j is the weighting factor between Di and Dj
Unified Similarity Matrix (USM):
s1,1
S
s
n ,1
s1,n
nn
R
sn ,n
s.t. S L S LT .
6
Example.
Trivial Solution !!!
S=[1]nxn
SimFusion Similarity on Heterogeneous Domain
data object
v1
v3
v2
D1
inter-type
relationship
v4
L URM
v5
v6
data
space
D2
intra-type
relationship
D3
18
1
4
161
1
10
1
101
10
S R nn
Λ
D1
D2
D3
D1
1
4
1
4
D2
1
8
D3
1
5
1
4
3
5
1
2
5
8
1
5
S USM
1
s2,1
sn ,1
s1,2
1
sn ,2
s1, n
s2, n
1
1
8
1
4
1
4
1
4
0
1
4
0
1
16
1
4
3
5
3
5
3
5
1
2
5
24
1
15
1
15
1
15
1
15
1
10
1
10
1
10
1
10
1
10
5
24
0
0
5
24
1
15
1
15
0
s.t. S L S LT .
High complexity !!!
O(Kn3) time
O(n2) space
7
Contributions
Revising the existing SimFusion model, avoiding
non-semantic convergence
divergence issue
Optimizing the computation of SimFusion+
O(Km) pre-computation time, plus O(1) time and O(n) space
Better accuracy guarantee
Incremental computation on edge updates
O(δn) time and O(n) space for handling δ edge updates
8
Revised SimFusion
Motivation: Two issues of the existing SimFusion model
Trivial Solution on Heterogeneous Domain
Divergent Solution on Homogeneous Domain
Root cause: row normalization of URM !!!
9
From URM to UAM
Unified Adjacency Matrix (UAM)
n1j , if N j x ;
A i , j x, y 1, if x, y R i , j ;
0, otherwise.
Example
10
Revised SimFusion+
Basic Intuition
replace URM with UAM to postpone “row normalization”
in a delayed fashion while preserving the reinforcement
assumption of the original SimFusion
Revised SimFusion+ Model
Original SimFusion
squeeze similarity scores in S into [0, 1].
11
Optimizing SimFusion+ Computation
Conventional Iterative Paradigm
Matrix-matrix multiplication, requiring O(kn3) time and O(n2) space
Our approach: To convert SimFusion+ computation into
finding the dominant eigenvector of the UAM A.
Pre-compute σmax(A) only once, and cache it for later reuse
Matrix-vector multiplication, requiring O(km) time and O(n) space
12
Example
Assume
with
Conventional Iteration:
Our approach:
13
Key Observation
Kroneckor product “⊗”:
e.g.
5
1
1
2
5
6
7
X
,
Y
,
X
Y
7 8
5
3 4
3
7
6
5 6 5 6 10
2
7 8
8
7 8 14
15 18 20
6
5
6
4
7 8
8
21 24 28
12
16
24
32
Vec operator:
e.g.
vec( X) [1 3 2 4]T
Two important Properties:
14
Key Observation
Two important Properties:
P1.
P2.
Our main idea:
(1)
Power Iteration
(2)
15
Accuracy Guarantee
Conventional Iterations: No accuracy guarantee !!!
Question:
|| S(k+1) – S || ≤ ?
Our Method: Utilize Arnoldi decomposition to build an
order-k orthogonal subspace for the UAM A.
Due to Tk small size and almost “upper-triangularity”,
Computing σmax(Tk) is less costly than σmax(A).
16
Accuracy Guarantee
Arnoldi Decomposition:
k-th iterative similarity
Estimate Error:
17
Example
Assume
with
Given
Arnoldi Decomposition:
(1)
(2)
(3)
18
Edge Update on Dynamic Graphs
Incremental UAM
Given old G =(D,R) and a new G’=(D,R’), the incremental UAM is
a list of edge updates, i.e.,
Main idea
To reuse
and the eigen-pair (αp, ξp) of the old A to compute
is a sparse matrix when the number δ of edge updates is small.
Incrementally computing SimFusion+
O(δn) time
O(n) space
19
Example
Suppose edges (P1,P2) and (P2,P1) are removed.
20
Datasets
Experimental Setting
Synthetic data (RAND 0.5M-3.5M)
Real data (DBLP, WEBKB)
DBLP
WEBKB
Compared Algorithms
SimFusion+ and IncSimFusion+ ;
SF, a SimFusion algorithm via matrix iteration [Xi et. al, SIGIR 05];
CSF, a variant SF, using PageRank distribution [Cai et. al, SIGIR 10];
SR, a SimRank algorithm via partial sums [Lizorkin et. al, VLDBJ 10];
PR, a P-Rank encoding both in- and out-links [Zhao et. al, CIKM 09];
21
Experiment (1): Accuracy
On DBLP and WEBKB
SF+ accuracy is consistently
stable on different datasets.
SF seems hardly to get sensible similarities
as all its similarities asymptotically approach
the same value as K grows.
22
Experiment (2): CPU Time and Space
On DBLP
SF+ outperforms the other approaches, due to the use of σmax(Tk)
On WEBKB
23
Experiment (3): Edge Updates
Varying δ
IncSF+ outperformed SF+ when δ is small.
For larger δ, IncSF+ is not that good because
the small value of δ preserves the sparseness
of the incremental UAM.
24
Experiment (4) : Effects of
The small choice of
imposes more iterations
on computing Tk and vk, and hence increases
the estimation costs.
25
Conclusions
A revision of SimFusion+, for preventing the trivial solution
and the divergence issue of the original model.
Efficient techniques to improve the time and space of
SimFusion+ with accuracy guarantees.
An incremental algorithm to compute SimFusion+ on
dynamic graphs when edges are updated.
Future Work
Devise vertex-updating methods for incrementally
computing SimFusion+.
Extend to parallelize SimFusion+ computing on GPU.
26
27
© Copyright 2026 Paperzz