Efficient Subgraph Search over
Large Uncertain Graphs
Ye Yuan1, Guoren Wang1, Haixun Wang2, Lei
3
Chen
1. Northeastern University, China
2. Microsoft Resarch Asia
3. HKUST
1
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Background
Graph is a complicated data structure, and has been used
in many real applications.
Bioinformatics
Gene regulatory networks
Yeast PPI networks
3
Background
Compounds
benzene ring
Compounds database
4
Background
Social Networks
EntityCube
Web2.0
5
Background
In these applications, graph data may be noisy and
incomplete, which leads to uncertain graphs.
STRING database (http://string-db.org) is a data source
that contains PPIs with uncertain edges provided by
biological
experiments.
Therefore,
it is important to study
Visual Pattern Recognition,
uncertain graphs are used
query processing
to model visual objects.
on
large
uncertain
graphs
.
Social networks, uncertain links used to represent
possible relationships or strength of influence between
people.
6
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Problem Definition
Probabilistic subgraph search
Uncertain graph:
Vertex uncertainty (existence probability)
Edge uncertainty (existence probability given its two
endpoints)
A (0.8)
1
b
2
A (0.6)
0.9 0.7
0.5
a
b
3
B (0.9)
8
Problem Definition
Probabilistic subgraph search
Possible worlds: combination of all uncertain edges and vertices
A (0.8)
1
b
2
A (0.6)
0.9 0.7
0.5
a
1
2
3
0.008
0.032
0.012
0.072
1
1
2
2
3
3
0.0432
0.2016
0.054
(1)
(2)
(3)
(4)
(5)
(6)
(7)
1
1
2
1
1
1
1
2
3
3
0.0048
0.0864
0.054
0.00648
0.05832
0.01512
0.00648
(8)
(9)
(10)
(11)
(12)
(13)
(14)
1
1
b
3
B (0.9)
1
2
2
3
1
3
2
3
2
3
2
3
2
2
3
2
3
3
0.13608
0.13608
0.05832
0.01512
(15)
(16)
(17)
(18)
9
Problem Definition
Probabilistic subgraph search
Given: an uncertain graph database G={g1,g2,…,gn},
query graph q and probability threshold
Query: find all gi ∈G, such that the subgraph isomorphic
probability is not smaller than .
Subgraph isomorphic probability (SIP):
The SIP between q and gi = the sum of the probabilities of
gi’s possible worlds to which q is subgraph isomorphic
10
Problem Definition
Probabilistic subgraph search
Subgraph isomorphic probability (SIP):
A (0.8)
1
b
b
0.9 0.7
0.5
a
2
a
A
3
B (0.9)
A (0.6)
g
2
3
0.054
(7)
q
1
1
+
2
3
0.00648
(14)
+
B
2
1
3
0.13608
(15)
+
2
1
3
0.05832
+
2
3
0.01512
(17)
= 0.27
(18)
It is #P-complete to calculate SIP
11
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Probabilistic subgraph query processing framework
Naïve method:sequence scan D, and decide if the SIP
between q and gi is not smaller than threshold .
g1 subgraph isomorphic to g2 : NP-Complete
Calculating SIP: #P-Complete
Naïve method: very costly, infeasible!
13
Probabilistic subgraph query processing framework
Filter-and-Verification
{g1,g2,..,gn}
{g’1,g’2,..,g’m}
Filtering
Query q
Candidates
{g”1,g”2,..,g”k}
Answers
Verification
14
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Solutions
Filtering: structural pruning
Principle: if we remove all the uncertainty from g, and
the resulting graph still does not contain q, then the
original uncertain graph cannot contain q.
A (0.8)
1
b
2
0.9 0.7
0.5
a
b
A
a
B
3
B (0.9)
A (0.6)
g
q
Theorem: if qgc,then Pr(qg)=0
16
Solutions
Probabilistic pruning: let f be a feature of gc i.e., fgc
Rule 1:
if f q , UpperB(Pr(fg))<,then g is pruned.
∵ f q, ∴ Pr(qg)Pr(fg)<
A (0.7)
6
A (1)
1
A
a
b
B
(0.3)
0.8
1
0.5 4
0.2 b
3
b 0.6
c
0.9
0.9
A
(0.6)
a
a
2
5
A (0.5)
B (0.4)
Uncertain graph
A
A
c
(a
B
A
a
c
b,
A
0.6 )
B
feature
query &
17
Solutions
Rule 2:
if qf, LowerB(Pr(fg)),then g is an answer.
∵ q f, ∴ Pr(qg)Pr(fg)
A (1)
1
A (0.7)
6
A
a
b
B
(0.3)
0.8
1
0.5 4
0.2 b
3
b 0.6
c
0.9
0.9
A
(0.6)
a
a
2
5
A (0.5)
Uncertain graph
B (0.4)
c
A
a
( A
a
B , 0.2 )
B
feature
query &
Two main issues for probabilistic pruning:
How to derive lower and upper bounds of SIP?
How to select features with great pruning power?
18
Solutions
Technique 1: calculation of lower and upper bounds
Lemma:
let Bf1,…,Bf|Ef|be all embeddings of f in gc, then
Pr(fg)=Pr(Bf1…Bf|Ef|).
UpperB(Pr(fg)):
Pr f g Pr Bf 1 Bf Ef 1 Pr Bf 1 Bf Ef
Ef
Pr Bf 1 Bf Ef Pr Bf i
i 1
|Ef |
|Ef |
i 1
i 1
Pr f g 1 Pr(Bf i ) 1 (1 Pr(Bf i )) UpperB( f )
19
Solutions
Technique 1: calculation of lower and upper bounds
LowerB(Pr(fg)):
Pr f g Pr
Ef
i 1
IN
Bf i Pr INj1 Bf j 1 1 Pr Bf i LowerB f
Tightest LowerB(f)
j 1
1
a
0.8
2
B (0.3)
3
b 0.6
0.9
a
0.5
c
A (0.5)
3
2
(EM1)
A (0.6) 0.9
a
0.2 b
5
b
A
a
B
(f2)
EM1
6
5
4
A
6
B (0.4)
(002)
4
1
b
1
(EM3)
1
EM2
EM3
3
2
(EM2)
Embeddings of f2 in 002
Graph bG of embeddings
Converting into computing the
maximum weight clique of
graph bG, NP-hard.
20
Solutions
Technique 1: calculation of lower and upper bounds
Exact value V.S. Upper and lower bound
UpperBound
Exact
LowerBound
UpperBound
1
LowerBound
Caculation time (second)
1000
0.8
Probability
Exact
0.6
0.4
0.2
100
10
1
0.1
0
50
100
150
Database size
Value
200
250
50
100
150
200
250
Database size
Computing time
21
Solutions
Technique2: optimal feature selection
If we index all features, we will have the most pruning power
index. But it is also very costly to query such index. Thus we
would like to select a small number of features but with the
greatest pruning power.
Cost model:
Max gain = sequence scan cost– query index cost
Maximum set coverage: NP-complete; use the greedy
algorithm to approximate it.
22
Solutions
Technique2: optimal feature selection
Maximum converge:greedy algorithm
Feature Matrix
001
002
f1
(0.19,0.19)
(0.27,0.49)
f2
(0.27,0.27)
(0.4,0.49)
f3
0
(0.01,0.11)
Approximate
optimal index
within 1-1/e
A
b
q1 (
A
a
q2
( A
a
B , 0.2 )
q3
B
(0.19,0.19) (0.27,0.49)
f2
(0.27,0.27) (0.4,0.49)
0
(0.27,0.49)
(0.27,0.27) (0.4,0.49)
b , 0.6 )
(a
B
f1
f3
A
A
a , 0.5 )
0
c
A
0
(0.27,0.27) (0.4,0.49)
0
0
0
0
0
(0.01,0.11)
001
002
001
002
001
002
Probabilistic Index
23
Solutions
Probabilistic Index
Construct a string for each feature
Construct a prefix tree for all feature strings
Construct an invert list for all leaf nodes
Root
fa
fb
ID-list: {<g1, 0.2, 0.6>, <g2, 0.4, 0.7>, ….} ID-list: {….}
fc
fd
ID-list: {….} ID-list: {<g2, 0.3, 0.8>, <g4, 0.4, 0.6>, ….}
24
Solutions
Verification: Iterative bound pruning
Lemma: Pr(qg)=Pr(Bq1…Bq|Eq|)
Eq
Prq g 1
Unfolding:
i 1
Let Si Pr j 1 Bqj
J
i 1
Pr j 1 Bqj
J 1,, E , J i
J
q
Based on Inclusion-Exclusion Principle
1i 1 i S w
w1
Pr q g
i
i 1
1
w1 S w
if i is odd
if i is even
Iterative bound pruning
25
Solutions
Performance Evaluation
Real dataset: uncertain PPI
1500 uncertain graphs
Average 332 vertices and 584 edges
Average probability: 0.367
Synthetic dataset: AIDS dataset
Generate probabilities using Gaussian
distribution
10k uncertain graphs
Average 24.3 vertices and 26.5 edges
26
Solutions
Performance Evaluation
Results on real dataset
Non-PF
SCAN
PFiltering
100
100
Candidate size
Response time (second)
PIndex
10
50
0
1
q50
q100
q150
Query size
q200
q250
q50
q100
q150
q200
q250
Query size
27
Solutions
Performance Evaluation
Results on real dataset
Non-PF
PFiltering
Non-PF
Response time (second)
Feature number
10000
1000
100
10
PFiltering
10
1
0.1
0.01
1
250
200
150
# Distinct labels
100
50
250
200
150
100
50
# Distinct labels
28
Solutions
Performance Evaluation
Response and Construction time
SFiltering
PFiltering
E-Bound
SFiltering
Construction time (second)
Response time (second)
10
1
0.1
0.01
2k
4k
6k
Database size
8k
10k
PFiltering
300
250
200
150
100
50
0
2k
4k
6k
8k
10k
Database size
29
Solutions
Performance Evaluation
Results on synthetic dataset
SFiltering
SFiltering
PFiltering
100
Index size(MB)
10000
Feature number
PFiltering
1000
100
10
1
10
1
0.1
0.01
0.3
0.4
0.5
Parameter
Mean
0.6
0.7
0.3
0.4
0.5
0.6
0.7
Parameter
Variance
30
Outline
Ⅰ
Background
Ⅱ
Problem Definition
Ⅲ Query Processing Framework
Ⅳ
Solutions
V
Conclusions
Conclusion
We propose the first efficient solution to answer
threshold-based probabilistic sub-graph search over
uncertain graph databases.
We employ a filter and verification framework, and
develop probability bounds for filtering.
We design a cost model to select minimum number of
features with the largest pruning ability.
We demonstrate the effectiveness of our solution
through experiments on real and synthetic data sets.
32
Thanks!
33
© Copyright 2026 Paperzz