Slides - wenqing lin

Network Motif Discovery: A GPU
Approach
Wenqing Lin†,‡, Xiaokui Xiao‡, Xing Xie §, Xiao-Li Li†
†
‡
§
1
Massive Graphs
Protein-Protein Interaction Networks
Human Disease Networks
2
Subgraphs
3
Network Motif Discovery
• Frequency f
• A given graph • Subgraph
1
s1
5
G
s2
• Random graphs
G1
subgraph
G2
frequency
G1
G2
s1
1
0
s2
5
5
--
……
……
……
……
4
Network Motif Discovery
• A network motif of G satisfies the condition:
𝑓−𝑓
– 𝜎 > 0 and
≥𝜃
𝜎
– where 𝜃 > 0 is a user-defined threshold.
• Objective
– find all size-k network motifs of G.
• Applications
– Study functionalities in brain network [Plos Biol. 04]
– Protein function prediction in protein interaction networks [KDD06,
ICDE07]
– Using network motifs to identify application protocols [GlobeCom09]
– Assessing participation in online engineering communities [ASEE 14]
– etc.
5
Summary of CPU-Based Methods
• Two phases of computation
1. Subgraph enumeration in G.
subgraphs in k-hop
neighbors
nbr(v1, k)
v1
v2
v3
v3
v2
v3
…
…
…
immediate
Adjacency
Matrix
(AM)
AMindex
If it exists, then
immediately
update CL-index
If not,
compute CL
Canonical
Labeling
(CL)
search
CLindex
update/insert
(v1, v2, v3)
2. Frequency estimation in random graphs G’
• Generate a random graph G’.
• For each subgraph g of G, compute its frequencies in G’ by reusing the AM-index and CL-index.
6
Summary of CPU-Based Methods
v3
• An example of AMs and CLs
Enumeration
Subgraph
v2
v2
v1
v3
v1
v2
v3
v1
0
0
0
v2
1
0
1
v3
0
0
0
v1
v4
v4
v5
v1
v5
v4
v7
Adjacent Matrix
v3
v1
v2
v1
v4
v5
v1
0
1
1
v4
0
0
0
v5
0
0
0
v1
v6
v5
Canonical Labeling
•
•
0
1
1
0
0
0
0
0
0
Concatenating each row
subsequently => a vector
Canonical labeling has the
largest vector
Summary of Our Method
subgraph enumeration
CLindex
AMindex
g
Matching
Order
G
For each subgraph g,
compute its frequent in each
G’ on GPU by subgraph
matching algorithm
Subgraph Matching
g, G’
CPU
Random
Graph
Generator
PCI-Express
f(g)
GPU
G’
8
Graphics Processing Units (GPUs)
• A GPU has several multiprocessors (MP).
• Each MP has a large number of stream processors (SP),
working in SIMD manner.
Main Memory
CPU Cores
PCI-Express
GPU Global Memory
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
……
……
……
MP
MP
GPU cores
9
Nvidia CUDA Programming
• Kernels
– Run on GPU, invoked
by CPU.
A grid of
blocks
• Thread Hierarchy
– Each thread executes a
kernel on a GPU core.
……
– Organized by blocks,
…… …… ……
each of which run on
……
an MP.
A block of threads
……
…… …… ……
……
A block of threads
10
Nvidia CUDA Programming
Given a kernel code:
• Branch Divergence
– Different execution paths
are executed sequentially.
If A then B else C
T1:
T2:
• Memory Coalescing
A
A
B
– Requests to Consecutive memory space will be one
access.
– But, t non-continuous memory spaces require t
random accesses.
GPU
Global
Memory
C
Difficulties in GPU Translations
• Computing canonical labeling contains
complicated execution paths.
– Incur significant overheads in branch divergence.
• Random access to AM-index and CL-index.
– Deficiency in memory coalescing.
– Index size is too large to fit in GPU global memory.
• Inefficient by examining subgraphs that have
different structures.
– Workload imbalance among GPU threads.
12
Our Solutions
• Parallelize frequency estimation on GPU
– Frequency estimation for lots of subgraphs on lots of random graphs is
significantly costly.
•
Iteratively compute the frequency of g
– Matching one vertex in one iteration.
neighbors of v2
{…, v3, v4, v5, …}
……
<v1, v2>
……
size-2 graphs
s1'
…...
…...
each GPU thread
generates one
size-3 candidate
by inspecting
one neighbor.
……
<v1, v2, v3>
<v1, v2, v4>
<v1, v2, v5>
……
size-3 candidates
each GPU thread
checks one
neighbor for
verifying one
size-3 candidate.
…...
……
<v1, v2, v3>
<v1, v2, v5>
……
size-3 graphs
…...
{…, v1, v2, …}
neighbors of v5
s1
13
Roadmap
•
•
•
•
•
Preliminaries
Solution Overview
Optimizations
Experimental Evaluation
Conclusion
14
Handling Large Candidates
• A straightforward solution
– Store candidates in main memory.
size-i candidates
C
Main Memory
PCI-Express
GPU Global Memory
size-(i+1) candidates
C*
• Numerous data transfers between the main
memory and the global memory.
15
Handling Large Candidates
• An advanced solution: memory-driven.
Available GPU Global
Memory
𝜆𝑀′
size-i candidates
𝑀′
𝜆𝑀′′
𝜆𝑀′′
size-(i+1) candidates
partial frequency
𝑀′′
f1*(g)
• Total number of subsets:
• Minimized when 𝜆 =
f(g) =
f2*(g)
𝑗−𝑖
𝑘−2 |𝐶𝑖 |𝑑
𝑗=𝑖 𝑀𝜆(1−𝜆)𝑗−𝑖
+
fj∗(g)
|𝐶𝑖 |𝑑 𝑗−𝑖−1
𝑀(1−𝜆)𝑗−𝑖−1
1
𝑘−𝑖
16
Handling Multiple Graphs
• When the random graph contains a small number
of vertices and edges, , or the frequency of the
subgraph is small, we can process multiple graphs
on GPU at a time.
𝜇
random graphs
in main memory
G1’ G2’ …… Gμ’
size-2 candidates
in global memory
• Total number of subsets is minimized when
1
𝜇𝑀𝐺 =
𝑀.
𝑘−2
17
Matching Tree
• Common prefix of matching orders of any two
graphs.
• DFS traversal on the matching tree.
• Re-use computation in ancestor nodes.
18
Roadmap
•
•
•
•
•
Preliminaries
Solution Overview
Optimizations
Experimental Evaluation
Conclusion
19
Experimental Evaluation
• Datasets
datasets
YE
HS
YP
MM
DM
AT
CE
|V|
688
1,509
2,361
4,293
6,303
9,216
17,179
|E|
1,079
5,598
6,646
7,987
18,224
50,669
124,599
Avg deg
3.14
7.42
5.63
3.72
5.78
11.00
14.51
Max out-deg
71
71
64
91
88
58
67
Max in-deg
13
45
47
111
122
89
107
• Devices
name
# of cores
Intel Xeon E5645 CPU
core freq.
GPU Memory
Price (USD)
6
2400MHz
N/A
513.39
Nvidia Quadro 2000 GPU (Q2000)
192
625MHz
1GB
277.77
Nvidia Telsa K20 GPU (K20)
2496
706MHz
5GB
2695.00
• Default: K20, k=6, r=1000.
20
Effects of optimizations
•
•
•
•
NA: all optimization methods are disabled.
DC: only the divide-and-conquer method enabled.
DC-GM: only DC and matching tree methods enabled.
Device: K20
21
Comparisons with CPU-based
techniques
BMC Bio. 09
PLoS One 12
PLoS One 13
JPDC 12
Over NemoGPU (K20)
22
K20 vs. Q2000
(a) Improvements over DistributeNM
(b) Performance-price ratio
Conclusion
• The first GPU-based solution to the problem
of network motif discovery.
• Mitigate GPUs’ limitations in terms of the
computation power per GPU core.
• A number of efficient and effective
optimization techniques.
• Several orders of magnitude more costeffective than CPU-based approaches.
24
Thank you!
25

Download Report

Slides - wenqing lin

Paperzz.com

Your Paperzz