Performance Evaluation of Breadth-First Search on Intel

Performance Evaluation of Breadth-First Search
on Intel Xeon Phi
Alexander Frolov, Elizaveta Golovina, and Alexander Semenov
OAO “NICEVT”, Varshavaskoe shosse, 125, 117587 Moscow, Russia
{frolov,golovina,semenov}@nicevt.ru,
WWW home page: http://www.dislab.org
Abstract. Breadth-First Search (BFS) is one of the most important
kernels in graph computing. It is the main kernel of the Graph500 rating
that evaluates performance of large supercomputers and multiprocessor
nodes in terms of traversed edges per second (TEPS). In this paper we
present the results of BFS performance evaluation on a recently released
high-performance Intel Xeon Phi coprocessor. We examine previously
proposed Queue-based and Read-based approaches to BFS implementation. We also apply several optimization techniques, such as manual
loop unrolling and prefetching, that significantly improve performance
on Intel Xeon Phi. On a representative graph set Intel Xeon Phi 7120P
demonstrates 178 % maximal and 137 % average speedup as compared
to the Intel Xeon E5-2660 processor. We achieved 4366 MTEPS on Intel
Xeon Phi 7120P for the graph with scale 25 and have the 89th place
on the November 2013 Graph500 list. This is the fourth place among
research teams in the class of single node x86-based systems.
Keywords: Intel Xeon Phi, Breadth-First Search, graph algorithms
1
Introduction
Large-scale graph processing is a relatively new and fast-growing application
area in HPC. It is usually characterized by large datasets stored in the memory
of compute nodes with low spatial and temporal locality of memory access.
This leads to inefficiency of many hardware and software optimizations designed
for regular access patterns, such as hardware prefetching, multilevel data cache
hierarchy, TLB buffers, and DDR burst mode operations.
Heterogeneous computing by means of GPUs and many-core coprocessors
such as Nvidia Kepler, AMD Firestream, Intel Xeon Phi has become a widespread
phenomenon in HPC, and especially in large supercomputers, holding top positions of the Top500 list. For example, the current No.1 system (as of November
2013) Tianhe-2 supercomputer has 48 K Intel Xeon Phi coprocessors.
GPUs were originally designed for applications that are well-suited for stream
architecture. However, recently significant efforts have been made to optimize
irregular problems, such as graph processing, to achieve high performance on
GPU. The Intel Xeon Phi coprocessor has a many-core multithreaded architecture and is more versatile than GPUs. But at first glance high performance can
2
be achieved by using 512-bit vector arithmetic instructions, and a vectorization
of graph applications seems to be difficult.
In this paper we present performance evaluation of several Breadth-First
Search algorithm implemented on the Intel Xeon Phi and Intel Xeon Sandy
Bridge-EP processors.
2
Breadth-First Search Algorithms
Breadth-First Search (BFS) is an important building block of many graph applications. BFS is the main kernel of the Graph500 [1] rating that evaluates
performance of large supercomputers and multiprocessor nodes. Starting from
the source vertex the frontier expands outwards during each step visiting all of
the vertices at the same depth before visiting any at the next depth. Vertices at
the same depth are called a level.
We investigate two approaches to BFS parallelization on two different multicore architectures: Intel Sandy Bridge and Intel Many Integrated Cores (MIC).
We consider several algorithms in each approach:
1. Queue-based approach
(a) naive algorithm
(b) block algorithm
2. Read-based approach
(a) top-down algorithm
(b) hybrid algorithm
The first approach, called Queue-based, is based on queue-type data structures
for workload distribution and represents a conventional technique of multithreaded
programming [2, 3]. The second, called Read-based, is based on iterative reading
of an array containing corresponding level numbers of graph vertices and has
been proposed for GPU architecture [2]. These approaches differ fundamentally
in parallel graph processing, and, as we show further, it is reflected in the obtained performance results.
In all implementations we store the graph as an adjacency matrix in Compressed Row Storage (CRS) format. All algorithms are implemented using C++
and OpenMP.
2.1
Queue-based Approach
We examine two algorithms representing the Queue-based approach: a naive
Queue-based algorithm and a block Queue-based algorithm.
In the naive Queue-based algorithm vertex numbers of the current level are
stored in the Q array, while vertex numbers of the next level are added to the
Qnext array. Each vertex in Q is processed to determine if it has any unvisited
neighbors by testing all of its neighbors. All unvisited neighbors are marked
as visited in the marked array and added to Qnext to be processed at the next
level. Then Q and Qnext are swapped and the next level is processed. Maintaining
3
consistency of Qnext requires using of atomic operation sync fetch and add
to avoid interference with other threads when adding a new vertex to Qnext .
The algorithm is presented in Fig. 1. In order to reduce utilization of the atomic
operation sync fetch and add in the block Queue-based algorithm [4] each
thread allocates a portion of Qnext of size k for adding vertces to the next
level. When the portion is full, a thread allocates another portion using the
same atomic operation. This optimization provides a k times decrease of atomic
operation usage.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Qcounter = 1 // initialization of Q vertices counter
Q[0] = r // Q array initialization
marked[r] = 1 // mark source vertex r
while Qcounter > 0 // while current level is not full
Qnext counter = 0 // zero next level counter
#pragma omp parallel // parallel level processing
for all vertex in Q do
// for all vertices w, which is neighbor of vertex
for all w : (vertex, w) in E do
if marked[w] == 0 then // if w is unvisited
// add w in Qnext
Qnext [ sync f etch and add(Qnext counter , 1)] = w
marked[w] = 1 // mark w
end if
end for
end for
// switch to next level: swap Q and Qnext , Qcounter = Qnext counter
swap(Q, Qnext )
end while
Fig. 1: Naive Queue-based algorithm
2.2
Read-based Approach
In case of the Read-based approach we examine two algorithms: a top-down
algorithm and a hybrid bottom-up algorithm.
Originally the Read-based approach was developed for GPUs. The main idea
of this approach is to use a single array called levels for workload distribution. Size of levels equals the number of vertices in the graph. Each element of
levels contains the level number for each vertex or −1 for unvisited vertices. At
each level levels is scanned and vertices at the current level are detected and
processed, that is their neighbor lists are read and next level number is stored
4
in levels for unvisited vertices. This algorithm is called a top-down algorithm
(Fig. 2).
Another witty idea for BFS implementation is proposed in [5]. It was observed
that many real-world graphs (such as social networks) have the following feature.
When a considerable part of the search is done, at some levels there is only a
small number of unvisited neighbors left, i.e. a large amount of vertex processing
is useless. For these levels it is efficient to use a bottom-up search: neighbors of
all unvisited vertices are analyzed, and if any such neighbor of the vertex is at
the current level, then the vertex is an ancestor of that neighbor in the search
tree, and there is no need to analyze other neighbors of the vertex. Single level
processing using the bottom-up algorithm is presented in Fig. 3.
We propose a hybrid bottom-up algorithm. In this algorithm we use the
top-down Read-based algorithm for some levels (unlike [5], where the Queuebased approach is used for top-down algorithm), and for other levels we use the
bottom-up algorithm.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
curLevel = 0 // level number initialization
levels[r] = curLevel // source vertex r will be processed at the level number 0
levelV ertsCount = 1 // number of vertices at the current level
while levelV ertsCount > 0 // while there are any vertices at the current level
levelV ertsCount = 0
// parallel levels processing
#pragma omp parallel for reduction(+:levelV ertsCount)
for all vertex in V do // for all graph vertices
// ignore vertices that are not at the current level
if levels[vertex] == curLevel then
// for all vertices w that are neighbors of vertex
for all w : (vertex, w) in E do
if levels[w] == -1 then // if w is unvisited
levels[w] = curLevel + 1 // mark w to next level
levelV ertsCount = levelV ertsCount + 1
end if
end for
end if
end for
curLevel = curLevel + 1
end while
Fig. 2: Top-down algorithm, Read-based approach
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// parallel levels processing
#pragma omp parallel for reduction(+:levelV ertsCount)
for all vertex in V do // for all graph vertices
if levels[vertex] == -1 then // if vertex is unvisited
// for all neighbors vertices w of vertex
for all w : (vertex, w) in E do
// if w is in the current level
if levels[w] == curLevel then // if w is in the current level
levels[vertex] = curLevel + 1 // mark vertex for the next level
levelV ertsCount = levelV ertsCount + 1
break // quit from inner for loop
end if
end for
end if
end for
Fig. 3: Level processing using the bottom-up algorithm
3
Performance Analysis
Table 1 provides configuration details of hardware platforms used for BFS algorithms performance evaluation. The first is a single socket Intel Xeon Sandy
Bridge-EP server platform and the other two are the Intel Xeon Phi 5110P (Phi5110P) and Intel Xeon Phi 7120P (Phi-7120P) coproccessor add-on cards. Both
Phi-5110P and Phi-7120P are implementations of the same coprocessor core,
Phi-7120P was released in the second quarter of 2013, six months later than
Phi-5110P. Phi-7120P has increased frequency enabled by Intel Turbo Boost
Technology and twice as much memory as Phi-5110P.
Performance of the BFS algorithms is measured in Millions of Traversed
Edges per Second (MTEPS). We use both Intel Xeon Phi platforms in native
mode, i.e. computation is exclusively performed on Intel Xeon Phi without using
the host CPU.
3.1
Performance Evaluation
The performance of the investigated algorithms on Uniform Random graph with
134 M vertices and average vertex degree 8 on Sandy Bridge-EP and on Phi5110P is presented in Fig. 4 and Fig. 5. The performance of the naive Queuebased algorithm on Phi-5110P is 20 times lower than on Sandy Bridge-EP when
single thread is used. Thus obtaining of high performance on Intel Xeon Phi is
only possible with a large number of threads. However, scalability of the naive
and block Queue-based algorithms is very poor despite the reduction of atomic
operation usage in the latter algorithm.
Read-based algorithms show significantly better scalability and performance
than the Queue-based algorithms, as shown in Fig. 4 and Fig. 5. This improve-
6
Table 1: System specifications
Sandy Bridge-EP
Phi-5110P
Phi-7120P
Intel Model
Xeon E5-2660
Xeon Phi 5110P
Xeon Phi 7120P
CPU speed, GHz
2.2
1.05
1.238
Number of sockets
1
1
1
Number of cores
8
60
61
Number of threads in
core
2
4
4
Data caches size
64 Kb * / 2 Mb * /
20 Mb
Memory size, GB
32
8
16
32 Kb * / 512 Kb * 32 Kb * / 512 Kb *
Memory type
DDR3
GDDR5
GDDR5
Memory bandwidth,
GB/s
51
352
352
Memory latency,
ticks
200
300
350
* – per one core
800
hybrid+prefetch+relabel
hybrid+prefetch
hybrid
700
read+prefetch
read
Performance, MTEPS
600
block
simple
500
400
300
200
100
0
0
2
4
6
8
10
Number of threads
12
14
16
Fig. 4: BFS performance of Sandy Bridge-EP on Uniform Random graph with 134M
vertices and average vertex degree 8
ment can be explained by several advantages of the Read-based approach over
Queue-based. First, the Read-based algorithms do not use atomic operations,
that heavily limit performance scalability on large number of cores and threads.
Second, the Q and Qnext arrays are not used, so memory usage is reduced, as a
result cache is used more effeciently. Finally, spatial locality of memory access
pattern in the Read-based approach is much higher than in the Queue-based
approach [2]. Indeed, at each level of the Read-based algorithms the levels array
7
is read sequentially, as well as the CRS packed neighbor array is accessed with
monotonically increasing indices with possible jumps over contiguous memory
portions.
1200
hybrid+prefetch+relabel
hybrid+prefetch
hybrid
1000
top-down+prefetch
Performance, MTEPS
top-down
800
block
naive
600
400
200
0
0
50
100
150
Number of threads
200
250
Fig. 5: BFS performance of Phi-5110P on Uniform Random graph with 134 M vertices
and average vertex degree 8
Sequential access pattern permits a better utilization of memory bandwidth,
raises cache efficiency, permits an efficient work of hardware prefetch, reduces
TLB misses count. Also the threads share the CRS neighbors array better by
not loading it to core’s caches repeatedly.
It seems that the Read-based algorithms perform redundant work when process the whole levels array at each level. But as sequential memory bandwidth
is very high, and processing time of vertices that are not at the current level is
negligible, the overheads are very small.
However, still there is a random access in the Read-based approach. In the
line 10 of the top-down algorithm in Fig. 2 access index to the levels array is
vertex w from the CRS neighbors array, for many graphs w is random. Memory
bandwidth comparison for sequential (vectorized) access and random access on
Sandy Bridge-EP and Phi-5110P is presented in Tab. 2. Performance results for
vectorized memory access are taken from [6] for the SE10P coprocessor (prerelease MIC card). All other results were obtained using DISBench [7]. Table 2
shows that random access memory pattern is very expensive. The reasons are
the following. When a random addresses stream is issued and one data word is
requested from the cache line, the efficiency of integrated DRAM controller is
greatly reduced. First, there is a reduction of useful amount of data transmitted
through the memory bus. Second, there are additional commands in memory
chips that are necessary to work with a stream of inconsecutive addresses.
The hybrid algorithm allows to considerably speed up the processing of some
levels, as a result we achieve an even better performance, see Fig. 4 and Fig. 5.
But there is a random access in the hybrid algorithm as well.
8
Table 2: Memory bandwidth (in GB/s) for read and write operations on Sandy
Bridge-EP and Phi-5110P for sequential and random access patterns
Sequential access
3.2
Random access
Read
Write
Read
Write
Sandy Bridge-EP
42
19
3.3
2.2
Phi-5110P
183
160
3.8
3.4
Optimizations
To optimize performance of our algorithms on Intel Xeon Phi we had to detect
bottlenecks. In the first place it could be either bandwidth or latency limitation of
random access. We introduced manual loop unrolling in the top-down algorithm
in the line 8 (Fig. 2) and manual prefetching of levels[w] into cache using the
mm prefetch intrinsic. Both these techniques (top-down+prefetch algorithm)
on Phi-5110P provide 2.1 times increase of single thread performance, and 1.52
times for 240 threads, see Fig. 5.
This improvement shows that the bottleneck of the top-down algorithm on
Intel Xeon Phi is memory latency of random access. For the top-down+prefetch
algorithm the bottleneck most likely is the maximal rate of random accesses that
is determined by the memory bandwidth.
At the same time on Sandy Bridge-EP the performance of the top-down+prefetch algorithm equals the performance of the top-down algorithm. In other
words manual loop unrolling and data prefetch showed no effect. We assume that
this can be explained by high quality of code generated by Intel C compiler.
We applied the same optimizations to the hybrid algorithm. We call it a
hybrid+prefetch algorithm.
Another possible way to increase performance is to improve data locality for
access to the levels array (for example, the line 10 of the top-down algorithm,
Fig. 2). It can be done by preprocessing adjacency matrix to the band form with
a reverse Cuthill-McKee algorithm [8]. As a result, since rows of the matrix in the
Read-based approach are processed sequentially in accordance with sequential
processing of the levels array, cache hit rate ofr accesses to the levels array
is increased. Also neighbors adjacency lists are sorted for TLB miss reduction.
The hybrid+prefetch algorithm with the described above preprocessing we call
hybrid+prefetch+relabel. The performance of the latter algorithm is increased by
12 % as compared to the hybrid+prefetch algorithm on Phi-5110P.
We used libhugetlbfs library for large pages support on Phi-5110P, but it
gave no performance gain.
3.3
Performance Comparison
We evaluated performance of Intel Xeon Phi and Intel Xeon Sandy Bridge-EP
using Uniform Random graphs, RMAT [9] graphs with (A, B, C) = (0.45, 0.25,
9
0.15), default Graph500 Kronecker graphs [1] and default SSCA2 graphs [10],
see Tab. 3. Random-k and RMAT-k below denote Uniform Random and RMAT
graphs with average degree k.
The performance of the best algorithms (hybrid+prefetch+relabel or hybrid+prefetch) on Phi-7120P (244 threads) and Sandy Bridge-EP (16 threads)
on four graph types is presented in Fig. 6. Along the X-axis is a number of
vertices in the graph. For each graph type on Phi-7120P the best performance
among variants with manual loop unrolling of 2, 4, 8, 16 is given.
The performance on smaller data sizes rapidly increases on Sandy Bridge-EP
until data become large enough not to fit in the cache, thereafter the performance gradually drops. Therefore Sandy Bridge-EP is not very efficient for data
intensive problems with large datasets. On the other hand, the performance on
Phi-7120P slowly increases and at some point outperforms Sandy Bridge-EP. Intel Xeon Phi is better designed for a massive parallelism and use of high memory
bandwidth.
Intel Xeon Phi 7120P has 16 GB of memory compared to 8 GB on Intel Xeon
Phi 5110P. Overall performance comparison of Sandy Bridge-EP, Phi-5110P and
Phi-7120P on maximal graphs fitting in 8 GB memory of Phi-5110P is presented
in Fig. 7. Maximal, average and minimal performance on Phi-5110P and Phi7120P are 134 %, 98 %, 71 % and 165 %, 121 %, 89 % of Sandy Bridge-EP performance. Overall performance comparison of Sandy Bridge-EP and Phi-7120P
on maximal graphs fitting in 16 GB memory is presented in Fig. 8. Maximal,
average and minimal performance on Phi-7120P are 178 %, 137 % and 100 % of
Sandy Bridge-EP performance. The increased frequency and memory capacity of
Phi-7120P compared to Phi-5110P make Intel Xeon Phi 7120P rather attractive.
We achieved 4366 MTEPS Graph500 performance result on Intel Xeon Phi
7120P on the graph with scale 25, and have the 89th place on the November
2013 list. There is no other Intel Xeon Phi in this Graph500 list, and this is
the fourth place among the research teams in the class of single node x86-based
systems. Performance comparison of single node x86-based systems for different
research teams is presented in Table 4.
4
Conclusion
In this paper we studied two different approaches to Breadth-First Search (BFS)
implementation: conventional Queue-based approach and stream-type Read-based
approach. We experimentally showed that for the Intel Xeon Sandy Bridge-EP
and Intel Xeon Phi processors better performance is obtained on Read-based algorithms. Read-based approach is characterized by the absence of atomic operations, intensive use of high memory bandwidth for sequential access and presence
of random access. Several optimization techniques, such as manual loop unrolling
and prefetching, were applied to the Read-based algorithms, which significantly
improved performance on Intel Xeon Phi while did not show any significant effect
on Intel Xeon Sandy Bridge-EP.
10
9000
Random-8, SB
8000
Random-8, MIC-7120P
Random-32, SB
Performance, MTEPS
7000
Random-32, MIC-7120P
RMAT-8, SB
6000
RMAT-8, MIC-7120P
RMAT-32, SB
5000
RMAT-32, MIC-7120P
4000
3000
2000
1000
0
10
12
14
16
18
20
22
24
SCALE, number of vertices in graph is 2SCALE
26
28
30
Fig. 6: Performance comparison of Intel Xeon Sandy Bridge-EP and Intel Xeon Phi
7120P on Uniform Random and RMAT graphs of various sizes
180%
Sandy Bridge-EP
Phi-5110P
160%
Phi-7120P
140%
120%
100%
80%
60%
40%
20%
0%
random-8-27
random-32-25
random-64-24
RMAT-8-26
RMAT-32-24
RMAT-64-23
SSCA2-25
graph500-25
Fig. 7: Performance comparison of Intel Xeon Sandy Bridge-EP, Intel Xeon Phi 5110P
and Intel Xeon Phi 7120P on maximal graphs fitting in 8 GB memory of Phi-5110P
200%
180%
Sandy Bridge-EP
Phi-7120P
160%
140%
120%
100%
80%
60%
40%
20%
0%
random-8-28 random-32-26 random-64-25
RMAT-8-27
RMAT-32-25
RMAT-64-24
SSCA2-26
Graph500-25
Fig. 8: Performance comparison of Intel Xeon Sandy Bridge-EP and Intel Xeon Phi
7120P on maximal graphs fitting in 16 GB memory of Phi-7120P
11
Table 3: Graphs used for evaluation
#
# Edges
Vertices
Degree Directed Generator
(M)
(M)
Abbreviation
Graph
random-k-n
Uniform
Random
2n
k*2n
k
N
Own
RMAT-k-n
RMAT
2n
k*2n
k
Y
[11]
SSCA2-25
SSCA2
25
2
267.8
8.0
Y
[12]
SSCA2-26
SSCA2
226
720.1
10.7
Y
[12]
1047.2
31.2
N
[1]
graph500-25
Kronecker
25
2
Table 4: Graph500 performance (November 2013) of single node x86-based systems
for different research teams
Organization
System
Graph500 Scale
GTEPS
Chuo University
4x Intel(R) Xeon(R)
CPU E5-4650
27
31.6
University of Tsukuba
Xeon E5-2650 v2,
GeForce GTX TITAN
25
17.2
National University of
Defense Technology
SMP, x86-based (?)
24
9.7
DISLab, NICEVT /
svetcorp.net
Intel Xeon Phi 7120P
25
4.4
Generally on Intel Xeon Phi BFS performance increases as graph size increases in contrast to Intel Xeon Sandy Bridge-EP. Intel Xeon Phi 7120P has
16 GB of memory compared to 8 GB of Intel Xeon Phi 5110P. On the 8 largest
graphs fitting in 8 GB average performance on Intel Xeon Phi 5110P and Intel
Xeon Phi 7120P is 98 % and 121 % of Intel Sandy Bridge-EP performance. On the
8 largest graphs fitting in 16 GB of Intel Xeon Phi 7120P average performance
is 137 %. The maximal performance gain on Intel Xeon Phi 7120P compared to
Intel Xeon Sandy Bridge-EP is 178 %, minimal performance is 100.31 %.
To the best of our knowledge as of February 2014 we are the first to publish
performance results of BFS on Intel Xeon Phi. In the paper [4] BFS scalability
results on 32-core Intel Xeon Phi prototype called Knights Ferry are presented.
However, this paper does not contain absolute performance results or comparison
with traditional Intel Xeon processors.
On the Graph500 benchmark we achieved 4366 MTEPS on Intel Xeon Phi
7120P on the graph with scale 25, and have the 89th place on the November
2013 list. There is no other Intel Xeon Phi in this Graph500 list, and this is
12
the fourth place among the research teams in the class of single node x86-based
systems.
The authors would like to thank the Svet Computers [13] company for the
provided IntellectDigital SciPhi 470 desktop supercomputer with Intel Xeon Phi
7120P coprocessor.
References
1. Graph500 benchmark. URL: http://www.graph500.org.
2. Hong, S., Oguntebi, T., Olukotun, K.: Efficient Parallel Graph Exploration on
Multi-Core CPU and GPU. In Proceedings of the 2011 International Conference
on Parallel Architectures and Compilation Techniques. 78–88 (2011)
3. Agarwal, V., Petrini, F., Pasetto, D., Bader, D.: Scalable Graph Exploration on
Multicore Processors. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. 1–11
(2011)
4. Saule, E., Catalyeurek, U.: An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture. In Proceedings of the 2012 IEEE 26th
International Parallel and Distributed Processing Symposium Workshops & PhD
Forum. 1629–1639 (2012)
5. Beamer, S., Asanović, K., Patterson, D.: Direction-optimizing breadth-first search.
In Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis. 10 pages (2012)
6. Saule, E., Kaya, K., Catalyurek, U.: Performance Evaluation of Sparse Matrix
Multiplication Kernels on Intel Xeon Phi. arXiv:1302.1078, 5 Feb 2013. URL:
http://gdcms.sysu.edu.cn/docs/20130329144208946523.pdf
7. Frolov, A., Gilmendinov, M.: DISBench: Benchmark for Memory Performance
Evaluation of Multicore Multiprocessors. Accepted to 12th International Conference, PaCT 2013, St.Petersburg, Russia, September 30-October 4, 2013
8. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In
Proceedings of the 1969 24th national conference. 157–172 (1969)
9. Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph
mining. SIAM Data Mining. (2004)
10. Bader, D., Madduri, K.: Design and Implementation of the HPCS Graph Analysis
Benchmark on Symmetric Multiprocessor. The 12th International Conference on
High Performance Computing (HiPC 2005). 465–476 (2005)
11. Bader, D., Madduri, K.: SNAP: Small-world network analysis and partitioning: an
open-souce parallel graph framework for the exploration of large-scale networks.
International Parallel and Distributed Processing Symposium (IPDPS). (2008)
12. Bader, D., Madduri, K.: GTGraph: A Synthetic Graph Generator Suite. URL:
http://www.cse.psu.edu/ madduri/software/GTgraph/ (2006)
13. Svet Computers company. URL: http://svetcorp.net