Parallel processing of Breadth First Search by Tightly Coupled

360
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
Parallel processing of Breadth First Search by Tightly
Coupled Accelerators
Takahiro Kaneda
Takuji Mitsuishi
Yuki Katsuta
Keio University
3-14-1 Hiyoshi, Yokohama,
223-8522, Japan
Keio University
3-14-1 Hiyoshi, Yokohama,
223-8522, Japan
Keio University
3-14-1 Hiyoshi, Yokohama,
223-8522, Japan
[email protected]
Takuya Kuhara
[email protected]
Toshihiro Hanawa
[email protected]
Hideharu Amano
Keio University
3-14-1 Hiyoshi, Yokohama,
223-8522, Japan
The University of Tokyo
5-1-5 Kashiwanoha, Kashiwa,
277-8589, Japan
Keio University
3-14-1 Hiyoshi, Yokohama,
223-8522, Japan
[email protected]
Taisuke Boku
[email protected]
[email protected]
University of Tsukuba
1-1-1 Tennodai, Tsukuba,
305-8573, Japan
[email protected]
The Tightly Coupled Accelerators (TCA) architecture connects a number of graphics processing units(GPUs) directly
through PCI express using dedicated switches called PEACH2
(PCI-Express Adaptive Communication Hub Ver.2). By
making the best use of the low-latency communication supported by PEACH2, the breadth-first search (BFS) algorithm from Graph500 which requires frequent communications between GPUs, was implemented with multiple GPU
systems. In using the BFS with the TCA, 1.58 times better
performance was achieved than with a common implementation using MPI.
The Tightly Coupled Accelerators (TCA) architecture[1][2]
provides direct data communication between accelerators
connected to different nodes. PEACH2 (PCI Express Adaptive Communication Hub Ver.2) is a realizaton of TCA architecture implemented by field-programmable gate array(FPGA),
and it uses the PCI Express(PCIe), commonly connects a
host CPU and accelerators, as a network link. The hardwired logic on the FPGA of PEACH2 provides a direct memory access (DMA) controller(DMAC) that can handle PCIe
transfers directly. Using PEACH2 enables a double ring
network to be formed with the PCIe, which was originally
designed only for a tree network with a single host CPU
as a root. The memories of host CPU and attached GPUs
Keywords
which are connected by PEACH2 are mapped into a single
GPU, Cluster, Tightly coupled accelerators architecture, PEACH2address space of the PCIe, and data can be transferred by
write access to the address. The current PEACH2 is imple1. INTRODUCTION
mented on Altera’s Stratix IV FPGA, operates on a 250MHz
clock, and the minimum latency between two GPUs is only
In recent years, due to the spread of general purpose com2.3 ms, much smaller than that using the MVAPICH2 with
putation using on graphics processing units(GPUs), heteroInfiniband.
geneous clusters with multiple hosts each equipped with
HA-PACS/TCA is a testbed as a proof-of-concept sysGPUs, have been the mainstream of high performance comtem
for TCA architecture in the University of Tsukuba’s
puting systems. Such systems are expected to be used for
Center for Computational Science and started using it for
the recently emerging big data processing as well as for nuscientific computation. However, the low latency commumerical computation. However, such heterogeneous clusters
nication it provides can be most efficiently used for noncause a large latency between GPUs communicating across
numerical computing rather than large scale scientific comnodes by indirect communication via the memory of host
puting programs, which require a higher bandwidth to cope
CPUs. In non-numerical computation procedures such as
with large block data transfers. Recently, big data comgraph processing, small data communication is frequently
puting functions such as graph analysis have come to rerequired and the communication latency tends to bottleneck
quire short, frequent messages, and these are difficult to
the performance improvement obtained by using multiple
handle with standard Infiniband connected heterogeneous
GPUs provided in heterogeneous clusters.
clusters. In the work reported in this paper, we implemented
the breadth-first search(BFS) algorithm from Graph500 on
This work was presented in part at the international symposium on Highlythe HA-PACS/TCA and compared the performance obtaind
Efficient Accelerators and Reconfigurable Technologies (HEART2015)
with that obtained when using MPI over Infiniband.
Boston, MA, USA, June 1-2, 2015.
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
Table 1: Specifications of PEACH2
FPGA family
Stratix IV GX
FPGA chip
EP4SGX530NF45C2
Process Technology
40nm
hline Port
PCIe Gen2 x8 * 4 port
Max bandwidth fo a port
4GB/sec
Max Frequency
250MHz
Internal bus width
128bit
On-board DRAM
DDR3 512MByte
2. COMMUNICATION USING PEACH2
2.1 PEACH2
Although PCIe has been commonly used in recent CPUs,
it is designed for I/O networks with a tree structure and
makes it difficult to connect multiple nodes directly. However, the PEACH2 switching hub enables PCIe to be used
as a low-latency network. Figure 1 shows a block diagram
of a PEACH2 chip implemented on an FPGA. There are
four ports on it; port N is an endpoint for PCIe Gen2 x8
and is plugged into the PCIe connector on the host CPU
board. A node is formed by connecting GPUs to the same
back-plane board. Multiple nodes are connected as a ring
network formed with two ports: Port E, an endpoint, and
Port W, a root complex. The remaining Port S is a switchable x16 port which works as x8 port and is used to connect
two ring networks, each of which can connect 8 nodes. The
routing function embedded in the FPGA determines the destination port merely by checking the destination address of
a PCIe packet to form a single address space for data transfers. The DMAC supports sophisticated block data transfers
in the address space. Table 1 shows the details of PEACH2.
It was implemented with Altera’s Stratix-IV and operates
on a 250MHz system clock. A 512MByte DDR3 SDRAM is
provided on the board.
361
2.2
Communication with PEACH2
Figure 2 shows a multi-node system connected with PEACH2.
Since 8 nodes form a ring by using Ports E and W, 16 nodes
can be connected with Port S in total. If more nodes need
to be connected, the next level network using Infiniband is
required. Figure 4 shows an example of the shared address
setting for 16 nodes.
The 512 GByte total address region is split, and a 32GByte
address is assigned to each node. The routing function provided in PEACH2 has control registers for address mask as
well as for lower bound and upper bound, and the destination port is statically determined by checking the address
with the address mask. On PEACH2, memory accesses to
remote nodes are restricted to memory write requests. Instead of memory read, which is difficult to implement efficiently, the proxy write mechanism can achieve the same
effect by using driver support.
Figure 2:
PEACH2
Figure 1: Block diagram of PEACH2
Multi-node system connected with
The PEACH2 provides two types of communication: PIO
and DMA. The former is useful for short message transfers,
while the latter can only perform a store operation to remote nodes. In order to enable PIO communication, the
PCIe region assigned to the PEACH2 is mapped through
the device driver for the user space by an mmap interface.
The PEACH2 DMAC supports enhanced DMA functions,
including chaining using descriptors and block stride data
transfer.
The TCA system specifications are shown in Table 2. The
physical configuration of the two nodes in Figure 4, shows
they are connected. An FPGA borad (that of PEACH2),
two Ivy Bridge processors, and four of NVIDIA K20X GPUs
are installed on each node. For the PEACH2 in PCIe Gen2
x8, GPUs are connected by a PCIe Gen2 x16 bus. Multiple
nodes in the cluster are connected by PEACH2, and each
362
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
Table 2: Specifications of HA-PACS/TCA
CPU
Num of Core
Clock
Peak spec
PCI-express
Memory
Figure 3: PCIe address spaces of PEACH2
GPU
Num of GPU
Memory
Node Cluster
Connection
TCA board
node has a two-step configuration for connecting to a higher
level network via the Infiniband standard. Figure 1 shows a
block diagram of the entire system.
Table 3: Evaluation environment
CPU
Intel(R) Xeon(R) CPU E5-2680
Clock
2.80GHz
Memory
128GB
GPU
NVIDIA Tesla K20m
Memory
5GB
PEACH2
Stratix IV EX4SGX530
OS
CentOS6.4
Host Compiler GCC 4.4.7
CUDA
Toolkit 6.0
MPI
MVAPICH2-GDR 2.0
Library
CUB v.1.3.2
3.1
Figure 4: PCIe address spaces of PEACH2
2.3 Details of the target system
Table 3 shows the system we used in the implementation.
It is an experimental cluster system as a prototype of HAPACS/TCA in the University of Tsukuba’s Center for Computational Sciences. It provides almost the same configuration as the HA-PACS/TCA available for use at the center,
which in 2013 took third place in the Green500, the ranking of the most energy efficient supercomputers in the world.
For performance comparison, it provides both PEACH2 and
Infiniband for each node. The communication performance
of PEACH2 was reported in [3]; the ping-pong latency in
case of GPU-GPU is only 2.3 ms, as opposed to 6.5ms for
MVAPICH2-GDR2.0 using GPUDirect for the RDMA option with Mellanox Infiniband. The maximum bandwidth
between GPUs over nodes is about 2.3GBytes/sec.
3. BREADTH-FIRST SEARCH AND ITS PARALLEL PROCESSING
Intel Xeon E5-2680 v2(Ivy Bridge-EP)
20 Core/Node (10 Core/Socket 2 Socket)
2.8 GHz
364 TFLOPS
Generation 3 80 Lane (40 Lane/CPU)
128 GB, DDR3 1866MHz,
4 channel/Socket, 119.4 GByte/s/Node
NVIDIA Tesla K20X
4 GPU/Node
24 GByte/Node (6 GByte/GPU)
Infiniband QDR 2 rail
(Mellanox ConnectX-3 dual head)
Stratix IV 530 GX
Level synchronized BFS
Breadth-first search (BSF) is an algorithm with which every vertex of a graph is visited in the breadth first order.
Each vertex is labeled by the parent number or distance
from the source vertex. Here, the label of each vertex is represented by the parent as in the Graph500 benchmark[4].
The target graph is represented by an adjacency matrix in
a compressed sparse row (CSR) sparse matrix format. We
used Level Synchronized BFS, which is a representative parallel BFS, and processed it as shown in the following pseudocode.
A CQ holds vertices of the current depth level, while an
NQ holds the vertices of the next depth level. The array
visited holds whether a vertex has been visited or not. The
search results are stored in the array pred as the parents
identifiers. If there are no parents, -1 is held in pred.
3.2
Related work
A lot of research has been reported on the parallel execution of level-synchronized BFS with a single GPU or multiple GPUs. Harish et al. proposed algorithms for a single
GPU[5], and we extended the method for multiple GPUs.
Mastrostefano[6] extended it to multi-GPU systems and proposed a method for reducing the communication. Mitsuishi
et al.[7] improved it for multi-GPU systems with poor communication capacity. Suzumura et al. implemented level
synchronized BFS on a TSUBAME2.0 supercomputer at the
Tokyo Institute of Technology[8]. For executing the BFS on
a large scale supercomputer, they used the 2D PartitioningBased BFS, which places processors, vertices and adjacency
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
Algorithm 1 Level-synchronized BFS
1: for all vertex v in parallel do
2:
pred[v] ← −1
3:
pred[r] ← 0
4:
Enqueue(CQ, r)
5:
while CQ! = Empty do
6:
N Q ← empty
7:
for all uinN Q in parallel do
8:
u ← Dequeue(CQ)
9:
for each v adjaccent to u in parallel do
10:
if pred[v] = −1 then
11:
pred[v] ← −1
12:
Enqueue(N Q, v)
13:
end if
14:
end for
15:
end for
16:
swap(CQ, N Q)
17:
end while
18: end for
363
4.2
Communication data size reduction
Each out queue is a bit-vector whose location correspond
to an index of a vertex. That is, we need to transfer the
out queue whose size is equal to (allvertices/thenumberof GP U s).
Here, we compress the size of exchanged data with the method
shown in Figure 5.
matrices in a two-dimensional array to improve the performance.
4. DESIGN AND IMPLEMENTATION
In implementing the level-synchronized BFS, we adopted
a standard approach for comparing TCA architecture and
an MPI-connected GPU cluster.
4.1 The BFS in a multi-GPU system
Algorithm?? for a single GPU can be extended to multiGPU systems by replacing CQ and N Q with the arrays of
bit-vector in queue and out queue, respectively.
1. Each GPU has an adjacency matrix in the CSR sparse
matrix format. First, the root is stored in in queue,
and corresponding bsf tree is marked as visited. Others are set to be unvisited by storing 0. The content
of out queue is also initialized to be 0.
2. Each GPU checks an assigned vertex u. If u is unvisited, go to the next step. Otherwise, check the next
vertex.
3. Check all neighboring nodes of u. If the corresponding
location of in queue is 1, update bsf tree, and write 1
into visited and out queue.
4. Gather all data in out queue of all GPUs and make an
in queue. If all values in in queue are 0, it shows that
the search is finished.
5. Go to step 2.
This algorithm includes communication between GPUs
connected to different nodes for exchanging data in out queue.
When BFS is executed in a multi-GPU system, GPUs
need to communicate with other GPUs connected to the different nodes. The communication across the node becomes
the performance bottoleneck. To reduce the communication
overhead, the amount of transfer data is compressed by using the replicated-csr method. This data exchange is done
by using the MPI function in common Infiniband clusters.
In the case of TCA, we can use an application programming
interface(API) by using the shared data space supported by
PEACH2.
Figure 5: An example of bit-vector compression
First, a scan array is made by performing a scan calculation that accumulates the number ”1” in out queue. If there
is a position where the number in scan array is changed it
means that there is a ”1” in the out queue. Thus, we can
make a transf er array that only includes the position of
”1”, and it is transferred until ”-1” is found. A GPU can restore the original ”out queue” from the receiving data. Note
that this compression can be performed for the ”out queue”
of all vertices in parallel. This method is especially advantageous when the target graph is sparse. The compression
algorithm is shown in Algorithm2.
Algorithm 2 Compression algorithm
1: scan array ← Perform scan calculation for out queue
of all vertices in parallel
2: transf er num ← scan array[last]
3: if out queue[index] = 1 then
4:
transf er array[scan array[index]] ← index in parallel
5: end if
6: Transfer transf er array
7: Return the in queue the flag form
364
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
Note that this compression is only effective for TCA communication that supports quick transfer for small data size.
For MPI communication, MPI Allgather() which only supports fixed length is advantageous for Infiniband GPU clusters.
4.3 The communication path in a GPU
In the target multi-GPU system, two GPUs form a node
and two nodes are connected with TCA or MPI. First, two
GPUs in the same node communicate with each other by
using the CUDA API. Then, GPUs connected to the different nodes exchange the data by using the TCA API or MPI
as shown in Figure 6. For a GPU, all data can be gathered
with two inner-node data transfers and one inter-node data
transfer.
as an independent network, which enables to compare two
networks having exactly the same node configuration.
5.2
5.3
Figure 6: Communication path
5. EVALUATION
In this section, we show how we evaluated and compared
the performance of the BFS with TCA and that with MPI.
Graphs from the Graph500 benchmark are used as target
graphs.
5.1 Evaluation environment
We used an experimental cluster with two nodes in the
University of Tsukuba’s Center for Computational Sciences.
The node specifications are shown in Table3. In the system,
two nodes are connected with TCA by using two PEACH2
boards attached to each node. Each host has Infiniband
Graph500
For the measurements we used the Graph500, which is
a benchmark for measuring performance by evaluating the
processing time of the graph search.
A graph is generated with parameters called scale and
edgef actor. The scale represents the number of vertices of
the graph with the formula: the number of vertices =
2scale . The edgef actor determines the number of edges:
that is, the number of edges = the number of vertices ∗
edge f actor
Performance is represented by the number of edges traversed in a second. This measure is called TEPS (Traversed
Edges Per Second); a larger TEPS means better performance.
Evaluation result
Figures7, 8, and 9 show the results obtained for the BFS
while changing the scale and the edge f actor. All figures
show that the BFS performance becomes better as the scale
becomes larger. It is obvious that a lot of threads can work
in parallel for large scale. In addition, the BFS using TCA
is more advantageous than that using MPI with large scale.
When scale = 16 and edgef actor = 64, the performance
of the BFS using TCA is 1.58 times faster than that with
MPI. Table.4 show the reduction ratio values obtained. We
achieved reduction by roughly 406̃0%. Figure10 is the graph
that shows execution time when the data are exchanged between nodes. The execution time and target communication time are shown for ”tca” or ”mpi”, which includes cudaMemcpyDtoH, HtoD, cudaMemset and kernels for data
reduction in TCA, and cudaMemcpyDtoH HtoD and cudaMemset for the same in MPI.
Figure10 makes it clear that the run time for the TCA version is shorter in many cases. When the scale is small, there
is large variation in the datae, and the results may therefore
change for each run. However, even these small advantages
will have a positive effect on the execution performance.
In the BFS, searching accounts for 90% of the execution
time, and less than 10% of the communication time. It
is unreasonable to assume that the data reduction and low
latency communication achieved with TCA are the only reasons for the improved performance that was obtained. Another reason that can be considered is the data exchanged
by sandwiching a process such as reduction while performing
a large number of consecutive data write operations. However, it is difficult to analyze this in detail. In the work
we report here, we measured the communication time with
MPI Wtime() and the others by using ”nvprof”. However,
the measurement data obtained is not enough sufficiently
accurate for conducting a detailed analysis.
When 9 <= scale <= 10, the BFS with MPI outperformed that with TCA. This might come from the fact that
in this case the execution time needed for the reduction cancels positive the effects of the small latency.
6.
CONCLUSION
We implemented the breadth-first search(BFS) algorithm
on multiple graphics processing unit clusters using the Tightly
Coupled Accelerators(TCA) architecture and optimized the
communication for the TCA by making use of its lower
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
365
Table 4: average ratio after data reduction(%)
Figure 7: edgef actor = 16
Figure 10: execution time
latency than the MPI over Infiniband. For a graph from
Graph500, the BFS with the TCA achieved 1.58 times performance than that with the MPI.
In future work, we will need to evaluate the performance
obtained carrying out the following optimization procedures.
• Optimizing for larger scale,
• optimizing for a larger number of nodes,
Figure 8: edgef actor = 32
• using additional techniques to reduce data size,
• using the new application programming interface (API)
for PEACH2, and
• developing more accurate profiling tools.
These procedures will be necessary because currently the
scale and the number of corresponding nodes are too small
for the system that the TCA targets. In addition there are
still a lot of data reduction methods that can be applied, and
the newly developed API for PEACH2 is more suitable for
higher performance implementation. Finally, more accurate
profiling tools will need to be developed.
7.
Figure 9: edgef actor = 64
ACKNOWLEDGEMENT
The present study is supported in part by the JST/CREST
program entitled “Research and Development on Unified
Environment of Accelerated Computing and Interconnection
for Post-Petascale Era” in the research area of “Development of System Software Technologies for post-Peta Scale
High Performance Computing”.
366
Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 |
8. REFERENCES
[1] S. Otani, H.Kondo, T. Hanawa, S. Miura, and T. Boku.
Peach: A multicore communication system on chip
with pci express. In IEEE Micro, pp. 39–50, 2011.
[2] T. Hanawa, Y. Kodama, T. Boku, and M. Sato.
Tightly coupled accelerators architecture for
minimizing communciation latency among accelerators.
In IEEE 27th IPDPSW, pp. 1030–1039, 2013.
[3] Yuetsu Kodama, Toshihiro Hanawa, Taisuke Boku, and
Mitsuhisa Sato. PEACH2: An FPGA-based PCIe
network device for Tightly Coupled Accelerators. In
Highly-Efficient Accelerators and Reconfigurable
Technologies (HEART2014), pp. 5–10, 6 2014.
[4] Graph 500. ”http://www.graph500.org/”.
[5] Pawan Harish and P.J. Narayanan. Accelerationg Large
Graph Algorithms on the GPU Using CUDA. In HiPC
2007, pp. 197–208, 2007.
[6] Enrico Mastrostefano. Large Graphs on multi-GPUs.
PhD thesis, Spienza University of Roma, 2013.
[7] Takuji Mitsuishi, Shimpei Nomura, Jun Suzuki, Yuki
Hayashi, Masaki Kan, and Hideharu Amano.
Accelerating breadth first search on gpu-box. In
International symposium on Highly Efficient
Accelerators and Reconfigurable Technologies,
HEART’14, July 2014.
[8] Toyotaro Suzumura, Koji Ueno, Hithoshi Sato, Katsuki
Fujisawa, and Satoshi Matsuoka. Performance
Evaluation of Graph500 on Large-Scale Distributed
Environment. In IISWC, pp. 149–158, Nov 2011.