360 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | Parallel processing of Breadth First Search by Tightly Coupled Accelerators Takahiro Kaneda Takuji Mitsuishi Yuki Katsuta Keio University 3-14-1 Hiyoshi, Yokohama, 223-8522, Japan Keio University 3-14-1 Hiyoshi, Yokohama, 223-8522, Japan Keio University 3-14-1 Hiyoshi, Yokohama, 223-8522, Japan [email protected] Takuya Kuhara [email protected] Toshihiro Hanawa [email protected] Hideharu Amano Keio University 3-14-1 Hiyoshi, Yokohama, 223-8522, Japan The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, 277-8589, Japan Keio University 3-14-1 Hiyoshi, Yokohama, 223-8522, Japan [email protected] Taisuke Boku [email protected] [email protected] University of Tsukuba 1-1-1 Tennodai, Tsukuba, 305-8573, Japan [email protected] The Tightly Coupled Accelerators (TCA) architecture connects a number of graphics processing units(GPUs) directly through PCI express using dedicated switches called PEACH2 (PCI-Express Adaptive Communication Hub Ver.2). By making the best use of the low-latency communication supported by PEACH2, the breadth-first search (BFS) algorithm from Graph500 which requires frequent communications between GPUs, was implemented with multiple GPU systems. In using the BFS with the TCA, 1.58 times better performance was achieved than with a common implementation using MPI. The Tightly Coupled Accelerators (TCA) architecture[1][2] provides direct data communication between accelerators connected to different nodes. PEACH2 (PCI Express Adaptive Communication Hub Ver.2) is a realizaton of TCA architecture implemented by field-programmable gate array(FPGA), and it uses the PCI Express(PCIe), commonly connects a host CPU and accelerators, as a network link. The hardwired logic on the FPGA of PEACH2 provides a direct memory access (DMA) controller(DMAC) that can handle PCIe transfers directly. Using PEACH2 enables a double ring network to be formed with the PCIe, which was originally designed only for a tree network with a single host CPU as a root. The memories of host CPU and attached GPUs Keywords which are connected by PEACH2 are mapped into a single GPU, Cluster, Tightly coupled accelerators architecture, PEACH2address space of the PCIe, and data can be transferred by write access to the address. The current PEACH2 is imple1. INTRODUCTION mented on Altera’s Stratix IV FPGA, operates on a 250MHz clock, and the minimum latency between two GPUs is only In recent years, due to the spread of general purpose com2.3 ms, much smaller than that using the MVAPICH2 with putation using on graphics processing units(GPUs), heteroInfiniband. geneous clusters with multiple hosts each equipped with HA-PACS/TCA is a testbed as a proof-of-concept sysGPUs, have been the mainstream of high performance comtem for TCA architecture in the University of Tsukuba’s puting systems. Such systems are expected to be used for Center for Computational Science and started using it for the recently emerging big data processing as well as for nuscientific computation. However, the low latency commumerical computation. However, such heterogeneous clusters nication it provides can be most efficiently used for noncause a large latency between GPUs communicating across numerical computing rather than large scale scientific comnodes by indirect communication via the memory of host puting programs, which require a higher bandwidth to cope CPUs. In non-numerical computation procedures such as with large block data transfers. Recently, big data comgraph processing, small data communication is frequently puting functions such as graph analysis have come to rerequired and the communication latency tends to bottleneck quire short, frequent messages, and these are difficult to the performance improvement obtained by using multiple handle with standard Infiniband connected heterogeneous GPUs provided in heterogeneous clusters. clusters. In the work reported in this paper, we implemented the breadth-first search(BFS) algorithm from Graph500 on This work was presented in part at the international symposium on Highlythe HA-PACS/TCA and compared the performance obtaind Efficient Accelerators and Reconfigurable Technologies (HEART2015) with that obtained when using MPI over Infiniband. Boston, MA, USA, June 1-2, 2015. Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | Table 1: Specifications of PEACH2 FPGA family Stratix IV GX FPGA chip EP4SGX530NF45C2 Process Technology 40nm hline Port PCIe Gen2 x8 * 4 port Max bandwidth fo a port 4GB/sec Max Frequency 250MHz Internal bus width 128bit On-board DRAM DDR3 512MByte 2. COMMUNICATION USING PEACH2 2.1 PEACH2 Although PCIe has been commonly used in recent CPUs, it is designed for I/O networks with a tree structure and makes it difficult to connect multiple nodes directly. However, the PEACH2 switching hub enables PCIe to be used as a low-latency network. Figure 1 shows a block diagram of a PEACH2 chip implemented on an FPGA. There are four ports on it; port N is an endpoint for PCIe Gen2 x8 and is plugged into the PCIe connector on the host CPU board. A node is formed by connecting GPUs to the same back-plane board. Multiple nodes are connected as a ring network formed with two ports: Port E, an endpoint, and Port W, a root complex. The remaining Port S is a switchable x16 port which works as x8 port and is used to connect two ring networks, each of which can connect 8 nodes. The routing function embedded in the FPGA determines the destination port merely by checking the destination address of a PCIe packet to form a single address space for data transfers. The DMAC supports sophisticated block data transfers in the address space. Table 1 shows the details of PEACH2. It was implemented with Altera’s Stratix-IV and operates on a 250MHz system clock. A 512MByte DDR3 SDRAM is provided on the board. 361 2.2 Communication with PEACH2 Figure 2 shows a multi-node system connected with PEACH2. Since 8 nodes form a ring by using Ports E and W, 16 nodes can be connected with Port S in total. If more nodes need to be connected, the next level network using Infiniband is required. Figure 4 shows an example of the shared address setting for 16 nodes. The 512 GByte total address region is split, and a 32GByte address is assigned to each node. The routing function provided in PEACH2 has control registers for address mask as well as for lower bound and upper bound, and the destination port is statically determined by checking the address with the address mask. On PEACH2, memory accesses to remote nodes are restricted to memory write requests. Instead of memory read, which is difficult to implement efficiently, the proxy write mechanism can achieve the same effect by using driver support. Figure 2: PEACH2 Figure 1: Block diagram of PEACH2 Multi-node system connected with The PEACH2 provides two types of communication: PIO and DMA. The former is useful for short message transfers, while the latter can only perform a store operation to remote nodes. In order to enable PIO communication, the PCIe region assigned to the PEACH2 is mapped through the device driver for the user space by an mmap interface. The PEACH2 DMAC supports enhanced DMA functions, including chaining using descriptors and block stride data transfer. The TCA system specifications are shown in Table 2. The physical configuration of the two nodes in Figure 4, shows they are connected. An FPGA borad (that of PEACH2), two Ivy Bridge processors, and four of NVIDIA K20X GPUs are installed on each node. For the PEACH2 in PCIe Gen2 x8, GPUs are connected by a PCIe Gen2 x16 bus. Multiple nodes in the cluster are connected by PEACH2, and each 362 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | Table 2: Specifications of HA-PACS/TCA CPU Num of Core Clock Peak spec PCI-express Memory Figure 3: PCIe address spaces of PEACH2 GPU Num of GPU Memory Node Cluster Connection TCA board node has a two-step configuration for connecting to a higher level network via the Infiniband standard. Figure 1 shows a block diagram of the entire system. Table 3: Evaluation environment CPU Intel(R) Xeon(R) CPU E5-2680 Clock 2.80GHz Memory 128GB GPU NVIDIA Tesla K20m Memory 5GB PEACH2 Stratix IV EX4SGX530 OS CentOS6.4 Host Compiler GCC 4.4.7 CUDA Toolkit 6.0 MPI MVAPICH2-GDR 2.0 Library CUB v.1.3.2 3.1 Figure 4: PCIe address spaces of PEACH2 2.3 Details of the target system Table 3 shows the system we used in the implementation. It is an experimental cluster system as a prototype of HAPACS/TCA in the University of Tsukuba’s Center for Computational Sciences. It provides almost the same configuration as the HA-PACS/TCA available for use at the center, which in 2013 took third place in the Green500, the ranking of the most energy efficient supercomputers in the world. For performance comparison, it provides both PEACH2 and Infiniband for each node. The communication performance of PEACH2 was reported in [3]; the ping-pong latency in case of GPU-GPU is only 2.3 ms, as opposed to 6.5ms for MVAPICH2-GDR2.0 using GPUDirect for the RDMA option with Mellanox Infiniband. The maximum bandwidth between GPUs over nodes is about 2.3GBytes/sec. 3. BREADTH-FIRST SEARCH AND ITS PARALLEL PROCESSING Intel Xeon E5-2680 v2(Ivy Bridge-EP) 20 Core/Node (10 Core/Socket 2 Socket) 2.8 GHz 364 TFLOPS Generation 3 80 Lane (40 Lane/CPU) 128 GB, DDR3 1866MHz, 4 channel/Socket, 119.4 GByte/s/Node NVIDIA Tesla K20X 4 GPU/Node 24 GByte/Node (6 GByte/GPU) Infiniband QDR 2 rail (Mellanox ConnectX-3 dual head) Stratix IV 530 GX Level synchronized BFS Breadth-first search (BSF) is an algorithm with which every vertex of a graph is visited in the breadth first order. Each vertex is labeled by the parent number or distance from the source vertex. Here, the label of each vertex is represented by the parent as in the Graph500 benchmark[4]. The target graph is represented by an adjacency matrix in a compressed sparse row (CSR) sparse matrix format. We used Level Synchronized BFS, which is a representative parallel BFS, and processed it as shown in the following pseudocode. A CQ holds vertices of the current depth level, while an NQ holds the vertices of the next depth level. The array visited holds whether a vertex has been visited or not. The search results are stored in the array pred as the parents identifiers. If there are no parents, -1 is held in pred. 3.2 Related work A lot of research has been reported on the parallel execution of level-synchronized BFS with a single GPU or multiple GPUs. Harish et al. proposed algorithms for a single GPU[5], and we extended the method for multiple GPUs. Mastrostefano[6] extended it to multi-GPU systems and proposed a method for reducing the communication. Mitsuishi et al.[7] improved it for multi-GPU systems with poor communication capacity. Suzumura et al. implemented level synchronized BFS on a TSUBAME2.0 supercomputer at the Tokyo Institute of Technology[8]. For executing the BFS on a large scale supercomputer, they used the 2D PartitioningBased BFS, which places processors, vertices and adjacency Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | Algorithm 1 Level-synchronized BFS 1: for all vertex v in parallel do 2: pred[v] ← −1 3: pred[r] ← 0 4: Enqueue(CQ, r) 5: while CQ! = Empty do 6: N Q ← empty 7: for all uinN Q in parallel do 8: u ← Dequeue(CQ) 9: for each v adjaccent to u in parallel do 10: if pred[v] = −1 then 11: pred[v] ← −1 12: Enqueue(N Q, v) 13: end if 14: end for 15: end for 16: swap(CQ, N Q) 17: end while 18: end for 363 4.2 Communication data size reduction Each out queue is a bit-vector whose location correspond to an index of a vertex. That is, we need to transfer the out queue whose size is equal to (allvertices/thenumberof GP U s). Here, we compress the size of exchanged data with the method shown in Figure 5. matrices in a two-dimensional array to improve the performance. 4. DESIGN AND IMPLEMENTATION In implementing the level-synchronized BFS, we adopted a standard approach for comparing TCA architecture and an MPI-connected GPU cluster. 4.1 The BFS in a multi-GPU system Algorithm?? for a single GPU can be extended to multiGPU systems by replacing CQ and N Q with the arrays of bit-vector in queue and out queue, respectively. 1. Each GPU has an adjacency matrix in the CSR sparse matrix format. First, the root is stored in in queue, and corresponding bsf tree is marked as visited. Others are set to be unvisited by storing 0. The content of out queue is also initialized to be 0. 2. Each GPU checks an assigned vertex u. If u is unvisited, go to the next step. Otherwise, check the next vertex. 3. Check all neighboring nodes of u. If the corresponding location of in queue is 1, update bsf tree, and write 1 into visited and out queue. 4. Gather all data in out queue of all GPUs and make an in queue. If all values in in queue are 0, it shows that the search is finished. 5. Go to step 2. This algorithm includes communication between GPUs connected to different nodes for exchanging data in out queue. When BFS is executed in a multi-GPU system, GPUs need to communicate with other GPUs connected to the different nodes. The communication across the node becomes the performance bottoleneck. To reduce the communication overhead, the amount of transfer data is compressed by using the replicated-csr method. This data exchange is done by using the MPI function in common Infiniband clusters. In the case of TCA, we can use an application programming interface(API) by using the shared data space supported by PEACH2. Figure 5: An example of bit-vector compression First, a scan array is made by performing a scan calculation that accumulates the number ”1” in out queue. If there is a position where the number in scan array is changed it means that there is a ”1” in the out queue. Thus, we can make a transf er array that only includes the position of ”1”, and it is transferred until ”-1” is found. A GPU can restore the original ”out queue” from the receiving data. Note that this compression can be performed for the ”out queue” of all vertices in parallel. This method is especially advantageous when the target graph is sparse. The compression algorithm is shown in Algorithm2. Algorithm 2 Compression algorithm 1: scan array ← Perform scan calculation for out queue of all vertices in parallel 2: transf er num ← scan array[last] 3: if out queue[index] = 1 then 4: transf er array[scan array[index]] ← index in parallel 5: end if 6: Transfer transf er array 7: Return the in queue the flag form 364 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | Note that this compression is only effective for TCA communication that supports quick transfer for small data size. For MPI communication, MPI Allgather() which only supports fixed length is advantageous for Infiniband GPU clusters. 4.3 The communication path in a GPU In the target multi-GPU system, two GPUs form a node and two nodes are connected with TCA or MPI. First, two GPUs in the same node communicate with each other by using the CUDA API. Then, GPUs connected to the different nodes exchange the data by using the TCA API or MPI as shown in Figure 6. For a GPU, all data can be gathered with two inner-node data transfers and one inter-node data transfer. as an independent network, which enables to compare two networks having exactly the same node configuration. 5.2 5.3 Figure 6: Communication path 5. EVALUATION In this section, we show how we evaluated and compared the performance of the BFS with TCA and that with MPI. Graphs from the Graph500 benchmark are used as target graphs. 5.1 Evaluation environment We used an experimental cluster with two nodes in the University of Tsukuba’s Center for Computational Sciences. The node specifications are shown in Table3. In the system, two nodes are connected with TCA by using two PEACH2 boards attached to each node. Each host has Infiniband Graph500 For the measurements we used the Graph500, which is a benchmark for measuring performance by evaluating the processing time of the graph search. A graph is generated with parameters called scale and edgef actor. The scale represents the number of vertices of the graph with the formula: the number of vertices = 2scale . The edgef actor determines the number of edges: that is, the number of edges = the number of vertices ∗ edge f actor Performance is represented by the number of edges traversed in a second. This measure is called TEPS (Traversed Edges Per Second); a larger TEPS means better performance. Evaluation result Figures7, 8, and 9 show the results obtained for the BFS while changing the scale and the edge f actor. All figures show that the BFS performance becomes better as the scale becomes larger. It is obvious that a lot of threads can work in parallel for large scale. In addition, the BFS using TCA is more advantageous than that using MPI with large scale. When scale = 16 and edgef actor = 64, the performance of the BFS using TCA is 1.58 times faster than that with MPI. Table.4 show the reduction ratio values obtained. We achieved reduction by roughly 406̃0%. Figure10 is the graph that shows execution time when the data are exchanged between nodes. The execution time and target communication time are shown for ”tca” or ”mpi”, which includes cudaMemcpyDtoH, HtoD, cudaMemset and kernels for data reduction in TCA, and cudaMemcpyDtoH HtoD and cudaMemset for the same in MPI. Figure10 makes it clear that the run time for the TCA version is shorter in many cases. When the scale is small, there is large variation in the datae, and the results may therefore change for each run. However, even these small advantages will have a positive effect on the execution performance. In the BFS, searching accounts for 90% of the execution time, and less than 10% of the communication time. It is unreasonable to assume that the data reduction and low latency communication achieved with TCA are the only reasons for the improved performance that was obtained. Another reason that can be considered is the data exchanged by sandwiching a process such as reduction while performing a large number of consecutive data write operations. However, it is difficult to analyze this in detail. In the work we report here, we measured the communication time with MPI Wtime() and the others by using ”nvprof”. However, the measurement data obtained is not enough sufficiently accurate for conducting a detailed analysis. When 9 <= scale <= 10, the BFS with MPI outperformed that with TCA. This might come from the fact that in this case the execution time needed for the reduction cancels positive the effects of the small latency. 6. CONCLUSION We implemented the breadth-first search(BFS) algorithm on multiple graphics processing unit clusters using the Tightly Coupled Accelerators(TCA) architecture and optimized the communication for the TCA by making use of its lower Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 365 Table 4: average ratio after data reduction(%) Figure 7: edgef actor = 16 Figure 10: execution time latency than the MPI over Infiniband. For a graph from Graph500, the BFS with the TCA achieved 1.58 times performance than that with the MPI. In future work, we will need to evaluate the performance obtained carrying out the following optimization procedures. • Optimizing for larger scale, • optimizing for a larger number of nodes, Figure 8: edgef actor = 32 • using additional techniques to reduce data size, • using the new application programming interface (API) for PEACH2, and • developing more accurate profiling tools. These procedures will be necessary because currently the scale and the number of corresponding nodes are too small for the system that the TCA targets. In addition there are still a lot of data reduction methods that can be applied, and the newly developed API for PEACH2 is more suitable for higher performance implementation. Finally, more accurate profiling tools will need to be developed. 7. Figure 9: edgef actor = 64 ACKNOWLEDGEMENT The present study is supported in part by the JST/CREST program entitled “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era” in the research area of “Development of System Software Technologies for post-Peta Scale High Performance Computing”. 366 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. | PDPTA'15 | 8. REFERENCES [1] S. Otani, H.Kondo, T. Hanawa, S. Miura, and T. Boku. Peach: A multicore communication system on chip with pci express. In IEEE Micro, pp. 39–50, 2011. [2] T. Hanawa, Y. Kodama, T. Boku, and M. Sato. Tightly coupled accelerators architecture for minimizing communciation latency among accelerators. In IEEE 27th IPDPSW, pp. 1030–1039, 2013. [3] Yuetsu Kodama, Toshihiro Hanawa, Taisuke Boku, and Mitsuhisa Sato. PEACH2: An FPGA-based PCIe network device for Tightly Coupled Accelerators. In Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2014), pp. 5–10, 6 2014. [4] Graph 500. ”http://www.graph500.org/”. [5] Pawan Harish and P.J. Narayanan. Accelerationg Large Graph Algorithms on the GPU Using CUDA. In HiPC 2007, pp. 197–208, 2007. [6] Enrico Mastrostefano. Large Graphs on multi-GPUs. PhD thesis, Spienza University of Roma, 2013. [7] Takuji Mitsuishi, Shimpei Nomura, Jun Suzuki, Yuki Hayashi, Masaki Kan, and Hideharu Amano. Accelerating breadth first search on gpu-box. In International symposium on Highly Efficient Accelerators and Reconfigurable Technologies, HEART’14, July 2014. [8] Toyotaro Suzumura, Koji Ueno, Hithoshi Sato, Katsuki Fujisawa, and Satoshi Matsuoka. Performance Evaluation of Graph500 on Large-Scale Distributed Environment. In IISWC, pp. 149–158, Nov 2011.
© Copyright 2026 Paperzz