Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba What is “Tightly Coupled Accelerators (TCA)” ? Concept: n Direct connection between accelerators (GPUs) over the nodes n n n Using PCIe as a communication link between accelerators over the nodes n n Eliminate extra memory copies to the host Reduce latency, improve strong scaling with small data size for scientific applications PCIe just performs packet transfer and direct device P2P communication is available. PEACH2: PCI Express Adaptive Communication Hub ver. 2 n 2 In order to configure TCA, each node is connected to other nodes through PEACH2 chip. Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Design policy of PEACH2 n Implement by FPGA with four PCIe Gen.2 IPs n n n Altera Stratix IV GX Prototyping, flexible enhancement Sufficient communication bandwidth n n PCI Express Gen2 x8 for each port Sophisticated DMA controller n n Chaining DMA Latency reduction n n Hardwired logic Low-overhead routing mechanism n n Efficient address mapping in PCIe address area using unused bits Simple comparator for decision of output port It is not only a proof-of-concept implementation, but it will also be available for product-run in GPU cluster. 3 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba TCA node structure example n PEACH2 can access all GPUs n n n NVIDIA Kepler architecture + CUDA 5.0 GPUDirect Support for RDMA Performance over QPI is quite bad. => support only for GPU0, GPU1 Connect among 3 nodes using PEACH2 4 Hot Interconnects 21 CPU (Xeon E5) G2 x8 G2 x8 QPI CPU (Xeon E5) G2 x8 PCIe address G2 G2 G2 Single G2 PCIe x16 x16 x16 x16 PEA CH2 GPU 0 G2 x8 Aug 22, 2013 GPU 1 GPU 2 GPU 3 G3 x8 IB HCA GPU: NVIDIA K20, K20X (Kepler architecture) Center for Computational Sciences, Univ. of Tsukuba Overview of PEACH2 chip n n n n n Selectable between Root and Endpoint Write only except Port N n 5 Instead, Proxy write on remote node realizes pseudo-read. Hot Interconnects 21 Aug 22, 2013 CPU & GPU side (Endpoint) DMAC NIOS (CPU) Routing function Memory To PEACH2 (Endpoint) n Port N To PEACH2 (Root Complex) n Fully compatible with PCIe Gen2 spec. Root and EndPoint must be paired according to PCIe spec. Port N: connected to the host and GPUs Port E and W: form the ring topology Port S: connected to the Port other ring W Port E To PEACH2 (Root Complex / Endpoint) Port S Center for Computational Sciences, Univ. of Tsukuba Communication by PEACH2 n PIO n n CPU can store the data to remote node directly using mmap. DMA n Chaining mode n n n Register mode n n n 6 DMA requests are prepared as the DMA descriptors chained in the host memory. Multiple DMA transactions are operated automatically according to the DMA descriptors by hardware. Up to 16 requests No overhead to transfer descriptors from host Block stride transfer function Hot Interconnects 21 Aug 22, 2013 Source Destination Next Length Flags Descriptor0 Descriptor1 Descriptor2 Descriptor3 … Descriptor (n-1) Center for Computational Sciences, Univ. of Tsukuba PEACH2 board (Production version for HA-PACS/TCA) n PCI Express Gen2 x8 peripheral board n Compatible with PCIe Spec. Top View Side View 7 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba PEACH2 board (Production version for HA-PACS/TCA) Main board + sub board FPGA (Altera Stratix IV 530GX) PCI Express x8 card edge Most part operates at 250 MHz (PCIe Gen2 logic runs at 250MHz) DDR3SDRAM Power supply for various voltage PCIe x16 cable connecter PCIe x8 cable connecter 8 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Performance Evaluation n Environment: 8node GPU cluster (TCAMINI) n n n n n n n n 9 CPU: Intel Xeon-E5 (SandyBridge EP) 2.6GHz x2socket MB: SuperMicro X9DRG-QF Memory: DDR3 128GB OS: CentOS 6.3 (kernel 2.6.32-279.22.1.el6.x86_64) GPU: NVIDIA K20, GDDR5 5GB x1 CUDA: 5.0, NVIDIA-Linuxx86_64-310.32 PEACH2 board: Altera Stratix IV 530GX MPI: MVAPICH2 1.9 with IB FDR10 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Evaluation items n Ping-pong performance between nodes n n n n Latency and bandwidth Written as application Comparison with MVAPICH2 1.9 (with CUDA support) for GPUGPU communication n 10 Hot Interconnects 21 Aug 22, 2013 In order to access GPU memory by the other device, “GPU Direct support for RDMA” in CUDA5 API is used. n Special driver named “TCA p2p driver” to enable memory mapping is developed. “PEACH2 driver” to control the board is also developed. Center for Computational Sciences, Univ. of Tsukuba Ping-pong Latency 7 PIO 6 Latency (usec) Minimum Latency (nearest neighbor comm.) n PIO: CPU to CPU: 0.9us n DMA:CPU to CPU: 1.9us GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) DMA (GPU) 8 5 4 3 DMA (CPU) 2 1 PIO, Good Performance (<64B) 0 8 64 512 4096 32768 Data Size (bytes) 11 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Ping-pong Latency 8 DMA Direct 7 DMA 1 hop 6 Latency (usec) Minimum Latency (nearest neighbor comm.) n PIO: CPU to CPU: 0.9us n DMA:CPU to CPU: 1.9us GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) Forwarding overhead n 200~300 nsec DMA 2 hop 5 DMA 3 hop 4 3 2 DMA (CPU) 1 0 8 64 512 4096 32768 Data Size (bytes) 12 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Ping-pong Bandwidth 3.5 Gbyte/s Max. 3.5 GByte/sec n n 95% of theoretical peak Converge to the same peak with various hop counts Max Payload Size = 256byte Theoretical peak: 4Gbyte/sec × 256 / (256 + 16 + 2 + 4 + 1 + 1) = 3.66 Gbyte/s n GPU to GPU DMA is saturated by up to 880MByte/sec. Ø n PCIe switch embedded in SandyBridge doesn t have enough resources for PCIe device read. In IvyBridge, the performance is expected to be improved… n 13 4000 DMA Direct Bandwidth (MBytes/sec) n 3500 DMA 2 hop 3000 DMA 3 hop 2500 GPU Direct MVA2 GPU 2000 880Mbyte/s 1500 1000 500 Performance will be expected as same as CPU to CPU case. Hot Interconnects 21 DMA 1 hop Aug 22, 2013 0 8 256 8192 262144 Data Size (bytes) Center for Computational Sciences, Univ. of Tsukuba Programming for TCA cluster n n Data transfer to remote GPU within TCA can be treated like local GPU. In particular, suitable for stencil computation n n Good performance at nearest neighbor communication due to direct network Chaining DMA can bundle data transfers for every Halo planes n n n n XY-plane: contiguous array XZ-plane: block stride YZ-plane: stride In each iteration, DMA descriptors can be reused and only a DMA kick operation is needed Bundle to 1 DMA => Improve strong scaling with small data size 14 Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Related Work n Non Transparent Bridge (NTB) n n n n APEnet+ (Italy) n n n n NTB appends the bridge function to a downstream port of the PCI-E switch. Inflexible, the host must recognize during the BIOS scan It is not defined in the standard of PCI-E and is incompatible with the vendors. GPU direct copy using Fermi GPU,different protocol from TCA is used. Latency between GPUs is around 5us? Original 3-D Torus network, QSFP+ cable MVAPICH2 + GPUDirect n n 15 CUDA5, Kepler Latency between GPUs is reported as 4.5us at Jun., but currently not released (4Q of 2013?) Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba Summary n TCA: Tightly Coupled Accelerators n n PEACH2 board: Implementation for realizing TCA using PCIe technology n n n n TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen s accelerated computing in exa-scale era. Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical peak) Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA between GPUs) GPU-GPU communication over the nodes can be demonstrated with 8 node cluster. By the ping-pong program, PEACH2 can achieve lower latency than existing technology, such as MVAPICH2 in small data size. HA-PACS/TCA with 64 nodes will be installed on the end of Oct. 2013. n n Actual proof system of TCA architecture with 4 GPUs per each node Development of the HPC application using TCA, and production-run n n 16 Performance comparison between TCA and InfiniBand Hybrid communication scheme using TCA and InfiniBand will be investigated. Hot Interconnects 21 Aug 22, 2013 Center for Computational Sciences, Univ. of Tsukuba
© Copyright 2026 Paperzz