Hanawa

Interconnection Network
for Tightly Coupled Accelerators
Architecture
Toshihiro Hanawa, Yuetsu Kodama,
Taisuke Boku, Mitsuhisa Sato
Center for Computational Sciences
University of Tsukuba, Japan
1
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
What is “Tightly Coupled Accelerators (TCA)” ?
Concept:
n  Direct connection between accelerators (GPUs) over the
nodes
n 
n 
n 
Using PCIe as a communication link between accelerators
over the nodes
n 
n 
Eliminate extra memory copies to the host
Reduce latency, improve strong scaling with small data size
for scientific applications
PCIe just performs packet transfer and direct device P2P
communication is available.
PEACH2: PCI Express Adaptive Communication Hub ver. 2
n 
2
In order to configure TCA, each node is connected to other
nodes through PEACH2 chip.
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Design policy of PEACH2
n 
Implement by FPGA with four PCIe Gen.2 IPs
n 
n 
n 
Altera Stratix IV GX
Prototyping, flexible enhancement
Sufficient communication bandwidth
n 
n 
PCI Express Gen2 x8 for each port
Sophisticated DMA controller
n 
n 
Chaining DMA
Latency reduction
n 
n 
Hardwired logic
Low-overhead routing mechanism
n 
n 
Efficient address mapping in PCIe address area using unused bits
Simple comparator for decision of output port
It is not only a proof-of-concept implementation, but it will
also be available for product-run in GPU cluster.
3
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
TCA node structure example
n 
PEACH2 can access all
GPUs
n 
n 
n 
NVIDIA Kepler
architecture + CUDA
5.0 GPUDirect
Support for RDMA
Performance over QPI
is quite bad.
=> support only for
GPU0, GPU1
Connect among 3
nodes using PEACH2
4
Hot Interconnects 21
CPU
(Xeon
E5)
G2
x8
G2
x8
QPI
CPU
(Xeon
E5)
G2
x8
PCIe address
G2
G2
G2 Single
G2 PCIe
x16 x16
x16
x16
PEA
CH2
GPU
0
G2
x8
Aug 22, 2013
GPU
1
GPU
2
GPU
3
G3
x8
IB
HCA
GPU: NVIDIA K20, K20X
(Kepler architecture)
Center for Computational Sciences, Univ. of Tsukuba
Overview of PEACH2 chip
n 
n 
n 
n 
n 
Selectable between Root
and Endpoint
Write only except Port N
n 
5
Instead, Proxy write on remote
node realizes pseudo-read.
Hot Interconnects 21
Aug 22, 2013
CPU & GPU side
(Endpoint)
DMAC
NIOS
(CPU)
Routing
function
Memory
To PEACH2
(Endpoint)
n 
Port N
To PEACH2
(Root Complex)
n 
Fully compatible with PCIe
Gen2 spec.
Root and EndPoint must be
paired according to PCIe spec.
Port N: connected to the host
and GPUs
Port E and W: form the ring
topology
Port S: connected to the
Port
other ring
W
Port
E
To PEACH2
(Root Complex /
Endpoint)
Port S
Center for Computational Sciences, Univ. of Tsukuba
Communication by PEACH2
n 
PIO
n 
n 
CPU can store the data to remote node
directly using mmap.
DMA
n 
Chaining mode
n 
n 
n 
Register mode
n 
n 
n 
6
DMA requests are prepared as the DMA
descriptors chained in the host memory.
Multiple DMA transactions are operated
automatically according to the DMA
descriptors by hardware.
Up to 16 requests
No overhead to transfer descriptors from
host
Block stride transfer function
Hot Interconnects 21
Aug 22, 2013
Source
Destination
Next
Length
Flags
Descriptor0
Descriptor1
Descriptor2
Descriptor3
…
Descriptor
(n-1)
Center for Computational Sciences, Univ. of Tsukuba
PEACH2 board (Production version for
HA-PACS/TCA)
n 
PCI Express Gen2 x8 peripheral board
n 
Compatible with PCIe Spec.
Top View
Side View
7
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
PEACH2 board (Production version for
HA-PACS/TCA)
Main board
+ sub board
FPGA
(Altera Stratix IV
530GX)
PCI Express x8 card edge
Most part operates at 250 MHz
(PCIe Gen2 logic runs at 250MHz)
DDR3SDRAM
Power supply
for various voltage
PCIe x16 cable connecter
PCIe x8 cable connecter
8
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Performance Evaluation
n 
Environment: 8node GPU cluster
(TCAMINI)
n 
n 
n 
n 
n 
n 
n 
n 
9
CPU:
Intel Xeon-E5 (SandyBridge EP)
2.6GHz x2socket
MB:
SuperMicro X9DRG-QF
Memory: DDR3 128GB
OS:
CentOS 6.3 (kernel
2.6.32-279.22.1.el6.x86_64)
GPU:
NVIDIA K20, GDDR5 5GB x1
CUDA: 5.0, NVIDIA-Linuxx86_64-310.32
PEACH2 board: Altera Stratix IV 530GX
MPI: MVAPICH2 1.9 with IB FDR10
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Evaluation items
n 
Ping-pong performance
between nodes
n 
n 
n 
n 
Latency and bandwidth
Written as application
Comparison with
MVAPICH2 1.9 (with
CUDA support) for GPUGPU communication
n 
10
Hot Interconnects 21
Aug 22, 2013
In order to access GPU
memory by the other
device, “GPU Direct support
for RDMA” in CUDA5 API is
used.
n  Special driver named “TCA
p2p driver” to enable
memory mapping is
developed.
“PEACH2 driver” to control
the board is also developed.
Center for Computational Sciences, Univ. of Tsukuba
Ping-pong Latency
7
PIO
6
Latency (usec)
Minimum Latency
(nearest neighbor comm.)
n  PIO: CPU to CPU: 0.9us
n  DMA:CPU to CPU: 1.9us
GPU to GPU: 2.3us
(cf. MVAPICH2 1.9:
19 usec)
DMA (GPU)
8
5
4
3
DMA (CPU)
2
1
PIO, Good Performance (<64B)
0
8
64
512
4096
32768
Data Size (bytes)
11
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Ping-pong Latency
8
DMA Direct
7
DMA 1 hop
6
Latency (usec)
Minimum Latency
(nearest neighbor comm.)
n  PIO: CPU to CPU: 0.9us
n  DMA:CPU to CPU: 1.9us
GPU to GPU: 2.3us
(cf. MVAPICH2 1.9:
19 usec)
Forwarding overhead
n  200~300 nsec
DMA 2 hop
5
DMA 3 hop
4
3
2
DMA (CPU) 1
0
8
64
512
4096
32768
Data Size (bytes)
12
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Ping-pong Bandwidth
3.5 Gbyte/s
Max. 3.5 GByte/sec
n 
n 
95% of theoretical peak
Converge to the same peak
with various hop counts
Max Payload Size = 256byte
Theoretical peak:
4Gbyte/sec × 256 / (256 + 16 + 2 + 4
+ 1 + 1) = 3.66 Gbyte/s
n 
GPU to GPU DMA is saturated
by up to 880MByte/sec.
Ø 
n 
PCIe switch embedded in
SandyBridge doesn t have
enough resources for PCIe
device read.
In IvyBridge, the
performance is expected to
be improved…
n 
13
4000
DMA Direct
Bandwidth (MBytes/sec)
n 
3500
DMA 2 hop
3000
DMA 3 hop
2500
GPU Direct
MVA2 GPU
2000
880Mbyte/s
1500
1000
500
Performance will be expected
as same as CPU to CPU case.
Hot Interconnects 21
DMA 1 hop
Aug 22, 2013
0
8
256
8192
262144
Data Size (bytes)
Center for Computational Sciences, Univ. of Tsukuba
Programming for TCA cluster
n 
n 
Data transfer to remote GPU within TCA can
be treated like local GPU.
In particular, suitable for stencil computation
n 
n 
Good performance at nearest neighbor
communication due to direct network
Chaining DMA can bundle data transfers
for every Halo planes
n 
n 
n 
n 
XY-plane: contiguous array
XZ-plane: block stride
YZ-plane: stride
In each iteration, DMA descriptors can be
reused and only a DMA kick operation is
needed
Bundle to 1 DMA
=> Improve strong scaling with small data size
14
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Related Work
n 
Non Transparent Bridge (NTB)
n 
n 
n 
n 
APEnet+ (Italy)
n 
n 
n 
n 
NTB appends the bridge function to a downstream port of the
PCI-E switch.
Inflexible, the host must recognize during the BIOS scan
It is not defined in the standard of PCI-E and is incompatible with
the vendors.
GPU direct copy using Fermi GPU,different protocol from TCA
is used.
Latency between GPUs is around 5us?
Original 3-D Torus network, QSFP+ cable
MVAPICH2 + GPUDirect
n 
n 
15
CUDA5, Kepler
Latency between GPUs is reported as 4.5us at Jun.,
but currently not released (4Q of 2013?)
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba
Summary
n 
TCA: Tightly Coupled Accelerators
n 
n 
PEACH2 board: Implementation for realizing TCA using PCIe technology
n 
n 
n 
n 
TCA enables direct communication among accelerators as an element
technology becomes a basic technology for next gen s accelerated computing
in exa-scale era.
Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical
peak)
Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA
between GPUs)
GPU-GPU communication over the nodes can be demonstrated with 8 node
cluster.
By the ping-pong program, PEACH2 can achieve lower latency than existing
technology, such as MVAPICH2 in small data size.
HA-PACS/TCA with 64 nodes will be installed on the end of Oct.
2013.
n 
n 
Actual proof system of TCA architecture with 4 GPUs per each node
Development of the HPC application using TCA, and production-run
n 
n 
16
Performance comparison between TCA and InfiniBand
Hybrid communication scheme using TCA and InfiniBand will be investigated.
Hot Interconnects 21
Aug 22, 2013
Center for Computational Sciences, Univ. of Tsukuba