PerformanceAnalysis

Performance Analysis and Evaluation of
InfiniBand FDR and 40GigE RoCE on HPC
and Cloud Computing Systems
Jerome Vienne Jitong Chen Md. Wasi-ur Rahman
Nusrat S. Islam Hari Subramoni
Dhabaleswar K. (DK) Panda
Network-Based Computing Laboratory
Department of Computer Science and Engineering
The Ohio State University
IEEE Hot Interconnects
August 23, 2012
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Outline
1
Introduction
2
Cloud Computing Applications
3
Performance Analysis and Evaluation
4
Conclusion
IEEE Hot Interconnects August 23, 2012
2 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Outline
1
Introduction
2
Cloud Computing Applications
3
Performance Analysis and Evaluation
4
Conclusion
IEEE Hot Interconnects August 23, 2012
3 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Introduction
• Commodity clusters continue to be very popular in HPC and
clouds
• HPC applications and cloud computing middleware (e.g.
Hadoop) have varying communication and computation
characteristics
• New PCIe Gen3 interface can now deliver speeds up to 128
Gbps
• High performance interconnects are capable of delivering
speeds up to 54 Gbps
• New Mellanox’s ConnectX-3 FDR (54 Gbps)/RoCE 40 GigE
IEEE Hot Interconnects August 23, 2012
4 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Overview of Network Protocol Stacks
High Performance Computing / Cloud Computing Applications
Sockets
Kernel Space
Verbs
TCP/IP
RDMA
RDMA
Ethernet Driver
IPoIB
User Space
User Space
10/40 GigE
Ethernet
Adapter
QDR/FDR
InfiniBand
Adapter
10/40 GigE
Ethernet
Adapter
QDR/FDR
InfiniBand
Adapter
10/40 GigE
Ethernet
Switch
Sockets
QDR/FDR
InfiniBand
Switch
IPoIB
10/40 GigE
Ethernet
Switch
RoCE
QDR/FDR
InfiniBand
Switch
IB
• RoCE: allows the RDMA of InfiniBand to run over Ethernet.
• ConnectX-2: 10 GigE in RoCE mode or QDR (32 Gbps) in IB mode
• ConnectX-3: 40 GigE in RoCE mode or FDR (54 Gbps) in IB mode
IEEE Hot Interconnects August 23, 2012
5 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Problem Statement
• How much benefit can the user of a HPC / Cloud installation
hope to see by utilizing IB FDR / RoCE 40 GigE over IB
QDR and RoCE 10 GigE interconnects, respectively?
• How does InfiniBand compare with RoCE in terms of
performance?
IEEE Hot Interconnects August 23, 2012
6 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Outline
1
Introduction
2
Cloud Computing Applications
3
Performance Analysis and Evaluation
4
Conclusion
IEEE Hot Interconnects August 23, 2012
7 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Cloud Computing Middleware
• Cloud computing economies have gained significant
momentum and popularity
• Required the highest performance and reliability available.
• Apache Hadoop is the most popular framework for running
applications on large cluster built of commodity hardware
• Major components of Hadoop used:
• HDFS
• HBase
IEEE Hot Interconnects August 23, 2012
8 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
HDFS
• Hadoop Distributed File System (HDFS) is the underlying
file system for Hadoop framework
• HDFS is designed for storing very large files on clusters of
commodity hardware
• Two main types of nodes:
• NameNode: responsible for storing and managing the
metadata
• DataNode: act as storage for HDFS files
• Files are usually divided into fixed-sized (64 MB) blocks and
stored as independent units
• Each block is also replicated to multiple (typically three)
DataNodes in order to provide fault tolerance and availability
IEEE Hot Interconnects August 23, 2012
9 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
HBase and YCSB
HBase
• Developed as part of the Apache Hadoop project
• Java-based database
• Runs on top of the Hadoop framework
• Used to host very large tables with many billions of entries
• Provides capabilities similar to Google’s BigTable
Yahoo! Cloud Serving Benchmark
• Used as our workload
• Facilitates performance comparisons of different key/value-pair and cloud
data serving systems
• Defines a core set of benchmarks for four widely used systems: HBase,
Cassandra, PNUTS and a simple shared MySQL implementation.
IEEE Hot Interconnects August 23, 2012
10 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Outline
1
Introduction
2
Cloud Computing Applications
3
Performance Analysis and Evaluation
4
Conclusion
IEEE Hot Interconnects August 23, 2012
11 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Experimental Testbed
4-node InfiniBand Linux cluster
• 16 cores/node – 2 Intel Sandy Bridge-EP 2.6 Ghz CPUs
• 32 GB main memory, 20 MB L3 shared cache
• 1 PCIe Gen3 (128 Gbps)
• Vendor modified version of OFED based on OFED-1.5.3
Network Equipment
• IB cards:
• ConnectX-2 QDR (32 Gbps) / 10 GigE
• ConnectX-3 FDR (54 Gbps) / 40 GigE
• 36-port Mellanox FDR switch used to connect all the nodes
IEEE Hot Interconnects August 23, 2012
12 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Performance Results
Network Level Performance
• Latency
• Bandwidth
MPI Level Performance
• Point-to-point MPI
• MPI Collectives
• NAS Parallel Benchmarks
Impact on Cloud Computing Middlewares
• HDFS Write using TestDFSIO
• HBase Get and Put throughput
IEEE Hot Interconnects August 23, 2012
13 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Latency
8 RoCE (10 Gige) 7 RoCE (40 GigE) 6 IB (QDR) Latency (us) 5 IB (FDR) 4 3 2 52% 1 0 2 4 8 16 32 64 128 256 Message Size (Bytes) 512 1024 2048 4096 • Network level latency benchmark (ib_send_lat)
• IB FDR provides best performance
• IB QDR gives better latency than 40GigE
IEEE Hot Interconnects August 23, 2012
14 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Bandwidth
7000 RoCE (10 GigE) 6000 RoCE (40 GigE) Latency (us) 5000 4000 IB (QDR) 84% IB (FDR) 3000 298% 2000 1000 0 Message Size (Bytes) • Network level bandwidth benchmark (ib_send_bw)
• 40 GigE gives better bandwidth than IB QDR
• Encoding: IB QDR (8/10) vs 40 GigE (64/66)
IEEE Hot Interconnects August 23, 2012
15 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
MVAPICH2 Software
High Performance MPI Library for IB and 10/40GE
• Used by more than 1,930 organizations in 68 countries
• More than 124,000 direct downloads from OSU site
• Empowering many TOP500 clusters
• 11th ranked 81,920-core cluster (Pleiades) at NASA
• 14th ranked , 73,278-core (Tsubame 2.0) at Tokyo Institute
of Technology
• 40th ranked 62,976-core cluster (Ranger) at TACC
• Available with software stacks of many IB, 10/40GE and
server vendors including Open Fabrics Enterprise
Distribution (OFED)
• Also supports uDAPL device (for networks supporting
uDAPL)
• http://mvapich.cse.ohio-state.edu/
IEEE Hot Interconnects August 23, 2012
16 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Point-to-point MPI: Latency
8 RoCE (10 GigE) 7 RoCE (40 GigE) Latency (us) 6 IB (QDR) 5 IB (FDR) 4 3 2 1 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size (Bytes) • MPI level latency benchmark (OMB: osu_latency)
• IB FDR provides best performance
• IB QDR gives better latency than 10/40GigE
IEEE Hot Interconnects August 23, 2012
17 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Point-to-point MPI: Bandwidth
7000 RoCE (10 GigE) Bandwidth (Million Bytes /Second) 6000 RoCE (40 Gige) 5000 4000 86% IB (QDR) IB (FDR) 3000 298% 2000 1000 0 Message Size (Bytes) • MPI level bandwidth benchmark (OMB: osu_bw)
• 40 GigE gives better bandwidth than IB QDR
• Encoding: IB QDR (8/10) vs RoCE 40 GigE (64/66)
IEEE Hot Interconnects August 23, 2012
18 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
MPI Collective: Scatter
40 35 RoCE (10 GigE) RoCE (40 GigE) Latency (ms) 30 25 IB (QDR) IB (FDR) 20 15 10 5 0 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Message Size (Bytes) • MPI level collective benchmark (OMB: osu_scatter)
• IB FDR provides best performance
IEEE Hot Interconnects August 23, 2012
19 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
NAS Parallel Benchmarks Class C
Benchmark
FT
IS
MG
BT
IB
(QDR)
9.96s
0.80s
2.02s
24.79s
IB
(FDR)
8.80s
0.64s
1.98s
24.74s
RoCE
(10 GigE)
14.39s
1.32s
2.20s
26.23s
RoCE
(40 GigE)
9.71s
0.71s
1.99s
24.83s
• Design to mimic computation and data movement in CFD
applications
• FT, IS: Communication bound
• MG, BT: Computation bound
IEEE Hot Interconnects August 23, 2012
20 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
TestDFSIO
• File system benchmark that measures the I/O performance of HDFS
• HDFS Write is more network sensitive compared to HDFS Read (occurs
locally in a node in most of the cases)
• In sequential write, each map task opens a file and writes specific
amount of data to the file.
• A single reduce task aggregates the results of all the map tasks
Protocol
• We start two map tasks each writing a file to three DataNodes
• We vary the file size from 1 GB to 10 GB
• We measure the throughput of sequential write reported by TestDFSIO
IEEE Hot Interconnects August 23, 2012
21 / 28
Introduction
Cloud Computing Applications
Throughput (MegaBytes/s)
350
Performance Analysis and Evaluation
Conclusion
HDFS Write Operation
IPoIB (FDR)
IPoIB (QDR)
Sockets (40 GigE)
Sockets (10 GigE)
300
250
200
150
100
50
0
1
3
5
7
File Size (GB)
9
• Due to the higher bandwidth of IPoIB (FDR) system, sequential write
provides better throughput for all the file sizes compared to IPoIB (QDR).
• Up to 19% benefit for IPoIB (FDR) over IPoIB (QDR)
• The throughput of sequential write is improved by 31% over Sockets
(40 GigE) compared to Sockets (10 GigE)
IEEE Hot Interconnects August 23, 2012
22 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
HBase evaluation
• Using YCSB as our workload, we perform 100% Get, 100% Put and a
50% Get and Put Mix operations.
• HBase Get operation requires less network communication.
• HBase Put creates more network traffic (all the data are written to
both MemStore and HDFS)
• Mix Get and Put generates network traffic (some old data are
replaced in MemStore by the new ones each time)
• Three regionservers are used.
• The regionservers communicate with the master (HDFS NameNode) and
the HBaseclient through the underlying interconnect.
• Usually regionservers are configured to reside in the same nodes as
HDFS DataNodes, to improve data locality.
• For these workloads, we have used 320,000 records to be inserted to
and read from HBase.
IEEE Hot Interconnects August 23, 2012
23 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Throughput (Kilo Ops/Sec)
HBase Get and Put throughput
9
8
7
6
5
4
3
2
1
0
IPoIB (FDR)
IPoIB (QDR)
Sockets (40 GigE)
Sockets (10 GigE)
Pure−Get
Get−Put−Mix
Operation
Pure−Put
• 9% benefit for IPoIB (FDR) with Get-Put-Mix
• Up to 10% benefit for IPoIB (FDR) with 100% Put
• Overall, IPoIB (FDR) 25% better than Sockets (40GigE)
IEEE Hot Interconnects August 23, 2012
24 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Performance Characterization
Network Level Performance
IB FDR > RoCE 40 GigE > IB QDR > RoCE 10 GigE
IB/IPoIB FDR
IB/IPoIB QDR
RoCE/Sockets 40 GigE
RoCE/Sockets 10 GigE
Performance of HPC Applications
IB FDR > RoCE 40 GigE > IB QDR > RoCE 10 GigE
Performance of Cloud Computing Middlewares
IPoIB FDR > IPoIB QDR > Sockets 40 GigE > Sockets 10 GigE
IEEE Hot Interconnects August 23, 2012
25 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Outline
1
Introduction
2
Cloud Computing Applications
3
Performance Analysis and Evaluation
4
Conclusion
IEEE Hot Interconnects August 23, 2012
26 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Conclusion
• Carried out a comprehensive performance evaluation of four
possible modes of communication
• Latest InfiniBand FDR interconnect gives the best
performance
• Network level evaluations and for HPC applications: RoCE
40 GigE performance better than IB QDR
• Cloud computing middleware: IPoIB QDR performance
better than RoCE 40 GigE
IEEE Hot Interconnects August 23, 2012
27 / 28
Introduction
Cloud Computing Applications
Performance Analysis and Evaluation
Conclusion
Thanks for your attention
Questions ?
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
MVAPICH Web Page
http://mvapich.cse.ohio-state.edu/
IEEE Hot Interconnects August 23, 2012
28 / 28