Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems Jerome Vienne Jitong Chen Md. Wasi-ur Rahman Nusrat S. Islam Hari Subramoni Dhabaleswar K. (DK) Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University IEEE Hot Interconnects August 23, 2012 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Outline 1 Introduction 2 Cloud Computing Applications 3 Performance Analysis and Evaluation 4 Conclusion IEEE Hot Interconnects August 23, 2012 2 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Outline 1 Introduction 2 Cloud Computing Applications 3 Performance Analysis and Evaluation 4 Conclusion IEEE Hot Interconnects August 23, 2012 3 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Introduction • Commodity clusters continue to be very popular in HPC and clouds • HPC applications and cloud computing middleware (e.g. Hadoop) have varying communication and computation characteristics • New PCIe Gen3 interface can now deliver speeds up to 128 Gbps • High performance interconnects are capable of delivering speeds up to 54 Gbps • New Mellanox’s ConnectX-3 FDR (54 Gbps)/RoCE 40 GigE IEEE Hot Interconnects August 23, 2012 4 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Overview of Network Protocol Stacks High Performance Computing / Cloud Computing Applications Sockets Kernel Space Verbs TCP/IP RDMA RDMA Ethernet Driver IPoIB User Space User Space 10/40 GigE Ethernet Adapter QDR/FDR InfiniBand Adapter 10/40 GigE Ethernet Adapter QDR/FDR InfiniBand Adapter 10/40 GigE Ethernet Switch Sockets QDR/FDR InfiniBand Switch IPoIB 10/40 GigE Ethernet Switch RoCE QDR/FDR InfiniBand Switch IB • RoCE: allows the RDMA of InfiniBand to run over Ethernet. • ConnectX-2: 10 GigE in RoCE mode or QDR (32 Gbps) in IB mode • ConnectX-3: 40 GigE in RoCE mode or FDR (54 Gbps) in IB mode IEEE Hot Interconnects August 23, 2012 5 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Problem Statement • How much benefit can the user of a HPC / Cloud installation hope to see by utilizing IB FDR / RoCE 40 GigE over IB QDR and RoCE 10 GigE interconnects, respectively? • How does InfiniBand compare with RoCE in terms of performance? IEEE Hot Interconnects August 23, 2012 6 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Outline 1 Introduction 2 Cloud Computing Applications 3 Performance Analysis and Evaluation 4 Conclusion IEEE Hot Interconnects August 23, 2012 7 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Cloud Computing Middleware • Cloud computing economies have gained significant momentum and popularity • Required the highest performance and reliability available. • Apache Hadoop is the most popular framework for running applications on large cluster built of commodity hardware • Major components of Hadoop used: • HDFS • HBase IEEE Hot Interconnects August 23, 2012 8 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion HDFS • Hadoop Distributed File System (HDFS) is the underlying file system for Hadoop framework • HDFS is designed for storing very large files on clusters of commodity hardware • Two main types of nodes: • NameNode: responsible for storing and managing the metadata • DataNode: act as storage for HDFS files • Files are usually divided into fixed-sized (64 MB) blocks and stored as independent units • Each block is also replicated to multiple (typically three) DataNodes in order to provide fault tolerance and availability IEEE Hot Interconnects August 23, 2012 9 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion HBase and YCSB HBase • Developed as part of the Apache Hadoop project • Java-based database • Runs on top of the Hadoop framework • Used to host very large tables with many billions of entries • Provides capabilities similar to Google’s BigTable Yahoo! Cloud Serving Benchmark • Used as our workload • Facilitates performance comparisons of different key/value-pair and cloud data serving systems • Defines a core set of benchmarks for four widely used systems: HBase, Cassandra, PNUTS and a simple shared MySQL implementation. IEEE Hot Interconnects August 23, 2012 10 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Outline 1 Introduction 2 Cloud Computing Applications 3 Performance Analysis and Evaluation 4 Conclusion IEEE Hot Interconnects August 23, 2012 11 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Experimental Testbed 4-node InfiniBand Linux cluster • 16 cores/node – 2 Intel Sandy Bridge-EP 2.6 Ghz CPUs • 32 GB main memory, 20 MB L3 shared cache • 1 PCIe Gen3 (128 Gbps) • Vendor modified version of OFED based on OFED-1.5.3 Network Equipment • IB cards: • ConnectX-2 QDR (32 Gbps) / 10 GigE • ConnectX-3 FDR (54 Gbps) / 40 GigE • 36-port Mellanox FDR switch used to connect all the nodes IEEE Hot Interconnects August 23, 2012 12 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Performance Results Network Level Performance • Latency • Bandwidth MPI Level Performance • Point-to-point MPI • MPI Collectives • NAS Parallel Benchmarks Impact on Cloud Computing Middlewares • HDFS Write using TestDFSIO • HBase Get and Put throughput IEEE Hot Interconnects August 23, 2012 13 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Latency 8 RoCE (10 Gige) 7 RoCE (40 GigE) 6 IB (QDR) Latency (us) 5 IB (FDR) 4 3 2 52% 1 0 2 4 8 16 32 64 128 256 Message Size (Bytes) 512 1024 2048 4096 • Network level latency benchmark (ib_send_lat) • IB FDR provides best performance • IB QDR gives better latency than 40GigE IEEE Hot Interconnects August 23, 2012 14 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Bandwidth 7000 RoCE (10 GigE) 6000 RoCE (40 GigE) Latency (us) 5000 4000 IB (QDR) 84% IB (FDR) 3000 298% 2000 1000 0 Message Size (Bytes) • Network level bandwidth benchmark (ib_send_bw) • 40 GigE gives better bandwidth than IB QDR • Encoding: IB QDR (8/10) vs 40 GigE (64/66) IEEE Hot Interconnects August 23, 2012 15 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion MVAPICH2 Software High Performance MPI Library for IB and 10/40GE • Used by more than 1,930 organizations in 68 countries • More than 124,000 direct downloads from OSU site • Empowering many TOP500 clusters • 11th ranked 81,920-core cluster (Pleiades) at NASA • 14th ranked , 73,278-core (Tsubame 2.0) at Tokyo Institute of Technology • 40th ranked 62,976-core cluster (Ranger) at TACC • Available with software stacks of many IB, 10/40GE and server vendors including Open Fabrics Enterprise Distribution (OFED) • Also supports uDAPL device (for networks supporting uDAPL) • http://mvapich.cse.ohio-state.edu/ IEEE Hot Interconnects August 23, 2012 16 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Point-to-point MPI: Latency 8 RoCE (10 GigE) 7 RoCE (40 GigE) Latency (us) 6 IB (QDR) 5 IB (FDR) 4 3 2 1 0 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Message Size (Bytes) • MPI level latency benchmark (OMB: osu_latency) • IB FDR provides best performance • IB QDR gives better latency than 10/40GigE IEEE Hot Interconnects August 23, 2012 17 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Point-to-point MPI: Bandwidth 7000 RoCE (10 GigE) Bandwidth (Million Bytes /Second) 6000 RoCE (40 Gige) 5000 4000 86% IB (QDR) IB (FDR) 3000 298% 2000 1000 0 Message Size (Bytes) • MPI level bandwidth benchmark (OMB: osu_bw) • 40 GigE gives better bandwidth than IB QDR • Encoding: IB QDR (8/10) vs RoCE 40 GigE (64/66) IEEE Hot Interconnects August 23, 2012 18 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion MPI Collective: Scatter 40 35 RoCE (10 GigE) RoCE (40 GigE) Latency (ms) 30 25 IB (QDR) IB (FDR) 20 15 10 5 0 4096 8192 16384 32768 65536 131072 262144 524288 1048576 Message Size (Bytes) • MPI level collective benchmark (OMB: osu_scatter) • IB FDR provides best performance IEEE Hot Interconnects August 23, 2012 19 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion NAS Parallel Benchmarks Class C Benchmark FT IS MG BT IB (QDR) 9.96s 0.80s 2.02s 24.79s IB (FDR) 8.80s 0.64s 1.98s 24.74s RoCE (10 GigE) 14.39s 1.32s 2.20s 26.23s RoCE (40 GigE) 9.71s 0.71s 1.99s 24.83s • Design to mimic computation and data movement in CFD applications • FT, IS: Communication bound • MG, BT: Computation bound IEEE Hot Interconnects August 23, 2012 20 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion TestDFSIO • File system benchmark that measures the I/O performance of HDFS • HDFS Write is more network sensitive compared to HDFS Read (occurs locally in a node in most of the cases) • In sequential write, each map task opens a file and writes specific amount of data to the file. • A single reduce task aggregates the results of all the map tasks Protocol • We start two map tasks each writing a file to three DataNodes • We vary the file size from 1 GB to 10 GB • We measure the throughput of sequential write reported by TestDFSIO IEEE Hot Interconnects August 23, 2012 21 / 28 Introduction Cloud Computing Applications Throughput (MegaBytes/s) 350 Performance Analysis and Evaluation Conclusion HDFS Write Operation IPoIB (FDR) IPoIB (QDR) Sockets (40 GigE) Sockets (10 GigE) 300 250 200 150 100 50 0 1 3 5 7 File Size (GB) 9 • Due to the higher bandwidth of IPoIB (FDR) system, sequential write provides better throughput for all the file sizes compared to IPoIB (QDR). • Up to 19% benefit for IPoIB (FDR) over IPoIB (QDR) • The throughput of sequential write is improved by 31% over Sockets (40 GigE) compared to Sockets (10 GigE) IEEE Hot Interconnects August 23, 2012 22 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion HBase evaluation • Using YCSB as our workload, we perform 100% Get, 100% Put and a 50% Get and Put Mix operations. • HBase Get operation requires less network communication. • HBase Put creates more network traffic (all the data are written to both MemStore and HDFS) • Mix Get and Put generates network traffic (some old data are replaced in MemStore by the new ones each time) • Three regionservers are used. • The regionservers communicate with the master (HDFS NameNode) and the HBaseclient through the underlying interconnect. • Usually regionservers are configured to reside in the same nodes as HDFS DataNodes, to improve data locality. • For these workloads, we have used 320,000 records to be inserted to and read from HBase. IEEE Hot Interconnects August 23, 2012 23 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Throughput (Kilo Ops/Sec) HBase Get and Put throughput 9 8 7 6 5 4 3 2 1 0 IPoIB (FDR) IPoIB (QDR) Sockets (40 GigE) Sockets (10 GigE) Pure−Get Get−Put−Mix Operation Pure−Put • 9% benefit for IPoIB (FDR) with Get-Put-Mix • Up to 10% benefit for IPoIB (FDR) with 100% Put • Overall, IPoIB (FDR) 25% better than Sockets (40GigE) IEEE Hot Interconnects August 23, 2012 24 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Performance Characterization Network Level Performance IB FDR > RoCE 40 GigE > IB QDR > RoCE 10 GigE IB/IPoIB FDR IB/IPoIB QDR RoCE/Sockets 40 GigE RoCE/Sockets 10 GigE Performance of HPC Applications IB FDR > RoCE 40 GigE > IB QDR > RoCE 10 GigE Performance of Cloud Computing Middlewares IPoIB FDR > IPoIB QDR > Sockets 40 GigE > Sockets 10 GigE IEEE Hot Interconnects August 23, 2012 25 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Outline 1 Introduction 2 Cloud Computing Applications 3 Performance Analysis and Evaluation 4 Conclusion IEEE Hot Interconnects August 23, 2012 26 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Conclusion • Carried out a comprehensive performance evaluation of four possible modes of communication • Latest InfiniBand FDR interconnect gives the best performance • Network level evaluations and for HPC applications: RoCE 40 GigE performance better than IB QDR • Cloud computing middleware: IPoIB QDR performance better than RoCE 40 GigE IEEE Hot Interconnects August 23, 2012 27 / 28 Introduction Cloud Computing Applications Performance Analysis and Evaluation Conclusion Thanks for your attention Questions ? Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ IEEE Hot Interconnects August 23, 2012 28 / 28
© Copyright 2025 Paperzz