Performance Evaluation of Open MPI on Cray XE/XK Systems Samuel K. Gutierrez – LANL Nathan T. Hjelm – LANL Manjunath Gorentla Venkata – ORNL Richard L. Graham – ORNL Hot Interconnects 2012 Aug 23, 2012 UNCLASSIFIED Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 1 A Collaborative Effort U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 2 Outline 1. Open MPI Overview 2. Gemini Overview 3. Protocols Overview 4. Test Environment 5. Results 6. Conclusions/Future Work U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 3 First Things First – Open MPI Overview Open-Source Implementation of the MPI-2 Standard Developed and Maintained By • • • Academia Industry National Laboratories Supports a Range of High-Performance Network APIs • • • • • Verbs (Infiniband, RoCE, iWarp) PSM (QLogic/Intel HCAs) MXM (Mellanox HCAs) Portals (Cray SeaStar, Infiniband) uGNI (Cray Gemini, Cray Ares) U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 4 Open MPI’s Plugin Architecture – A High-level Overview1 User Application MPI API Modular Component Architecture (MCA) U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 5 Open MPI’s Plugin Architecture – Main Code Sections1 Open MPI Layer (OMPI) • Open Run-Time Environment (ORTE) • MPI API and Support Logic Run-Time System Open Portability Access Layer (OPAL) • OS-Specific/Utility Code OMPI ORTE OPAL Operating System U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 6 The Gemini System Interconnect3 – An Overview Network Used by the Cray XE and XK System Families • Titan, Cielo, Hopper Successor to the Cray SeaStar* Network Interconnect 3D Torus Network Built of Gemini ASICs Gemini ASIC • • Provides 10 Torus Connections – 2 x (+X, -X, +Z, -Z) – 1 x (+Y, -Y) Provides 2 NICs and a 48-port Router U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 7 OB1 PML High-Level Protocol Overview Eager Message Protocol • Remote Get Protocol • • 3 Protocol Messages: RNDV + segment, RDMA, FIN Used When Remote Get protocol is not Available Remote Get Fallback (New) • • 2 Protocol Messages: RGET (ready to send + segment), FIN Available When Registration Cache is Enabled and BTL Implements Get RDMA Pipeline Protocol (Put) • • Uses BTL buffered, inline, and in-place send protocols Essentially a Rendezvous Fallback Initiated by the Receiver During Remote Get Protocol if BTL Get Protocol is not Available Rendezvous (no RDMA) U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 8 uGNI BTL Overview Protocols • Send — — • Get — — • Uses FMA Or BTE Available Only if Source And Destination Segments Are 4-Byte aligned and a Multiple of 4-Bytes in Size Put — — In-place Send for Small Messages Directly Using Small Message Protocol (SMSG) Buffered Send Using Get for Larger Eager Messages (Eager Get) Uses Fast Memory Access (FMA) or Byte Transport Engine (BTE) No Alignment Restrictions Lazy Connection Establishment • Resource Utilization Directly Related to Application Communication Characteristics U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 9 uGNI BTL Eager Get Protocol Details (Send) BTL Send Called Size Larger Than SMSG Limit? No Send Data Using SMSGSendwTag Yes Send GET_INIT With SMSGSendwTag Send Complete RDMA_ COMPLETE Progress Remote SMSG U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 10 uGNI BTL Eager Get Protocol Details (Receive) OPAL Progress Progress Remote SMSG GET_INIT Size Smaller Than FMA Limit No Start RDMA GET Yes Start FMA GET Progress Local FMA/RDMA CQ Get Complete Send RDMA_COMPLETE with SmsgSendwTag Receive Complete U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 11 Vader BTL Overview MPICH Nemesis-like Design • • Copy Backend Changes Based on Message Size • • E.g. bcopy [a,b) - memcpy Otherwise User Tunable with Good Defaults Cross-Process Memory Mapping Allows for RDMA-Like Semantics • • • • Lock-Free Message Queues “Fast Boxes” – I.e. Per-Peer Receive Queues for Short Messages Copy-In/Copy-Out (CICO) Avoided No Backing Store Required Heavy Use of Registration Cache to Amortize Attach Latency Exposes Both Put and Get Interfaces to PML Layer XPMEM Support Requires Kernel Patch and User-Level Library • Already Available and Leveraged by Cray’s Native MPI Implementation U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 12 Test Environment Testing Platforms • • Microbenchmarks • • • • Cielito - 1088 Core XE6 Cielo - 142,304 Core XE6 NetPIPE – Measure Lat/BW Benchmark AMG2006 – Algebraic Multi-grid Solver LAMMPS – Classical Molecular Dynamics Code All Microbenchmarks Were Run on Live Production System Launcher • orterun U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 13 NetPIPE Latency on XE6 (on ASIC) Netpipe Latency 4096 Vendor MPI Open MPI 1024 Latency (usec) 256 64 16 4 1 1 32 1024 32768 Message Size (Bytes) U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 1.04858e+06 3.35544e+07 Slide 14 NetPIPE Bandwidth on XE6 (on ASIC) Netpipe Bandwidth 6000 Vendor MPI Open MPI Uni-directional Bandwidth (MB/sec) 5000 4000 3000 2000 1000 0 1 32 1024 32768 Message Size (Bytes) U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 1.04858e+06 3.35544e+07 Slide 15 Microbenchmark – LAMMPS LAMMPS ASC Benchmark for Scaled-size Lennard-Jones Liquid 98 Vendor MPI Open MPI Loop Time (100 Steps, Seconds) 96 94 92 90 88 86 84 82 1 4 16 64 256 Cores 1024 U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 4096 16384 65536 Slide 16 Microbenchmark – AMG2006 AMG2006 ASC Benchmark (3D 7-Point Laplace Problem on a Cube) 4.5e+10 Vendor MPI Open MPI System Size * Iterations / Solve Phase Time 4e+10 3.5e+10 3e+10 2.5e+10 2e+10 1.5e+10 1e+10 5e+09 0 0 5000 10000 15000 20000 25000 Cores 30000 U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 35000 40000 45000 50000 Slide 17 Conclusion and Ongoing/Future Work Conclusion • Bandwidth, Latency, and Scalability Similar to Vendor MPI Implementation Stabilization/Optimization • • • Improve Launch Scalability (Over a Minute to Launch 131072 MPI Tasks) Investigating New Protocols (Shared Message Queue-- MSGQ) Reduce Memory Requirements Improved Collective Performance Using uGNI Atomics Work with Friendly Testers Prepare for General Release in Open MPI 1.7.0 U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 18 Thanks! U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 19 Questions? Questions? Comments? U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 20 References [1] Open MPI. 13 Feb. 2012 <open-mpi.org>. [2] R. Alverson, et al., “The Gemini System Interconnect,” in High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on, Aug. 2010, pp. 83 –87. U N C L A S S I F I E D – LA-UR-12-24229 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Thursday, August 23, 12 Slide 21
© Copyright 2026 Paperzz