OpenMPI Cray XE

Performance Evaluation of Open MPI
on Cray XE/XK Systems
Samuel K. Gutierrez – LANL
Nathan T. Hjelm – LANL
Manjunath Gorentla Venkata – ORNL
Richard L. Graham – ORNL
Hot Interconnects 2012
Aug 23, 2012
UNCLASSIFIED
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 1
A Collaborative Effort
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 2
Outline
1. Open MPI Overview
2. Gemini Overview
3. Protocols Overview
4. Test Environment
5. Results
6. Conclusions/Future Work
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 3
First Things First – Open MPI Overview

Open-Source Implementation of the MPI-2 Standard

Developed and Maintained By
•
•
•

Academia
Industry
National Laboratories
Supports a Range of High-Performance Network APIs
•
•
•
•
•
Verbs (Infiniband, RoCE, iWarp)
PSM (QLogic/Intel HCAs)
MXM (Mellanox HCAs)
Portals (Cray SeaStar, Infiniband)
uGNI (Cray Gemini, Cray Ares)
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 4
Open MPI’s Plugin Architecture – A High-level Overview1
User Application
MPI API
Modular Component Architecture (MCA)
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 5
Open MPI’s Plugin Architecture – Main Code Sections1

Open MPI Layer (OMPI)
•

Open Run-Time Environment (ORTE)
•

MPI API and Support Logic
Run-Time System
Open Portability Access Layer (OPAL)
•
OS-Specific/Utility Code
OMPI
ORTE
OPAL
Operating System
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 6
The Gemini System Interconnect3 – An Overview

Network Used by the Cray XE and XK System Families
•
Titan, Cielo, Hopper

Successor to the Cray SeaStar* Network Interconnect

3D Torus Network Built of Gemini ASICs

Gemini ASIC
•
•
Provides 10 Torus Connections – 2 x (+X, -X, +Z, -Z) – 1 x (+Y, -Y)
Provides 2 NICs and a 48-port Router
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 7
OB1 PML High-Level Protocol Overview

Eager Message Protocol
•

Remote Get Protocol
•
•

3 Protocol Messages: RNDV + segment, RDMA, FIN
Used When Remote Get protocol is not Available
Remote Get Fallback (New)
•
•

2 Protocol Messages: RGET (ready to send + segment), FIN
Available When Registration Cache is Enabled and BTL Implements Get
RDMA Pipeline Protocol (Put)
•
•

Uses BTL buffered, inline, and in-place send protocols
Essentially a Rendezvous
Fallback Initiated by the Receiver During Remote Get Protocol if BTL Get Protocol
is not Available
Rendezvous (no RDMA)
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 8
uGNI BTL Overview

Protocols
• Send
—
—
•
Get
—
—
•
Uses FMA Or BTE
Available Only if Source And Destination Segments Are 4-Byte aligned and a
Multiple of 4-Bytes in Size
Put
—
—

In-place Send for Small Messages Directly Using Small Message Protocol
(SMSG)
Buffered Send Using Get for Larger Eager Messages (Eager Get)
Uses Fast Memory Access (FMA) or Byte Transport Engine (BTE)
No Alignment Restrictions
Lazy Connection Establishment
•
Resource Utilization Directly Related to Application Communication Characteristics
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 9
uGNI BTL Eager Get Protocol Details (Send)
BTL Send Called
Size Larger
Than SMSG
Limit?
No
Send Data Using
SMSGSendwTag
Yes
Send GET_INIT With
SMSGSendwTag
Send Complete
RDMA_ COMPLETE
Progress Remote
SMSG
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 10
uGNI BTL Eager Get Protocol Details (Receive)
OPAL Progress
Progress Remote
SMSG
GET_INIT
Size
Smaller
Than FMA
Limit
No
Start RDMA
GET
Yes
Start FMA GET
Progress Local
FMA/RDMA CQ
Get Complete
Send
RDMA_COMPLETE
with SmsgSendwTag
Receive
Complete
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 11
Vader BTL Overview

MPICH Nemesis-like Design
•
•

Copy Backend Changes Based on Message Size
•
•

E.g. bcopy [a,b) - memcpy Otherwise
User Tunable with Good Defaults
Cross-Process Memory Mapping Allows for RDMA-Like Semantics
•
•
•
•

Lock-Free Message Queues
“Fast Boxes” – I.e. Per-Peer Receive Queues for Short Messages
Copy-In/Copy-Out (CICO) Avoided
No Backing Store Required
Heavy Use of Registration Cache to Amortize Attach Latency
Exposes Both Put and Get Interfaces to PML Layer
XPMEM Support Requires Kernel Patch and User-Level Library
•
Already Available and Leveraged by Cray’s Native MPI Implementation
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 12
Test Environment

Testing Platforms
•
•

Microbenchmarks
•
•
•
•

Cielito - 1088 Core XE6
Cielo - 142,304 Core XE6
NetPIPE – Measure Lat/BW Benchmark
AMG2006 – Algebraic Multi-grid Solver
LAMMPS – Classical Molecular Dynamics Code
All Microbenchmarks Were Run on Live Production System
Launcher
• orterun
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 13
NetPIPE Latency on XE6 (on ASIC)
Netpipe Latency
4096
Vendor MPI
Open MPI
1024
Latency (usec)
256
64
16
4
1
1
32
1024
32768
Message Size (Bytes)
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
1.04858e+06
3.35544e+07
Slide 14
NetPIPE Bandwidth on XE6 (on ASIC)
Netpipe Bandwidth
6000
Vendor MPI
Open MPI
Uni-directional Bandwidth (MB/sec)
5000
4000
3000
2000
1000
0
1
32
1024
32768
Message Size (Bytes)
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
1.04858e+06
3.35544e+07
Slide 15
Microbenchmark – LAMMPS
LAMMPS ASC Benchmark for Scaled-size Lennard-Jones Liquid
98
Vendor MPI
Open MPI
Loop Time (100 Steps, Seconds)
96
94
92
90
88
86
84
82
1
4
16
64
256
Cores
1024
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
4096
16384
65536
Slide 16
Microbenchmark – AMG2006
AMG2006 ASC Benchmark (3D 7-Point Laplace Problem on a Cube)
4.5e+10
Vendor MPI
Open MPI
System Size * Iterations / Solve Phase Time
4e+10
3.5e+10
3e+10
2.5e+10
2e+10
1.5e+10
1e+10
5e+09
0
0
5000
10000
15000
20000
25000
Cores
30000
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
35000
40000
45000
50000
Slide 17
Conclusion and Ongoing/Future Work

Conclusion
• Bandwidth, Latency, and Scalability Similar to Vendor MPI Implementation

Stabilization/Optimization
•
•
•
Improve Launch Scalability (Over a Minute to Launch 131072 MPI Tasks)
Investigating New Protocols (Shared Message Queue-- MSGQ)
Reduce Memory Requirements

Improved Collective Performance Using uGNI Atomics

Work with Friendly Testers

Prepare for General Release in Open MPI 1.7.0
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 18
Thanks!
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 19
Questions?

Questions?

Comments?
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 20
References

[1] Open MPI. 13 Feb. 2012 <open-mpi.org>.

[2] R. Alverson, et al., “The Gemini System Interconnect,” in High
Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium
on, Aug. 2010, pp. 83 –87.
U N C L A S S I F I E D – LA-UR-12-24229
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
Thursday, August 23, 12
Slide 21