6. Nichols_KAUST__Ma..

Creating an Exascale
Ecosystem for Science
Presented to:
HPC Saudi 2017
Jeffrey A. Nichols
Associate Laboratory Director
Computing and Computational Sciences
March 14, 2017
ORNL is managed by UT-Battelle
for the US Department of Energy
Our vision
Sustain leadership and scientific impact in computing
and computational sciences
• Provide world’s most powerful open resources for scalable computing
and simulation, data and analytics at any scale, and scalable
infrastructure for science
• Follow a well-defined path for maintaining world leadership
in these critical areas
• Attract the brightest talent and partnerships from all over the world
• Deliver leading-edge science relevant to missions
of DOE and key federal and state agencies
• Invest in cross-cutting partnerships
with industry
• Provide unique opportunity for innovation
based on multiagency collaboration
• Invest in education and training
2
Oak Ridge Leadership Computing Facility
(OLCF) is one of the world’s most powerful
computing facilities
Titan
Peak performance
27 PF/s
Data storage
Memory
710 TB
• Spider file system
Disk bandwidth
Square feet
Gaea
− 240 PB capacity
− 6 tape libraries
Peak performance
1.1 PF/s
Data analytics/visualization
240 TB
• LENS cluster
104 GB/s
• Ewok cluster
1,600
• EVEREST visualization facility
2.2 MW
• uRiKA data appliance
Square feet
Power
Peak performance
Memory
Disk bandwidth
Peak performance
Memory
3
• HPSS archive
8.8 MW
Disk bandwidth
Darter
5,000
− 40 PB capacity
− >1 TB/s bandwidth
Power
Memory
Beacon
240 GB/s
Disk bandwidth
210 TF/s
12 TB
Networks
56 GB/s
• ESnet, 100 Gbps
240.9 TF/s • Internet2, 100 Gbps
• Private dark fibre
22.6 TB
30 GB/s
Our Compute and Data Environment for Science
(CADES) provides a shared infrastructure to help
solve big science problems
Spallation
Neutron
Source
Leadership
Computing
Facility
Center for
Nanophase
Materials
Sciences
ALICE,
etc.
Infrastructure
(File systems/ Networking / etc.)
XK7
4
Basic
Energy
Sciences
CADES
UTCADES
Graph
Analytics
Cray GX
Atmospheric
Radiation
Measurement
Condos,
Clusters, and
Hybrid
Clouds
SharedMemory
Ultraviolet
(UV)
Future
Technologies,
Beyond
Moore’s Law
CADES connects the modes of
discovery
In silico investigation
1 Model
Capturing the
physical processes
C Analytics at scale
2 Formulation
Machine learning
Mapping to solvers
and computing
3 Execution
B Alignment
Forward process in
simulation or iteration
for convergence
Data capture into staged
structures
A Experiment design
Synthesis or control
Empirical/experimental investigation
5
CADES Deployment
.. and several ORNL projects on OIC
•
•
•
•
.. and several other smaller projects.
•
•
•
•
•
~5000 Cores of Integrated Condos on Infiniband
~10,000 OIC Cores
Attested PHI Enclave
Integrated with UCAMS and XCAMS
CADES
Open
CADES
Moderate
OIC
~6000 Cores of Integrated Condos on Infiniband
~5000 Cores of Hybrid, Expandable Cloud
SGI UV, Urika-GD/XA: GX
5PB+ High-Speed Storage
~3000 Cores of XK7
Cray Condos
Hybrid
Cloud
Unique Heterogeneous Platforms
Object store
PHI Enclave
Large-Scale
Storage
High-Speed
Interconnects
6
Big Compute + Analytics (OLCF and CADES)
coupled to Big Science Data
Beam user tier
CADES
BEAM web and data tier
HTTPS
Storage
MySQL
Database
Scientific instrument tier
Scanning
Tunneling
Microscopy
(STM)
Scanning
Probe
Microscopy
(SPM)
IFIR/CNMS resources
7
Data/
artifacts
High speed secure data transfer
Local
Scanning
Transmission
Electron
Microscopy
(STEM)
Cades cluster computing
Supercomputing tier
Distributed cloud-based architecture
DOE HPC cloud
Titan, Edison,
and Hopper
Scanning
probe
microscope
CADES VM
web/data server
CADES
compute
clusters
CADES
data
storage
8
ORNL’s computing ecosystem must
integrate data analysis and simulation
capabilities
• Simulation and data are critical to DOE
Experiment
Theory
• Both need more computing capability
• Both have similar hardware
technology requirements
– High bandwidth to memory
– Efficient processing
– Very fast I/O
• Different machine balance
may be required
Computing
9
Big data
Simulation
Analyzing and managing large complex
data sets from experiments,
observation, or simulation and sharing
them with a community
Used to implement theory;
helps with understanding
and prediction
2017 OLCF Leadership System
“The Smartest Supercomputer on the Planet”
Hybrid CPU/GPU
architecture
• Vendor: IBM (Prime) / NVIDIA™ /
Mellanox Technologies®
• At least 5X Titan’s Application
Performance
• Total System Memory >6 PB
HBM, DDR, and non-volatile
• Dual-rail Mellanox® Infiniband full,
non-blocking fat-tree interconnect
• IBM Elastic Storage (GPFS™) – 2.5
TB/s I/O and 250 PB disk capacity
10
Approximately 4,600 nodes,
each with:
• Multiple IBM POWER9 CPUs and
multiple NVIDIA Tesla® GPUs using
the NVIDIA Volta architecture
• CPUs and GPUs connected with
high speed NVLink
• Large coherent memory: over
512 GB (HBM + DDR4)
– all directly addressable from the
CPUs and GPUs
• An additional 800 GB of NVRAM,
which can be configured as
either a burst buffer or as extended
memory
• over 40 TF peak performance
Summit will replace Titan as the OLCF’s
leadership supercomputer
Feature
Titan
Summit
Baseline
5-10x Titan
Number of nodes
18,688
~4,600
Node performance
1.4 TF
> 40 TF
Memory per node
38GB DDR3 + 6GB
GDDR5
512 GB DDR4 + HBM
NV memory per node
0
800 GB
Total system memory
710 TB
>6 PB DDR4 + HBM +
non-volatile
System interconnect
(node injection
bandwidth)
Gemini (6.4 GB/s)
Dual Rail EDR-IB (23 GB/s)
Or Dual Rail HDR-IB (48 GB/s)
Interconnect topology
3D Torus
Non-blocking Fat Tree
Processors
1 AMD Opteron™
1 NVIDIA Kepler™
2 IBM POWER9™
6 NVIDIA Volta™
File system
32 PB, 1 TB/s, Lustre®
250 PB, 2.5 TB/s, GPFS™
9 MW
13 MW
Application
performance
• Many fewer nodes
• Much more
powerful nodes
• Much more memory
per node and total
system memory
• Faster interconnect
• Much higher
bandwidth between
CPUs and GPUs
• Much larger and
faster file system
11
Peak power
consumption
ECP aims to transform the HPC ecosystem
and make major contributions to the nation
• Develop applications that will tackle a broad spectrum of mission critical
problems of unprecedented complexity with unprecedented performance
• Contribute to the economic competitiveness of the nation
• Support national security
• Develop a software stack, in collaboration with vendors, that is exascalecapable and is usable on smaller systems by industry and academia
• Train a large cadre of computational scientists, engineers, and computer
scientists who will be an asset to the nation long after the end of ECP
• Partner with vendors to develop computer architectures that support exascale
applications
• Revitalize the US HPC vendor industry
• Demonstrate the value of comprehensive co-design
12
The ECP Plan of Record
• A 7-year project that follows the holistic/co-design approach,
which runs through 2023 (including 12 months of schedule
contingency)
• Enable an initial exascale system based on advanced
architecture and delivered in 2021
• Enable capable exascale systems, based on ECP R&D,
delivered in 2022 and deployed in 2023 as part of an NNSA
and SC facility upgrades
• Acquisition of the exascale systems is outside of the ECP
scope, will be carried out by DOE-SC and NNSA-ASC
supercomputing facilities
13
Transition to higher trajectory with
advanced architecture
First exascale
advanced
architecture
system
Computing
Capability
Capable
exascale
systems
10X
5X
2017
2021
2022
2023
Time
14
2024
2025
2026
2027
Reaching the elevated trajectory will require
advanced and innovative architectures
In order to reach the elevated trajectory, advanced
architectures must be developed that make a big leap in
– Parallelism
The exascale advanced
– Memory and Storage
architecture developments
– Reliability
benefit all future U.S. systems
on the higher trajectory
– Energy Consumption
In addition, the exascale advanced architecture will need to
solve emerging data science and machine learning problems
in addition to the traditional modeling and simulations
applications.
15
ECP follows a holistic approach that uses
co-design and integration to achieve capable
exascale
Hardware
Technology
Exascale
Systems
Science and
mission
applications
Scalable and
productive software
stack
Hardware
technology
elements
Integrated
exascale
supercomputers
Correctness Visualization
Data
Analysis
Applications
Co-Design
Programming
models,
Math
development
libraries and Tools
environment, and Frameworks
runtimes
System
Software,
resource
management
Data
Memory
threading,
managem
and
scheduling,
ent I/O
Burst
monitoring, and
and file
buffer
control
system
Node OS, runtimes
Workflows
Software
Technology
Resilience
Application
Development
Hardware interface
ECP’s work encompasses applications, system software, hardware
technologies and architectures, and workforce development
16
Planned outcomes of the ECP
• Important applications running at exascale in 2021, producing useful
results
• A full suite of mission and science applications ready to run on the
2023 capable exascale systems
• A large cadre of computational scientists, engineers, and computer
scientists with deep expertise in exascale computing, who will be an
asset to the nation long after the end of ECP
• An integrated software stack that supports exascale applications
• Results of PathForward R&D contracts with vendors that are
integrated into exascale systems and are in vendors’ product
roadmaps
• Industry and mission critical applications have been prepared for a
more diverse and sophisticated set of computing technologies,
carrying U.S. supercomputing well into the future
17
The Oak Ridge Leadership Computing
Facility is on a well-defined path to
exascale
Titan and beyond:
hierarchical parallelism
with very powerful nodes
Since clock-rate scaling ended in 2003:
HPC performance achieved
through increased parallelism
MPI plus thread-level parallelism through
OpenACC or OpenMP plus vectors
Jaguar scaled
to 300,000 cores
Jaguar
2.3 PF
Multi-core CPU
7 MW
2010
18
Summit
200PF Titan
Hybrid GPU/CPU
13 MW
Titan:
27 PF
Hybrid GPU/CPU
9 MW
2012
2017
2021-2022
OLCF-5
5–10× Summit
~20-50 MW
Summary
• ORNL has a long history in highperformance computing for
science, delivering many first-ofa-kind systems that were among
the world’s most powerful
computers. We will continue this
as a core-competency of the
laboratory
• Delivering an ecosystem focused
on the integration of computing
and data into instruments of
science and engineering
• This ecosystem delivers
important, time-critical science
with enormous impacts
19
Questions?
20