Creating an Exascale Ecosystem for Science Presented to: HPC Saudi 2017 Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences March 14, 2017 ORNL is managed by UT-Battelle for the US Department of Energy Our vision Sustain leadership and scientific impact in computing and computational sciences • Provide world’s most powerful open resources for scalable computing and simulation, data and analytics at any scale, and scalable infrastructure for science • Follow a well-defined path for maintaining world leadership in these critical areas • Attract the brightest talent and partnerships from all over the world • Deliver leading-edge science relevant to missions of DOE and key federal and state agencies • Invest in cross-cutting partnerships with industry • Provide unique opportunity for innovation based on multiagency collaboration • Invest in education and training 2 Oak Ridge Leadership Computing Facility (OLCF) is one of the world’s most powerful computing facilities Titan Peak performance 27 PF/s Data storage Memory 710 TB • Spider file system Disk bandwidth Square feet Gaea − 240 PB capacity − 6 tape libraries Peak performance 1.1 PF/s Data analytics/visualization 240 TB • LENS cluster 104 GB/s • Ewok cluster 1,600 • EVEREST visualization facility 2.2 MW • uRiKA data appliance Square feet Power Peak performance Memory Disk bandwidth Peak performance Memory 3 • HPSS archive 8.8 MW Disk bandwidth Darter 5,000 − 40 PB capacity − >1 TB/s bandwidth Power Memory Beacon 240 GB/s Disk bandwidth 210 TF/s 12 TB Networks 56 GB/s • ESnet, 100 Gbps 240.9 TF/s • Internet2, 100 Gbps • Private dark fibre 22.6 TB 30 GB/s Our Compute and Data Environment for Science (CADES) provides a shared infrastructure to help solve big science problems Spallation Neutron Source Leadership Computing Facility Center for Nanophase Materials Sciences ALICE, etc. Infrastructure (File systems/ Networking / etc.) XK7 4 Basic Energy Sciences CADES UTCADES Graph Analytics Cray GX Atmospheric Radiation Measurement Condos, Clusters, and Hybrid Clouds SharedMemory Ultraviolet (UV) Future Technologies, Beyond Moore’s Law CADES connects the modes of discovery In silico investigation 1 Model Capturing the physical processes C Analytics at scale 2 Formulation Machine learning Mapping to solvers and computing 3 Execution B Alignment Forward process in simulation or iteration for convergence Data capture into staged structures A Experiment design Synthesis or control Empirical/experimental investigation 5 CADES Deployment .. and several ORNL projects on OIC • • • • .. and several other smaller projects. • • • • • ~5000 Cores of Integrated Condos on Infiniband ~10,000 OIC Cores Attested PHI Enclave Integrated with UCAMS and XCAMS CADES Open CADES Moderate OIC ~6000 Cores of Integrated Condos on Infiniband ~5000 Cores of Hybrid, Expandable Cloud SGI UV, Urika-GD/XA: GX 5PB+ High-Speed Storage ~3000 Cores of XK7 Cray Condos Hybrid Cloud Unique Heterogeneous Platforms Object store PHI Enclave Large-Scale Storage High-Speed Interconnects 6 Big Compute + Analytics (OLCF and CADES) coupled to Big Science Data Beam user tier CADES BEAM web and data tier HTTPS Storage MySQL Database Scientific instrument tier Scanning Tunneling Microscopy (STM) Scanning Probe Microscopy (SPM) IFIR/CNMS resources 7 Data/ artifacts High speed secure data transfer Local Scanning Transmission Electron Microscopy (STEM) Cades cluster computing Supercomputing tier Distributed cloud-based architecture DOE HPC cloud Titan, Edison, and Hopper Scanning probe microscope CADES VM web/data server CADES compute clusters CADES data storage 8 ORNL’s computing ecosystem must integrate data analysis and simulation capabilities • Simulation and data are critical to DOE Experiment Theory • Both need more computing capability • Both have similar hardware technology requirements – High bandwidth to memory – Efficient processing – Very fast I/O • Different machine balance may be required Computing 9 Big data Simulation Analyzing and managing large complex data sets from experiments, observation, or simulation and sharing them with a community Used to implement theory; helps with understanding and prediction 2017 OLCF Leadership System “The Smartest Supercomputer on the Planet” Hybrid CPU/GPU architecture • Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies® • At least 5X Titan’s Application Performance • Total System Memory >6 PB HBM, DDR, and non-volatile • Dual-rail Mellanox® Infiniband full, non-blocking fat-tree interconnect • IBM Elastic Storage (GPFS™) – 2.5 TB/s I/O and 250 PB disk capacity 10 Approximately 4,600 nodes, each with: • Multiple IBM POWER9 CPUs and multiple NVIDIA Tesla® GPUs using the NVIDIA Volta architecture • CPUs and GPUs connected with high speed NVLink • Large coherent memory: over 512 GB (HBM + DDR4) – all directly addressable from the CPUs and GPUs • An additional 800 GB of NVRAM, which can be configured as either a burst buffer or as extended memory • over 40 TF peak performance Summit will replace Titan as the OLCF’s leadership supercomputer Feature Titan Summit Baseline 5-10x Titan Number of nodes 18,688 ~4,600 Node performance 1.4 TF > 40 TF Memory per node 38GB DDR3 + 6GB GDDR5 512 GB DDR4 + HBM NV memory per node 0 800 GB Total system memory 710 TB >6 PB DDR4 + HBM + non-volatile System interconnect (node injection bandwidth) Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s) Or Dual Rail HDR-IB (48 GB/s) Interconnect topology 3D Torus Non-blocking Fat Tree Processors 1 AMD Opteron™ 1 NVIDIA Kepler™ 2 IBM POWER9™ 6 NVIDIA Volta™ File system 32 PB, 1 TB/s, Lustre® 250 PB, 2.5 TB/s, GPFS™ 9 MW 13 MW Application performance • Many fewer nodes • Much more powerful nodes • Much more memory per node and total system memory • Faster interconnect • Much higher bandwidth between CPUs and GPUs • Much larger and faster file system 11 Peak power consumption ECP aims to transform the HPC ecosystem and make major contributions to the nation • Develop applications that will tackle a broad spectrum of mission critical problems of unprecedented complexity with unprecedented performance • Contribute to the economic competitiveness of the nation • Support national security • Develop a software stack, in collaboration with vendors, that is exascalecapable and is usable on smaller systems by industry and academia • Train a large cadre of computational scientists, engineers, and computer scientists who will be an asset to the nation long after the end of ECP • Partner with vendors to develop computer architectures that support exascale applications • Revitalize the US HPC vendor industry • Demonstrate the value of comprehensive co-design 12 The ECP Plan of Record • A 7-year project that follows the holistic/co-design approach, which runs through 2023 (including 12 months of schedule contingency) • Enable an initial exascale system based on advanced architecture and delivered in 2021 • Enable capable exascale systems, based on ECP R&D, delivered in 2022 and deployed in 2023 as part of an NNSA and SC facility upgrades • Acquisition of the exascale systems is outside of the ECP scope, will be carried out by DOE-SC and NNSA-ASC supercomputing facilities 13 Transition to higher trajectory with advanced architecture First exascale advanced architecture system Computing Capability Capable exascale systems 10X 5X 2017 2021 2022 2023 Time 14 2024 2025 2026 2027 Reaching the elevated trajectory will require advanced and innovative architectures In order to reach the elevated trajectory, advanced architectures must be developed that make a big leap in – Parallelism The exascale advanced – Memory and Storage architecture developments – Reliability benefit all future U.S. systems on the higher trajectory – Energy Consumption In addition, the exascale advanced architecture will need to solve emerging data science and machine learning problems in addition to the traditional modeling and simulations applications. 15 ECP follows a holistic approach that uses co-design and integration to achieve capable exascale Hardware Technology Exascale Systems Science and mission applications Scalable and productive software stack Hardware technology elements Integrated exascale supercomputers Correctness Visualization Data Analysis Applications Co-Design Programming models, Math development libraries and Tools environment, and Frameworks runtimes System Software, resource management Data Memory threading, managem and scheduling, ent I/O Burst monitoring, and and file buffer control system Node OS, runtimes Workflows Software Technology Resilience Application Development Hardware interface ECP’s work encompasses applications, system software, hardware technologies and architectures, and workforce development 16 Planned outcomes of the ECP • Important applications running at exascale in 2021, producing useful results • A full suite of mission and science applications ready to run on the 2023 capable exascale systems • A large cadre of computational scientists, engineers, and computer scientists with deep expertise in exascale computing, who will be an asset to the nation long after the end of ECP • An integrated software stack that supports exascale applications • Results of PathForward R&D contracts with vendors that are integrated into exascale systems and are in vendors’ product roadmaps • Industry and mission critical applications have been prepared for a more diverse and sophisticated set of computing technologies, carrying U.S. supercomputing well into the future 17 The Oak Ridge Leadership Computing Facility is on a well-defined path to exascale Titan and beyond: hierarchical parallelism with very powerful nodes Since clock-rate scaling ended in 2003: HPC performance achieved through increased parallelism MPI plus thread-level parallelism through OpenACC or OpenMP plus vectors Jaguar scaled to 300,000 cores Jaguar 2.3 PF Multi-core CPU 7 MW 2010 18 Summit 200PF Titan Hybrid GPU/CPU 13 MW Titan: 27 PF Hybrid GPU/CPU 9 MW 2012 2017 2021-2022 OLCF-5 5–10× Summit ~20-50 MW Summary • ORNL has a long history in highperformance computing for science, delivering many first-ofa-kind systems that were among the world’s most powerful computers. We will continue this as a core-competency of the laboratory • Delivering an ecosystem focused on the integration of computing and data into instruments of science and engineering • This ecosystem delivers important, time-critical science with enormous impacts 19 Questions? 20
© Copyright 2026 Paperzz