Environment and Climate Change Canada HPC Renewal Project: Procurement Results 17th Workshop on HPC in meteorology ECMWF, Reading, UK Alain St-Denis & Luc Corbeil October 2016 One Team – One Culture – One Purpose – One SSC Outline • Background • History • Scope • RFP • Outcome One Team – One Culture – One Purpose – One SSC 2 HPC Renewal for ECCC Background • Environment Canada highly dependent on HPC in delivery of mandate: simulation of Environmental Forecasts for health, safety, security and economic well-being of Canadians. • Contract with IBM expiring with few remaining options to extend • Linked to Meteorological Service of Canada (MSC) Renewal Treasury Board Submission Component 1: Monitoring Networks Component 2: Supercomputing capacity Component 3: Weather Warnings and Forecast System • Joint ECCC-SSC submission for Supercomputing Capacity One Team – One Culture – One Purpose – One SSC 3 New player: Shared Services Canada • Created in 2012, to take responsibility of email, networks and data center for the whole Government of Canada. • Supercomputing IT people working for ECCC transferred to SSC. • Scope of the HPC team expanded to all science departments • As in any reorganization, there are challenges and opportunities! One Team – One Culture – One Purpose – One SSC Shared Services Canada – Our Mandate Shared Services Canada was formed to consolidate and streamline the delivery of IT infrastructure services, specifically email, data centre and network services. Our mandate is to do this so that federal organizations and their stakeholders have access to reliable, efficient and secure IT infrastructure services at the best possible value. Service to Canadians Departmental Programs SSC Services SSC will Innovate, ensure full Value for Money and achieve Service Excellence ! One Team – One Culture – One Purpose – One SSC 5 A Bit of History • ECCC has been using a supercomputer for weather forecasting and atmospheric science for more than half a century 1.E+09 Power7 Millions of Floating Point Operations per Second 1.E+08 Power5 1.E+07 SX-6/80M10 SX-5/32M2 SX-4/80M3 1.E+06 1.E+05 1.E+04 X-XMP 4-16 X-XMP 28 1.E+03 1.E+02 1.E+01 G20 360/65 176 7600 Power4 SX-4/16 SX-3/44R SX-3/44 1 1.E+00 1.E-01 1.E-02 Bendix IBM CDC CRAY NEC IBM Year Peak Sustained One Team – One Culture – One Purpose – One SSC 6 A Bit of (More Recent) History • Request for Information (Fall 2012, • Invitation to Qualify (Fall 2013, 4 bidders qualified) • Review Refine Requirements (Summer 2014) • Requests for Proposal (November 2014 – June 2015) • Treasury Board Approval (April 2016) • Contract Award (May 27 2016) One Team – One Culture – One Purpose – One SSC 7 Scope Scope In replacement of Supercomputer clusters Two 8192 P7 cores clusters Pre/Post-Processing clusters (PPP) Two 640 X86 cores custom clusters Global Parallel Storage (Site-Store) CNFS and ESS clusters Near-Line Storage (HP-NLS) StorNext based archiving cluster Home directories Netapp home directories As well as • Hosting of the Solution • High Performance Interconnects • Software & tools • Maintenance & Support • Training & Conversion support • On-going High Availability One Team – One Culture – One Purpose – One SSC 8 ECCC Supercomputing Procurement Requirements • Contract for Hosted HPC Solution: 8.5 years + one 2.5 year option (Transition year + two upgrades + one optional) Solution Data Hall B x2 ) k( Lin x2 k( r-H Lin all all Hall A r-H ) • Flexible Options for additional needs Inter-Hall Link (x2) Solution Data Hall A e Int • No more than 70km between Hall A, Hall B & Dorval On-going Availability Int e • Connectivity between HPC Solution Data Halls and Dorval NCF One Team – One Culture – One Purpose – One SSC Hall B High Level Architecture Home Home Scratch Scratch Site Store SupercomputerA Site Store SupercomputerB Pre/Post ProcessingA Pre/Post ProcessingB HP-NLS HP-NLS Cache Cache Solution Data Hall A HPN Data Transfer Storage Synchronization Out-of-Band Management Solution Data Hall B DA TA Fe e ds D ee AF T A ds NCF SCF Data Flow – Logical View One Team – One Culture – One Purpose – One SSC LPT, HPN/DADS, SSC 2014-10-07 10 Outcome • IBM was awarded the contract Evaluation based on benchmark performance on a fixed budget • IBM's Proposal for initial system Supercomputer: Cray XC-40, Intel Broadwell, Sonexion Lustre Storage PPP: Cray CS-400, Intel Broadwell Site-Store and Homes: IBM Elastic Storage Server (ESS, GPFSbased) HP-NLS: based on IBM High Performance Storage System (HPSS) One Team – One Culture – One Purpose – One SSC 11 Sizing • Computing About 35,000 Intel Broadwell cores per Data Hall ♦ Super and PPP combined • More than 40PB of disk storage 2.5 PB scratch storage per supercomputer (one per data hall) 18 PB site store per data hall 1.1 PB disk cache to the archive per data hall • More than 230 petabytes of tape storage (two copies) One Team – One Culture – One Purpose – One SSC 12 Comparison Increase Factors HP-NLS storage (vs current tape capacity), petabytes 6 5 4 Scratch storage (vs p7), petabytes 3 Site-Store, homes storage (vs current), petabytes 2 1 0 Cores count Supercomputer and PPP (vs P7, current PPP) Sustained TFlops Supercomputer and PPP (vs P7, current PPP) Peak TFlops Supercomputer and PPP (vs P7, current PPP) One Team – One Culture – One Purpose – One SSC 13 The Newest Addition to a Long History Historical Performance, EC Supercomputers (Flops) 10000000000.00 IBM/XC-40 IBM P7 1000000000.00 100000000.00 IBM P4IBM P5 NEC SX-6/80M10 NEC SX-5/32/M2 NEC SX-4/80M3 10000000.00 1000000.00 NEC SX-4/16 100000.00 NEC SX-3/44R NEC SX-3/44 Cray XMP 416 Cray XMP-28 Cray 1S 10000.00 1000.00 100.00 10.00 1.00 Sustained CDC 7600 CDC 176 IBM 360/65 Bendix G20 Peak 0.10 0.01 One Team – One Culture – One Purpose – One SSC 14 Resulting Architecture One Team – One Culture – One Purpose – One SSC 15 HPC Implementation Milestones: Delivery to Acceptance • Data Hall and Hosting Site Certification Inspection • Functionality Testing (IT infra) • Security Accreditation • Performance testing • Conversion of Operational codes (Automated Environmental Analysis & Production (AEAPPS) • Meeting the above triggers a 30 day availability test One Team – One Culture – One Purpose – One SSC Functionality Testing Performance Testing Conversion RFU Acceptance 16 Challenge • Change the Supercomputer clusters, PPP clusters, archiving system and homes. All at once. Never been done A lot of preparation work has been done ahead of time ♦ ♦ Most codes have already been ported to Intel architecture Our General Purpose Science Clusters available for PPP migration work – Linux containers are being leveraged to smooth the transition One Team – One Culture – One Purpose – One SSC 17 Thank you! Questions? One Team – One Culture – One Purpose – One SSC 18
© Copyright 2026 Paperzz