Environment and Climate Change Canada HPC Renewal

Environment and Climate Change Canada HPC
Renewal Project:
Procurement Results
17th Workshop on HPC in meteorology
ECMWF, Reading, UK
Alain St-Denis & Luc Corbeil
October 2016
One Team – One Culture – One Purpose – One SSC
Outline
• Background
• History
• Scope
• RFP
• Outcome
One Team – One Culture – One Purpose – One SSC
2
HPC Renewal for ECCC Background
• Environment Canada highly dependent on HPC in delivery of
mandate: simulation of Environmental Forecasts for health, safety,
security and economic well-being of Canadians.
• Contract with IBM expiring with few remaining options to extend
• Linked to Meteorological Service of Canada (MSC) Renewal Treasury
Board Submission
 Component 1: Monitoring Networks
 Component 2: Supercomputing capacity
 Component 3: Weather Warnings and Forecast System
• Joint ECCC-SSC submission for Supercomputing Capacity
One Team – One Culture – One Purpose – One SSC
3
New player: Shared Services Canada
• Created in 2012, to take responsibility of email, networks
and data center for the whole Government of Canada.
• Supercomputing IT people working for ECCC transferred to
SSC.
• Scope of the HPC team expanded to all science
departments
• As in any reorganization, there are challenges and
opportunities!
One Team – One Culture – One Purpose – One SSC
Shared Services Canada – Our Mandate
Shared Services Canada was
formed to consolidate and
streamline the delivery of IT
infrastructure services,
specifically email, data centre
and network services. Our
mandate is to do this so that
federal organizations and their
stakeholders have access to
reliable, efficient and secure IT
infrastructure services at the
best possible value.
Service to
Canadians
Departmental
Programs
SSC
Services
SSC will Innovate, ensure full Value for Money and achieve Service Excellence !
One Team – One Culture – One Purpose – One SSC
5
A Bit of History
• ECCC has been using a supercomputer for weather
forecasting and atmospheric science for more than half a
century
1.E+09
Power7
Millions of Floating Point Operations per Second
1.E+08
Power5
1.E+07
SX-6/80M10
SX-5/32M2
SX-4/80M3
1.E+06
1.E+05
1.E+04
X-XMP 4-16
X-XMP 28
1.E+03
1.E+02
1.E+01
G20
360/65
176
7600
Power4
SX-4/16
SX-3/44R
SX-3/44
1
1.E+00
1.E-01
1.E-02
Bendix
IBM
CDC
CRAY
NEC
IBM
Year
Peak
Sustained
One Team – One Culture – One Purpose – One SSC
6
A Bit of (More Recent) History
• Request for Information (Fall 2012,
• Invitation to Qualify (Fall 2013, 4 bidders qualified)
• Review Refine Requirements (Summer 2014)
• Requests for Proposal (November 2014 – June 2015)
• Treasury Board Approval (April 2016)
• Contract Award (May 27 2016)
One Team – One Culture – One Purpose – One SSC
7
Scope
Scope
In replacement of
Supercomputer clusters
Two 8192 P7 cores clusters
Pre/Post-Processing clusters (PPP)
Two 640 X86 cores custom clusters
Global Parallel Storage (Site-Store)
CNFS and ESS clusters
Near-Line Storage (HP-NLS)
StorNext based archiving cluster
Home directories
Netapp home directories
As well as
• Hosting of the Solution
• High Performance Interconnects
• Software & tools
• Maintenance & Support
• Training & Conversion support
• On-going High Availability
One Team – One Culture – One Purpose – One SSC
8
ECCC Supercomputing Procurement Requirements
• Contract for Hosted HPC Solution: 8.5 years + one 2.5 year
option (Transition year + two upgrades + one optional)
Solution Data Hall B
x2
)
k(
Lin
x2
k(
r-H
Lin
all
all
Hall A
r-H
)
• Flexible Options for
additional needs
Inter-Hall Link (x2)
Solution Data Hall A
e
Int
• No more than 70km
between Hall A, Hall B
& Dorval
On-going Availability
Int
e
• Connectivity between
HPC Solution Data
Halls and Dorval
NCF
One Team – One Culture – One Purpose – One SSC
Hall B
High Level Architecture
Home
Home
Scratch
Scratch
Site
Store
SupercomputerA
Site
Store
SupercomputerB
Pre/Post
ProcessingA
Pre/Post
ProcessingB
HP-NLS
HP-NLS
Cache
Cache
Solution
Data Hall A
HPN Data Transfer
Storage Synchronization
Out-of-Band Management
Solution
Data Hall B
DA
TA
Fe
e
ds
D
ee
AF
T
A
ds
NCF
SCF Data Flow – Logical View
One Team – One Culture – One Purpose – One SSC
LPT, HPN/DADS, SSC
2014-10-07
10
Outcome
• IBM was awarded the contract

Evaluation based on benchmark performance on a fixed budget
• IBM's Proposal for initial system




Supercomputer: Cray XC-40, Intel Broadwell, Sonexion Lustre
Storage
PPP: Cray CS-400, Intel Broadwell
Site-Store and Homes: IBM Elastic Storage Server (ESS, GPFSbased)
HP-NLS: based on IBM High Performance Storage System (HPSS)
One Team – One Culture – One Purpose – One SSC
11
Sizing
• Computing

About 35,000 Intel Broadwell cores per Data Hall
♦
Super and PPP combined
• More than 40PB of disk storage



2.5 PB scratch storage per supercomputer (one per data hall)
18 PB site store per data hall
1.1 PB disk cache to the archive per data hall
• More than 230 petabytes of tape storage (two copies)
One Team – One Culture – One Purpose – One SSC
12
Comparison
Increase Factors
HP-NLS storage (vs current
tape capacity), petabytes
6
5
4
Scratch storage (vs p7),
petabytes
3
Site-Store, homes storage
(vs current), petabytes
2
1
0
Cores count Supercomputer
and PPP (vs P7, current
PPP)
Sustained TFlops
Supercomputer and PPP
(vs P7, current PPP)
Peak TFlops
Supercomputer and PPP
(vs P7, current PPP)
One Team – One Culture – One Purpose – One SSC
13
The Newest Addition to a Long History
Historical Performance, EC Supercomputers (Flops)
10000000000.00
IBM/XC-40
IBM P7
1000000000.00
100000000.00
IBM P4IBM P5
NEC SX-6/80M10
NEC SX-5/32/M2
NEC SX-4/80M3
10000000.00
1000000.00
NEC SX-4/16
100000.00
NEC SX-3/44R
NEC SX-3/44
Cray XMP 416
Cray XMP-28
Cray 1S
10000.00
1000.00
100.00
10.00
1.00
Sustained
CDC 7600 CDC 176
IBM 360/65
Bendix G20
Peak
0.10
0.01
One Team – One Culture – One Purpose – One SSC
14
Resulting Architecture
One Team – One Culture – One Purpose – One SSC
15
HPC Implementation Milestones: Delivery to Acceptance
• Data Hall and Hosting Site Certification
Inspection
• Functionality Testing (IT infra)
• Security Accreditation
• Performance testing
• Conversion of Operational codes
(Automated Environmental Analysis &
Production (AEAPPS)
• Meeting the above triggers a 30 day
availability test
One Team – One Culture – One Purpose – One SSC
Functionality Testing
Performance Testing
Conversion
RFU
Acceptance
16
Challenge
• Change the Supercomputer clusters, PPP clusters,
archiving system and homes. All at once. Never been done

A lot of preparation work has been done ahead of time
♦
♦
Most codes have already been ported to Intel architecture
Our General Purpose Science Clusters available for PPP migration work
– Linux containers are being leveraged to smooth the transition
One Team – One Culture – One Purpose – One SSC
17
Thank you!
Questions?
One Team – One Culture – One Purpose – One SSC
18