Application Scalability and High Productivity Computing

Application Scalability and
High Productivity Computing
Nicholas J Wright
John Shalf
Harvey Wasserman
Advanced Technologies Group
1
NERSC/LBNL
NERSC- National Energy Research
Scientific Computing Center
• Mission: Accelerate the pace of scientific
discovery by providing high performance
computing, information, data, and
communications services for all DOE Office
of Science (SC) research.
• The production computing facility for DOE
SC.
• Berkeley Lab Computing Sciences
Directorate
– Computational Research Division (CRD), ESnet
– NERSC
22
NERSC is the Primary Computing
Center for DOE Office of Science
• NERSC serves a large population
Over 3000 users, 400 projects, 500 codes
• NERSC Serves DOE SC Mission
–Allocated by DOE program managers
–Not limited to largest scale jobs
–Not open to non-DOE applications
• Strategy: Science First
–Requirements workshops by office
–Procurements based on science codes
–Partnerships with vendors to meet
science requirements
Physics
Chemistry
Fusion
Materials
3
Math + CS
Climate
Lattice Gauge
Other
Astrophysics
Combustion
Life Sciences
NERSC Systems for Science
Large-Scale Computing Systems
Franklin (NERSC-5): Cray XT4
• 9,532 compute nodes; 38,128 cores
• ~25 Tflop/s on applications; 356 Tflop/s peak
Hopper (NERSC-6): Cray XE6
• Phase 1: Cray XT5, 668 nodes, 5344 cores
• Phase 2: 1.25 Pflop/s peak (late 2010 delivery)
Clusters
140 Tflops total
Carver
• IBM iDataplex cluster
PDSF (HEP/NP)
• ~1K core throughput cluster
Magellan Cloud testbed
• IBM iDataplex cluster
GenePool (JGI)
• ~5K core throughput cluster
NERSC Global
Filesystem (NGF)
Uses IBM’s GPFS
• 1.5 PB capacity
• 5.5 GB/s of bandwidth
HPSS Archival Storage
• 40 PB capacity
• 4 Tape libraries
• 150 TB disk cache
4
Analytics
Euclid
(512 GB shared
memory)
Dirac GPU
testbed (48
nodes)
NERSC Roadmap
107
Peak Teraflop/s
106
105
NERSC-9
1 EF Peak
How do we ensure that Users Performance
NERSC-8
100 PF Peak is
follows this trend and their Productivity
NERSC-7 ?
unaffected
10 PF Peak
104
Hopper (N6)
>1 PF Peak
103
102
10
Franklin (N5)
19 TF Sustained
101 TF Peak
Franklin (N5) +QC
36 TF Sustained
352 TF Peak
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Users expect 10x improvement in capability every 3-4 years
5
5
Hardware Trends: The Multicore
era
•
•
•
Moore’s Law continues
unabated
Power constraints means
cores will double every 18
months not clock speed
Memory capacity is not
doubling at the same rate –
GB/core will decrease
Power is the Leading Design
Constraint
7
Figure courtesy of Kunle Olukotun, Lance
Hammond, Herb Sutter, and Burton Smith
… and the power costs will still
be staggering
From Peter Kogge,
DARPA Exascale Study
$1M per megawatt per year! (with CHEAP power)
9
Changing Notion of
“System Balance”
• If you pay 5% more to double the FPUs and get 10%
improvement, it’s a win (despite lowering your % of peak
performance)
• If you pay 2x more on memory BW (power or cost) and get
35% more performance, then it’s a net loss (even though %
peak looks better)
• Real example: we can give up ALL of the flops to improve
memory bandwidth by 20% on the 2018 system
• We have a fixed budget
– Sustained to peak FLOP rate is wrong metric if FLOPs are cheap
– Balance involves balancing your checkbook & balancing your
power budget
– Requires a application co-design make the right trade-offs
10
Summary: Technology Trends:
• Number Cores 
– Flops will be “free”
•
•
•
•
Memory Capacity per core 
Memory Bandwidth per core 
Network Bandwidth per core 
I/O Bandwidth 
11
Navigating Technology Phase Transitions
107
Exascale + ???
Peak Teraflop/s
106
NERSC-9
1 EF Peak
NERSC-8
100 PF Peak
105
NERSC-7
10 PF Peak
104
GPU CUDA/OpenCL
Or Manycore BG/Q, R
Hopper (N6)
>1 PF Peak
103
102
10
Franklin (N5)
19 TF Sustained
101 TF Peak
Franklin (N5) +QC
36 TF Sustained
352 TF Peak
COTS/MPP + MPI (+ OpenMP)
COTS/MPP + MPI
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
12
12
Application Scalability
How can a user continue to be
productive in the face of these
disruptive technology trends?
13
Source of Workload Information
• Documents
–
–
–
–
–
2005 DOE Greenbook
2006-2010 NERSC Plan
LCF Studies and Reports
Workshop Reports
2008 NERSC assessment
• Allocations analysis
• User discussion
14
14
New Model for Collecting
Requirements
• Joint DOE Program Office / NERSC
Workshops
• Modeled after ESnet method
– Two workshops per year
– Describe science-based needs over 3-5 years
• Case study narratives
– First workshop is BER, May 7, 8
15
15
Numerical Methods at NERSC
(Caveat: survey data from ERCAP requests)
Methods at NERSC
Percentage of 400 Total Projects
35%
30%
25%
20%
15%
10%
5%
0%
16
Application Trends
• Weak Scaling
Performance
– Time to solution is
often a non-linear
function of problem
size
• Strong Scaling
– Latency or Serial
fraction will get you
in the end.
Performance
• Add features to
models – “New”
Weak Scaling
“Processors”
“Processors”
17
Develop Best Practices in
Multicore Programming
fvCAM
NERSC/Cray Programming
Models “Center of
Excellence” combines:
(240 cores on Jaguar)
700
14
Time
• LBNL strength in languages,
tuning, performance analysis
• Cray strength in languages,
compilers, benchmarking
12
Time (sec)
Memory
Goals:
• Immediate goal is training
material for Hopper users:
hybrid OpenMP/MPI
• Long term input into
exascale programming
model
500
10
400
8
300
6
200
4
100
2
0
0
1
2
3
6
cores per MPI process
= OpenMP thread parallelism
18
18
12
Memory per node (GB)
600
Develop Best Practices in
Multicore Programming
PARATEC
Conclusions so far:
(768 cores on Jaguar)
2000
• Mixed OpenMP/MPI
saves significant
memory
• Running time impact
varies with application
• 1 MPI process per
socket is often good
12
1800
Time
Memory
Time (sec)
1400
8
1200
1000
6
800
4
600
400
Run on Hopper next:
2
200
• 12 vs 6 cores per socket
• Gemini vs. Seastar
0
0
1
2
3
6
cores per MPI process
= OpenMP thread parallelism
19
19
12
Memory per node (GB)
10
1600
Co-Design
Eating our own dogfood
20
20
Inserting Scientific Apps into the
Hardware Development Process
• Research Accelerator for Multi-Processors
(RAMP)
– Simulate hardware before it is built!
– Break slow feedback loop for system designs
– Enables tightly coupled hardware/software/science
co-design (not possible using conventional approach)
21
Summary
• Disruptive technology changes are
coming
• By exploring
– new programming models (and revisiting
old ones)
– Hardware software co-design
• We hope to ensure that scientists
productivity remains high !
22
23