APPIGO Overview

Accelerated Prediction of
Polar Ice and Global Ocean
(APPIGO): Overview
Phil Jones (LANL)
Eric Chassignet (FSU)
Elizabeth Hunke, Rob Aulwes (LANL)
Alan Wallcraft, Tim Campbell (NRL-SSC)
Mohamed Iskandarani, Ben Kirtman (Univ. Miami)
Arctic Prediction
•
Polar amplification
– Rapid ice loss, feedbacks
– Impacts on global weather
•
Human activities
– Infrastructure, coastal erosion,
permafrost melt
– Resource extraction
– Shipping
– Security/safety, staging
•
Regime change
– Thin ice leads to more
variability
Shell Kulluck
Arctic oil rig
runs aground
in Gulf of
Alaska
(USCG photo)
LNG carrier Ob
River in winter
crossing (with
icebreakers)
Trump: ISIS route into N. America
Interagency Arctic efforts
•
Earth System Prediction
Capability (ESPC) Focus Area
– Sea ice prediction: up to seasonal
•
Sea Ice Prediction Network
(SIPN)
– Sea Ice Outlook
•
This project – enabling better
prediction through model
performance
Interagency Arctic efforts
•
Earth System Prediction
Capability (ESPC) Focus
Area
– Sea ice prediction: up to
seasonal
– Seasonal prediction: Broncos
vs Carolina in Super Bowl
•
Sea Ice Prediction Network
(SIPN)
– Sea Ice Outlook
•
This project
APPIGO
•
Enhance performance of Arctic
forecast models on advanced
architectures with a focus on:
– Los Alamos CICE – sea ice
model
– HYCOM – global ocean model
– WaveWatch III – wave model
– Components of Arctic Cap
Nowcast/Forecast System
(ACNFS), Global Ocean
Forecast System (GOFS)
Proposed Approach
•
Refactoring: incremental
•
– Profile
– Accelerate section (slower)
– Expand sections
– Can test along way
– Try directive/other approaches
Optimized
– Best possible for specific kernels
Abstractions, larger-scale changes (data structures)
In parallel: optimized operator library
Stennis (HYCOM, Phi/many-core), LANL (GPU, CICE,
HYCOM), Miami (operators), FSU (validation, science)
•
•
•
APPIGO proposed timeline
•
Year 1
– Initial profiling
– Initial acceleration (deceleration!)
o
o
o
CICE: GPU
HYCOM: GPU, Phi (MIC)
WW3: hybrid scalability
– Begin operator libs
•
Year 2
– Continued optimization
– Expand accelerated regions (change sign)
– Abstractions, operator lib
•
Year 3
– Deploy in models and validate with science
Progress to Date
Focus on CICE: Challenges
•
CICE
– Dynamics (EVP rheology)
– Transport
– Column physics (thermo, ridging, etc.)
•
Quasi-2d
– Num of levels, thickness classes small
•
Parallelism
– Not enough in just horiz domain decomp
•
Computational intensity
– Maybe not enough work for efficient kernels
– BGC and new improvements help
Accelerating CICE with OpenACC
•
•
Focused on dynamics
Halo updates presented signification challenge
– Attempted to use GPUDirect to avoid extra GPU –
CPU data transfers
•
What we tried
– Refactored loops to get more computation on GPU
– Fused separate kernels
– Using OpenACC streams to get concurrent execution and
hide data transfer latencies
Slide
HYCOM Progress
•
Large Benchmark
Standard DoD HPCMP HYCOM 1/25 global benchmark
– 9000 by 6595 by 32 layers
•
•
Includes typical I/O and data sampling
Benchmark updated from HYCOM version 2.2.27 to 2.2.98
– Land masks in place of do-loop
land avoidance
– Dynamic vs static memory
allocation
HYCOM Progress
Large Benchmark
•
On the Cray XC40:
– Using huge pages improves performance by about 3%
– Making the first dimension of all arrays a multiple of 8 saved 3-6%
o Change a single number in the run-time patch.input file
o ifort -align array64byte
•
Total Core hours per model day
vs number of cores
– 3 generations of Xeon cores
o No single-core improvement,
but 8 vs 12 vs 16 cores per
socket
HYCOM on Xeon Phi
•
•
•
Standard gx1v6 HYCOM benchmark run in native mode on 48 cores of single
5120D Phi attached to Navy DSRC’s Cray XC30
– No additional code optimization
– Compared to 24 cores of a single Xeon E5-2697v2 node
o Individual subroutines run 6 to 13 times slower
o Overall, 10 times slower
– Memory capacity is too small
– I/O is very slow
o Native mode is not practical
Decided not to optimize for Knights Corner - Knights Landing very different
Self hosting Knights Landing nodes
– Up to 72 cores per socket, lots of memory
– Scalability of 1/25 global HYCOM make this a good target
o May need additional vector (AVX-512F) optimization
o I/O must perform well
Validation Case
•
CESM test case
– HYCOM (2.2.35), CICE
– Implementation of flux exchange
– HYCOM, CICE in G compset
•
Three 50-year
experiments
– CORE v2 forcing
– HYCOM in CESM w/
CICE
– POP in CESM w/
CICE
– HYCOM standalone w/
CICE
Lessons Learned
•
•
Hosted accelerators suck
Programming models, software stack immature
– Inability to even build at Hackathon a year ago
•
Substantial improvement
– Can build and run to break-even at 2015 Hackathon
– OpenACC can compete with CUDA, 2-3x speedup
o
Based on ACME atmosphere experience
– GPU Direct
•
•
Need to expand accelerated regions beyond singleroutine to gain performance
We have learned a great deal and obtained
valuable experience
APPIGO Final Year
•
CICE
– Continue, expand OpenACC work
– Column physics
•
HYCOM
– Revisit OpenACC
– Continue work toward Intel Phi
•
Continue validation/comparison
– Coupled and uncoupled
APPIGO Continuation?
•
Focus on path to operational ESPC model
– Continued optimization, but focus on coverage,
incorporation into production models
– CICE, HYCOM on Phi (threading), GPU (OpenACC)
– WWIII?
•
Science application
– Use coupled sims to understand Arctic regime change
•
Throw Mo under the bus: Abandon stencils
– Too fine granularity