Accelerated Prediction of Polar Ice and Global Ocean (APPIGO): Overview Phil Jones (LANL) Eric Chassignet (FSU) Elizabeth Hunke, Rob Aulwes (LANL) Alan Wallcraft, Tim Campbell (NRL-SSC) Mohamed Iskandarani, Ben Kirtman (Univ. Miami) Arctic Prediction • Polar amplification – Rapid ice loss, feedbacks – Impacts on global weather • Human activities – Infrastructure, coastal erosion, permafrost melt – Resource extraction – Shipping – Security/safety, staging • Regime change – Thin ice leads to more variability Shell Kulluck Arctic oil rig runs aground in Gulf of Alaska (USCG photo) LNG carrier Ob River in winter crossing (with icebreakers) Trump: ISIS route into N. America Interagency Arctic efforts • Earth System Prediction Capability (ESPC) Focus Area – Sea ice prediction: up to seasonal • Sea Ice Prediction Network (SIPN) – Sea Ice Outlook • This project – enabling better prediction through model performance Interagency Arctic efforts • Earth System Prediction Capability (ESPC) Focus Area – Sea ice prediction: up to seasonal – Seasonal prediction: Broncos vs Carolina in Super Bowl • Sea Ice Prediction Network (SIPN) – Sea Ice Outlook • This project APPIGO • Enhance performance of Arctic forecast models on advanced architectures with a focus on: – Los Alamos CICE – sea ice model – HYCOM – global ocean model – WaveWatch III – wave model – Components of Arctic Cap Nowcast/Forecast System (ACNFS), Global Ocean Forecast System (GOFS) Proposed Approach • Refactoring: incremental • – Profile – Accelerate section (slower) – Expand sections – Can test along way – Try directive/other approaches Optimized – Best possible for specific kernels Abstractions, larger-scale changes (data structures) In parallel: optimized operator library Stennis (HYCOM, Phi/many-core), LANL (GPU, CICE, HYCOM), Miami (operators), FSU (validation, science) • • • APPIGO proposed timeline • Year 1 – Initial profiling – Initial acceleration (deceleration!) o o o CICE: GPU HYCOM: GPU, Phi (MIC) WW3: hybrid scalability – Begin operator libs • Year 2 – Continued optimization – Expand accelerated regions (change sign) – Abstractions, operator lib • Year 3 – Deploy in models and validate with science Progress to Date Focus on CICE: Challenges • CICE – Dynamics (EVP rheology) – Transport – Column physics (thermo, ridging, etc.) • Quasi-2d – Num of levels, thickness classes small • Parallelism – Not enough in just horiz domain decomp • Computational intensity – Maybe not enough work for efficient kernels – BGC and new improvements help Accelerating CICE with OpenACC • • Focused on dynamics Halo updates presented signification challenge – Attempted to use GPUDirect to avoid extra GPU – CPU data transfers • What we tried – Refactored loops to get more computation on GPU – Fused separate kernels – Using OpenACC streams to get concurrent execution and hide data transfer latencies Slide HYCOM Progress • Large Benchmark Standard DoD HPCMP HYCOM 1/25 global benchmark – 9000 by 6595 by 32 layers • • Includes typical I/O and data sampling Benchmark updated from HYCOM version 2.2.27 to 2.2.98 – Land masks in place of do-loop land avoidance – Dynamic vs static memory allocation HYCOM Progress Large Benchmark • On the Cray XC40: – Using huge pages improves performance by about 3% – Making the first dimension of all arrays a multiple of 8 saved 3-6% o Change a single number in the run-time patch.input file o ifort -align array64byte • Total Core hours per model day vs number of cores – 3 generations of Xeon cores o No single-core improvement, but 8 vs 12 vs 16 cores per socket HYCOM on Xeon Phi • • • Standard gx1v6 HYCOM benchmark run in native mode on 48 cores of single 5120D Phi attached to Navy DSRC’s Cray XC30 – No additional code optimization – Compared to 24 cores of a single Xeon E5-2697v2 node o Individual subroutines run 6 to 13 times slower o Overall, 10 times slower – Memory capacity is too small – I/O is very slow o Native mode is not practical Decided not to optimize for Knights Corner - Knights Landing very different Self hosting Knights Landing nodes – Up to 72 cores per socket, lots of memory – Scalability of 1/25 global HYCOM make this a good target o May need additional vector (AVX-512F) optimization o I/O must perform well Validation Case • CESM test case – HYCOM (2.2.35), CICE – Implementation of flux exchange – HYCOM, CICE in G compset • Three 50-year experiments – CORE v2 forcing – HYCOM in CESM w/ CICE – POP in CESM w/ CICE – HYCOM standalone w/ CICE Lessons Learned • • Hosted accelerators suck Programming models, software stack immature – Inability to even build at Hackathon a year ago • Substantial improvement – Can build and run to break-even at 2015 Hackathon – OpenACC can compete with CUDA, 2-3x speedup o Based on ACME atmosphere experience – GPU Direct • • Need to expand accelerated regions beyond singleroutine to gain performance We have learned a great deal and obtained valuable experience APPIGO Final Year • CICE – Continue, expand OpenACC work – Column physics • HYCOM – Revisit OpenACC – Continue work toward Intel Phi • Continue validation/comparison – Coupled and uncoupled APPIGO Continuation? • Focus on path to operational ESPC model – Continued optimization, but focus on coverage, incorporation into production models – CICE, HYCOM on Phi (threading), GPU (OpenACC) – WWIII? • Science application – Use coupled sims to understand Arctic regime change • Throw Mo under the bus: Abandon stencils – Too fine granularity
© Copyright 2025 Paperzz