ECSS Final Report Title: Hydrological model development and application for a refined understanding of Arctic hydrology PI: Liljedahl PI Institution: University of Alaska, Fairbanks ECSS Consultant(s): Laura Carrington SDSC Allocation Start/End dates: Abstract: The observed changes in the Arctic hydrological system is a product of several interacting processes and complex feedback mechanisms occurring across multiple spatial and temporal scales. This justify the need of physically based models if we are to effectively address the impact of climate change on cold region watershed hydrology. We have in the last ~5 years worked to implement permafrost and frozen ground into the hydrological model WaSiM, which has an been extensively used in temperate regions (over 50 peer review publications). Our current and recent efforts include the refinement of the computational efficiency of the code as well as applications to several basins in Arctic and sub-arctic Alaska. We are currently working under five different NSF supported research projects (NSF-ARCSS and EPSCoR) that highly depend upon our WaSiM efforts in order to assess the impact of climate change on Arctic hydrology. Executive Summary: The focus of the project was to investigate the overall performance. First to focus on the computational performance and determine which routines the time was being spent in. For the initial input one main function was AltitutedCorrection imbedded in the surfacerouting function which accounted for ~30% of the overall runtime. My work focused on optimizing the main loop in the function to reduce the overall runtime by about 20% (e.g. 5590 s original vs. 4455 s optimized). Discussions with the developers about the discovered hotspots and main time in surfacerouting function resulted in a shift to examine a new input in addition the developer provided a new algorithm which significantly reduced time spent in the surfacerouting function ~10X. The performance analysis of the second input and new algorithm identified additional optimizations which resulted in ~9% reduction in runtime. The examination of scalability both OpenMP and MPI is on-going. The OpenMP scalability initially showed issues with a scalability of a few routines (e.g. 2-3X slowdown going from 1-6 OMP threads). Some of the issues were resolved using OpenMP reduction which existed in some parts of the code. Investigation of remaining issues is on-going. For MPI scalability that work has just begun with performance data collection completed and discussion with the team beginning. Statement by the PI: The analyses Laura Carrington conducted on the WaSiM source code were very helpful in finding some performance bottlenecks which were not related to the parallelization (like in the surface routing and unsaturated soil zone sub-models) as well as in pointing to some performance issues that are related to the implementation of the MPI and OpenMP parallelization techniques. This was the first time someone with a strong background in super computer programming looked at the source code with a focus on parallel performance. A continued effort will surely give us much more hints how to improve the overall and in particular the parallel performance of the model. Chronology: Describe how the ECSS project was initiated (user submitted a proposal, a ticket, etc) and the steps used to address the problem as the ECSS project was executed. XRAC committee suggested this for an ECSS project and the PI was contacted with the details about ECSS and how the project worked. The ECSS staff identified areas to be addressed and began working with the PI’s team. Technical Details: Describe the problem in technical terms along with possible solutions and the actual solution. Include “before and after” performance charts etc. as applicable. Computational time analysis: Input #1 & code v4: Utilizing a binary instrumentation tool, PEBIL, was able to instrument the entry and exit of all functions with timers, which produce similar results for data sets that ran a medium length and longer length which meant performance analysis could be done on the shorter running case to save time. The instrumentation identified the top functions consuming time. Fine grain timers were placed within each function to determine where time was spent. Focus was on AltitudeCorrection that is called from GetFixPoint the optimization performed on that routine and its nested loops was to break out the computationally intensive for-loop so that no if-thens inside the loop. What this does is allow the compiler to optimize better and vectorizer the loop. That routine accounts for ~30% or more of the computational time, the results of the optimization on the computational time is: Before: ~2400s After: ~1400s Overall runtime improvement: ~20% 4455 vs. 5590. I will continue to look at some of the computational hotspots and see what I can do as well as see what the compiler report is saying about the changes I made. Input #2 & code v10: Input #2 is a more complex input requiring “spin up” runs to create accurate initial conditions. V10 replaced the main algorithm in the most expensive computational function with an implicit method that reduced the overall runtime by ~10X. Similar analysis was performed on the code to identify where the computational time was being spent. The fine grain timers identified two routines CalcSE (294s = 12%) and CalcdEOverdT (288s = 13%) that where very similar in their structure and looping over data in a similar pattern. The two routines were combined into a single routine which resulted in an overall reduction of ~212s or ~9% reduction in the computational time. OpenMP Scaling: After manually inserting timers around key computational areas in the code it was run on 48 cores changing the number of OMP threads going from 1-6 . The plots below are comparing 1 OMP thread run to 6OMP threads. 2 Each bar is an MPI rank, and the colors in the bar represent the percent of time in the timer. Both are using 48 cores where the first plot is 48 cores with 48 MPI tasks 1 OMP thread per task, the second plot is 48 cores with 8 MPI tasks and 6 OMP threads per task. Since the bar height is 100% and not scaled to total runtime, the Runtime comparison: 1 OMP thread: 3,135s 6 OMP threads: 11,575s ~2.8X slower One would hope that the runtime of the two would be similar with maybe the MPI only (e.g. omp1) running ~10% faster, so the large slowdown indicated a performance issue. The timers are located in unsatzon and surface routine and note that in the first plot the top color is "MISSING" or time that is not captured by the manually inserted timers (there were 20 in the code). Below replotting the two plots with the ONLY change is that for the first plot only includes the first 8 MPI tasks (e.g., ranks 0-7 and not plotting 8-47) just to make the comparison easier to visualize: The key points illustrated by the plots are: load imbalance in unsatzon-t0 timer and the surface routine t17. For reference t0 is wrapping the first loop in unsatzonmodellklasse::run for comparison t0 omp1~2205 and t0 omp6~3008 which is about 30% slower but not the ~2.8X slower seen in the total run time. So while 30% isn't great it isn't bad either. The t17 timer is located at the point in surface routine where the comment appears: "// Gauss Seidel fully implicit iteration (stabel, but many iterations needed); May be extended later to Crank-Nicholson, if that turns out to be equally stable but faster" This portion of the code the average time for t17 omp1~11s and t17 omp6~2475s so this is where performance is getting hit using OMP. Discussions with the developers identified an 3 OpenMP reduction call that could be utilized on Comet. Utilizing the reduction call in other areas of the code resulted in the below the MISSING time also becomes a problem, in omp1 ~ 280 vs omp6~5640s or it grows from ~10% of the runtime to 50% of the runtime when using OMP. Utilizing the -D_OMP3 directive turns on use of OpenMP reduction, which is not available on all installations of OpenMP. The results from utilizing the OpenMP reduction are illustrated in the plots below. The first 2 graphs are comparing time break down for omp6 and omp6 with the -D _OMP3 flag (e.g. OMP reduction). The big change is the MISSING time reducing with nothing else changing. Next is the comparison of the overall runtime (so y-axis is runtime) of the omp1 vs omp6 with -D _OMP3. You can see the reduction in the MISSING time is big but still the omp6 is running ~2X slower than omp1 due to the inefficiency in t17. This identified that there is an OpenMP bottleneck in the routine wrapped by the t17 timer. There are two potential issues being investigated (Targeted for the extension): 1. Does use of the OpenMP parallelization result in slower convergence or more iterations? 2. Is there something within the OpenMP usage that is causing serialization? 3. Is there something about the thread placement and scheduling of Comet that is resulting in the performance bottleneck? 4 MPI Scaling The performance analysis of the scalability of the code focused on how the time spent in communication changes as the application is run with more cores. A series of MPI profiling runs were performed to determine which routines consume more time as the code scales to more cores. The plots below show the breakdown of computational and communication time for a series of core counts. The first plot is just looking at the change in runtime as well as change in % communication time as the core count increases. The first plot indicates that the runtime at the larger core count of 144 is not scaling as well as lower core counts due to time spent in communication increasing. A detailed plot of the two of the runs one at 48 cores and one at 144 cores are shown below. The plots show the runtime on the y-axis with the x-axis being individual MPI ranks. Each bar breaks the down the time spent in computation and the different MPI events with a further breakdown on time to send data verse time waiting to send or receive data (e.g. sync). The plots show how the time spent in MPI reduce increases in its overall percent of the runtime as the core counts increase. But the plot for 144 cores does indicate that the increase in time is a result of sync time and not the sending/receiving/computing of the data. Both this and the MPI waitall time identify areas for further investigation which will be tackled as part of the extension. Collaboration details: The team members are: 5 ECSS Staff: Laura Carrington worked on performance analysis of the code and some small optimizations. PIs team: Anna Liljedahl (PI) coordination of the project and contribution to simulations/inputs identified for analysis. Ronald Daanen – aided in getting application running with both inputs (including spin up phase) and providing information about performance of application. Jörg Schulla – main developer of the code and provided 5 new versions of the code including a new algorithm that significantly reduced the runtime. Anne Gaedeke – provided input data for application on Comet as well as setup for running on Comet. Outcome and Recommendations: The final product of the project is the current version of the code that includes all optimizations/changes to the code. The recommendation is to continue to investigate the MPI performance via the MPI reduction sync time and the Waitall time possibly utilizing some overlap with communication and computation. In addition identifying the OpenMP bottleneck will be important on many-core architectures like the Intel Xeon Phis. Lessons Learned: Describe any lessons learned while completing this project that other ECSS staff (or XSEDE staff) may find of value. Describe any plans to write an advanced topic article for documentation or to create a tutorial based on lessons learned in this project. Consider volunteering to give a talk at the monthly ECSS symposium. Impact: What impact does the completion of this project have on the project team, XSEDE and the community as a whole? The WaSiM hydrological model is used by over 50 Universities and Institutes. The optimizations and changes to code to improve performance will benefit those users. Publications: List any publications that have been (or may be) generated from this project or any conferences at which it might be relevant to present this work. 6
© Copyright 2026 Paperzz