Hydrological model development and application for a refined

ECSS Final Report
Title: Hydrological model development and application for a refined understanding of Arctic
hydrology
PI: Liljedahl
PI Institution: University of Alaska, Fairbanks
ECSS Consultant(s): Laura Carrington SDSC
Allocation Start/End dates:
Abstract:
The observed changes in the Arctic hydrological system is a product of several interacting
processes and complex feedback mechanisms occurring across multiple spatial and temporal scales. This
justify the need of physically based models if we are to effectively address the impact of climate change
on cold region watershed hydrology. We have in the last ~5 years worked to implement permafrost and
frozen ground into the hydrological model WaSiM, which has an been extensively used in temperate
regions (over 50 peer review publications). Our current and recent efforts include the refinement of the
computational efficiency of the code as well as applications to several basins in Arctic and sub-arctic
Alaska. We are currently working under five different NSF supported research projects (NSF-ARCSS and
EPSCoR) that highly depend upon our WaSiM efforts in order to assess the impact of climate change on
Arctic hydrology.
Executive Summary:
The focus of the project was to investigate the overall performance. First to focus on the
computational performance and determine which routines the time was being spent in. For the initial
input one main function was AltitutedCorrection imbedded in the surfacerouting function which
accounted for ~30% of the overall runtime. My work focused on optimizing the main loop in the
function to reduce the overall runtime by about 20% (e.g. 5590 s original vs. 4455 s optimized).
Discussions with the developers about the discovered hotspots and main time in surfacerouting function
resulted in a shift to examine a new input in addition the developer provided a new algorithm which
significantly reduced time spent in the surfacerouting function ~10X. The performance analysis of the
second input and new algorithm identified additional optimizations which resulted in ~9% reduction in
runtime.
The examination of scalability both OpenMP and MPI is on-going. The OpenMP scalability
initially showed issues with a scalability of a few routines (e.g. 2-3X slowdown going from 1-6 OMP
threads). Some of the issues were resolved using OpenMP reduction which existed in some parts of the
code. Investigation of remaining issues is on-going. For MPI scalability that work has just begun with
performance data collection completed and discussion with the team beginning.
Statement by the PI:
The analyses Laura Carrington conducted on the WaSiM source code were very helpful in finding
some performance bottlenecks which were not related to the parallelization (like in the surface routing
and unsaturated soil zone sub-models) as well as in pointing to some performance issues that are
related to the implementation of the MPI and OpenMP parallelization techniques. This was the first time
someone with a strong background in super computer programming looked at the source code with a
focus on parallel performance. A continued effort will surely give us much more hints how to improve
the overall and in particular the parallel performance of the model.
Chronology:
Describe how the ECSS project was initiated (user submitted a proposal, a ticket, etc) and the
steps used to address the problem as the ECSS project was executed.
XRAC committee suggested this for an ECSS project and the PI was contacted with the details
about ECSS and how the project worked. The ECSS staff identified areas to be addressed and began
working with the PI’s team.
Technical Details:
Describe the problem in technical terms along with possible solutions and the actual solution.
Include “before and after” performance charts etc. as applicable.
Computational time analysis:
Input #1 & code v4:
Utilizing a binary instrumentation tool, PEBIL, was able to instrument the entry and exit of all
functions with timers, which produce similar results for data sets that ran a medium length and longer
length which meant performance analysis could be done on the shorter running case to save time.
The instrumentation identified the top functions consuming time. Fine grain timers were placed within
each function to determine where time was spent. Focus was on AltitudeCorrection that is called from
GetFixPoint the optimization performed on that routine and its nested loops was to break out the
computationally intensive for-loop so that no if-thens inside the loop. What this does is allow the
compiler to optimize better and vectorizer the loop. That routine accounts for ~30% or more of the
computational time, the results of the optimization on the computational time is:
Before: ~2400s
After: ~1400s
Overall runtime improvement: ~20% 4455 vs. 5590.
I will continue to look at some of the computational hotspots and see what I can do as well as
see what the compiler report is saying about the changes I made.
Input #2 & code v10:
Input #2 is a more complex input requiring “spin up” runs to create accurate initial conditions.
V10 replaced the main algorithm in the most expensive computational function with an implicit method
that reduced the overall runtime by ~10X. Similar analysis was performed on the code to identify where
the computational time was being spent. The fine grain timers identified two routines CalcSE (294s =
12%) and CalcdEOverdT (288s = 13%) that where very similar in their structure and looping over data in
a similar pattern. The two routines were combined into a single routine which resulted in an overall
reduction of ~212s or ~9% reduction in the computational time.
OpenMP Scaling:
After manually inserting timers around key computational areas in the code it was run on 48
cores changing the number of OMP threads going from 1-6 . The plots below are comparing 1 OMP
thread run to 6OMP threads.
2
Each bar is an MPI rank, and the colors in the bar represent the percent of time in the timer.
Both are using 48 cores where the first plot is 48 cores with 48 MPI tasks 1 OMP thread per task, the
second plot is 48 cores with 8 MPI tasks and 6 OMP threads per task. Since the bar height is 100% and
not scaled to total runtime, the Runtime comparison:
 1 OMP thread: 3,135s
 6 OMP threads: 11,575s ~2.8X slower
One would hope that the runtime of the two would be similar with maybe the MPI only (e.g.
omp1) running ~10% faster, so the large slowdown indicated a performance issue.
The timers are located in unsatzon and surface routine and note that in the first plot the top
color is "MISSING" or time that is not captured by the manually inserted timers (there were 20 in the
code). Below replotting the two plots with the ONLY change is that for the first plot only includes the
first 8 MPI tasks (e.g., ranks 0-7 and not plotting 8-47) just to make the comparison easier to visualize:
The key points illustrated by the plots are:
 load imbalance in unsatzon-t0 timer and the surface routine t17. For reference t0 is wrapping the
first loop in unsatzonmodellklasse::run for comparison t0 omp1~2205 and t0 omp6~3008 which is
about 30% slower but not the ~2.8X slower seen in the total run time. So while 30% isn't great it
isn't bad either. The t17 timer is located at the point in surface routine where the comment
appears:
"// Gauss Seidel fully implicit iteration (stabel, but many
iterations needed); May be extended later to Crank-Nicholson, if
that turns out to be equally stable but faster"
This portion of the code the average time for t17 omp1~11s and t17 omp6~2475s so this is
where performance is getting hit using OMP. Discussions with the developers identified an
3

OpenMP reduction call that could be utilized on Comet. Utilizing the reduction call in other
areas of the code resulted in the below
the MISSING time also becomes a problem, in omp1 ~ 280 vs omp6~5640s or it grows from ~10% of
the runtime to 50% of the runtime when using OMP.
Utilizing the -D_OMP3 directive turns on use of OpenMP reduction, which is not available on all
installations of OpenMP. The results from utilizing the OpenMP reduction are illustrated in the plots
below. The first 2 graphs are comparing time break down for omp6 and omp6 with the -D _OMP3 flag
(e.g. OMP reduction). The big change is the MISSING time reducing with nothing else changing.
Next is the comparison of the overall runtime (so y-axis is runtime) of the omp1 vs omp6 with -D
_OMP3. You can see the reduction in the MISSING time is big but still the omp6 is running ~2X slower
than omp1 due to the inefficiency in t17.
This identified that there is an OpenMP bottleneck in the routine wrapped by the t17 timer.
There are two potential issues being investigated (Targeted for the extension):
1. Does use of the OpenMP parallelization result in slower convergence or more iterations?
2. Is there something within the OpenMP usage that is causing serialization?
3. Is there something about the thread placement and scheduling of Comet that is resulting in
the performance bottleneck?
4
MPI Scaling
The performance analysis of the scalability of the code focused on how the time spent in
communication changes as the application is run with more cores. A series of MPI profiling runs were
performed to determine which routines consume more time as the code scales to more cores. The plots
below show the breakdown of computational and communication time for a series of core counts. The
first plot is just looking at the change in runtime as well as change in % communication time as the core
count increases.
The first plot indicates that the runtime at the larger core count of 144 is not scaling as well as
lower core counts due to time spent in communication increasing. A detailed plot of the two of the runs
one at 48 cores and one at 144 cores are shown below.
The plots show the runtime on the y-axis with the x-axis being individual MPI ranks. Each bar
breaks the down the time spent in computation and the different MPI events with a further breakdown
on time to send data verse time waiting to send or receive data (e.g. sync). The plots show how the time
spent in MPI reduce increases in its overall percent of the runtime as the core counts increase. But the
plot for 144 cores does indicate that the increase in time is a result of sync time and not the
sending/receiving/computing of the data. Both this and the MPI waitall time identify areas for further
investigation which will be tackled as part of the extension.
Collaboration details:
The team members are:
5
ECSS Staff:
Laura Carrington worked on performance analysis of the code and some small optimizations.
PIs team:
Anna Liljedahl (PI) coordination of the project and contribution to simulations/inputs identified
for analysis.
Ronald Daanen – aided in getting application running with both inputs (including spin up phase)
and providing information about performance of application.
Jörg Schulla – main developer of the code and provided 5 new versions of the code including a
new algorithm that significantly reduced the runtime.
Anne Gaedeke – provided input data for application on Comet as well as setup for running on
Comet.
Outcome and Recommendations:
The final product of the project is the current version of the code that includes all
optimizations/changes to the code. The recommendation is to continue to investigate the MPI
performance via the MPI reduction sync time and the Waitall time possibly utilizing some overlap with
communication and computation. In addition identifying the OpenMP bottleneck will be important on
many-core architectures like the Intel Xeon Phis.
Lessons Learned:
Describe any lessons learned while completing this project that other ECSS staff (or XSEDE staff)
may find of value. Describe any plans to write an advanced topic article for documentation or to create a
tutorial based on lessons learned in this project. Consider volunteering to give a talk at the monthly ECSS
symposium.
Impact:
What impact does the completion of this project have on the project team, XSEDE and the
community as a whole?
The WaSiM hydrological model is used by over 50 Universities and Institutes. The optimizations
and changes to code to improve performance will benefit those users.
Publications:
List any publications that have been (or may be) generated from this project or any conferences
at which it might be relevant to present this work.
6