Evaluation of WRF scaling to several thousand cores on

Evaluation of WRF scaling to several thousand cores on the Yellowstone Supercomputer
Christopher G. Kruse , Davide Del Vento,
National Center for Atmospheric Research, Boulder, Colorado
Raffaele Montuoro,
Texas A&M University, College Station, Texas
Mark Lubin, and
Intel Corporation, Folsom, California
Scott McMillan
Intel Corporation, Champaign, Illinois
ABSTRACT
Benchmarking and scaling assessments were performed on the Yellowstone supercomputer at the
NCAR-Wyoming Supercomputing Center using the Weather Research and Forecasting model.
Two large test cases simulating Hurricane Katrina at 1-km and 3-km resolutions were used as
workloads for benchmarking. MPI-only and hybrid MPI-OpenMP parallelizations of the WRF
model were compared, with comparable simulation speeds (simulated time/wall clock time) using
two or more MPI tasks per node and not over subscribing cores with threads. Intel and IBM
Parallel Environment MPI implementations were also tested and compared. Simulation speed was
found to scale nearly linearly through 16K cores, with appreciable gains in simulation speed with
increasing core counts beyond 16K. While compute time decreased with increasing core counts, time
to complete operations involving I/O (e.g., processing of initial and boundary conditions, writing
output) using default I/O settings increased with increasing core counts, overwhelming the gains in
simulation speed at 2K cores for the 1-km case.
1. Introduction
able for simulating phenomena with scales ranging from
meters to thousands of kilometers with proper parameterization selection. WRF has been collaboratively developed by the National Center for Atmospheric Research
(NCAR), the National Oceanic and Atmospheric Administration (NOAA), the Air Force Weather Agency (AFWA),
the Naval Research Laboratory (NRL), the University
of Oklahoma, and the Federal Aviation Administration
(FAA). WRF is currently in operational use at NCAR, the
National Centers for Environmental Prediction (NCEP),
AFWA, and other centers and has a large user base in the
environmental sciences.
WRF features two dynamical solvers, a variational ensemble data assimilation system (Barker et al. 2012), and
a software architecture allowing for computational parallelism with distributed and/or shared memory systems.
The two dynamical solvers are the Advanced Research
WRF (ARW) solver, developed and maintained by the
Mesoscale and Microscale Meteorology (MMM) division at
NCAR, and the Nonhydrostatic Mesoscale Model (NMM)
solver developed by NCEP. It has an extensible design,
which makes it possible to easily add physics, chemistry,
The Weather Research and Forecasting (WRF) model is
known to scale well to large numbers of cores on a variety of
architectures (http://mmm.ucar.edu/wrf/WG2/benchv3),
including specialized architectures such as BlueGene/L
(Michalakes et al. 2008). The objective of this study
was to demonstrate WRF scalability to several thousand cores on commodity supercomputers using Intel compilers on the Yellowstone supercomputer at
the NCAR-Wyoming Supercomputing Center (NWSC)
(NCAR 2013). Scalability depends on a suitably sized
workload. The standard WRF benchmark workloads
(http://mmm.ucar.edu/wrf/WG2/benchv3) are too small
to show reasonable scalability to several thousand cores.
Therefore, new workloads based on high-resolution simulations of Hurricane Katrina were used for this study.
2. The WRF Model
WRF is an open-source numerical weather prediction
(NWP) model used in operational, research, and educational settings. It is a mesoscale model, but is still suit1
hydrology models, and other features contributed by the
research community. It can be configured to simulate realworld cases, that is produce actual weather simulations
using observations to produce initial and boundary conditions and topography. Idealized cases (e.g., two dimensional cases with specified terrain and atmosphere, or cases
where an analytical solution is known) can also be simulated, which are useful for both educational and testing
purposes. In this study, the ARW solver was used to simulate Hurricane Katrina. For an extensive description of
WRF version 3, see Skamarock et al. (2008).
The WRF model was at version 3.3.1 (released 22
September 2011) at the time this work started; however,
performance and scaling presented in this paper were evaluated using the latest version (3.5, released 18 April 2013).
was used. DAPL UD improves scalability by requiring significantly less memory for connection management and is
more tolerant of transient network issues. For these runs,
the default DAPL UD values were modified to further reduce the memory usage and increase the timeout settings
(see Appendix A).
Minor modifications were made to the WRF source
code in order to facilitate this study.
In order to
run WRF on 64K cores, WRF source had to be
modified as it is limited to 10,000 MPI processes by
default.
In order to use more MPI processes, the
value of RSL MAXPROC was increased in external/RSL LITE/rsl lite.h, and all instances of rsl.out.%04
and rsl.error.%04 in external/RSL LITE/c code.c were
modified appropriately. Additionally, the Known Problems and Fixes patch on the WRF Model Users Site
(http://www.mmm.ucar.edu/wrf/users/wrfv3.5/knownprob-3.5.html, accessed 19 June 2013) was applied.
Disk I/O was problematic at core counts at and above
16K using default namelist settings. In order to investigate
scaling of simulation speed, defined as the ratio between
simulated time and wall clock time, namelist variables were
modified appropriately to prevent writing output. These
modifications were useful as default I/O algorithms took
many hours at high core counts (shown and discussed below) and writing output was not needed to evaluate simulation speed. While not the focus of this study, it should be
noted that there exists a few useful namelist options to address anti-scaling of algorithms involving disk I/O within
WRF (e.g., splitting netCDF history files per MPI task,
parallel netCDF support, parallel HDF5 support).
a. Compilation and Cluster
WRF version 3.5 was built using Intel Composer XE
2013 (Intel 2013a), and was compiled in distributed memory and hybrid distributed/shared memory modes. The
performances of both builds were investigated. Version 4.2
of the netCDF library was used with this version of WRF.
The key compiler options were “-O3 -align all -fno-alias
-fp-model precise”.
Distributed and shared memory parallelism is implemented with the Message Passing Interface (MPI) and
the OpenMP application programming interface (API), respectively. MPI is “a message-passing library interface
specification” (MPI 2012) and is the dominant programming model for high performance computing on distributed
memory systems. MPI itself is not an implementation;
there are multiple implementations of the MPI specification. The OpenMP API is a collection of compiler
directives, library routines, and environmental variables
which can be used to specify shared memory parallelism
(OpenMP Archetecture Review Board 2011). Intel MPI Library (Intel 2013b) version 4.1.0.030 and IBM Parallel Environment (PE) Developer Edition (Quintero et al. 2013)
version 1.2.0.9 MPI implementations were benchmarked in
this study with Intel compilers. The Intel implementation
of the OpenMP API specification version 3.1 was used for
shared memory parallelism.
The benchmark workload was run on the new Yellowstone supercomputer at NWSC located just west of
Cheyenne, Wyoming. Yellowstone consists of 4,518 comR
pute nodes, each with two Intel “Sandy Bridge” Xeon
E5-2670 processors (eight cores each), for a total of 72,288
cores. The nodes are connected using a fat-tree topology
InfiniBand network. Default settings were used for both
MPI implementations, with exceptions for the large runs
(≥4K cores) with the Intel MPI Library. DAPL version
2.0.36 was built for Yellowstone and used with the Intel
MPI Library. For the large runs with the Intel MPI Library, the DAPL Unreliable Datagram mode (DAPL UD)
3. The Katrina Workload
Sensitivity studies of simulated strength, track, and
structure of Hurricane Katrina to atmospheric model resolution performed at Texas A&M University (Patricola et al.
2012) provided the opportunity to use a reliable set of simulation cases of a suitable size for scaling WRF to thousands
of cores. Input initial and boundary conditions for those
cases were obtained from NCEP Climate Forecast System
Reanalysis (Saha et al. 2010) data over a region encompassing the Gulf of Mexico, with 00 UTC on 25 August
2005 as the starting simulation time (Hurricane Katrina
made landfall on the gulf coast on 29 August 2005).
To be able to investigate performance scaling of WRF
on nearly the entire Yellowstone supercomputer, two cases
with high horizontal resolutions of 3-km and 1-km were
selected. In the 3-km workload, the horizontal domain
contained 1,396 x 1,384 (NX x NY) horizontal grid points
and the integration was performed using a time step of
10 seconds. The 1-km workload used 3,665 x 2,894 horizontal grid points with a time step of 3 seconds. In both
workloads, 35 terrain-following vertical levels were speci-
2
Simulation Speed
# OpenMP Threads MPI Task−1
2
1
8 168 4
7
6
5
4
3
2
Binding
1
No Binding
0 12 4
8
16
−1
# MPI Tasks Node
of the two sockets in each node, causing context switching and trashing of the cache. To address this issue, the
environment variable “MP TASK AFFINITY” was set to
“core:${OMP NUM THREADS}” in order to “bind” all
tasks of a particular core into a single socket. With task
binding enabled, performance was significantly increased
for the runs with numbers of threads per task > 1 (Fig. 1,
Binding).
Maximum simulation speeds for each combination of
tasks and threads were comparable, except where one only
task was subscribed to each node with thread binding. The
highest simulation speed was achieved with eight (two)
tasks per node (threads per task); however, the performance was highly variable. The least variable runs were
achieved using 2 (8) and 16 (1) tasks (threads). The runs
using two tasks per node, effectively one task per socket,
were 4% faster than the effectively “MPI-only” runs.
Additionally, performances of “MPI-only” runs conducted using hybrid (“dm+sm”) and MPI-only (“dm”)
builds using only the IBM PE implementation were compared (not shown). Simulation speeds between both builds
for a given core count were comparable, with differences of
approximately 2%. Given the run to run variability discussed below, it was concluded that hybrid and MPI-only
builds of WRF were comparable, with no significant performance difference in terms of simulation speed.
While highest simulation speeds were achieved using
hybrid parallelization, variability was also high except for
the runs using two tasks per node. Given the comparable speeds between the “MPI-only” runs and the two task
hybrid runs, MPI-only parallelization was chosen to assess
scaling to several thousand cores on Yellowstone in order
to conform to previous benchmark assessments. Scaling of
WRF using two tasks per node will be investigated further in the future. MPI-only builds of WRF were used to
conduct the scaling assessment presented below.
Fig. 1. Simulation speeds (simulated time/wall clock time)
of the Katrina 1-km workload on 256 nodes of the Yellowstone supercomputer with different combinations of number of MPI tasks per node and number of OpenMP threads
per task. Runs with and without thread binding are shown.
fied, with the 1-km workload containing approximately 5.5
times more grid points than the 3-km workload. Physical
parameterizations used with these simulations are provided
in Appendix B.
4. Results and Discussion
a. Hybrid Parallelization
b. Scaling Assessment
Hybrid parallelization of the Hurricane Katrina workload was investigated with 256 nodes on Yellowstone. In
this case, WRF was compiled via the distributed and
shared memory option (“dm+sm”) and the compiler flags
mentioned above. The number of MPI tasks per node and
number of OpenMP threads per task were selected such
that cores were not over subscribed with threads/tasks.
Figure 1 shows the result of increasing the number of tasks
per node while correspondingly decreasing the number of
threads per task.
Initially, performance was degraded using any number
of threads per task > 1, with simulation speed decreasing
with increasing thread counts per task (Fig. 1, No Binding). This performance degradation was attributed to a
lack of task affinity, where tasks could have been randomly
moved at runtime by the operating system onto either
Figure 2 shows the scalability of the Katrina 1-km workload simulation speed on the Yellowstone cluster up to
65,536 cores. It is important to note timings for disk I/O
and processing of initial and boundary conditions are not
included in the calculation of simulation speed in Fig. 2.
For individual runs, the average simulation speed of all
time steps was calculated and used to compare between
runs. For both MPI implementations, four runs were conducted at core counts of 16K and less. Three benchmark
runs were conducted using both implementations at 32K
cores, while only three runs were successfully conducted
at 64K cores (two IBM PE MPI, one Intel MPI). Only
the fastest runs using 16K cores and less were included
in Fig. 2, whereas all runs using 32K and 64K cores were
included in the figure.
Run to run variability was small using the Intel MPI
3
# horizontal grid points core−1
295647 324
62
1
1
50
18
16
40
14
12
30
10
8
20
6
4
10
IBM PE MPI
Intel MPI 2
00 10 20 30 40 50 60 70
# cores [thousands]
cores decreases the number of grid points assigned to a particular core. Calculations conducted on perimeter or halo
points in each patch require communication with neighboring patches during each time step. Computation time can
be overwhelmed by communication time at each time step
when the number of grid points per patch becomes small
enough, which decreases model performance.
Per core performance was investigated using both workloads in terms of horizontal grid points per core. Since different time steps were used between the two workloads, the
3-km workload was advanced seven seconds further in time
than the 1-km workload each time step, resulting in simulation speeds not being comparable. Instead, the number of
time steps completed per second was used to evaluate performance and compare fairly between the two workloads.
It is important to note that the number of vertical levels was the same between the two workloads, which would
result in the same number of total grid points assigned to
each processor for a given number of horizontal grid points
per core. Also, the physical parameterizations were identical between the two workloads. While these data might
be used to approximate performance of a different simulation with a given number of horizontal grid points, cores,
and vertical levels, performance of a run with a different
parameterization selections will differ.
Figure 3 illustrates the comparison in scaling between
the two workloads. Scaling is nearly linear for both workloads assigning '500 grid points per core, with comparable performances. When patches are reduced to less than
≈500 grid points per core, performances of both workloads
do no scale linearly, but appreciable gains in performance
were observed. Scaling at these numbers of grid points per
core was more nonlinear for the 1-km workload, which was
attributed to this workload requiring a significant portion
of the supercomputer compared to the 3-km workload. The
3-km workload likely had a more closely grouped network
topology, reducing the communication time per time step
relative to the 1-km workload.
While the computation time (simulation speed) decreased (increased) with increasing numbers of cores
(Fig. 4c, Fig. 3), it was found that the relationship between total time to complete the workload and core count
was neither linear nor monotonic (Fig. 4a). Total time decreased between 512 and 2K cores and increased beyond
2K cores. While the computation time does indeed scale
well with increasing numbers of cores, increases in time for
processing the input file and boundary conditions (Fig. 4b)
and for writing a single output file (Fig. 4d, note the differing vertical scales) overwhelm the benefits of increasing
numbers of cores beyond 2K cores on Yellowstone.
To address anti-scaling of writing output, history file
splitting was enabled by setting the namelist variable
“io form history” to “102”, resulting in an equivalent number of output files and tasks. Using 8K tasks on Yellow-
# time steps s−1
Simulation Speed
≈
Fig. 2. Scaling assessment of the Katrina 1-km workload
on Yellowstone.
Library, with relative standard deviations (standard deviation/mean) in simulation speed near 3.5% for the 16K
and 32K runs and less than 0.2% for smaller core count
runs. Run to run variability was generally higher using the
IBM PE MPI implementation, with relative standard deviations as high as 14% and no clear trend with core count.
It was not concluded that the Intel MPI implementation is
better with respect to variability given the small number
of runs conducted with both implementations. In earlier
runs, even larger variabilities were observed, which were
thought to be due to a variety of issues including defective InfiniBand cables, other network issues, and occasional
contention due to other jobs running on the machine. In
one instance, it was found that a single bad node not identified by the system health checks slowed down all the runs
which happened to use it by about 50%. All these issues
were particularly prevalent during during the first months
after the opening of NWSC.
The Katrina 1-km workload scaled approximately linearly through 16K cores (≈ 647 grid points assigned to
each core) for both MPI libraries (Fig. 2). Between 16K
and 64K cores, the scaling ceases to be linear, but appreciable increases in simulation speed were still observed.
This reduced slope is a result of how WRF parallelizes a
given workload. Domains are decomposed into horizontal “patches,” with each MPI task responsible for a single
patch. With one task per core, increasing the number of
4
(d)
65536
# cores
32768
IBM PE MPI
Intel MPI
16384
512
stone, the time to write out one time step was decreased
from approximately 45 minutes to 0.24 seconds. Similarly,
initialization time can also be reduced by creating individual input files for each MPI task, accomplished by setting
“io form input” also to “102” and rerunning “real.exe”.
Initializing WRF with many input files was not investigated. While time for WRF routines involving disk I/O
can be significantly decreased by splitting up the output
files, as of WRF v3.5 there are no supported post processing utilities available that handle these split files.
60
50
40
30
20
10
(c)
8192
Comp. Time
Writing Time
Fig. 3. Per core scaling assessment in terms of number of
time steps completed per second. Performances for both
resolutions of the Katrina workload are presented, with
both workloads having 35 vertical levels and identical physical parameterizations. The linear fit is approximate, being
fit “by eye” to emphasize deviations from linear scaling.
(b)
4096
104
105
103
≈ # horizontal grid points core−1
102
120
100
80
60
40
20
0
60
50
40
30
20
10
0
(a)
2048
100
80
70
60
50
40
30
1024
IC Proc. Time
# time steps s−1
101
Total Run Time
1km: IBM PE
1km: Intel
3km: IBM PE
Fig. 4. Timing for WRF as a function of number of cores
broken down by (a) total time, (b) processing time (time to
process the input file and boundary conditions), (c) compute time (time to simulate one hour), and (d) time to
write one output file.
5. Conclusions
WRF is a widely used NWP model. The ability to
perform larger and more detailed simulations depends
on the ability of the code to effectively use large-scale
computational resources, which are becoming increasingly
common even at relatively small supercomputing centers
(www.top500.org). This paper demonstrates the efficient
parallel implementation of WRF, the high performance
scalability of both the Intel and IBM PE MPI implementations, and the scalability of HPC systems, such as Yellowstone at NWSC with Intel Sandy Bridge processors.
Running WRF at very high resolutions for large spatial
domains and extremely large core counts presents unique
challenges and may not be yet appropriate for daily production runs. With respect to the WRF model, initialization and I/O time need to scale. Future work will focus
on domain decomposition and investigating resulting communication to computation ratios and model performance.
Also, scaling with two tasks per node, eight threads per
task with task binding will be investigated to see if the
small improvements noted at 4K cores are applicable to all
5
core counts.
APPENDIX B
Acknowledgments.
The authors are very grateful to C. M. Patricola, P.
Chang, and R. Saravanan (TAMU) for kindly providing
the Katrina workload. Dave Gill was also extremely helpful with feedback and suggestions to address anti-scaling of
model initialization and writing output. Additionally, R.
Montuoro would like to acknowledge support from the Office of Science (BER) and the U.S. Department of Energy,
through Grant N. DE-SC0006824 and C. Kruse would like
to acknowledge support through the Summer Internships
in Parallel Computational Science program from Rich Loft,
NCAR, and the NSF.
Physical parameterization namelist settings for
both Katrina workloads
&p h y s i c s
mp physics
ra lw physics
ra sw physics
radt
ra call offset
cam abs freq s
levsiz
paerlev
cam abs dim1
cam abs dim2
sf sfclay physics
sf surface physics
bl pbl physics
bldt
cu physics
cudt
isfflx
ifsnow
icloud
surface input source
num soil layers
sf urban physics
maxiens
maxens
maxens2
maxens3
ensdim
sst update
usemonalb
fractional seaice
/
APPENDIX A
Environment variable settings used for large runs
with Intel MPI and DAPL
The following environment variables were specified to
reduce memory usage and to address connection issues encountered using ≥ 4K cores.
I MPI DEBUG=2
I MPI FALLBACK=d i s a b l e
I MPI FABRICS=shm : d a p l
I MPI DAPL UD PROVIDER=ofa −v2−mlx4 0 −1u
I MPI DAPL UD=on
I MPI DAPL UD RNDV EP NUM=2
I MPI DAPL UD DIRECT COPY THRESHOLD=65536
I MPI DAPL UD SEND BUFFER NUM=8208
I MPI DAPL UD RECV BUFFER NUM=8208
I MPI DAPL UD ACK RECV POOL SIZE=8704
I MPI DAPL UD CONN EVD SIZE=2048
I MPI DAPL UD REQUEST QUEUE SIZE=80
DAPL UCM REP TIME=16000
DAPL UCM RTU TIME=8000
DAPL UCM RETRY=10
DAPL UCM QP SIZE=8000
DAPL UCM CQ SIZE=8000
DAPL UCM TX BURST=100
DAPL MAX INLINE=64
DAPL ACK RETRY=10
DAPL ACK TIMER=20
DAPL RNR TIMER=12
DAPL RNR RETRY=10
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
2,
4,
5,
3,
0
21600 ,
59 ,
29 ,
4,
35 ,
1,
2,
1,
0,
0,
5,
1,
0,
1,
1,
4,
0,
1,
3,
3,
16 ,
144 ,
1,
. true .
1
REFERENCES
Barker, D., et al., 2012: The Weather Research and Forecasting model’s community variational/ensemble data
assimilation system. Bull. Amer. Meteor. Soc., 93, 831–
843.
R
Intel,
2013a:
Intel
Composer
XE
2013.
http://software.intel.com/en-us/intel-composer-xe.
R
Intel,
2013b:
Intel
http://www.intel.com/go/mpi.
6
MPI
Library.
Michalakes, J., et al., 2008: WRF nature run. J. of
Physics: Conference Series, 125, doi: 10.1088/17426596/125/1/012022.
MPI, 2012: MPI: A Message-Passing Interface standard version 3.0. http://www.mpi-forum.org/docs/mpi3.0/mpi30-report.pdf.
NCAR, 2013: NCAR-Wyoming Supercomping Center.
http://www2.cisl.ucar.edu/resources/yellowstone.
OpenMP
Archetecture
Review
Board,
2011:
OpenMP
application
program
interface
version
3.1.
http://www.openmp.org/mpdocuments/OpenMP3.1.pdf.
Patricola, C. M., P. Chang, R. Saravanan, and
R. Montouro, 2012: The effect of the atmosphereocean-wave
interactions
and
model
resolution on Hurricane Katrina in a coupled regional climate model. Geophys. Res. Abs., 14,
http://meetingorganizer.copernicus.org/EGU2012/EGU201211855.pdf.
Quintero, D., A. Chaudhri, F. Dong, J. Higino,
P.
Mayes,
K.
Sacilotto
de
Souza,
W. Moschetta, and X. T. Xu, 2013:
IBM
Parallel Environment (PE) Developer Edition.
http://www.redbooks.ibm.com/redbooks/pdfs/sg248075.pdf.
Saha, S., et al., 2010: The NCEP climate forecast system
reanalysis. Bull. Amer. Meteor. Soc., 91, 1015–1057.
Skamarock, W. C., et al., 2008: A description of the
advanced research wrf version 3. NCAR Tech. Note
NCAR/TN-4751STR, 113 pp. [Available online at http:
//www.mmm.ucar.edu/wrf/users/docs/arw_v3.pdf.].
7