Profiling Methodology and Performance Tuning of the Met Office Unified Model for Weather and Climate Simulations Peter E. Strazdins∗ , Margaret Kahn† , Joerg Henrichs‡ , Tim Pugh§ , Mike Rezny¶ ∗ School of Computer Science, The Australian National University, Email: [email protected] † NCI National Facility, Australia, Email: [email protected] ‡ Oracle, Email: [email protected] § Bureau of Meteorology, Australia Email: [email protected] ¶ Monash Weather and Climate, Monash University, Email: [email protected] Abstract—Global weather and climate modelling is a compute-intensive task that is mission-critical to government departments concerned with meteorology and climate change. The dominant component of these models is a global atmosphere model. One such model, the Met Office Unified Model (MetUM), is widely used in both Europe and Australia for this purpose. This paper describes our experiences in developing an efficient profiling methodology and scalability analysis of the MetUM version 7.5 at both low scale and high scale atmosphere grid resolutions. Variability within the execution of the MetUM and variability of the run-time of identical jobs on a highly shared cluster are taken into account. The methodology uses a lightweight profiler internal to the MetUM which we have enhanced to have minimal overhead and enables accurate profiling with only a relatively modest usage of processor time. At high-scale resolution, the MetUM scaled to core counts of 2048, with load imbalance accounting a significant fraction the loss from ideal performance. Recent patches have removed two relatively small sources of inefficiency. Internal segment size parameters gave a modest performance improvement at low-scale resolution (such as are used in climate simulation); this however was not significant a higher scales. Near-square process grid configurations tended to give the best performance. Byte-swapping optimizations vastly improved I/O performance, which has in turn a large impact on performance in operational runs. Keywords-weather prediction, climate modelling, parallel computing, performance analysis, high performance computing. I. I NTRODUCTION The Met Office Unified Model (MetUM, hereafter referred to as simply ‘UM’) [1] is a global atmospheric model developed by the UK Met Office (MetO), which has been in operational use since the early 1990s. Its name signifies that it can be used for both weather and climate prediction. As well as in the UK, the UM is used for both of these purposes in South Korea, India, South Africa, and New Zealand; recently it has also been adopted by the Bureau of Meteorology (BoM) in Australia [2]. When configured for modelling the whole earth’s atmosphere, the UM models the atmosphere using a rectangular grid with variable distance between grid points to account for the curvature of the earth. The N512L70 grid has 1024 × 769 × 70 grid points in the east-west, north-south and vertical dimensions, respectively. This grid and the N320L70 (640 × 481 × 70) are typically used for weather prediction; smaller grids such as the N96L38 (192×145×38) are used for climate prediction. In general, the doubling of the horizontal resolution has an 8-fold effect on the computational work. This is because the timestep also may need to be halved, as a restriction of the model is that mass may not be transported (e.g. through advection) across more than one grid point upon each simulated timestep. For current operational use in weather prediction, the BoM uses UM codes (version 6.4) on a global earth grid at a resolution of N144L50. For 2011, it is desired to use at first an N320L70 grid and by later an N512L70 grid for greater accuracy (forecast ‘skill’) [3]; 24 hours of model simulation time in 500 seconds using fewer than 1000 cores on an Intel/Infiniband cluster or less is the operational target. Such a short run time is required for timely delivery of forecasts and ’ensemble’ simulations. Ensemble simulations, where for example 24 simulations, each with their input data perturbed, are performed to generate a representative sample of the possible future atmospheric states for probability assessment. It should be noted that the operational simulation time for weather prediction is typically 10 days; a 24 hour target is deemed sufficient to project to this time. The version of UM to be used is 7.5 (released by the UK Met Office in April 2010) [4], which contains the first implementation of multi-threaded MPI code [5] and is optimized for IBM Power processors. The Australian Community Climate and Earth System Simulation (ACCESS) project [6] is a joint venture between CSIRO, BoM and five Australian Universities. The project is founded on the principle that, by using a common model infrastructure, larger climate science questions can be tackled by the national community. ACCESS is scheduled to contribute to the fifth Intergovernmental Panel on Climate Change in 2011. Here an N96L38 grid is typically used, with the computationally dominant component UM being coupled to ocean, land surface and sea ice models. The target core count is 96 for the UM, and 25 cores for the other models. Next generation short-term UK climate models are scheduled to use an N216L85 grid, projected to scale up to an N320L70 grid in 2012 [7]. Thus, large amounts of supercomputer time in sites across Australia and beyond will soon be committed to running the UM. Without an understanding of its performance, users are likely to run the code on ‘out-of-the-box’ configurations, potentially making poor use of available resources. This paper reports our experiences in profiling and tuning UM 7.5 code performance at the N96L38, N320L70 and N512L70 grid resolutions, with the motivation of understanding and improving performance in order to meet the above targets. We used the vayu cluster at the National Computing Infrastructure (NCI) National Facility [8], which is an Intel/Infiniband cluster. This paper is organized as follows. Section II describes the Unified Model and its code structure. It also describes its internal timer module, which becomes an important tool in our performance evaluation methodology. Section III gives relevant details of the vayu cluster, and the constraints we have for this project on its usage. Section IV describes our preliminary experiences in performance analysis of the UM on the vayu cluster. From a clearer understanding of the properties and limitations of both, a performance evaluation methodology is proposed in Section V. Results involving a scaling analysis are given in Section VI and those involving parameter tuning are given in Section VII. Related work is discussed in Section VIII, with concluding remarks from our experiences being given in Section IX. II. T HE U NIFIED M ODEL Part of the Unified Model’s widespread appeal is its flexibility, allowing it to be used for both weather and climate prediction (short and long timescales). It can be configured to run over regional and global models, and can be coupled with ocean and other earth system models [9]. A UM benchmark is configured using the Unified Model User Interface (UMUI). This produces a directory containing the source code and data files. According to the configuration, compilation is conditional (using cpp directives). Preprocessing is used to select at compile time one from Figure 1. Surface temperature input data of the model (date 27/05/10) a number of different algorithms for major sections of the model, and to select architectural parameters (e.g. little vs. big-endian). Once built, it is difficult to run the executable for input data substantially different (e.g. a different grid resolution) than what it was configured for. Like other atmospheric models, many of the components of UM are highly data-dependent; thus, any benchmark must use realistic atmospheric data. For global atmosphere simulation, the main input data file is a ‘dump file’ containing a snapshot of all the atmospheric state. Figure 1 gives the surface temperature from the input data file used for the N320L70 grid; this file is approximately 1.5 GB in size. When used in production, the models dump statistics on the atmosphere’s state over the simulation; this is handled by Spatial and Temporal Averaging and Storage Handling (STASH) sub-system. Other input files include configurations for the STASH, specification of the shortwave and long-wave radiation parameters, and namelist files for configuring other aspects of the simulation (these last have approximately one thousand parameters). When running on a grid of P × Q distributed memory processors, the horizontal dimensions of the grid are divided evenly over the processor grid. As well as its own block column of points, each processor will also have a ‘halo’ of surrounding grid points. A. UM Code Structure The UM 7.5 codes consist of approximately 900,000 lines of Fortran 90 source code. Reflecting the long evolution of the model, predominately Fortran 77 features are used. The code is pre-processed via cpp; this is mainly used for code declarations which are often repeated, such as common block declarations and ‘standard’ sub-lists of parameter names used for subroutine calls. The top-level control routine for the model is called um_shell() [9]. This initializes the parameters used for the simulations (namelist files initializing common blocks and environment variables). The main simulation routine, u_model() is then called. This allocates the major arrays, and reads in the data files. It then iterates over the required timesteps, controlling I/O and transfer of data to model couplers (if required). At each iteration, the atm_step() routine is called, which undertakes the simulation of the atmosphere (both dynamics and physics) over a single timestep. For large grids, the computationally dominant part of this process is applying a solver to the Helmholtz pressuretemperature equation for acoustic wave dynamics: this is performed using a Generalized Conjugate Residual (GCR) iterative solver on a tridiagonal ‘linear system’. This system is essentially that of the vertical dimension, but replicated across all horizontal mesh points. The dominant communication patterns in this part of the computation are collective communications (row, column and global), principally barriers and reductions. Other parts have these and also a significant amount of point-to-point communication from the eight neighboring horizontal points (NS, EW and diagonals); these form the so-called ‘halo exchanges’. B. Internal Profiler UM includes a profiling module which may be activated through namelist configuration [4]. It times parts of the major sub-routines computation using high-resolution wallclock and CPU timer functions. As well as recording the time for a whole subroutine (called an ‘inclusive timer’), it also measures parts of the subroutine excluding the calls that it makes which are also timed (called a ‘non-inclusive’ timer). u_model() is the top-level non-inclusive timer; the sum of all non-inclusive timers thus gives the time of the call to u_model(), which is effectively the total execution time for the simulation. Approximately one hundred of each kind of timer are profiled. For the timers that are called globally, all processors are synchronized at the start of a timer call. As well as giving time totals for each such timer (subroutine), it also records the number of calls made. Statistics of the time totals across all processing elements are also gathered; these include the mean, median, maximum and minimum times across each process (PE). A ‘parallel speedup’ for each subroutine is also provided; this is defined to be the total CPU time divided by the maximum wall time1 . The output is sorted in terms of wallclock times. Figure 2 indicates these statistics for a 9 timestep UM simulation on an N512L70 grid; the read_dump() subroutine, which reads the dump file, is called only once; most 1 This is not meaningful on systems where MPI wait time is included in the CPU time, such as vayu. others are called 9 times. It can be seen that execution time is dominated by the Helmholtz solver. An estimate of load imbalance can be obtained from the differences between the maximum and average of each timer across all processors. This will be an underestimate in the sense that the waiting time in communication due to load imbalance within a function will not be included; nonetheless, it still provides a useful lower bound. The overhead of the timer routine itself is also recorded (it has been negligible in all our experiments). However, this does not include the synchronization at the start of the timer call; this however does get included in the time of the enclosing (‘non-inclusive’) timer. The timer output is given at the end of a run; it is also possible to get timer output on a number of intermediate regions by inserting the appropriate timer routine calls in the source code. III. T HE vayu C LUSTER The vayu cluster has 1492 nodes consisting of Sun X6275 blades, each with two quad-core 2.93 GHz Intel X5570 Nehalem processors and cache sizes of 32KB (per core) / 256KB (per core) / 8MB (per socket). The available physical memory per node is 24 Gbytes, excepting 48 nodes which have 48 Gbytes. The interconnect is single plane QDR Infiniband, with a measured MPI latency of 2.0µs and unidirectional bandwidth of 2600 MB/s per node [8]. The nodes are configured with a CentOS 5.4 Linux distribution with the 2.6.32 kernel. Parallel jobs are launched using a locally modified version of the Portable Batch System (PBS). Maximum walltime and virtual memory limits must be specified upon submission. The scheduler (by default) allocates 8 consecutively numbered MPI processes to each node allocated to the job. Where possible, contiguous sequences of nodes are allocated to a job to minimize interference between jobs. Jobs may be suspended, e.g. to make room for a job requiring a large number of cores. A typical snapshot of the main job queue (taken noon in mid-December) is: 1216 running jobs (465 suspended), 280 queued jobs, 11776 cpus in use Run-time files used in our benchmarks such as the dump file and output from runs are on a Lustre file system. Our ‘project’ was restricted to use no more than 2048 cores and was allocated a few thousand CPU hours of cluster usage. IV. P RELIMINARY E XPERIENCES ON THE N320L70 B ENCHMARK Our initial experiments were run with a UM N320L70 benchmark compiled with the Intel Fortran compiler 11.1.072 and using OpenMPI 1.4.1 with processor affinity enabled. In order to simplify profiling, STASH output was disabled. Due to memory constraints, it was not possible to run the benchmarks on fewer than 2 nodes. ROUTINE PE_Helmholtz SL_Full_wind ATM_STEP SL_Thermo READDUMP Convect LW Rad Atmos_Physics2 LS Rain SW Rad MEDIAN 206.98 24.09 38.53 26.60 24.36 18.14 10.21 6.85 4.91 7.11 SD 0.05 6.91 9.46 3.45 1.12 2.22 0.20 1.92 1.48 3.27 % of mean 0.02% 25.09% 25.99% 13.58% 4.62% 13.34% 1.96% 25.07% 28.22% 60.47% MAX 207.02 44.78 44.60 31.15 24.37 18.87 10.65 11.72 9.70 8.90 ( ( ( ( ( ( ( ( ( ( (PE) 7) 3) 44) 3) 58) 0) 14) 53) 2) 59) MIN 206.85 23.99 12.35 21.05 15.76 12.48 9.72 5.20 2.68 1.09 ( ( ( ( ( ( ( ( ( ( (PE) 48) 12) 2) 44) 0) 63) 40) 33) 40) 0) Extract of timer output: wallclock time statistics for an N512L70 benchmark with 64 processes for a 1.5 simulated hour run To profile individual subroutines, a tool such as Oracle Solaris Studio collect, in conjunction with MPI profiling and hardware performance counters, would have been ideal. This would in principle enable a performance analysis based on both communication and memory hierarchy issues. However, the Linux kernel available at that time did not support performance counters. While we were able to successfully use collect for the N96L38 benchmark for walltime profiling, we were unable to do so for the N320L70, due to the sheer size of the generated data. The MPI profiling tool IPM [10] was also attempted; this only completed successfully on 256 or fewer cores. For a 5 hour simulation on a 16×16 process grid, IPM reported that 44% of communication time was spent in MPI, dominated by MPI_Barrier() (14%), MPI_Scatterv() (9% called only from a subroutine called q_pos_ctl()), and MPI_allReduce() (7%). It became evident that the size and complexity of the N320L70 benchmark would require a more lightweight approach. The UM internal profiler, as described in Section II-B, was employed. To conserve usage quotas, the simulations were generally limited to 2 simulated hours (12 timesteps). A. Variability Analysis We observed a variability in run times of up to 50% for repeated jobs; however, these were bimodal, with a number of ‘fast’ runs clustered around 5–10% of each other. Within each run, each iteration of atm_step() also showed some variability, with iterations k, where k%6 = 0, taking significantly more than the others, and iteration 0 taking between 2 and 5 times more than them all. The former effect is due to this benchmark having been configured to do extra processing at the start of each simulated hour. Figure 3 illustrates this phenomenon for two hours simulated time runs. A few runs of up to five hours were attempted; there were no qualitative differences between the second and subsequent hours. The causes for time variability between runs were almost entirely due to factors affecting the whole of the computation. The most likely causes are to cross-application 10 GCR iters run 1 (t=54) run 2 (t=53) run 3 (t=59) run 4 (t=47) run 5 (t=52) 8 6 GCR iterations Figure 2. MEAN 206.97 27.54 36.39 25.38 24.18 16.64 10.19 7.66 5.23 5.41 atm_step() times 1 2 3 4 5 6 7 8 9 10 4 2 0 0 2 4 6 8 10 12 iteration Figure 3. Variation in walltime (s) in consecutive calls to atm_step() for two simulated hours of an N320L70 benchmark on a 32 × 32 grid. Iterations in GCR solver are on scale from 0 to 50 contention [11] (other current jobs heavily communicating the same switch(es) as the UM job) and operating systemrelated causes (page coloring, NUMA effects and interrupts). Unfortunately, none of these effects was under our control at the time. The read_dump() routine took between 25 and 29 seconds of the total; this routine scales negatively. Within atm_step(), the only routine to scale negatively was q_pos_ctl()2 , taking between 12 and 15 seconds of the total run-time (at 32 nodes). The minimal job time for Figure 3 was ≈ 80s; this can be extrapolated to give ≈ 260s for a 24 hour run, corresponding to about 75 CPU hours. After modifying the UM internal profiler to record on which call to a particular routine was the most expensive, we saw that the first call was generally the most expensive when there was significant variations, and indeed this was the case with q_pos_ctl(). From code inspection, this routine used gather and scatter communications (and was the only routine to do so to a significant extent); we conjectured that the first call had extra overhead in setting up OpenMPI connections. The second possible reason was that upon the first iteration, memory-mapped pages in physical memory 2 This routine corrects the moisture content field, which can become negative from previous calculations. Conservation of the overall level of moisture should be preserved, especially for climate simulations. from the previous job were being flushed We repeated the tests, adding the OpenMPI flag -mca mpi_preconnect_mpi 1 and using the memhog utility at the start of the job to flush out out memory-mapped pages before u_model() was called. The first iteration was now only within a factor of two of the average, with the impact of q_pos_ctl() reduced by a factor of about one half. Finally, the effect of the variation between non-hourly iterations of atm_step() deserves some comment. The most obvious cause is differences in the number of iterations required for the Helmholtz solver. These are listed in the bottom row of Figure 3 (they are the same for every run); they correlate only weakly to the observed times per call. V. P ROFILING M ETHODOLOGY From the above experiences, and taking into account the resource limitations for our project, we designed a profiling methodology based on the following principles: 1) The profiling infrastructure should have minimal effect on runtime. 2) The profiling be as efficient as possible, i.e. consume a minimum of resources while providing an accurate prediction of long-term simulations. 3) The methodology should take into account (as far as possible) the variability of the vayu system. Our methodology is thus to: • take at least 5 timings for each configuration (process grid and any other tunable parameters). The minimum will generally be taken as the true timing, as it represents the run with the least interference. When varying parameters of the simulation, a single run from each setting is submitted to the cluster and repeat runs do not start until the previous has completed. The rationale behind this is that, if there comes a ‘quiet’ period, all settings will have a chance to run under those conditions. • examine the scalability of the major subroutines and estimate the degree of load imbalance. • run the simulation for a minimal number of timesteps in order to be able to project the performance of the target run-time. This was done by dividing the simulation into an initial period (reading dump file and first 6 iterations) and a ‘warmed period’ (the next 12 iterations). The warmed period should be over an integral number of simulated hours, in order to take into account the hourly variability observed in Section IV. Two hours appear sufficient to smooth out the effects of other sources, such as the variability of the execution time of individual timesteps. • reduce the overhead of profiling, and measure (or estimate) its extent. This is particularly important over the ‘warmed period’ which is used to make the projections. • individually time each iteration within u_model(), for later analysis. Accordingly, we modified the UM internal profiler to measure the warmed and total periods. To reduce overhead (in the warmed period), the global synchronizations invoked when a timer started were limited to those routines which exceeded a certain threshold of the fraction of walltime in the initial period. We chose a threshold of 3%, which resulted in a factor of approximately ten in the reduction of the number of synchronizations. This number was recorded, and at the end of the simulation, a number of consecutive synchronizations were timed in order to get an estimate of the synchronization overhead. The load imbalance estimate for the warmed period was then calculated as the averaged time (over each process) spent in synchronizations during that period, minus the estimated overhead. As a defensive programming measure, total times for the execution and the warmed period were independently recorded. These correlated to within less than 1% of the sum of the non-inclusive timers. It turned out that suppressing synchronizations for certain functions was far more complicated to implement efficiently than it at first appeared. This is because some processes invoke timers for subroutines (e.g. polar filtering) that other processes do not. A compromise was to let process 0 decide which timers should be suppressed, and this information was broadcast to other process at the beginning of the warmed period. VI. S CALABILITY A NALYSIS This section gives profiling and scalability analysis for the UM on both the N320L70 and N512L30 grids. q_pos_ctl() was identified in Section IV as being a bottleneck; rather than attempting to optimize it, we subsequently obtained from the MetO the recent Parallel Suite 24 (PS24) patches, which had addressed this problem. These were then merged into the UM 7.5 codes. The benchmarks were configured to select an algorithm for q_pos_ctl() which conserved the moisture field for the same level; this avoided some of the expensive collective communications in the original code. The codes were built under the conditions described in Section IV. The original and PS24 codes both used the same dump files for both the N320L70 and N512L70 benchmarks. Each timestep corresponded to 12 simulated minutes in both cases. However, it was observed that the PS24 codes did not have any periodic pattern in the time taken for each timestep. This is unlike the original codes in which every sixth timestep took appreciably longer, as described in Section IV-A. Figure 4 gives the speedups corresponding to the N512L70 benchmark on the PS24 codes. The runs are over three simulated hours and generate no STASH output. The grid configurations were chosen to be near-square (either a 1:1 aspect ratio, or 1:1.5 where this was not possible). Assuming the degree of communication north-south and eastwest is similar, a near square configuration should optimize 128 job - min (t16=2324) u_model - min (t16=2307)) warm - min (t16=1464) warm - av (t16=1603) 64 ideal Speedup (over 16 cores) Speedup (over 16 cores) 128 32 16 8 128 256 512 1024 Number of Cores N512+PS (t16=1464) N320+PS (t16=481) N512 (t16=1793) N320 (t16=493) 64 ideal 32 16 8 128 2048 256 512 1024 Number of Cores 2048 Figure 4. Scaling of UM 7.5 + PS24 patches on the N512L70 benchmark. ‘t16’ is the time in seconds for 16 cores Figure 5. Scaling of the minimum ‘warmed’ time of UM 7.5 with and without PS24 patches on the N320L70 and N512L70 benchmarks. ‘t16’ is the time in seconds for 16 cores the surface to volume effects in the grid decomposition, and hence minimize nearest-neighbor communication. For less than 128 cores, the scaling was linear or very slightly super-linear; to amplify scaling effects at higher cores counts, these results were omitted from Fig 4. The job times, time for the call to u_model() and the warmed period times are given. The latter is given for both the minimum and average times. Despite there being a difference of at least 15%, due to factors described in Section IV-A, the averaged times show similar scaling behavior as the minimum. The time to call u_model() scales more poorly, due to the shortness of the simulated time period, and the effect of calling read_dump() and the ‘warming’ of the benchmark. An analysis of the difference between the job u_model() times for 1024 cores (100s vs 87s) yielded the following breakdown: 2s to launch process 0, 4s to launch all processes, 6s to read the namelist files and 1s to exit UM and ‘cleanup’ standard output files. It can be noted from Figure 4 that this difference increases with increasing core counts. Figure 5 gives the speedups corresponding to the warmed period for all four benchmarks. While all N320L70 benchmarks are still scaling above 1000 cores, the N512L70 benchmarks scale slightly better 3 . This is not surprising in the sense that the N512L70 grid will have a factor of 1.6 better surface-to-volume ratio than the N320L70. It can be seen particularly from the N320L70 curves that the PS24 codes scale better. Table I shows the execution profile for the N512L70 benchmarks at 1536 cores for the ‘non-inclusive timers’ corresponding to the major subroutine. No routine scales perfectly; hence there is potential for worthwhile performance improvement in all. PE_Helmholtz is the dom- Table I N512L70 SUBROUTINE WALL - TIME PROFILE WITH 32 × 48 = 1536 CORES . ‘s’ IS THE SPEEDUP ( OVER 16 CORES ), ‘% tw ’ IS % OF WARMED PERIOD TIME , ‘% IM .’ IS FRACTION OF TIME DUE TO LOAD IMBALANCE 3 The missing points for the non-PS24 codes above 1500 cores are due to bus errors. function PE_Helmholtz atm_step q_pos_ctl SL_Full_wind SL_Thermo Convect SF_EXPL Atmos Physics2 : total: s 59.4 15.1 0.8 50.8 52.7 28.0 6.7 12.9 52.1 UM7.5 % tw % im. 51 0 11 27 12 0 10 47 6 20 9 54 6 69 5 71 100 17 s 61.6 11.4 5.4 49.1 53.3 66.2 35.4 52.9 55.8 UM7.5+PS24 % tw % 48 25 0 13 9 5 1 1 100 im. 0 20 10 46 19 37 75 28 30 inant routine. The parts of atm_step() that are not part of the other listed functions also have poor scaling, with significant load imbalance. Considering that 96 is the ideal scaling at this point, the overhead of SL_Full_wind appears to be almost entirely in load imbalance. It is also high in Convect and SF_EXPL; this probably due to latitudinal variations, which is difficult to avoid with a fixed decomposition. read_dump() at 1536 cores on the N512L70 grid has an execution time of 24s, corresponding to an inverted speedup of 0.18! It has an absolute execution time of ≈ 25s. While it is only called once per simulation, it is also a target for optimization. Overall, load imbalance effects are quite significant (recall that our methodology gives an underestimate of the real load imbalance). It can also be seen that the PS24 patches have significantly better performance in q_pos_ctl() and SF_EXPL. Figure 6 shows the load-imbalance distribution of processor nodes over hours 2 and 3 of an N320L70 simulation 666666666666666666666666 999999999999999999999999 888999999999999999999999 888988889999999888999999 888888888889887777778888 887777778888777777777788 887777777788777766677778 666666666677666555556667 222222223333322200002222 333322223344222233333344 555555556666666555556655 111122221222222200112222 000011223333333223333322 222223334444444444444443 555555666666666666666666 555555555555555555555555 Speedup (over 16 cores) 128 p aff. (t16=1603) none (t16=2043 ) p+m aff. (t16=1807) ideal 64 32 16 8 4 64 128 256 512 Number of Cores 1024 2048 Figure 7. Scaling of averaged ‘warmed time’ with none, process and process+NUMA affinity for the PS24 codes on the N512L70 benchmark. ‘t16’ is the time in seconds for 16 cores Figure 6. Visualization of load imbalance for an N320L70 simulation on a 16 × 24 processor grid (‘0’ = 0%, ‘9’ = 15% of run-time) using the dump in Figure 1. The data is from time spent in MPI barriers in the updated timer module; thus a large value indicates the process had less work to do. Lighter loads over the high northern latitudes indicate both the effects of extra processing in the equatorial regions due to convection. A. Methodology Evaluation To evaluate the profiling methodology of Section V, some 24 hour simulations were performed on the both the N512L70 benchmark with PS24 patches, and on the N320L70 benchmark without the PS24 patches (recall that the later showed more variability in the time to execute a timestep). We chose our ‘operational target’ core count of 960 cores in a near-square aspect ratio. Defining t to be the job time, t2:24 to be the warmed time (after the 1st simulated hour), and t2:3 to be the time for the second and third simulated hours, the formula t′ = t − t2:24 + 11.5t2:3 gives the projected runtime from runs using the above methodology. The following table gives this data for three runs. run N512L70 N320L701 N320L702 t t2:24 527 39.12 224 163.8 237 174.9 t2:3 6.5 13.7 14.9 t′ 524.3 214.8 233.6 For the N320L70 run 1, there was an anomaly at timestep 55, taking 7.9s; on run 2, there was one at timestep 134 of 4.2s (the average timestep is about 1.2s). If these anomalies were removed, the above results shows that the timings for the second and third simulated hours are a quite accurate predictor of the full simulation time. It was found that the overhead of the profiler (synchronization overhead, gathering data) was less than 1% of the warmed time period in all cases. The printing output phase began to take more than this for over 512 cores; however, this had no effect on the ‘warmed time’ measurements. B. Analysis of Process and NUMA Affinity Effects The previous results used OpenMPI 1.4.1 with process affinity flags. A locally modified version of OpenMPI 1.4.3, with both process and NUMA affinity available (by default) became available after the data in the previous sections was collected. To analyse the effects of these factors on execution time and its variability, scaling measurements using the methodology of Section V was repeated for OpenMPI 1.4.3 and 1.4.1 (without process affinity). The results are shown in Figure 7. Both kinds of affinity is overall the fastest; it is 18% faster (in minimum ‘warmed’ time) than the others at 2048 cores. Figure 8 indicates the reduction in variability due to affinity, by taking the normalized error from the average Σn i=1 |ti −t̄| . While the number of measurements (n = 5) are nt̄ only sufficient to observe general trends, we can see that there is no clear correlation with the error with the number of cores. We can see also that process affinity reduces variability by 20%, but NUMA affinity reduces this further by a factor of 4! VII. T UNING OF THE M ODEL This section describes some experiments in tuning the UM 7.5 codes, based on varying configuration parameters and compilation details. A. Segment Size Parameters The performance of the N96L38 benchmark4 is qualitatively different from the larger benchmarks, since a 48-hour simulation (with each timestep representing 30 minutes) can 4 For pragmatic reasons, UM version 7.3 was used for this benchmark. However, we expect no significant difference between the behavior of the relevant routines in UM 7.5. 0.25 p aff. (avg err=.098) none (avg err=0.124) p+m aff. (avg err=0.029) Normalized Error 0.2 0.15 0.1 0.05 0 16 32 64 128 256 512 Number of Cores 1024 2048 Figure 8. Normalized average error of the ‘warmed time’ with none, process and process+NUMA affinity for the PS24 codes on the N512L70 benchmark Table II S EGMENT SIZE PARAMETER SETTINGS FOR THE N96L38 BENCHMARK subroutine short-wave radiation long-wave radiation convection setting def. opt. 80 32 80 25 80 16 improvement % subr. % total 14 0.3 25 1.2 25 2.2 be done in under 5 minutes using only 16 cores. Of more interest is improving computational performance, since this resolution is typical of those used for climate runs which are used for simulations in the order of years. One of the few performance-related parameters provided by the UM are segment size parameters for three of the atmospheric physics functions, which controls the sizes of segments of atmospheric columns which are processed together. Their values affects memory hierarchy performance and by default are tuned for an IBM Power 6 system. Table II gives the performance improvement of tuning these parameters for a 2×8 process grid. These are based on the average of 5 runs. These routines originally accounted for 21% of the total run time; the tuning resulted in a relatively modest total reduction of 3.7%. For a 2×4 grid, we obtained identical settings, with similar improvements in walltime. Figure 9 shows the tuning of these parameters, with again averages taken over 5 runs. Our methodology was to utilize the UM internal profile described in Section II-B, and uses the same segment size for all three parameters on each run. It may be noted that once the parameter size increases over 16, performance is relatively insensitive to the value of these parameters. For the N320L70 benchmark using the PS24 patches, the effect of all of these routines is however much smaller, accounting for 6–8% of the run time for core counts ranging from 16 to 1536. An experiment on an intermediate core count (384) showed a 1% improvement in the warmed period time with an optimal parameter setting of 16 in all three cases. To conserve machine usage, settings were limited to Figure 9. Segment size parameter tuning of the N96L38 benchmark multiples of 8 up to 80. The minimum of 5 timings (over the warmed period) was used for this result. B. Process Grid Configuration The results presented in Section VI used near-square process grid settings. The following table gives the execution time in seconds for various process grids with 480 cores on the PS24 patches on the N512L70 grid: grid: t (s): 10 × 24 100.1 12 × 20 97.0 15 × 16 97.4 20 × 12 98.0 24 × 10 98.2 and those for 960 cores are given in the following table: grid: t (s): 16×60 29.6 20×48 30.2 24×40 28.8 30×32 27.6 40×24 27.7 48×20 28.0 As before, these results were for 3 simulated runs with the minimum timing being given. The above results show that the near-square configurations give slightly better performance. C. I/O Performance As noted in Section VI, the read_dump() routine took a significant amount of the execution time, even compared with a full simulation run. Part of this inefficiency was traced to the C byte-swapping code which is required for I/O on a little-endian machine (the MetO mainly test the UM on IBM Power machines where this code is not used). The default configuration from UMUI will have this code compiled without optimization. Simply compiling the system with icc -O3 rather than the default gcc with no optimization reduced read_dump() execution time by 42% at 16 cores (the absolute value of the improvement remained constant at higher core counts). With STASH output on, the endian swapping has a dramatic effect on total benchmark performance. For a 24 hour simulation on 800 cores on the BoM cluster (virtually identical to vayu except having a faster Infiniband switch), compiling with icc -O3, using a x86-optimized assembler byte-swap routine and avoiding endian conversion altogether improved total execution time by 42%, 58% and 58% respectively. This brought the simulation time down to 1085s. VIII. R ELATED W ORK One of the few papers in the open literature to analyze the performance of the UM is [12]. However, this describes the very first parallelization schemes, almost two decades ago, and the results do not reflect the current state of the UM. Also, it concentrates on the data assimilation rather than the weather simulation aspects of the UM. The UM developers at the UK Met Office produce internal reports in the profiling, and scalability of the UM. These are made available to the UM community. A scalability analysis of UM 7.6 with the PS24 patches is made in [4]. The N512L70 benchmark over three simulated days with STASH output off scaled up to 1536 cores, achieving a speedup of 7.7 over 96 cores; this was on an IBM Power 6 cluster. With STASH output on, it scaled to 1024 cores, with a speedup of 5. For climate simulations, with the UM coupled to the NEMO ocean model by the OASIS3 coupler, the N216L85 model scaled to 18 cores, yielding a speedup of 1.5 over 6 cores. A profile of internal subroutines using the notion of ‘parallel speedup’ as described in Section II-B was also given; these showed q_pos_ctl() still scaled the most poorly. However, the results do not show how important to the run-time the poorly scaling subroutines were. An earlier report studies the effect of using OpenMP within UM 7.4 [5]. Results on an N512L70 benchmark on a 2-core Power 6 system with 2-way hardware threading that utilizing 2 threads per MPI task with two MPI processes per core (to take advantage of hardware threading) gave 10% improvement over pure MPI (one process per core). This was for 2 nodes; the advantage steadily reduced to 2% at 64 nodes. There are a number of papers on profiling and performance tuning of other atmospheric models. [13] shows linear scaling up to 24 nodes on a dual-socket quad-core Opteron and Infiniband cluster; this is in terms of the number of (large) WRF jobs completed per day. MPI profiles of the distribution of the number of messages per message size intervals (for various core counts up to 64) were given. A comparison between OpenMPI 1.3 and HP MPI 2.2.7 showed linear scaling up to 24 nodes, with OpenMPI having slightly better scaling and generally 10% greater overall performance. A more detailed scaling analysis of WRF is given in [10]. The benchmarked used a 1500 × 1201 × 35 regional US grid run for 240 timesteps. The performance modelling tool IPM was used to give separate scaling analysis of computation vs communication, with the former scaling super-linearly due to “algorithmic limitations”, and the latter scaling sub-linearly. The larger part was due to nearestneighbor communication in the ‘halo exchanges’, differing from our result of the majority of the time being spent in barriers and other collectives due to load imbalance. CAM is a widely used atmospheric model with a similar development history to the UM [14]. It also uses a rectangular grid but has more flexible methods for decomposing data over processes (including decomposing the vertical dimensions), and supports hybrid MPI / OpenMP parallelism. It also has a general ‘chunking’ scheme for blocks of atmospheric columns, providing a more generic solution to the segment sizes for the UM described in Section VII-A. The runtime selection of a variety of chunk load balancing schemes and communication protocols (e.g. different MPI collective and point-to-point algorithms) is also available. [14] presents a methodology to efficiently traverse the large parameter space and find an optimal set of values. Results are given for a variety of systems using a benchmark with a 256 × 128 × 26 grid. The main result on a IBM p90 cluster represented a speedup of 12 on 512 cores over 16 cores. The metric used was the number of simulated hours simulated per day. Their chunk size results are similar to ours in Section VII-A, finding an optimal chunk size of around 16 for a variety of platforms, and the run-time largely insensitive provided the chunk size is greater than 10. Load balancing scheme selection can improve runtime by 10% in some instances. However, in all of these works, the results were obtained over long simulation runs, representing a very large amount of compute resources required to generate them. IX. C ONCLUSIONS AND F UTURE W ORK The Unified Model has a highly complex code base with large and expanding user groups around the world. Atmospheric simulations with large resolution grids using the UM is a very large computation. Current profiling tools have difficulty in dealing with these simulations. Even on a highly timeshared cluster with a high variability between the execution time for identical jobs, it is still possible to accurately and efficiently profile the UM on large benchmarks by using an extended internal subroutine profiler. By reducing synchronizations within the profiler, it is possible to reduce profiler overhead to a negligible extent, while retaining the ability to estimate load imbalance. This is based on the idea of using small parts of the simulation (after ‘warming’) to project the run-time for a longer simulation. Care has to be taken to ensure the period is representative, due to effects such as extra work being done at periodicallyspaced timesteps. Care also has to taken to remove one-time effects, as these lead to a pessimistic projection of the scaling behavior. The UM 7.5 codes scale well up to 2048 cores for the N512L70 benchmark, with slightly lower scaling for the smaller N320L70 benchmark. The PS24 patches improved two subroutines which were scaling poorly; however, there remains 6 subroutines in which optimization efforts are worthwhile (potential to improve 2% or more the overall runtime). However, three of these suffer largely from load imbalance, and might require widespread changes, such as dynamic load balancing across the north-south direction. As noted in Section VI, the codes with PS24 patches do not exhibit any periodic behavior per iteration, and it is possible to to use a more efficient methodology (i.e. 1.5 hrs simulated time with a 1 hr simulated time warmed period). Performance tuning opportunities have largely arisen from the fact that the UM has been tuned for for Power 6 based platforms. Segment size parameters may be tuned to bring modest improvements for the N96L38 data grid, currently used for climate simulations. For larger data grids, the proportion of time spent in routines affected by these parameter is reduced, and the advantage becomes negligible. For moderate to large core counts, near-square processor grids give slightly better performance. I/O performance is significant; optimizing or avoiding the associated byte swap operation gives dramatic performance improvement when STASH output is enabled. OpenMPI process and NUMA affinity parameters have also been shown to have a very significant affect, in both improving performance and reducing the variability of measurements. Future work includes extending the UM internal profiler to use hardware event performance counters in order to understand better the memory performance of the codes, which is also expected to have a large impact on performance. Obtaining a profiling of MPI subroutines at large core counts is also needed, in order to understand the time spent in communication (as opposed to load imbalance). We plan to combine IPM with our extended internal profiler, using the IPM regions facility to get MPI and hardware event performance counters breakdown across the major subroutines. A more detailed investigation into the constituent components of the dominant Helmholtz solver is also planned. The results of Section VII-C indicate we are already within a factor of two of the ‘operational target’ with the N512L70 benchmark. Apart from waiting for a cluster based on the Intel Sandy Bridge processor (for which the target is intended), the most promising opportunity for optimization is in using mixed OpenMP / MPI parallelism, and improving the latter within each nodes. This should reduce MPI communication, due to having fewer MPI processes. The more detailed profiling methodology mentioned above would be helpful in predicting whether this has any benefit. for making available the PS24 patches. We finally thank the NCI National facility for the usage of the vayu cluster. ACKNOWLEDGMENTS [13] G Shainer et al, “Weather Research and Forecast (WRF) Model: Performance and Profiling Analysis on Advanced Multi-core HPC Clusters,” in Proc. of The 10th LCI International Conference on High-Performance Clustered Computing, Mar. 2009. The authors thank Les Logan for preparing the UM 7.5 benchmarks. We thank Les and Martin Dix for technical advice on the UM. We also thank David Singleton, Robin Humble and Judy Jenkinson for technical support on the vayu cluster. In particular, to David Singleton for adding NUMA affinity to the local version of OpenMPI 1.4.3. We thank Paul Selwood for performance advice on the UM and R EFERENCES [1] T. Davies, M. J. P. Cullen, A. J. Malcolm, M. H. Mawson, A. Staniforth, A. A. White, and N. Wood, “A new dynamical core for the Met Offices global and regional modelling of the atmosphere,” Q. J. R. Meteorol. Soc, vol. 131, pp. 1759–1782, 2005. [2] The Unified Model Collaboration. [Online]. Available: http://www.metoffice.gov.uk/research/collaboration/ um-collaboration [3] R. Buizza, D. S. Richardson, and T. N. Palmer, “Benefits of increased resolution in the ECMWF ensemble system and comparison with poor-man’s ensembles,” Quarterly Journal of the Royal Meteorological Society, no. 589, pp. 1269–1288, 2003. [4] A. Malcolm, M. Glover, and P. Selwood, “Scalability of the Unified Model,” UK Met Office, Tech. Rep., Jun. 2010. [5] M. Glover, A. Malcolm, and P. Selwood, “OpenMP scaling tests,” UK Met Office, Tech. Rep., Nov. 2009. [6] (2010) The Australian Community Climate and Earth-System Simulator. [Online]. Available: http://www.accessimulator. org.au/ [7] S. Milton, A. Arribas, M. Bell, and R. Swinbank, “Seamless Forecasting,” UK Met Office, Tech. Rep., Nov. 2009. [8] (2010) The NCI Cluster vayu System Details. [Online]. Available: http://nf.nci.org.au/facilities/vayu/hardware.php [9] UK Met Office, “Unified Model: Basic User Guide (version 7.3).” [10] N. J. Wright, W. Pfeiffer, and A. Snavely, “Characterizing Parallel Scaling of Scientific Applications using IPM,” in Proc. of The 10th LCI International Conference on HighPerformance Clustered Computing, Mar. 2009. [11] D. Skinner and W. Kramer, “Understanding the causes of performance variability in HPC workloads,” in In International Symposium on Workload Characterization. IEEE, 2005, pp. 137–149. [12] B. J. N. Wylie and P. D. Surry, “Meteorological Data Assimilation: Migration to Massively Parallel Systems,” in Proc. of Scalable Computing: Cray User Group Conference. Montreux: Cray Inc, 1993. [14] P. H. Worley, “Benchmarking using the Community Atmospheric Model,” in Proc of the 2006 SPEC Benchmark Workshop, Jan. 2006.
© Copyright 2026 Paperzz