General information - Meso-NH

PRACE Final Report Form – Preparatory Call B
1. General information
1.1. Proposal ID
2010PA1429
1.2. Type of preparatory proposal granted
B – Code development and optimization by applicant (without PRACE support).
1.3. Period of access to the PRACE facilities
1/05/2013 to 31/10/2013
1.4. Name of the PRACE facility assigned
JUQUEEN , HERMIT & CURIE HYBRIDE
2. Project information
2.1. Project name to which corresponds the developed and optimized code
Scaling MESONH to next gen IBM, Cray & GPU architecture for PETASCALE
simulation .
2.2. Research field
Astrophysics
Engineering and Energy
Earth Sciences and Environment
Mathematics and Computer Science
Medicine and Life Sciences
Chemistry and Materials
Fundamental Physics
Finance and Social Science
Linguistics and Encryption
2.3. Institutions and research team members
Dr Juan ESCOBAR MUNOZ : Computer Scientist
Dr Jean-Pierre CHABOUREAU : Physicist
@
Laboratoire d'Aérologie / CNRS / OMP / Université Paul Sabatier TOULOUSE
2.4. Summary of the project interest (Maximum 300 words)
Please fill in the field with the same text used in the application form.
MesoNH is the non-hydrostatic mesoscale atmospheric model of the French
research community. It has been jointly developed by the Laboratoire d'Arologie
(UMR 5560 UPS/CNRS) and by CNRM-GAME (URA 1357 CNRS/Mto-France).
MesoNH is now running in production mode in TIER0 computer up to 8K cores .
The goal of this project is to prepare MesoNH to the next gen architecture, CRAY &
NVIDIA/GPU & IBM-BG/Q to gain 1 order of magnitude in the scalability of the code
in production runs.
Development will concentrate in priority on the continuation of porting of MESONH
on the GPU with the PGI/OPENACC tool-kit on multi-node clusters. The second
goal of the project is to test the scalability of new algorithm for the pressure solver
equation ( like Multi-Grid ) to replace the quasi-spectral present algorithm .
3. Main features of the code
3.1. Name of the code or codes.
MESO-NH
3.2. Type of the code distribution (e.g. opensource, commercial, academic…etc.)
Academic , License agreement (free of charge) with CNRM and LA.
3.3. Computational problem executed (e.g. N-body problem, Navier-stokes
equation…etc.)
Non-hydrostatic Atmospheric Model
3.4. Computational method (e.g. FEM, FVM, PIC, spectral methods…etc.)
Finite Differences + Quasi Spectral + Iterative Krylov Methods
3.5. Kind of parallelism used (e.g. MPI, OpenMP, MPI/OpenMP, pthreads, embarrassingly
parallelism…etc.)
Domain Decomposition with MPI
3.6. Main libraries used (e.g. FFTW, MKL, BLAS, LAPACK…etc.), version and language
(Fortran, C, C++…etc.). Did you use the /usr/local one?
Fortran90 + MPI , vectorized code , all array syntax .
Ifort & pgf90 + OpenACC in test .
Global Array Lib to speed up the I/O .
3.7. Which other software did you use on the PRACE machines? Did you use
some post-processing or pre-processing tools?
Own graphic tools using narg/ncl .
Netcdf tools , NCO , ncview , etc ...
Some tests with paraview (in progress )
4. Compilation step
4.1. Which other software did you use on the PRACE machines? Did you use
some post-processing or pre-processing tools?
Module environment, cvs , rsync
4.2. How is the program compiled? (e.g. makefile, script…etc.)
Makefile + script
4.3. Difficulties met to compile, if any, and how they were tackled.
On Cray HERMIT, some CRAY compiler bug reported to HLRS support .
On IBM/BGQ, missing MPI support of all integer*8 interface .
4.4. Which version of the compiler and version of the MPI library did you use?
The default one. On Cray test done with CRAY + Intel compiler.
On curie PGI + OpenACC
4.5. Did you use any tools to study the behavior of your code? (e.g. debugger,
profiler…etc.)
Totalview & scalasca on JUQUEEN.
MPI timing : subroutine profiling with MPI_WTIME / CPU_TIME
5. Execution step
5.1. How is the program launched?
Batch job with runjob/aprun/mpirun .
5.2. Difficulties met to launch the code, if any, and how they were tackled.
Difficulties to launch the code over more than 100 000 cores on IBM BG/Q, due to
problem of I/O scalability and too much memory consumption. See next section for
the optimization done.
On CURIE Hybrid, due to the amount of effort spent in the recoding for BG/Q
architecture, not a lot of progress done.
6. Communication patterns
6.1. If you know which are the main communication patterns used in your code
configuration, select the ones from the mentioned below:
Many point to point communications
Many collective communications
Barrier
Reduction
Broadcast
Scatter/gather (I/O)
All to all
7. Scalability testing
7.1. Summary of the obtained results from the scalability testing (Maximum 500
words)
Show the scaling behavior of your application. Which progress did you achieved? Does it fulfill
your expectations. If not, what were the reasons?
The previous scalability record for MesoNH was obtained on IBM/BGP JUGENE in
a previous PRACE preparatory project up to 128K cores.
During the time of this new PRACE preparatory access project the MesoNH code
was ported using multiple project on IBM/BGQ machine, TURING a IDRIS
100Khours , JUQUEEN at JULIECH 200Kcores & MIRA Argonne Lab. , 4 million
hours :
 The I/O scalability was demonstrated up to 80 GB/sec on 500K cores on
Mira
 The MesoNH scalability of the code was demonstrated up to 500 000 mpi
rank * 4 ompthreads = 2 million threads on IBM/BG/Q .
 On HERMIT, it was a first test on Cray XE6 architecture , the MesoNH code
was scaled up to 60 000 cores .
 On CURIE Hybride , first mixed OpenACC/MPI version on 4 GPUs
( time missing , but work in progress next year )
7.2. Images or graphics showing results from the scalability testing (Minimum
resolution of 300 dpi)
Please attach the images to this form.
Figure 1: I/O scalability of MesoNH up to 84 GB/sec on 500K cores in IBM/BG/Q
Figure 2: MesoNH scalability up to 60 TeraFlops & 2 million threads on IBM/BG/Q
7.3. Data to deploy scalability curves
7.4. Publications or reports regarding the scalability testing. (Format: Author(s).
“Title”. Publication, volume, issue, page, month year)



Chaboureau/Escobar TURING GENCI Grand Challenge 2013
Escobar GENCI MesoNH Support project 1605 , Annual report 2013
Paoli , Argonne Lab. Director Discretionary Allocation on Mira
Meso_CCS_DD13
8. Development and optimization
8.1. Summary of the obtained results from the enabling process (Maximum 500
words)
Please describe the effort you spent. Which progress did you achieve? Please describe in detail
which enabling work was performed (porting, work on algorithms, I/O…etc.). Which problems
did you experience?
Optimization of Meso-NH on IBM BG/Q
The objective of this project was to test the capability of Meso-NH to run beyond
the limit of 130,000 cores that was previously achieved on Jugene IBM BG/P in
Germany. This could allow performing simulations on computational meshes of up
to 40963, i.e. 68 billion grid points (our present "record" was 20483 or 8 billion grid
points and 16,000 nodes in DUAL mode on Intrepid). With such a computational
mesh, Meso-NH could potentially run on 1 million MPI processes (the entire Mira
machine, i.e. 740,000 cores).
To achieve these objectives we have tried to improve the performances of Meso-NH
by working on two specific drawbacks: the scalability of I/O and the memory
consumption of the code.
1) I/O scalability
We first tested an alternative procedure for treating IO in Meso-NH that is based on
Global Array library (instead of classical MPI SEND/RECV). This is a very
attractive method for transposing 3D data into N independent files, each of them
containing a subset of 2D data corresponding to a number of horizontal planes. This
procedure was tested on up to 32,768 nodes and resulted to be 20 to 30 faster
compared to data transposition based on pure MPI subroutines! Figure 1 shows a
good scaling of the I/O performances up to 524,288 cores with 32 threads/nodes,
with a peak in I/O speed of about 84 GB/sec.
As a side note, we found that a certain number of jobs remained in deadlock; this
problem was solved by setting two environment variables as follows:
export ARMCI_STRIDED_METHOD=IOV
export ARMCI_STRIDED_METHOD=CONSRV
2) Memory consumption
One of the recurrent problems with Meso-NH the memory consumption that in some
cases increases instead of decreasing with the number of cores, which ultimately
leads to loss of scalability of the code.
- One of the sources of the problem comes again from the I/O processes. So far, all
I/O buffers were allocated using a few dedicated I/O processes. Since only a small
part of MPI processes are involved in the I/O (512 to 1024 out of a total of tens of
thousands processes) we opted for an allocation of buffers in the SHARED
MEMORY available in BG/Q that does not limit the memory size to 1Gb per core.
Using this procedure, we can use 2Gb per node by selecting
export BG_SHAREDMEMSIZE=+2048
export BG_MAPCOMMONHEAP=1
and, since only 1 core per node is dedicated to I/O, we can allocate the totality of the
node memory.
The use of SHARED MEMORY is not standard in F90, hence we coded some C
subroutines interfaced using the f90 module iso_c_binding. We then replaced
ALLCOATE(BIFFER(IN)) by a new function shm_alloc.f in addition to a
c_f_pointer. The latte is a function provided by iso_c_binding that allows linking a
Fortran pointer to a memory zone issued from any C subroutine. With this
functionality we could run on 128,000 cores on a 4096x4096x1024 =16 billion grid
points.
- When we tried to run on 250,000 cores, the code still remained in deadlock. We identified
the problem came from the data structure of some arrays whose allocated memory increases
with the number of MPI processes. Once again, we fixed the problem using the SHARED
MEMORY allocation of core 0 of each node and divided by 16 the size of these arrays.
After various feedbacks with ALCF support team we could finally run at the end of
September 2013 a series of jobs with up to 32,768 nodes on MIRA i.e. 524,288 cores (or
over 2 million threads using hyperthreading). These data are summarized in Fig. 2, which
shows a reasonable scaling between 4096 and 32768 nodes.
8.2. If applicable, which tools did you use to analyze your code? (e.g. Scalasca,
Vampir…etc.) (Maximum 500 words)
A lot of totalview debugger sessions in all of the available IBM BG/Q machine
under ours hands ( scalasca & TAU were used in previous project )
8.3. What are the main actions that you did for optimization or improvement of
your code on the PRACE machines? What feature was to be optimized? What
was the bottleneck? What solution did you use (if any)? (Maximum 500 words)
See 8.1
8.4. Publications or reports regarding the development and optimization. (Format:
Author(s). “Title”. Publication, volume, issue, page, month year)
9. Results on Input/Output
9.1. Size of the data and/or the number of files. (Maximum 300 words)
See up 8.1
For the grid size tested 4096x4096x1024 points = 16 billions grid points .
The input file is about 1.7 TB ( divide in 512 2D slab ) and is write/read in less than
1 minute .
9.2. Please, let us know if you used some MPI-IO features. (Maximum 300 words)
No.
But I have tested parallel HDF5 . The scalability is reasonably good up to 1 rack of
IBM BG/Q , but after this it is catastrophic . Time *2, with each * 2 in number of
cores !
10. Main results
What are your conclusions? What do you think of the usability of the assigned PRACE system?
MesoNH scaled up to 2 million threads ( 500K mpi rank ) and 60 TeraFlops on IBM
BG/Q .
First port to CRAY XE6 upto 60K cores, and work In progress to the hybridization
with OpenACC .
4 million hours spent only on debugging/test to this purpose thanks to Argonne
Lab. Director Discretionary Allocation access in Mira.
200K hours = 1 job of ½ hour at 400Kcores in JUQUEEN not enough !
11. Feedback and technical deployment
11.1. Feedback on the centers/PRACE mechanism (Maximum 500 words)
On HERMIT, the “1 month evanescent” working space is not usable for any project
specially for Preparatory one !
For users the feeling is very insecure: “Try hard to use our computer and generate
your data … If you succeed, I will destroy them ?!
11.2. Explanation of how the computer time was used compared with the work
plan presented in the proposal. Justification of discrepancies, especially if the
computer time was not completely used. (Maximum 500 words)
11.3. Please, let us know if you plan to apply for a regular PRACE project? If not,
explain us why. (Maximum 500 words)
Not this year , not enough man power ( INCITE, Argonne Lab. one's )