PRACE Final Report Form – Preparatory Call B 1. General information 1.1. Proposal ID 2010PA1429 1.2. Type of preparatory proposal granted B – Code development and optimization by applicant (without PRACE support). 1.3. Period of access to the PRACE facilities 1/05/2013 to 31/10/2013 1.4. Name of the PRACE facility assigned JUQUEEN , HERMIT & CURIE HYBRIDE 2. Project information 2.1. Project name to which corresponds the developed and optimized code Scaling MESONH to next gen IBM, Cray & GPU architecture for PETASCALE simulation . 2.2. Research field Astrophysics Engineering and Energy Earth Sciences and Environment Mathematics and Computer Science Medicine and Life Sciences Chemistry and Materials Fundamental Physics Finance and Social Science Linguistics and Encryption 2.3. Institutions and research team members Dr Juan ESCOBAR MUNOZ : Computer Scientist Dr Jean-Pierre CHABOUREAU : Physicist @ Laboratoire d'Aérologie / CNRS / OMP / Université Paul Sabatier TOULOUSE 2.4. Summary of the project interest (Maximum 300 words) Please fill in the field with the same text used in the application form. MesoNH is the non-hydrostatic mesoscale atmospheric model of the French research community. It has been jointly developed by the Laboratoire d'Arologie (UMR 5560 UPS/CNRS) and by CNRM-GAME (URA 1357 CNRS/Mto-France). MesoNH is now running in production mode in TIER0 computer up to 8K cores . The goal of this project is to prepare MesoNH to the next gen architecture, CRAY & NVIDIA/GPU & IBM-BG/Q to gain 1 order of magnitude in the scalability of the code in production runs. Development will concentrate in priority on the continuation of porting of MESONH on the GPU with the PGI/OPENACC tool-kit on multi-node clusters. The second goal of the project is to test the scalability of new algorithm for the pressure solver equation ( like Multi-Grid ) to replace the quasi-spectral present algorithm . 3. Main features of the code 3.1. Name of the code or codes. MESO-NH 3.2. Type of the code distribution (e.g. opensource, commercial, academic…etc.) Academic , License agreement (free of charge) with CNRM and LA. 3.3. Computational problem executed (e.g. N-body problem, Navier-stokes equation…etc.) Non-hydrostatic Atmospheric Model 3.4. Computational method (e.g. FEM, FVM, PIC, spectral methods…etc.) Finite Differences + Quasi Spectral + Iterative Krylov Methods 3.5. Kind of parallelism used (e.g. MPI, OpenMP, MPI/OpenMP, pthreads, embarrassingly parallelism…etc.) Domain Decomposition with MPI 3.6. Main libraries used (e.g. FFTW, MKL, BLAS, LAPACK…etc.), version and language (Fortran, C, C++…etc.). Did you use the /usr/local one? Fortran90 + MPI , vectorized code , all array syntax . Ifort & pgf90 + OpenACC in test . Global Array Lib to speed up the I/O . 3.7. Which other software did you use on the PRACE machines? Did you use some post-processing or pre-processing tools? Own graphic tools using narg/ncl . Netcdf tools , NCO , ncview , etc ... Some tests with paraview (in progress ) 4. Compilation step 4.1. Which other software did you use on the PRACE machines? Did you use some post-processing or pre-processing tools? Module environment, cvs , rsync 4.2. How is the program compiled? (e.g. makefile, script…etc.) Makefile + script 4.3. Difficulties met to compile, if any, and how they were tackled. On Cray HERMIT, some CRAY compiler bug reported to HLRS support . On IBM/BGQ, missing MPI support of all integer*8 interface . 4.4. Which version of the compiler and version of the MPI library did you use? The default one. On Cray test done with CRAY + Intel compiler. On curie PGI + OpenACC 4.5. Did you use any tools to study the behavior of your code? (e.g. debugger, profiler…etc.) Totalview & scalasca on JUQUEEN. MPI timing : subroutine profiling with MPI_WTIME / CPU_TIME 5. Execution step 5.1. How is the program launched? Batch job with runjob/aprun/mpirun . 5.2. Difficulties met to launch the code, if any, and how they were tackled. Difficulties to launch the code over more than 100 000 cores on IBM BG/Q, due to problem of I/O scalability and too much memory consumption. See next section for the optimization done. On CURIE Hybrid, due to the amount of effort spent in the recoding for BG/Q architecture, not a lot of progress done. 6. Communication patterns 6.1. If you know which are the main communication patterns used in your code configuration, select the ones from the mentioned below: Many point to point communications Many collective communications Barrier Reduction Broadcast Scatter/gather (I/O) All to all 7. Scalability testing 7.1. Summary of the obtained results from the scalability testing (Maximum 500 words) Show the scaling behavior of your application. Which progress did you achieved? Does it fulfill your expectations. If not, what were the reasons? The previous scalability record for MesoNH was obtained on IBM/BGP JUGENE in a previous PRACE preparatory project up to 128K cores. During the time of this new PRACE preparatory access project the MesoNH code was ported using multiple project on IBM/BGQ machine, TURING a IDRIS 100Khours , JUQUEEN at JULIECH 200Kcores & MIRA Argonne Lab. , 4 million hours : The I/O scalability was demonstrated up to 80 GB/sec on 500K cores on Mira The MesoNH scalability of the code was demonstrated up to 500 000 mpi rank * 4 ompthreads = 2 million threads on IBM/BG/Q . On HERMIT, it was a first test on Cray XE6 architecture , the MesoNH code was scaled up to 60 000 cores . On CURIE Hybride , first mixed OpenACC/MPI version on 4 GPUs ( time missing , but work in progress next year ) 7.2. Images or graphics showing results from the scalability testing (Minimum resolution of 300 dpi) Please attach the images to this form. Figure 1: I/O scalability of MesoNH up to 84 GB/sec on 500K cores in IBM/BG/Q Figure 2: MesoNH scalability up to 60 TeraFlops & 2 million threads on IBM/BG/Q 7.3. Data to deploy scalability curves 7.4. Publications or reports regarding the scalability testing. (Format: Author(s). “Title”. Publication, volume, issue, page, month year) Chaboureau/Escobar TURING GENCI Grand Challenge 2013 Escobar GENCI MesoNH Support project 1605 , Annual report 2013 Paoli , Argonne Lab. Director Discretionary Allocation on Mira Meso_CCS_DD13 8. Development and optimization 8.1. Summary of the obtained results from the enabling process (Maximum 500 words) Please describe the effort you spent. Which progress did you achieve? Please describe in detail which enabling work was performed (porting, work on algorithms, I/O…etc.). Which problems did you experience? Optimization of Meso-NH on IBM BG/Q The objective of this project was to test the capability of Meso-NH to run beyond the limit of 130,000 cores that was previously achieved on Jugene IBM BG/P in Germany. This could allow performing simulations on computational meshes of up to 40963, i.e. 68 billion grid points (our present "record" was 20483 or 8 billion grid points and 16,000 nodes in DUAL mode on Intrepid). With such a computational mesh, Meso-NH could potentially run on 1 million MPI processes (the entire Mira machine, i.e. 740,000 cores). To achieve these objectives we have tried to improve the performances of Meso-NH by working on two specific drawbacks: the scalability of I/O and the memory consumption of the code. 1) I/O scalability We first tested an alternative procedure for treating IO in Meso-NH that is based on Global Array library (instead of classical MPI SEND/RECV). This is a very attractive method for transposing 3D data into N independent files, each of them containing a subset of 2D data corresponding to a number of horizontal planes. This procedure was tested on up to 32,768 nodes and resulted to be 20 to 30 faster compared to data transposition based on pure MPI subroutines! Figure 1 shows a good scaling of the I/O performances up to 524,288 cores with 32 threads/nodes, with a peak in I/O speed of about 84 GB/sec. As a side note, we found that a certain number of jobs remained in deadlock; this problem was solved by setting two environment variables as follows: export ARMCI_STRIDED_METHOD=IOV export ARMCI_STRIDED_METHOD=CONSRV 2) Memory consumption One of the recurrent problems with Meso-NH the memory consumption that in some cases increases instead of decreasing with the number of cores, which ultimately leads to loss of scalability of the code. - One of the sources of the problem comes again from the I/O processes. So far, all I/O buffers were allocated using a few dedicated I/O processes. Since only a small part of MPI processes are involved in the I/O (512 to 1024 out of a total of tens of thousands processes) we opted for an allocation of buffers in the SHARED MEMORY available in BG/Q that does not limit the memory size to 1Gb per core. Using this procedure, we can use 2Gb per node by selecting export BG_SHAREDMEMSIZE=+2048 export BG_MAPCOMMONHEAP=1 and, since only 1 core per node is dedicated to I/O, we can allocate the totality of the node memory. The use of SHARED MEMORY is not standard in F90, hence we coded some C subroutines interfaced using the f90 module iso_c_binding. We then replaced ALLCOATE(BIFFER(IN)) by a new function shm_alloc.f in addition to a c_f_pointer. The latte is a function provided by iso_c_binding that allows linking a Fortran pointer to a memory zone issued from any C subroutine. With this functionality we could run on 128,000 cores on a 4096x4096x1024 =16 billion grid points. - When we tried to run on 250,000 cores, the code still remained in deadlock. We identified the problem came from the data structure of some arrays whose allocated memory increases with the number of MPI processes. Once again, we fixed the problem using the SHARED MEMORY allocation of core 0 of each node and divided by 16 the size of these arrays. After various feedbacks with ALCF support team we could finally run at the end of September 2013 a series of jobs with up to 32,768 nodes on MIRA i.e. 524,288 cores (or over 2 million threads using hyperthreading). These data are summarized in Fig. 2, which shows a reasonable scaling between 4096 and 32768 nodes. 8.2. If applicable, which tools did you use to analyze your code? (e.g. Scalasca, Vampir…etc.) (Maximum 500 words) A lot of totalview debugger sessions in all of the available IBM BG/Q machine under ours hands ( scalasca & TAU were used in previous project ) 8.3. What are the main actions that you did for optimization or improvement of your code on the PRACE machines? What feature was to be optimized? What was the bottleneck? What solution did you use (if any)? (Maximum 500 words) See 8.1 8.4. Publications or reports regarding the development and optimization. (Format: Author(s). “Title”. Publication, volume, issue, page, month year) 9. Results on Input/Output 9.1. Size of the data and/or the number of files. (Maximum 300 words) See up 8.1 For the grid size tested 4096x4096x1024 points = 16 billions grid points . The input file is about 1.7 TB ( divide in 512 2D slab ) and is write/read in less than 1 minute . 9.2. Please, let us know if you used some MPI-IO features. (Maximum 300 words) No. But I have tested parallel HDF5 . The scalability is reasonably good up to 1 rack of IBM BG/Q , but after this it is catastrophic . Time *2, with each * 2 in number of cores ! 10. Main results What are your conclusions? What do you think of the usability of the assigned PRACE system? MesoNH scaled up to 2 million threads ( 500K mpi rank ) and 60 TeraFlops on IBM BG/Q . First port to CRAY XE6 upto 60K cores, and work In progress to the hybridization with OpenACC . 4 million hours spent only on debugging/test to this purpose thanks to Argonne Lab. Director Discretionary Allocation access in Mira. 200K hours = 1 job of ½ hour at 400Kcores in JUQUEEN not enough ! 11. Feedback and technical deployment 11.1. Feedback on the centers/PRACE mechanism (Maximum 500 words) On HERMIT, the “1 month evanescent” working space is not usable for any project specially for Preparatory one ! For users the feeling is very insecure: “Try hard to use our computer and generate your data … If you succeed, I will destroy them ?! 11.2. Explanation of how the computer time was used compared with the work plan presented in the proposal. Justification of discrepancies, especially if the computer time was not completely used. (Maximum 500 words) 11.3. Please, let us know if you plan to apply for a regular PRACE project? If not, explain us why. (Maximum 500 words) Not this year , not enough man power ( INCITE, Argonne Lab. one's )
© Copyright 2026 Paperzz