Calcul Québec Introduction to Scientific Computing Objectives • Familiarize new users with the concepts of High Performance Computing (HPC). • Outline the knowledge to use our infrastructure. • Our analysts are your best assets. Please contact them! ² [email protected] (cottos, briaree, altix, hades) ² [email protected] (mpII, msII) ² [email protected] (colosse, guillimin) Outline • • • • • • • Distinction between HPC and desktop computing Understanding your applications Understanding the infrastructure Understanding the batch queue systems Registration, access to resources and usage policies Using the infrastructure Best practices Distinction between HPC and desktop computing Definitions – Building Blocks A compute cluster is composed of multiple servers also known as Compute Nodes Compute Nodes Network Compute Nodes Definitions – Building Blocks The login node permit users to interact with the cluster where they can compile, test, transfer files, etc. This node is used by multiple users at the same time and is a shared resource. Compute Node Compute Node Compute Node Compute Node Login Node Compute Node Definitions – Building Blocks Processor Processor Memory Memory Memory Memory Memory Memory A compute node server is similar to an office computer. We shall see what sets them apart and how to choose between them. I/O Controller Network Disk Definitions – Building Blocks A processor is composed of multiple independent* compute cores. It also contains a memory cache that is smaller but faster than the main system memory. Compute Core Memory Compute Core Memory Compute Core Memory Compute Core Memory Cache Processor Definitions – Building Blocks Each core is composed with processing units and registers. Registers are small but very fast memory spaces. Their number and characteristics vary between systems. Memory Memory Memory Memory Cache Registers Processing Units Processor Definitions – Units The base unit is the bit noted « b ». A bit has two possible values : 0 or 1. Computers never manipulate the value of a single bit. Here are several examples of used units: • Byte (octet) : composed of 8 bits, noted « B » • Character : generally composed of 1 Byte ex : 11000001 in ascii is « a » • Whole : generally composed of 32 bits (4 Bytes) ex : 00000000000000000000000001001101 represents 77 Definitions – Units The binary base is a power of 2 and not 10. The units frequently used are: 8 b = 1 Byte 1024 B = 1 kB 1024 kB = 1 MB 1024 MB = 1 GB 1024 GB = 1 TB Caution! According to international standard, it should rather be noted with an i. ex : kB èKiB Definitions – Bandwidth Bandwidth is a measure of the quantity of information that can be transferred per unit of time. This measure is valid if the quantity of data being transferred is large. 00:48 00:36 00:24 00:12 00:00 Node 1 GB 1 1 GB 1 GB Network 1 GB 1024 MB / 48 sec = 21.3 MB/s Node 1 GB 2 Definitions – Latency Latency corresponds to the minimum communication time. It is measured as the time it takes to transfer a small quantity of data. 00:07 00:06 00:05 00:04 00:03 00:02 00:01 00:00 1 B 1 Node 1 B 1 B Network 1 B latency = 7 seconds Node 1 B 2 Characteristics – Networking The servers used for HPC are characterized by performant networks. Here are some examples of networks and their characteristics: Type Latency (µs) Bandwidth (Gb/s) ethernet 100 Mb/s 30 0.098 ethernet 1 Gb/s 30 1 ethernet 10 Gb/s 30 10 infiniband SDR ~2 10 infiniband DDR ~2 20 Infiniband QDR ~2 40 Numalink 4 ~1 25 Characteristics – Storage The storage and file systems at the sites are both very different. HPC centres use storage arrays with parallel file systems. Type Latency (ms) sync / async Bandwidth (MB/s) 1 file sync / async Capacity (TB) 1 ~120 3 0.1 ~250 0.25 SATA – (ext3) 1 / 0.01 50 / 100 3 Mp2 (lustre) 75 / 0.5 75 / 350 500 Briaree (gpfs) 15 / 0.1 450 / 1500 256 Colosse (lustre) 100 / 2 100 / 600 1000 Guillimin (gpfs) 0.5 / 0.1 600 / 1900 2000 SATA - theoretical SSD - theoretial These measurements were made on systems in production. The performance varies greatly as a function of time. Characteristics – Size Colosse (Univ. Laval) Guillimin (McGill/ÉTS) Mammouth (Univ. de Sherbrooke) Briarée (Univ. de Montréal) Characteristics – Shared Resources Resources A queuing system permits the sharing of resources and the application of usage policies. We describe queuing systems in more details in another section. compute cluster Delay of 00:40 in the start of the job The job terminates at 01:30 office computer 03:00 02:50 02:40 02:30 02:20 02:00 02:10 01:50 01:40 01:30 01:20 01:10 01:00 00:50 00:40 00:30 00:10 00:20 00:00 The job terminates at 03:00 Time Understanding your application Performance – Compute Intensive The performances of compute cores and memory access is described in terms of cycles. For example, a 3 GHz processor de 3 GHz means it is able to perform 3 000 000 000 cycles per second. The processors work with a stream of instructions. Each instruction requires a different number of cycles (depending upon the processor). Instruction Cycles real 32 bits Sandybridge + 4 * 6 / 10-24 sqrt() 12-26 sin() 64-100 Performance – Compute Intensive Modern processors (cores) divide the work into steps, like done on an assembly line. This functionality is called a pipeline that permits the acceleration of instructions. For example, if we add a=1.0 with b=2.0, we can use the following steps: • decode the instruction • obtain the registers of a and b • add a and b • place the result in a register Performance – Compute Intensive Therefore, if we do c1 = a1+b1 and c2 = a2+b2 and c3 = a3+b3, the pipeline functions as follows : pipeline depth time DI1 DIi : decode instruction i DI2 ORa1,b2 DI3 ORa2,b2 a1+b1 ORa3,b3 a2+b2 SRc1 a3+b3 SRc2 ORi : obtain register i SRi : save register i SRc3 Performance – Compute Intensive Another important functionality of modern processors is vectorization. It combines several data values and performs a single operation on them. r1=1.0 r2=1.0 r3=1.0 r4=1.0 x1={1.0,1.0,1.0,1.0} Ex: We want to add 1.0 to our values. conventional r5= r1+1.0 r6= r2+1.0 r7= r3+1.0 r8= r4+1.0 4 instructions! vectorized x2= x1+{1.0,1.0,1.0,1.0} 1 instructions! Performance – Memory Access The organization of memory access strongly affects application performance. 140 RAM – 146 cycles 120 100 80 Cache– 15 cycles 60 40 Registers – 3 cycles 20 0 1 2 4 8 16 32 64 112 128 160 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 Processor Cycles 160 Size (Bytes) Performance – Disk Access How a file is written is very important for the software performance. HPC storage is often performant for large files. 20 Bandwidth MB/sec 350 raid + gpfs 300 18 sata + ext3 16 14 250 12 200 10 150 8 sata + ext3 100 raid + gpfs 6 4 50 2 0 0 0 5000 10000 15000 20000 25000 Size (kB) 30000 35000 0 5 10 15 20 Size (kB) 25 30 35 Serial Computations It is a suite of instructions that are done one after another. A= 1 1 1 1 1 1 1 1 1 1 B= 1 2 3 4 5 6 7 8 9 10 C= 2 3 4 5 6 7 8 9 10 11 calculate the sum, loop over the index i (from 1 to 10) Ci=Ai+Bi time initialisation, loop over the index i (from 1 to 10) Ai=1 Bi=i i=1 i=2 . . . i=10 i=1 i=2 . . . i=10 Parallel Computations It is a suite of instructions that are done at the same time. A= 1 1 1 1 1 1 1 1 1 1 B= 1 2 3 4 5 6 7 8 9 10 C= 2 3 4 5 6 7 8 9 10 11 calculate the sum, loop i (1 to 5) loop j (6 to 10) Ci=Ai+Bi Cj=Aj+Bj time initialisation, loop over the index i (from 1 to 10) Ai=1 Bi=i i=1 i=2 . . . i=10 i=1 j=6 i=2 j=7 . . . . . . i=5 j=10 Parallel Computations – Why? The frequency of the processors have not increased in the last 10 years! Therefore if we want more compute power it is necessary to parallelize. The memory space available on a server can be insufficient. Therefore it is necessary to use more compute nodes and distribute the data and work on these. It is a way to be competitive! Parallel Computing - Implications Parallelizing an application is not easy. There are several possible difficulties. • The algorithms that are performant in serial computations are not generally so in parallel • The organization of the data and work is not simple • The memory is not necessarily accessible from all the child processes • The network now affects the performance Parallelism and Memory When all processors have access to the same memory, is said to be shared. Conversely, if the processor only see a small portion of memory, is said to be distributed. Shared Memory Distributed Memory Memory Nowadays, almost all systems have a shared memory component. Parallelism and Communications In the case of a distributed memory application, there are communications to transfer data between the processing threads. The organization of these communications is important for the performance. Here is an example : Mail delivery Can bring 10 letters in 10 minutes - latency = 10 minutes - bandwidth = 0.02 letters/second Can bring 1 million letters in 60 minutes - latency = 60 minutes - bandwidth = 300 letters/second Parallelism and Communications So if I have one letter to send: takes 1 trip, therefore 10 minutes takes 1 trip, therefore 60 minutes And if I have 10 000 : takes 1000 trips, therefore almost 7 days takes 1 trip, therefore 60 minutes Difficulties of Parallelism Certain algorithms cannot be parallelized or are not efficient. When that is the case, it is necessary to approach the problem using a different method. Ex : dependencies Loop over i: ai = ai-1 + ai-2 Ex : too little work Loop over i from 1 to 10: ai = ai + 2 Difficulties of Parallelism Two execution threads can access the same memory at almost the same time. In this case, one can have a race condition. There are methods to synchronize these accesses but result in a degradation of performance. sequential section a=12 if a > 10 : a=0 if a > 10 : parallel section a=1 a=0 or 1? sequential section Difficulties of Parallelism During synchronization of access it can occur that all the execution threads await an event that no one can create. This problem is named deadlock. a=0 and b=0 Infinite loop : if a = 1: b = b+1 end loop sequential section Infinite loop : if a = 1: parallel section b = b+1 end loop Difficulties of Parallelism In parallel code, it is in general impossible to determine the order of execution for instructions. Also, if one repeats the same calculation multiple times, one can find differences in the numerical errors. An example in single precision : 10000.0 + 3.14159 - 10000.0 - 3.14159 = 0.000011 10000.0 - 10000.0 + 3.14159 - 3.14159 = 0.0 Difficulties in Parallelism The distribution of work performed in parallel is important for performance and is sometimes difficult to optimise. time perfect distribution out of balance Understanding the infrastructure Compute Clusters ;VWVSVNPL :LY]L\Y *V[[VZ (S[P_ )YPHYtL /HKuZ .\PSSPTPU 7YVJLZZL\YZ 0U[LS?LVUJVL\YZ ./a 0[HUP\TJVL\Y ./a 0[HUP\TJVL\YZ ./a 0U[LS>LZ[TLYLJVL\YZ ./a U=PKPH.;? 0U[LS>LZ[TLYLJVL\YZ ./a *VL\YZ 5VL\KZ *38 *VSVZZL 0U[LS?LVUJVL\YZ ./a 4Z00 0U[LS?LVUJVL\YZ ./a 4W00 7ZP (4+6W[LYVUJVL\YZ ./a 0U[LS>LZ[TLYLJVL\YZ ./a :[VJRHNL *VL\YZ 5VL\KZ 4tTVPYL *VL\YZ 5VL\KZ *VL\YZ7YV JLZZL\YZWHY UVL\K *R ,%''5)%% *R 1XPD/,1. *R 1XPD/,1. *R *R 1XPD/,1. *R *R *38 *RVXU*38 *R *38 ,%4'5)%% *R ,%4'5 *R QRHXGV VFDOH03 9tZLH\ :*9(;*/ 7R ,%4'5)%% 7R ,%4'5)%% *R *R ,%''5)%% *R *R ,%''5)%% FRHXUV *R ,%4'5)%% QRHXGV *R ,%4'5 *R ,%4'5 *R ,%4'5 *R (WKHUQHW 3R 3R 7R 7R 7R Understanding the queuing systems Queuing Systems – Why? • Maximize the usage of available resources; • Avoid tasks that can affect each other; • Moderate the usage of resources according to defined policies and allocations. The launching of interactive jobs is prohibited on the compute cluster servers. Queuing Systems – Nomenclature Job Submission System the user interface that permits job submissions and interactions with them. In Calcul Québec the following systems are utilised : • Torque (Altix, Cottos, Briarée, MsII, MpII, Psi) • Moab (Guillimin, Colosse*) • Oracle Grid Engine – OGE (Colosse). Scheduler : the software that calculates the priority of each job and applies the site policies. In Calcul Québec the following systems are utilized : • Maui (Altix, Cottos, Briarée, MsII, MpII, Psi) • Moab (Guillimin, Colosse*) • Oracle Grid Engine – OGE (Colosse). Queuing Systems - Priority The scheduler establishes the priority of jobs so that the target allocation of resources can be reached. Therefore, if the recent usage by a group is less than the target, then the job priority increases; otherwise, the priority decreases. Factors that determine the job priority : • time waiting in the queue; • recent group utilisation (including decay factors as a function of time); • the resource allocation of the group. Queuing Systems - Parameters When submitting a job it is important to specify : • the total memory, • the number of cores, • the duration, Permits jobs to pass into « holes » !!! • the desired queue. Each cluster possesses a group of queues with different properties (number of concurrent jobs, duration of jobs, maximum number of processors, etc.) To learn the details, see the documentation on our web sites or contact our analysts. reservation for a priority task Job waiting requested duration Job 6 Requeted duration CPU 14 CPU 13 CPU 12 CPU 11 CPU 10 CPU 9 CPU 8 Job 4 – requested duration Job 5 – requested duration CPU 7 CPU 6 CPU 5 CPU 4 CPU 3 CPU 2 CPU 1 Job 3 – requested duration Job 2 requested duration Job 1 requested duration Job waiting requested duration time Queuing Systems Registration, access to resources and policies Registration with Compute Canada Compute Canada is the organization which federates the regional HPC consortia of which Calcul Québec is a part. The first step to use the resources of Calcul Québec is to register with Compute Canada: https://ccdb.computecanada.org/account_application This step must first be made by the Professor who leads a group, and then by each sponsored member (students, post-docs, researchers, external collaborators). Each user must be registered in the database. Registration with Calcul Québec Currently registration is done through the old existing consortia. RQCHP (Altix, Briarée, Cottos, MpII, MsII, Psi) : From the RQCHP website the sponsor selects which resource of RQCHP to open an account for each user that they supervise. https://rqchp.ca/servers_accounts CLUMEQ (Colosse, Guillimin) : From the website of the Compute Canada database (CCDB) each user can request the creation of an account at CLUMEQ and are automatically configured for access to Colosse and Guillimin https://ccdb.computecanada.org/me/facilities Acceptable Use Policy By obtaining an account with Compute Canada one must abide by the following policies: : 1. An account holder is responsible for all activity associated with their account. 2. An account holder must not share their account with others or try to access another user s account. Access credentials must be kept private and secure. 3. Compute Canada resources must only be used for projects/programs for which they have been duly allocated. 4. Compute Canada resources must be used in an efficient and considerate fashion. Acceptable Use Policy 5. Compute Canada resources must not be used for illegal purposes. 6. An account holder must respect the privacy of their own, other users and the underlying systems data. 7. An account holder must provide reporting information in a timely manner and cite Compute Canada in all publications that resulted from work undertaken with Compute Canada resources. 8. An account holder must observe computing policies in effect at the relevant centre and their residing institution. 9. account holder may lose access if any of these policies are transgressed. https://ccdb.computecanada.org/security/accept_aup Use of Resources Obtaining SSH • Available through linux/UNIX, Mac OS X : • Windows : • cygwin (for graphics, you will need an X-emulator or X-server) • Xming X Server (http://straightrunning.com/XmingNotes/) • Putty (putty.exe) • Tunnelier (http://www.bitvise.com/tunnelier) • see http://www.openssh.org/windows.html • www.openssh.org Using SSH • Connection in a terminal: ssh –X [email protected] The login node name is obtained following activation of your account, or via the support webpages • Transferring files: scp local_file [email protected]:destination sftp [email protected] • Tips-and-Tricks ssh keys (ssh-keygen) to remember your password Important: never use pass-phraseless ssh-keys! Configuration file (.ssh/config) Software Software used by more than one user is generally installed centrally on each system by analysts. Versioning and dependencies is handled by a tool called module. A module contains information that permits the modification of a user’s environment so as to use a given version of the software. • List the modules currently used module list • List the modules currently available module avail • Add (Remove) a module from your environment module add (rm) module_name Software Storage Utilization The environment variable $HOME refers to the default directory for each user. This directory is sometimes protected via regular back-up. With an ssh connection the user arrives at this location. A directory named « scratch » is available for most non-basic work. This directory is not backed-up, is of large capacity and has high performance. Storage Utilization MpII, MSII, Altix, Cottos, Briaree et Psi : - the variable $SCRATCH indicates the location of scratch. - $HOME is backed-up. Guillimin : - /sb/scratch/username is the scratch for each user. - /sb/project/RAP_ID provides some small persistent space per group. - $HOME is backed-up. Colosse : Typing “colosse-info” in a terminal will tell the user their “RAP_ID”. - /scratch/RAP_ID/ is the scratch for the project. - $HOME is not backed-up. Job Submission • Set the options for the job. • Writing a script to run. • Submit the script. In the next cycle of resource allocation, the scheduler determines the job priority. • The jobs with the highest priority are executed first if the requested resources are available. • Queuing of the jobs is possible. • The calculated priority increases with the time spent waiting. • Job execution. • Return of the standard output and standard error of the job. Job Submission - Briarée Queue Name Maximum Duration (h) Constraints / notes all : 2520 cores per group 36 jobs max / user 1416 cores max / user 4 Nodes max / job normale 168 courte 48 hp 168 hpcourte 48 171 nodes max / jobs longue 336 180 cores max / user 60 node available test 1 72 jobs max / user 4 nodes max / job 8 tâjobs ches max / user 2052 cores max / user 4 nodes available Job Submission - Briarée #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1:ppn=4 #PBS –l mem=14gb cd $SCRATCH/my_directory The obtained node is reserved for the user. The next job does the same if possible. One can add #PBS –q courte but by default the submission system chooses the queue based upon what is requested. module load module_used ./execution qsub script : submit the script qstat –u user_name : see the status of my jobs Job Submission - Guillimin Queue Name Maximum Duration (h) Constraints / notes sw 720 Serial queue: 2:1 blocking network, 36 GB memory per node, 600 nodes hb 720 Non-blocking network: 24 GB memory per node, 400 nodes lm 720 Non-blocking network: 72 GB memory per node, 200 nodes debug 2 Maximum of 1280 cores per job (default) Per group maximum number of core seconds for running jobs - Allows flexibility between many short duration and fewer longer duration jobs Job Submission - Guillimin #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1:ppn=4 #PBS –q lm cd $SCRATCH/my_directory One can specify the queue to which the job is submitted based upon the memory and other requirements. module load module_needed ./execution msub –q queue_name script : submit the job showq –u user_name : see the state of my jobs checkjob –v jobID : see detailed job information Job Submission - Colosse Queue Name Maximum Duration (h) Constraints / notes short 24 256 cores maximum med 48 128 cores maximum long 168 test ¼ 16 cores maximum Job Submission - Colosse #!/bin/bash #$ –l h_rt=7200 #$ -pe default 8 #$ –P abc-000-00 cd $SCRATCH/my_directory A job obtains a complete node. The number of requested cores is therefore a multiple of 8. One can add #$ –q short but by default the submission system chooses the queue based upon the requested resources. module load module_needed ./execution colosse-info : obtain your abc-000-00 qsub script : submit the script qstat –u user_name : see the status of my jobs Job Submission - MpII Queue Name Maximum Duration (h) Constraints / notes qwork 120 qfbb 120 portion with non-blocking network qfat256 120 20 nodes available (48 cores per node) qfat512 48 2 nodes available (48 cores per node) The size of the jobs that can be executed depends on the allocation and other tasks in queue. Ex : if there are 2400 cores available and 3 jobs in the queue. - group 1 : allocation = 100 è can use 1200 cores - group 2 : allocation = 50 è can use 600 cores - group 3 : allocation = 50 è can use 600 cores Job Submission - MpII #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1 #PBS –l mem=14gb A job obtains a complete node. The number of requested cores must be 1. #PBS –q qwork@mp2 cd $SCRATCH/my_directory module load module_needed ./execution qsub script : submit the job script qstat –u user_name : see the status of my jobs Best Practices Grouping Tasks It is sometimes inefficient to launch many jobs one by one: For i from 1 to 100 : qsub –lnodes=1:ppn=1 – #!/bin/bash #PBS -l walltime=02:00:00 #PBS -l nodes=1 #PBS –l mem=14gb #PBS –q qwork@mp2 lwalltime ... module load module_needed This approach is potentially inefficient and should be avoided: cd $SCRATCH/my_directory1 ./execution & cd $SCRATCH/my_directory2 • Certain systems limit the number of jobs ./execution & • Certain systems allocate whole nodes to jobs wait Grouping Tasks Methods and tools are available by which job parameters can be automatically adjusted when running programs repeatedly, and can aid to optimize the submission of related jobs. On Guillimin, Colosse and Briarée: The submission or scheduler systems support the use of job arrays that simplify the submission of identical workloads that operate on different sets of parameters or data. On MpII: The grouping of jobs can be automated through the use of bqtools. Job Duration Estimate your job execution time. A job that requests less time waits less time in the queue! The shorter queues generally allow the user to run more jobs. In addition there is less risk of suffering the effects of a job failure. If you do not know how to do it, the analysts can help you. Storage For handling large files, use the scratch space on the systems. It is generally preferable to use large block reads and writes. Sometimes it is useful to use disks local to the compute nodes where the jobs are running. Contact the analysts to obtain advice! Contacting the Analysts ² [email protected] (cottos, briaree, altix, hades) ² [email protected] (mpII, msII) ² [email protected] (colosse, guillimin) Useful Documentation and Support Links ² cottos, briaree, altix, hades, mpII, msII ² https://rqchp.ca and select ‘Documentation’ ² colosse, guillimin ² https://www.clumeq.ca and select ‘Support’
© Copyright 2026 Paperzz