presentation slides

Calcul Québec
Introduction to Scientific Computing
Objectives
• Familiarize new users with the concepts of High
Performance Computing (HPC).
• Outline the knowledge to use our infrastructure.
•  Our analysts are your best assets. Please contact them!
² [email protected] (cottos, briaree, altix, hades)
² [email protected] (mpII, msII)
² [email protected] (colosse, guillimin)
Outline
• 
• 
• 
• 
• 
• 
• 
Distinction between HPC and desktop computing
Understanding your applications
Understanding the infrastructure
Understanding the batch queue systems
Registration, access to resources and usage policies
Using the infrastructure
Best practices
Distinction between HPC and desktop
computing
Definitions – Building Blocks
A compute cluster is composed of multiple servers also
known as Compute Nodes
Compute Nodes
Network
Compute Nodes
Definitions – Building Blocks
The login node permit users to interact with the cluster where
they can compile, test, transfer files, etc. This node is used
by multiple users at the same time and is a shared resource.
Compute Node
Compute Node
Compute Node
Compute Node
Login Node
Compute Node
Definitions – Building Blocks
Processor
Processor
Memory
Memory
Memory
Memory
Memory
Memory
A compute node server is similar to an office computer. We
shall see what sets them apart and how to choose between
them.
I/O Controller
Network
Disk
Definitions – Building Blocks
A processor is composed of multiple independent* compute
cores. It also contains a memory cache that is smaller but
faster than the main system memory.
Compute
Core
Memory
Compute
Core
Memory
Compute
Core
Memory
Compute
Core
Memory
Cache
Processor
Definitions – Building Blocks
Each core is composed with processing units and registers.
Registers are small but very fast memory spaces. Their
number and characteristics vary between systems.
Memory
Memory
Memory
Memory Cache
Registers
Processing
Units
Processor
Definitions – Units
The base unit is the bit noted « b ». A bit has two possible
values : 0 or 1.
Computers never manipulate the value of a single bit. Here
are several examples of used units:
•  Byte (octet) : composed of 8 bits, noted « B »
•  Character : generally composed of 1 Byte
ex : 11000001 in ascii is « a »
•  Whole : generally composed of 32 bits (4 Bytes)
ex : 00000000000000000000000001001101 represents 77
Definitions – Units
The binary base is a power of 2 and not 10.
The units frequently used are:
8 b = 1 Byte
1024 B = 1 kB
1024 kB = 1 MB
1024 MB = 1 GB
1024 GB = 1 TB
Caution! According to international standard, it should rather
be noted with an i. ex : kB èKiB
Definitions – Bandwidth
Bandwidth is a measure of the quantity of
information that can be transferred per unit of
time. This measure is valid if the quantity of
data being transferred is large.
00:48
00:36
00:24
00:12
00:00
Node
1 GB
1
1 GB
1 GB
Network
1 GB
1024 MB / 48 sec = 21.3 MB/s
Node
1 GB
2
Definitions – Latency
Latency corresponds to the minimum
communication time.
It is measured as the time it takes to transfer
a small quantity of data.
00:07
00:06
00:05
00:04
00:03
00:02
00:01
00:00
1 B
1
Node
1 B
1 B
Network
1 B
latency = 7 seconds
Node
1 B
2
Characteristics – Networking
The servers used for HPC are characterized by performant
networks. Here are some examples of networks and their
characteristics:
Type
Latency (µs)
Bandwidth (Gb/s)
ethernet 100 Mb/s
30
0.098
ethernet 1 Gb/s
30
1
ethernet 10 Gb/s
30
10
infiniband SDR
~2
10
infiniband DDR
~2
20
Infiniband QDR
~2
40
Numalink 4
~1
25
Characteristics – Storage
The storage and file systems at the sites are both very
different. HPC centres use storage arrays with parallel file
systems.
Type
Latency
(ms)
sync / async
Bandwidth (MB/s)
1 file
sync / async
Capacity
(TB)
1
~120
3
0.1
~250
0.25
SATA – (ext3)
1 / 0.01
50 / 100
3
Mp2 (lustre)
75 / 0.5
75 / 350
500
Briaree (gpfs)
15 / 0.1
450 / 1500
256
Colosse (lustre)
100 / 2
100 / 600
1000
Guillimin (gpfs)
0.5 / 0.1
600 / 1900
2000
SATA - theoretical
SSD - theoretial
These measurements were made on systems in production.
The performance varies greatly as a function of time.
Characteristics – Size
Colosse (Univ. Laval)
Guillimin (McGill/ÉTS)
Mammouth (Univ. de Sherbrooke)
Briarée
(Univ. de Montréal)
Characteristics – Shared Resources
Resources
A queuing system permits the sharing of resources and the
application of usage policies. We describe queuing systems
in more details in another section.
compute cluster
Delay of 00:40 in the
start of the job
The job terminates at 01:30
office computer
03:00
02:50
02:40
02:30
02:20
02:00
02:10
01:50
01:40
01:30
01:20
01:10
01:00
00:50
00:40
00:30
00:10
00:20
00:00
The job terminates at 03:00
Time
Understanding your application
Performance – Compute Intensive
The performances of compute cores and memory access is
described in terms of cycles. For example, a 3 GHz
processor de 3 GHz means it is able to perform 3 000 000
000 cycles per second.
The processors work with a
stream of instructions. Each
instruction requires a
different number of cycles
(depending upon the
processor).
Instruction
Cycles
real 32 bits
Sandybridge
+
4
*
6
/
10-24
sqrt()
12-26
sin()
64-100
Performance – Compute Intensive
Modern processors (cores) divide the work into steps, like
done on an assembly line. This functionality is called a
pipeline that permits the acceleration of instructions.
For example, if we add a=1.0 with b=2.0, we can use the
following steps:
•  decode the instruction
•  obtain the registers of a and b
•  add a and b
•  place the result in a register
Performance – Compute Intensive
Therefore, if we do c1 = a1+b1 and c2 = a2+b2 and
c3 = a3+b3, the pipeline functions as follows :
pipeline depth
time
DI1
DIi : decode instruction i
DI2
ORa1,b2
DI3
ORa2,b2
a1+b1
ORa3,b3
a2+b2
SRc1
a3+b3
SRc2
ORi : obtain register i
SRi : save register i
SRc3
Performance – Compute Intensive
Another important functionality of modern processors is
vectorization. It combines several data values and performs a
single operation on them.
r1=1.0
r2=1.0
r3=1.0
r4=1.0
x1={1.0,1.0,1.0,1.0}
Ex: We want to add 1.0 to our values.
conventional
r5= r1+1.0
r6= r2+1.0
r7= r3+1.0
r8= r4+1.0
4 instructions!
vectorized
x2= x1+{1.0,1.0,1.0,1.0}
1 instructions!
Performance – Memory Access
The organization of memory access strongly affects
application performance.
140
RAM – 146 cycles
120
100
80
Cache– 15 cycles
60
40
Registers – 3 cycles
20
0
1
2
4
8
16
32
64
112
128
160
256
512
1024
2048
4096
8192
16384
32768
65536
131072
262144
Processor Cycles
160
Size (Bytes)
Performance – Disk Access
How a file is written is very important for the software
performance. HPC storage is often performant for large files.
20
Bandwidth MB/sec
350
raid + gpfs
300
18
sata + ext3
16
14
250
12
200
10
150
8
sata + ext3
100
raid + gpfs
6
4
50
2
0
0
0
5000
10000
15000
20000
25000
Size (kB)
30000
35000
0
5
10
15
20
Size (kB)
25
30
35
Serial Computations
It is a suite of instructions that are done one after another.
A=
1
1
1
1
1
1
1
1
1
1
B=
1
2
3
4
5
6
7
8
9
10
C=
2
3
4
5
6
7
8
9
10 11
calculate the sum, loop over the
index i (from 1 to 10)
Ci=Ai+Bi
time
initialisation, loop over the
index i (from 1 to 10)
Ai=1
Bi=i
i=1
i=2
.
.
.
i=10
i=1
i=2
.
.
.
i=10
Parallel Computations
It is a suite of instructions that are done at the same time.
A=
1
1
1
1
1
1
1
1
1
1
B=
1
2
3
4
5
6
7
8
9
10
C=
2
3
4
5
6
7
8
9
10 11
calculate the sum,
loop i (1 to 5) loop j (6 to 10)
Ci=Ai+Bi
Cj=Aj+Bj
time
initialisation, loop over the
index i (from 1 to 10)
Ai=1
Bi=i
i=1
i=2
.
.
.
i=10
i=1 j=6
i=2 j=7
.
.
.
.
.
.
i=5 j=10
Parallel Computations – Why?
The frequency of the processors have not increased in the
last 10 years! Therefore if we want more compute power it
is necessary to parallelize.
The memory space available on a server can be insufficient.
Therefore it is necessary to use more compute nodes and
distribute the data and work on these.
It is a way to be competitive!
Parallel Computing - Implications
Parallelizing an application is not easy. There are several
possible difficulties.
• The algorithms that are performant in serial computations
are not generally so in parallel
• The organization of the data and work is not simple
• The memory is not necessarily accessible from all the child
processes
• The network now affects the performance
Parallelism and Memory
When all processors have access to the same memory, is said
to be shared. Conversely, if the processor only see a small
portion of memory, is said to be distributed.
Shared Memory
Distributed Memory
Memory
Nowadays, almost all systems have a shared memory
component.
Parallelism and Communications
In the case of a distributed memory application, there are
communications to transfer data between the processing
threads. The organization of these communications is
important for the performance.
Here is an example : Mail delivery
Can bring 10 letters in 10 minutes
- latency = 10 minutes
- bandwidth = 0.02 letters/second
Can bring 1 million letters in 60 minutes
- latency = 60 minutes
- bandwidth = 300 letters/second
Parallelism and Communications
So if I have one letter to send:
takes 1 trip, therefore 10 minutes
takes 1 trip, therefore 60 minutes
And if I have 10 000 :
takes 1000 trips, therefore almost 7 days
takes 1 trip, therefore 60 minutes
Difficulties of Parallelism
Certain algorithms cannot be parallelized or are not efficient.
When that is the case, it is necessary to approach the
problem using a different method.
Ex : dependencies
Loop over i:
ai = ai-1 + ai-2
Ex : too little work
Loop over i from 1 to 10:
ai = ai + 2
Difficulties of Parallelism
Two execution threads can access the same memory at
almost the same time. In this case, one can have a race
condition. There are methods to synchronize these accesses
but result in a degradation of performance.
sequential section
a=12
if a > 10 :
a=0
if a > 10 :
parallel section
a=1
a=0 or 1?
sequential section
Difficulties of Parallelism
During synchronization of access it can occur that all the
execution threads await an event that no one can create.
This problem is named deadlock.
a=0 and b=0
Infinite loop :
if a = 1:
b = b+1
end loop
sequential section
Infinite loop :
if a = 1:
parallel section
b = b+1
end loop
Difficulties of Parallelism
In parallel code, it is in general impossible to determine the
order of execution for instructions. Also, if one repeats the
same calculation multiple times, one can find differences in
the numerical errors.
An example in single precision :
10000.0 + 3.14159 - 10000.0 - 3.14159 = 0.000011
10000.0 - 10000.0 + 3.14159 - 3.14159 = 0.0
Difficulties in Parallelism
The distribution of work performed in parallel is important
for performance and is sometimes difficult to optimise.
time
perfect distribution
out of balance
Understanding the infrastructure
Compute Clusters
;VWVSVNPL
:LY]L\Y
*V[[VZ
(S[P_
)YPHYtL
/HKuZ
.\PSSPTPU
7YVJLZZL\YZ
0U[LS?LVUJVL\YZ
./a
0[HUP\TJVL\Y
./a
0[HUP\TJVL\YZ
./a
0U[LS>LZ[TLYLJVL\YZ
./a
U=PKPH.;?
0U[LS>LZ[TLYLJVL\YZ
./a
*VL\YZ
5VL\KZ
*38
*VSVZZL
0U[LS?LVUJVL\YZ
./a
4Z00
0U[LS?LVUJVL\YZ
./a
4W00
7ZP
(4+6W[LYVUJVL\YZ
./a
0U[LS>LZ[TLYLJVL\YZ
./a
:[VJRHNL
*VL\YZ
5VL\KZ
4tTVPYL
*VL\YZ
5VL\KZ
*VL\YZ7YV
JLZZL\YZWHY
UVL\K
*R
,%''5)%%
*R
1XPD/,1.
*R
1XPD/,1.
*R
*R
1XPD/,1.
*R
*R
*38
*RVXU*38
*R
*38
,%4'5)%%
*R
,%4'5
*R
QRHXGV
VFDOH03
9tZLH\
:*9(;*/
7R
,%4'5)%%
7R
,%4'5)%%
*R
*R
,%''5)%%
*R
*R
,%''5)%%
FRHXUV
*R
,%4'5)%%
QRHXGV
*R
,%4'5
*R
,%4'5
*R
,%4'5
*R
(WKHUQHW
3R
3R
7R
7R
7R
Understanding the queuing systems
Queuing Systems – Why?
• Maximize the usage of available resources;
• Avoid tasks that can affect each other;
• Moderate the usage of resources according to defined
policies and allocations.
The launching of interactive jobs is prohibited on the
compute cluster servers.
Queuing Systems – Nomenclature
Job Submission System the user interface that permits job
submissions and interactions with them. In Calcul Québec
the following systems are utilised :
•  Torque (Altix, Cottos, Briarée, MsII, MpII, Psi)
•  Moab (Guillimin, Colosse*)
•  Oracle Grid Engine – OGE (Colosse).
Scheduler : the software that calculates the priority of each
job and applies the site policies. In Calcul Québec the
following systems are utilized :
•  Maui (Altix, Cottos, Briarée, MsII, MpII, Psi)
•  Moab (Guillimin, Colosse*)
•  Oracle Grid Engine – OGE (Colosse).
Queuing Systems - Priority
The scheduler establishes the priority of jobs so that the
target allocation of resources can be reached. Therefore, if
the recent usage by a group is less than the target, then the
job priority increases; otherwise, the priority decreases.
Factors that determine the job priority :
• time waiting in the queue;
• recent group utilisation (including decay factors as a
function of time);
• the resource allocation of the group.
Queuing Systems - Parameters
When submitting a job it is important to specify :
•  the total memory,
•  the number of cores,
•  the duration,
Permits jobs to pass into « holes » !!!
•  the desired queue.
Each cluster possesses a group of queues with different
properties (number of concurrent jobs, duration of jobs,
maximum number of processors, etc.)
To learn the details, see the documentation on our web sites
or contact our analysts.
reservation
for a priority task
Job waiting
requested
duration
Job 6
Requeted
duration
CPU 14
CPU 13
CPU 12
CPU 11
CPU 10
CPU 9
CPU 8
Job 4 – requested duration
Job 5 – requested duration
CPU 7
CPU 6
CPU 5
CPU 4
CPU 3
CPU 2
CPU 1
Job 3 – requested duration
Job 2
requested duration
Job 1
requested duration
Job waiting
requested duration
time
Queuing Systems
Registration, access to resources and
policies
Registration with Compute Canada
Compute Canada is the organization which federates the
regional HPC consortia of which Calcul Québec is a part.
The first step to use the resources of Calcul Québec is to
register with Compute Canada:
https://ccdb.computecanada.org/account_application
This step must first be made by the Professor who leads a
group, and then by each sponsored member (students,
post-docs, researchers, external collaborators).
Each user must be registered in the database.
Registration with Calcul Québec
Currently registration is done through the old existing
consortia.
RQCHP (Altix, Briarée, Cottos, MpII, MsII, Psi) :
From the RQCHP website the sponsor selects which resource of
RQCHP to open an account for each user that they supervise.
https://rqchp.ca/servers_accounts
CLUMEQ (Colosse, Guillimin) :
From the website of the Compute Canada database (CCDB) each user
can request the creation of an account at CLUMEQ and are
automatically configured for access to Colosse and Guillimin
https://ccdb.computecanada.org/me/facilities
Acceptable Use Policy
By obtaining an account with Compute Canada one must abide by the
following policies: :
1.  An account holder is responsible for all activity associated with their
account.
2.  An account holder must not share their account with others or try to
access another user s account. Access credentials must be kept
private and secure. 3.  Compute Canada resources must only be used for projects/programs
for which they have been duly allocated.
4.  Compute Canada resources must be used in an efficient and
considerate fashion.
Acceptable Use Policy
5.  Compute Canada resources must not be used for illegal purposes.
6.  An account holder must respect the privacy of their own, other users
and the underlying systems data. 7.  An account holder must provide reporting information in a timely
manner and cite Compute Canada in all publications that resulted
from work undertaken with Compute Canada resources.
8.  An account holder must observe computing policies in effect at the
relevant centre and their residing institution.
9.  account holder may lose access if any of these policies are
transgressed.
https://ccdb.computecanada.org/security/accept_aup
Use of Resources
Obtaining SSH
•  Available through linux/UNIX, Mac OS X :
•  Windows :
•  cygwin (for graphics, you will need an X-emulator or X-server)
• Xming X Server (http://straightrunning.com/XmingNotes/)
• Putty (putty.exe)‫‏‬
•  Tunnelier (http://www.bitvise.com/tunnelier)
•  see http://www.openssh.org/windows.html
•  www.openssh.org
Using SSH
•  Connection in a terminal:
ssh –X [email protected]
The login node name is obtained
following activation of your account,
or via the support webpages
•  Transferring files:
scp local_file [email protected]:destination
sftp [email protected]
•  Tips-and-Tricks
ssh keys (ssh-keygen) to remember your password
Important: never use pass-phraseless ssh-keys!
Configuration file (.ssh/config)
Software
Software used by more than one user is generally installed
centrally on each system by analysts. Versioning and
dependencies is handled by a tool called module.
A module contains information that permits the modification
of a user’s environment so as to use a given version of the
software.
• List the modules currently used
module list
• List the modules currently available
module avail
•  Add (Remove) a module from your environment
module add (rm) module_name
Software
Storage Utilization
The environment variable $HOME refers to the default
directory for each user. This directory is sometimes
protected via regular back-up. With an ssh connection the
user arrives at this location.
A directory named « scratch » is available for most non-basic
work. This directory is not backed-up, is of large capacity
and has high performance.
Storage Utilization
MpII, MSII, Altix, Cottos, Briaree et Psi :
- the variable $SCRATCH indicates the location of scratch.
- $HOME is backed-up.
Guillimin :
- /sb/scratch/username is the scratch for each user.
- /sb/project/RAP_ID provides some small persistent space per group.
- $HOME is backed-up.
Colosse :
Typing “colosse-info” in a terminal will tell the user their “RAP_ID”.
- /scratch/RAP_ID/ is the scratch for the project.
- $HOME is not backed-up.
Job Submission
•  Set the options for the job.
•  Writing a script to run.
•  Submit the script.
In the next cycle of resource allocation, the scheduler determines the job
priority.
•  The jobs with the highest priority are executed first if the
requested resources are available.
•  Queuing of the jobs is possible.
•  The calculated priority increases with the time spent waiting.
•  Job execution.
•  Return of the standard output and standard error of the
job.
Job Submission - Briarée
Queue
Name
Maximum
Duration (h)
Constraints / notes
all : 2520 cores per group
36 jobs max / user
1416 cores max / user
4 Nodes max / job
normale
168
courte
48
hp
168
hpcourte
48
171 nodes max / jobs
longue
336
180 cores max / user
60 node available
test
1
72 jobs max / user
4 nodes max / job
8 tâjobs ches max / user
2052 cores max / user
4 nodes available
Job Submission - Briarée
#!/bin/bash
#PBS -l walltime=02:00:00
#PBS -l nodes=1:ppn=4
#PBS –l mem=14gb
cd $SCRATCH/my_directory
The obtained node is reserved for the user.
The next job does the same if possible.
One can add #PBS –q courte but by default
the submission system chooses the queue
based upon what is requested.
module load module_used
./execution
qsub script
: submit the script
qstat –u user_name
: see the status of my jobs
Job Submission - Guillimin
Queue
Name
Maximum
Duration (h)
Constraints / notes
sw
720
Serial queue: 2:1 blocking network,
36 GB memory per node, 600 nodes
hb
720
Non-blocking network:
24 GB memory per node, 400 nodes
lm
720
Non-blocking network:
72 GB memory per node, 200 nodes
debug
2
Maximum of 1280 cores per job (default)
Per group maximum number of core seconds for running jobs
- Allows flexibility between many short duration and fewer longer duration jobs
Job Submission - Guillimin
#!/bin/bash
#PBS -l walltime=02:00:00
#PBS -l nodes=1:ppn=4
#PBS –q lm
cd $SCRATCH/my_directory
One can specify the queue to which the job
is submitted based upon the memory and
other requirements.
module load module_needed
./execution
msub –q queue_name script
: submit the job
showq –u user_name
: see the state of my jobs
checkjob –v jobID
: see detailed job information
Job Submission - Colosse
Queue
Name
Maximum
Duration (h)
Constraints / notes
short
24
256 cores maximum
med
48
128 cores maximum
long
168
test
¼
16 cores maximum
Job Submission - Colosse
#!/bin/bash
#$ –l h_rt=7200
#$ -pe default 8
#$ –P abc-000-00
cd $SCRATCH/my_directory
A job obtains a complete node. The number
of requested cores is therefore a multiple of 8.
One can add #$ –q short but by default the
submission system chooses the queue based
upon the requested resources.
module load module_needed
./execution
colosse-info
: obtain your abc-000-00
qsub script
: submit the script
qstat –u user_name
: see the status of my jobs
Job Submission - MpII
Queue
Name
Maximum
Duration (h)
Constraints / notes
qwork
120
qfbb
120
portion with non-blocking network
qfat256
120
20 nodes available (48 cores per node)
qfat512
48
2 nodes available (48 cores per node)
The size of the jobs that can be executed depends on the allocation and other tasks in
queue.
Ex : if there are 2400 cores available and 3 jobs in the queue.
- group 1 : allocation = 100 è can use 1200 cores
- group 2 : allocation = 50
è can use 600 cores
- group 3 : allocation = 50
è can use 600 cores
Job Submission - MpII
#!/bin/bash
#PBS -l walltime=02:00:00
#PBS -l nodes=1
#PBS –l mem=14gb
A job obtains a complete node. The number
of requested cores must be 1.
#PBS –q qwork@mp2
cd $SCRATCH/my_directory
module load module_needed
./execution
qsub script
: submit the job script
qstat –u user_name
: see the status of my jobs
Best Practices
Grouping Tasks
It is sometimes inefficient to launch
many jobs one by one:
For i from 1 to 100 :
qsub –lnodes=1:ppn=1 –
#!/bin/bash
#PBS -l walltime=02:00:00
#PBS -l nodes=1
#PBS –l mem=14gb
#PBS –q qwork@mp2
lwalltime ...
module load module_needed
This approach is potentially inefficient
and should be avoided:
cd $SCRATCH/my_directory1
./execution &
cd $SCRATCH/my_directory2
•  Certain systems limit the number of jobs
./execution &
•  Certain systems allocate whole nodes to
jobs
wait
Grouping Tasks
Methods and tools are available by which job parameters
can be automatically adjusted when running programs
repeatedly, and can aid to optimize the submission of related
jobs.
On Guillimin, Colosse and Briarée:
The submission or scheduler systems support the use of job
arrays that simplify the submission of identical workloads
that operate on different sets of parameters or data.
On MpII:
The grouping of jobs can be automated through the use of
bqtools.
Job Duration
Estimate your job execution time. A job that requests less
time waits less time in the queue!
The shorter queues generally allow the user to run more jobs.
In addition there is less risk of suffering the effects of a job
failure.
If you do not know how to do it, the analysts can help you.
Storage
For handling large files, use the scratch space on the systems.
It is generally preferable to use large block reads and writes.
Sometimes it is useful to use disks local to the compute
nodes where the jobs are running.
Contact the analysts to obtain advice!
Contacting the Analysts
²  [email protected] (cottos, briaree, altix, hades)
²  [email protected] (mpII, msII)
²  [email protected] (colosse, guillimin)
Useful Documentation and Support Links
²  cottos, briaree, altix, hades, mpII, msII
²  https://rqchp.ca and select ‘Documentation’
²  colosse, guillimin
²  https://www.clumeq.ca and select ‘Support’