Introduction to SLURM - EPFL

Introduction to SLURM
scitas.epfl.ch
October 9, 2014
Bellatrix
Frontend at bellatrix.epfl.ch
16 x 2.2 GHz cores per node
424 nodes with 32GB
Infiniband QDR network
The batch system is SLURM
1 / 36
Castor
Frontend at castor.epfl.ch
16 x 2.6 GHz cores per
node
50 nodes with 64GB
2 nodes with 256GB
For sequential jobs
(Matlab etc.)
The batch system is
SLURM
RedHat 6.5
2 / 36
Deneb (October 2014)
Frontend at deneb.epfl.ch
16 x 2.6 GHz cores per node
376 nodes with 64GB
8 nodes with 256GB
2 nodes with 512GB and 32 cores
16 nodes with 4 Nvidia K40 GPUs
Infiniband QDR network
3 / 36
Storage
/home
filesystem has per user quotas
will be backed up for important things
(source code, results and theses)
/scratch
high performance ”temporary” space
is not backed up
is organised by laboratory
4 / 36
Connection
Start the X server (automatic on a Mac)
Open a terminal
ssh -Y [email protected]
Try the following commands:
id
pwd
quota
ls /scratch/<group>/<username>
5 / 36
The batch system
Goal: to take a list of jobs and execute them when
appropriate resources become available
SCITAS uses SLURM on its clusters:
http://slurm.schedmd.com
The configuration depends on the purpose of the cluster
(serial vs parallel)
6 / 36
sbatch
The fundamental command is sbatch
sbatch submits jobs to the batch system
Suggested workflow:
create a short job-script
submit it to the batch system
7 / 36
sbatch - exercise
Copy the first two examples to your home directory
cp /scratch/examples/ex1.run .
cp /scratch/examples/ex2.run .
Open the file ex1.run with your editor of choice
8 / 36
ex1.run
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
--workdir /scratch/<group>/<username>
--nodes 1
--ntasks 1
--cpus-per-task 1
--mem 1024
sleep 10
echo "hello from $(hostname)"
sleep 10
9 / 36
ex1.run
#SBATCH is a directive to the batch system
--nodes 1
the number of nodes to use - on Castor this is limited to 1
--ntasks 1
the number of tasks (in an MPI sense) to run per job
--cpu-per-task 1
the number of cores per aforementioned task
--mem 4096
the memory required per node in MB
--time 12:00:00
--time 2-6
the time required
# 12 hours
# two days and six hours
10 / 36
Running ex1.run
The job is assigned a default runtime of 15 minutes
$ sbatch ex1.run
Submitted batch job 439
$ cat /scratch/<group>/<username>/slurm-439.out
hello from c03
11 / 36
What went on?
sacct -j
<JOB_ID>
sacct -l -j
<JOB_ID>
Or more usefully:
Sjob <JOB ID>
12 / 36
Cancelling jobs
To cancel a specific job:
scancel <JOB_ID>
To cancel all your jobs:
scancel -u <username>
13 / 36
ex2.run
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
--workdir /scratch/<group>/<username>
--nodes 1
--ntasks 1
--cpus-per-task 8
--mem 122880
--time 00:30:00
/scratch/examples/linpack/runme_1_45k
14 / 36
What’s going on?
squeue
squeue -u <username>
Squeue
Sjob <JOB_ID>
scontrol -d show job <JOB_ID>
sinfo
15 / 36
squeue and Squeue
squeue
Job states: Pending, Resources, Priority, Running
squeue | grep <JOB_ID>
squeue -j <JOB_ID>
Squeue <JOB_ID>
16 / 36
Sjob
$ Sjob <JOB_ID>
JobID
JobName
Cluster
Account Partition Timelimit
User
Group
------------ ---------- ---------- ---------- ---------- ---------- --------- --------31006
ex1.run
castor scitas-ge
serial
00:15:00
jmenu scitas-ge
31006.batch
batch
castor scitas-ge
Submit
Eligible
Start
End
------------------- ------------------- ------------------- ------------------2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08
2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:55:48 2014-05-12T15:56:08
Elapsed ExitCode
State
---------- -------- ---------00:00:20
0:0 COMPLETED
00:00:20
0:0 COMPLETED
NCPUS
NTasks
NodeList
UserCPU SystemCPU
AveCPU MaxVMSize
---------- -------- --------------- ---------- ---------- ---------- ---------1
c04
00:00:00 00:00.001
1
1
c04
00:00:00 00:00.001
00:00:00
207016K
17 / 36
scontrol
$ scontrol -d show job <JOB_ID>
$ scontrol -d show job 400
obId=400 Name=s1.job
UserId=user(123456) GroupId=group(654321)
Priority=111 Account=scitas-ge QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:03:39 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2014-03-06T09:45:27 EligibleTime=2014-03-06T09:45:27
StartTime=2014-03-06T09:45:27 EndTime=2014-03-06T10:00:27
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=serial AllocNode:Sid=castor:106310
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c03
BatchHost=c03
NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=c03 CPU IDs=0 Mem=1024
MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/<user>/jobs/s1.job
WorkDir=/scratch/<group>/<user>
18 / 36
Modules
Modules make your life easier
module avail
module show <take your pick>
module load <take your pick>
module list
module purge
module list
19 / 36
ex3.run - Mathematica
Copy the following files to your chosen directory:
cp /scratch/examples/ex3.run .
cp /scratch/examples/mathematica.in .
Submit �ex3.run’ to the batch system and see what happens...
20 / 36
ex3.run
#!/bin/bash
#SBATCH
#SBATCH
#SBATCH
#SBATCH
#SBATCH
--ntasks 1
--cpus-per-task 1
--nodes 1
--mem 4096
--time 00:05:00
echo STARTING AT �date�
module purge
module load mathematica/9.0.1
math < mathematica.in
echo FINISHED at �date�
21 / 36
Compiling ex4.* source files
Copy the following files to your chosen directory:
/scratch/examples/ex4_README.txt
/scratch/examples/ex4.c
/scratch/examples/ex4.cxx
/scratch/examples/ex4.f90
/scratch/examples/ex4.run
Then compile them with:
module load intelmpi/4.1.3
mpiicc -o ex4 c ex4.c
mpiicpc -o ex4 cxx ex4.cxx
mpiifort -o ex4 f90 ex4.f90
22 / 36
The 3 methods to get interactive access 1/3
In order to schedule an allocation use salloc with exactly the
same options for resources as sbatch
You will then arrive in a new prompt which is still on the
submission node but by using srun you can get access to the
allocated resources
eroche@castor:hello > salloc -N 1 -n 2
salloc: Granted job allocation 1234
bash-4.1$ hostname
castor
bash-4.1$ srun hostname
c03
c03
23 / 36
The 3 methods to get interactive access 2/3
To get a prompt on the machine one needs to use the “--pty”
option with “srun” and then “bash -i” (or “tcsh -i”) to get
the shell:
eroche@castor > salloc -N 1 -n 1
salloc: Granted job allocation 1235
eroche@castor > srun --pty bash -i
bash-4.1$ hostname
c03
24 / 36
The 3 methods to get interactive access 3/3
This is the least elegant but it is the method by which one can run
X11 applications:
eroche@bellatrix > salloc -n 1 -c 16 -N 1
salloc: Granted job allocation 1236
bash-4.1$ srun hostname
c04
bash-4.1$ ssh -Y c04
eroche@c04 >
25 / 36
Dynamic libs used in an application
“ldd” displays the libraries an executable file depends on:
/COURS > ldd ex4 f90
jmenu@castor:~
linux-vdso.so.1 => (0x00007fff4b905000)
libmpigf.so.4 =>
/opt/software/intel/14.0.1/intel64/lib/libmpigf.so.4
(0x00007f556cf88000)
libmpi.so.4 => /opt/software/intel/14.0.1/intel64/lib/libmpi.so.4
(0x00007f556c91c000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003807e00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003808a00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003808200000)
libm.so.6 => /lib64/libm.so.6 (0x0000003807600000)
libc.so.6 => /lib64/libc.so.6 (0x0000003807a00000)
libgcc s.so.1 => /lib64/libgcc s.so.1 (0x000000380c600000)
/lib64/ld-linux-x86-64.so.2 (0x0000003807200000)
26 / 36
ex4.run
#!/bin/bash
...
module purge
module load intelmpi/4.1.3
module list echo
LAUNCH DIR=/scratch/scitas-ge/jmenu
EXECUTABLE="./ex4 f90"
echo "--> LAUNCH DIR = ${LAUNCH DIR}"
echo "--> EXECUTABLE = ${EXECUTABLE}" echo
echo "--> ${EXECUTABLE} depends on the following dynamic
libraries:"
ldd ${EXECUTABLE}
echo
cd ${LAUNCH DIR}
srun ${EXECUTABLE}
...
27 / 36
The debug QoS
In order to have priority access for debugging
sbatch --qos debug ex1.run
Limits on Castor:
30 minutes walltime
1 job per user
16 cores between all users
To display the available QoS’s:
sacctmgr show qos
28 / 36
cgroups (Castor)
General:
cgroups (“control groups”) is a Linux kernel feature to limit,
account, and isolate resource usage (CPU, memory, disk I/O,
etc.) of process groups
SLURM:
Linux cgroups apply contraints to the CPUs and memory that
can be used by a job
They are automatically generated using the resource requests
given to SLURM
They are automatically destroyed at the end of the job, thus
releasing all resources used
Even if there is physical memory available a task will be killed if
it tries to exceed the limits of the cgroup!
29 / 36
System process view
Two tasks running on the same node with “ps auxf”
root
user
user
177873
177877
177908
slurmstepd: [1072]
\_ /bin/bash /var/spool/slurmd/job01072/slurm_script
\_ sleep 10
root
user
user
177890
177894
177970
slurmstepd: [1073]
\_ /bin/bash /var/spool/slurmd/job01073/slurm_script
\_ sleep 10
Check memory, thread and core usage with “htop”
30 / 36
Fair share 1/3
The scheduler is configured to give all groups a share of the
computing power
Within each group the members have an equal share by default:
jmenu@castor:~ > sacctmgr show association where account=lacal
format=Account,Cluster,User,GrpNodes, QOS,DefaultQOS,Share tree
Account
Cluster
User GrpNodes
QOS
Def QOS
Share
-------------------- ---------- ---------- -------- -------------------- --------- --------lacal
castor
normal
1
lacal
castor
aabecker
debug,normal
normal
1
lacal
castor
kleinjun
debug,normal
normal
1
lacal
castor
knikitin
debug,normal
normal
1
Priority is based on recent usage
this is forgotten about with time (half life)
fair share comes into play when the resources are heavily used
31 / 36
Fair share 2/3
Job priority is a weighted sum of various factors:
jmenu@castor:~ > sprio -w
JOBID
Weights
PRIORITY
AGE
1000
jmenu@bellatrix:~ > sprio -w
JOBID
PRIORITY
Weights
FAIRSHARE
10000
AGE
1000
QOS
100000
FAIRSHARE
10000
JOBSIZE
100
To compare jobs’ priorities:
jmenu@castor:~ > sprio -j80833,77613
JOBID
PRIORITY
AGE FAIRSHARE
77613
145
146
0
80833
9204
93
9111
32 / 36
QOS
0
0
QOS
100000
Fair share 3/3
FairShare values range from 0.0 to 1.0:
Value
Meaning
≈ 0.0
you used much more resources that you were granted
0.5
≈ 1.0
you got what you paid for
you used nearly no resources
jmenu@bellatrix:~ > sshare -a -A lacal
Accounts requested:
: lacal
Account
User Raw Shares Norm Shares
Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ---------lacal
666
0.097869 1357691548
0.256328
0.162771
lacal
boissaye
1
0.016312
0
0.042721
0.162771
lacal
janson
1
0.016312
0
0.042721
0.162771
lacal
jetchev
1
0.016312
0
0.042721
0.162771
lacal
kleinjun
1
0.016312 1357691548
0.256328
0.000019
lacal
pbottine
1
0.016312
0
0.042721
0.162771
lacal
saltini
1
0.016312
0
0.042721
0.162771
More information at:
http://schedmd.com/slurmdocs/priority_multifactor.html
33 / 36
Helping yourself
man pages are your friend!
man sbatch
man sacct
man gcc
module load intel/14.0.1
man ifort
34 / 36
Getting help
If you still have problems then send a message to:
[email protected]
Please start the subject with HPC
for automatic routing to the HPC team
Please give as much information as possible including:
the jobid
the directory location and name of the submission script
where the “slurm-*.out” file is to be found
how the “sbatch” command was used to submit it
the output from “env” and “module list” commands
35 / 36
Appendix
Change your shell at:
https://dinfo.epfl.ch/cgi-bin/accountprefs
Scitas web site:
http://scitas.epfl.ch
36 / 36