Juropa3 ZEA-1 Partition Batch System – Maui/Torque User`s Manual

Juropa3 ZEA-1 Partition
Batch System – Maui/Torque
User's Manual
2 Jul 2013 @ JSC
Chrysovalantis Paschoulas | [email protected]
1. System Information
Juropa3 is a new small cluster in JSC. Juropa3 is divided into two partitions: the experimental partition
that will be used mainly for tests and experiments and the ZEA-1 partition that belongs to ZEA-1.
Cluster Information
Juropa3 ZEA-1 partition consists of one master-login node and 16 compute nodes. 2 out of the 16
compute nodes have bigger memory installed – fat nodes. Here is a table with the node specifications:
Nodes
Num
Hostname
CPU
Phys.
Cores
VCores
RAM
Description
Attributes*
1
juropa3z.zam.kfa-juelich.de
local: j3l01
Intel Xeon
E5-2650
16
32
128 GB
Master/Login
node
-
2
j3c0[29-30]
Intel Xeon
E5-2650
16
32
256 GB
Fat compute
nodes
bigmem
14
j3c0[39-52]
Intel Xeon
E5-2650
16
32
128 GB Normal compute
nodes
normal
* The compute nodes were given some attributes that can be used in the job scripts to distinguish the fat nodes from the
normal nodes.
Node juropa3z.zam.kfa-juelich.de is the login and master node of this partition. The users will login
there to compile code and submit jobs. However important services are also running on this node, like:
batch system servers, ldap server, NFS server etc.
On the Juropa3 ZEA-1 partition, for Batch System, we use the combination of Torque and Maui.
Torque is the resource manager and Maui is the scheduler.
Local Disks
All compute nodes of the cluster are diskless, loading the whole OS image in main memory during boot
time. After a request of ZEA-1 group, we have installed local disks on the compute nodes, offering a
local file-system, in order for their software to take advantage of the Linux dynamic caching. Here is a
table with information about the local file-system:
Mount Point
/data
File-system
ext4
Size
~ 1TB
Access to the cluster
Users can connect to the login node with the ssh command:
> ssh <username>@juropa3z.zam.kfa-juelich.de
2. Commands
Here is a list with Maui and Torque commands. For more information please use the man pages (or
give the option “--help”).
Maui Commands
Command
Description
canceljob
cancel existing job
checkjob
display job state, resource requirements,
environment, constraints, credentials, history,
allocated resources, and resource utilization
showbf
show resource availability for jobs with specific
resource requirements
showq
display detailed prioritized list of active and idle
jobs
showstart
show estimated start time of idle jobs
Torque Commands
Command
Description
pbsnodes
View/modify batch status of compute nodes
qalter
Modify queued batch jobs
qdel
Delete/cancel batch jobs
qhold
Hold batch jobs
qrls
Release batch job holds
qrun
Start a batch job
qstat
View queues and jobs
qsub
Submit jobs
3. Compilers
On Juropa3 ZEA-1 partition we offer some wrappers to the users, in order to compile and execute
parallel jobs using MPI (like on Juropa2). The current wrappers are:
mpicc, mpicxx, mpif77, mpif90
Users can choose the compiler's version using the module command.
* Some useful compiler options:
-openmp:
enables OpenMP
-g:
creates debugging information
-L:
path to libraries for the linker
-O[0-3]:
optimization levels
Compile examples:
a) MPI program in C++:
mpicxx -O2 program.cpp -o mpi_program
b) Hybrid MPI/OpenMP program in C:
mpicc -openmp -o exe_program code_program.c
To execute a parallel application you can use the mpiexec command.
4. Modules
All the available software on the cluster (compilers, tools, libraries, etc..) is provided in the form of
modules. The user in order to use the desired software they have to use the module command. With this
command the user can load or unload the software or a specific version of the required software. By
default some modules are preloaded for all users. Here is a list of useful options:
Command
Description
module list
Print a list with all the currently loaded modules
module avail
Display all available modules
module load <module name>
Load a module
module unload <module name>
Unload a module
5. Job Scripts
Users can submit jobs using the qsub command. In the job scripts, in order to define the qsub
parameters you have to use the #PBS directives. (The options are the same as in Juropa2 but we use
Maui instead of Moab, so in the job scripts we have #PBS instead of #MSUB).
In the job script you can define the number of nodes and number of processors that will be used to run
a parallel program. To distinguish the fat nodes and the normal nodes we have defined two attributes
for the resource manager: bigmem for the fat nodes and normal for the other nodes. So for example, if
you want to use 1 fat node with one task and 4 normal nodes with 32 tasks per node you have to give:
#PBS -l nodes=1:bigmem:ppn=1+4:normal:ppn=32
With these options the master node (the node that will have the MPI task with rank 0) will be be always
the fat node. NOTE: you have to put the fat node first on the list.
To define the walltime of the job (30 minutes in this example) you have to give this option:
#PBS -l walltime=00:30:00
If you don't define any walltime then the default value is INFINITY which means that the batch system
will run for ever that job. (Also if you give walltime longer than 100 days then walltime will be set to
INFINITY).
Here is a list with useful options of qsub:
Option
Description
-l nodes=<num>[:attribute]
number of nodes [compute node attribute]
-l ppn=<num>
processes per node
-l walltime=<hh:mm:ss>
requested wall-clock time (default: INFINITY)
-j oe
combine stderr and stdout
-M <email address>
send email to this address
-m eab
send email – on end, abort or begin
-N <name>
name of job
-v tpt=<num threads>
number of OpenMP threads
-I
start an interactive job
NOTE: The batch system is configured to have only one default queue with the name “batch”. While
the users are submitting jobs, they don't have to specify the queue because all jobs will be submitted in
the default queue.
Here are some examples of job scripts:
A) Normal MPI job without using the parameters from resource manager.
#!/bin/bash
#PBS -N TestJob1
#PBS -l nodes=8:ppn=32
#PBS -l walltime=01:00:00
#
cd $PBS_O_WORKDIR
mpiexec -np=256 <exe program>
Here we have an MPI program using 8 compute nodes and 32 processors per node, one thread per
processor. The compute nodes support 16 Hardware Cores and 32 Virtual Cores with SMT. On each
VCore will be running one MPI task with one execution thread. There is no restriction on which
compute node will be used, so it is possible for this job to use randomly some fat or normal nodes.
B) MPI job using the resource manager's parameters.
#!/bin/bash
#PBS -N TestJob2
#PBS -l nodes=1:bigmem:ppn=1+4:normal:ppn=16
#PBS -l walltime=01:00:00
#
...
mpiexec -np=65 <exe program>
Here we have a parallel MPI program that will use one fat node with one MPI task on one HW Core
with one execution thread and 4 normal compute nodes with 16 MPI tasks per node. The total number
of MPI tasks is 65.
C) Hybrid program using MPI and OpenMP
#!/bin/bash
#PBS -N TestJobHybrid
#PBS -l nodes=6:normal:ppn=32
#PBS -v tpt=8
...
#
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
mpiexec -np=24 --exports=OMP_NUM_THREADS <exe program>
Here we have a parallel MPI program that uses also OpenMP. The job will run on 6 normal compute
nodes using all Vcores per nodes. On each node we will have 4 MPI tasks and 8 OpenMP threads per
task. Total number of MPI taks is 24. We didn't define any walltime limit so the job will run for ever.
Working Directory
The default initial working directory (for the job) is configured to be the home directory of the user. So,
when a job starts the initial working directory of the job script will always be user's home. In order to
change this behavior there are 2 optional ways:
1. Use the “-d” option of qsub. Here is an example of this option in a job script:
#PBS -d /home/group_dir/user_dir/current_dir
2. In the job script, call “cd $PBS_O_WORKDIR”. The environment variable $PBS_O_WORKDIR is
always set to the current working directory when qsub was called. Here is an example:
cd $PBS_O_WORKDIR
6. Interactive Jobs
In order to start an interactive job the user has to give the option “-I” of qsub. User can use the same
options of qsub as in the batch scripts. Here is an example of starting an interactive job:
[userx@j3l01 jobs]$ qsub -I -l nodes=1:bigmem:ppn=1+2:normal:ppn=8,walltime=00:05:00
qsub: waiting for job 562.j3l01 to start
qsub: job 562.j3l01 ready
[userx@j3c030 ~]$
...
In this example we start an interactive job running on one fat-node using one core (one MPI Task) and
two normal-nodes with 8 cores (8 MPI Tasks) per node. The requested walltime of this job is five
minutes. As we can see above, the qsub command returns the job-ID and then gives to the user a
command prompt on the first compute node in the list (in our case it is the fat-node). Afterwards the
user is free to run his applications (e.g. with mpiexec)..
TIP: While running the interactive job, the user can check his job and see info with the command
qstat -f <job ID>
Here is an example:
[userx@j3c030 ~]$ qstat -f 562.j3l01