How High Throughput was my cluster? Greg Thain

MPI Jobs in HTCondor
Greg Thain
INFN Workshop 2016
Overview
Some remarks about MPI
Running MPI in vanilla slots
Running MPI jobs in the parallel universe
Setting up the parallel universe
First rule of MPI jobs
› First rule of running MPI jobs
3
First rule of MPI jobs
› First rule of running MPI jobs:
DON’T!
4
Problems with MPI (in general)
› Difficult to schedule – tetris problem
› Fragile – one bad node ruins your day
› Difficult to know how big to make
5
If you must…
Some best practices:
Keep MPI jobs as small as possible
Keep MPI jobs as uniform as possible
Try to self-checkpoint MPI jobs
6
The best way to run MPI jobs in
HTCondor:
In vanilla universe:
7
Best way to run MPI jobs
in HTCondor
universe = vanilla
executable = my_wrapper_script.sh
request_cpus = 8
log = log
output = output
Error = error
…
queue
How to get an 8 cpu slot?
› Two ways
9
Static slots with 8 cpus
# condor_config
NUM_SLOTS_TYPE_1 = 3
SLOT_TYPE_1 = cpus=8
Best way to run MPI jobs
in HTCondor
$ condor_status
Name
Mem
Ac
slot1@chevre LINUX
slot2@chevre LINUX
slot3@chevre LINUX
OpSys
Arch
State
X86_64 Unclaimed Idle
X86_64 Unclaimed Idle
X86_64 Unclaimed Idle
Activity LoadAv
0.350 273066
0.000 273066
0.000 273066
0+
0+
0+
Best way to run MPI jobs
in HTCondor
$ condor_status –af cpus
8
8
8
OK, now what?
universe = vanilla
executable = my_wrapper_script.sh
request_cpus = 8
log = log
output = output
Error = error
…
queue
my_wrapper_script.sh
#!/bin/sh
# do something that uses 8 cores:
make –j 8
Openmp-exec …
mpiexec –n 8 myapp
Gotchas with this approach
› Start must know a-priori cpu sizes
› Job must fit on one machine
› .. But does have best performance…
15
Don’t forget to xfer helpers
universe = vanilla
executable = my_wrapper_script.sh
transfer_input_files = mpiexec, real-exe, etc.
request_cpus = 8
log = log
output = output
Error = error
…
Partitionable slots: The big idea
› One “partionable” slot
› From which “dynamic” slots are made
› When dynamic slot exit, merged back into
“partionable”
› Split happens at claim time
(cont)
› Partionable slots split on
Cpu
Disk
Memory
(Maybe more later)
› When you are out of one, you’re out of slots
3 types of slots
› Static (e.g. the usual kind)
› Partitionable (e.g. leftovers)
› Dynamic (usableable ones)
Dynamically created
But once created, static
How to configure
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true
Looks like
$ condor_status
Name
Mem
OpSys
Arch
State
Activity LoadAv
ActvtyTime
slot1@c LINUX
X86_64 Unclaimed Idle
0.110
8192
Total Owner Claimed Unclaimed Matched
X86_64/LINUX
1
0
0
1
0
Total
1
0
0
1
0
When running
$ condor_status
Name
Mem
OpSys
Arch
State
Activity LoadAv
ActvtyTime
slot1@c LINUX
slot1_1@c LINUX
slot1_2@c LINUX
slot1_3@c LINUX
X86_64
X86_64
X86_64
X86_64
Unclaimed
Claimed
Claimed
Claimed
Idle
Busy
Busy
Busy
0.110
0.000
0.000
0.000
4096
1024
2048
1024
No changes to submit file
universe = vanilla
executable = my_wrapper_script.sh
transfer_input_files = mpiexec, real-exe, etc.
request_cpus = 8
log = log
output = output
Error = error
…
If Job > one machine
› Parallel Universe for jobs > one machine
› Big hammer
24
Basic idea
1) Job requests more than one slot
2) Schedd gathers up matching slots
3) Gives all slots to one shadow
4) Runs one job on all slots
25
Implications
› May take > 1 negotiation cycle to get slots
› Possible deadlock if > 1 schedd
› Interactions with serial jobs?
26
Results
› Parallel slots must prefer parallel jobs over
serial
Via machine RANK
› Parallel slots must be marked dedicated for
one specific schedd
› Slots are claimed/idle while being gathered
With a timeout
27
Verification
condor_status –af Name DedicatedScheduler
28
Parallel Universe configuration
DedicatedScheduler = \
"[email protected]"
STARTD_ATTRS = DedicatedScheduler \
$(STARTD_ATTRS)
RANK = Scheduler =?= $(DedicatedScheduler)
29
submit file
universe = parallel
executable = my_wrapper_script.sh
transfer_input_files = mpiexec, real-exe, etc.
machine_count = 8
log = log
output = output.$(NODE)
Error = error.$(NODE)
…
What does this do?
1) Get 8 slots that match requirements
2) Launch exec on each node
3) Wait for rank 0 to exit
1) ParallelUniverseShutdownPolicy =
WAIT_FOR_ALL
4) Shut everything down
31
How do I launch MPI
› Each node launches password-less sshd
› And uses condor-chirp to share info
› Then mpi-exe uses ssh to use every node
32
How do I launch MPI
› This is on you
Many implementations of MPI
› We provide some example scripts
etc/examples/mp1script
etc/examples/openmpiscript
33
Parallel job still a job
› All condor tools work with it
condor_rm
condor_hold
condor_release
ccb
condor_ssh_to_job
Etc.
34
Surprises with Parallel Uni
› Scheduling is mostly-fifo
Can override with first-fit
No fair-share
› Accounting is a bit funny
35
Summary
MPI jobs are un-HTC
Two ways to run in condor
Prefer to keep all jobs on one machine