MPI Jobs in HTCondor Greg Thain INFN Workshop 2016 Overview Some remarks about MPI Running MPI in vanilla slots Running MPI jobs in the parallel universe Setting up the parallel universe First rule of MPI jobs › First rule of running MPI jobs 3 First rule of MPI jobs › First rule of running MPI jobs: DON’T! 4 Problems with MPI (in general) › Difficult to schedule – tetris problem › Fragile – one bad node ruins your day › Difficult to know how big to make 5 If you must… Some best practices: Keep MPI jobs as small as possible Keep MPI jobs as uniform as possible Try to self-checkpoint MPI jobs 6 The best way to run MPI jobs in HTCondor: In vanilla universe: 7 Best way to run MPI jobs in HTCondor universe = vanilla executable = my_wrapper_script.sh request_cpus = 8 log = log output = output Error = error … queue How to get an 8 cpu slot? › Two ways 9 Static slots with 8 cpus # condor_config NUM_SLOTS_TYPE_1 = 3 SLOT_TYPE_1 = cpus=8 Best way to run MPI jobs in HTCondor $ condor_status Name Mem Ac slot1@chevre LINUX slot2@chevre LINUX slot3@chevre LINUX OpSys Arch State X86_64 Unclaimed Idle X86_64 Unclaimed Idle X86_64 Unclaimed Idle Activity LoadAv 0.350 273066 0.000 273066 0.000 273066 0+ 0+ 0+ Best way to run MPI jobs in HTCondor $ condor_status –af cpus 8 8 8 OK, now what? universe = vanilla executable = my_wrapper_script.sh request_cpus = 8 log = log output = output Error = error … queue my_wrapper_script.sh #!/bin/sh # do something that uses 8 cores: make –j 8 Openmp-exec … mpiexec –n 8 myapp Gotchas with this approach › Start must know a-priori cpu sizes › Job must fit on one machine › .. But does have best performance… 15 Don’t forget to xfer helpers universe = vanilla executable = my_wrapper_script.sh transfer_input_files = mpiexec, real-exe, etc. request_cpus = 8 log = log output = output Error = error … Partitionable slots: The big idea › One “partionable” slot › From which “dynamic” slots are made › When dynamic slot exit, merged back into “partionable” › Split happens at claim time (cont) › Partionable slots split on Cpu Disk Memory (Maybe more later) › When you are out of one, you’re out of slots 3 types of slots › Static (e.g. the usual kind) › Partitionable (e.g. leftovers) › Dynamic (usableable ones) Dynamically created But once created, static How to configure NUM_SLOTS = 1 NUM_SLOTS_TYPE_1 = 1 SLOT_TYPE_1 = cpus=100% SLOT_TYPE_1_PARTITIONABLE = true Looks like $ condor_status Name Mem OpSys Arch State Activity LoadAv ActvtyTime slot1@c LINUX X86_64 Unclaimed Idle 0.110 8192 Total Owner Claimed Unclaimed Matched X86_64/LINUX 1 0 0 1 0 Total 1 0 0 1 0 When running $ condor_status Name Mem OpSys Arch State Activity LoadAv ActvtyTime slot1@c LINUX slot1_1@c LINUX slot1_2@c LINUX slot1_3@c LINUX X86_64 X86_64 X86_64 X86_64 Unclaimed Claimed Claimed Claimed Idle Busy Busy Busy 0.110 0.000 0.000 0.000 4096 1024 2048 1024 No changes to submit file universe = vanilla executable = my_wrapper_script.sh transfer_input_files = mpiexec, real-exe, etc. request_cpus = 8 log = log output = output Error = error … If Job > one machine › Parallel Universe for jobs > one machine › Big hammer 24 Basic idea 1) Job requests more than one slot 2) Schedd gathers up matching slots 3) Gives all slots to one shadow 4) Runs one job on all slots 25 Implications › May take > 1 negotiation cycle to get slots › Possible deadlock if > 1 schedd › Interactions with serial jobs? 26 Results › Parallel slots must prefer parallel jobs over serial Via machine RANK › Parallel slots must be marked dedicated for one specific schedd › Slots are claimed/idle while being gathered With a timeout 27 Verification condor_status –af Name DedicatedScheduler 28 Parallel Universe configuration DedicatedScheduler = \ "[email protected]" STARTD_ATTRS = DedicatedScheduler \ $(STARTD_ATTRS) RANK = Scheduler =?= $(DedicatedScheduler) 29 submit file universe = parallel executable = my_wrapper_script.sh transfer_input_files = mpiexec, real-exe, etc. machine_count = 8 log = log output = output.$(NODE) Error = error.$(NODE) … What does this do? 1) Get 8 slots that match requirements 2) Launch exec on each node 3) Wait for rank 0 to exit 1) ParallelUniverseShutdownPolicy = WAIT_FOR_ALL 4) Shut everything down 31 How do I launch MPI › Each node launches password-less sshd › And uses condor-chirp to share info › Then mpi-exe uses ssh to use every node 32 How do I launch MPI › This is on you Many implementations of MPI › We provide some example scripts etc/examples/mp1script etc/examples/openmpiscript 33 Parallel job still a job › All condor tools work with it condor_rm condor_hold condor_release ccb condor_ssh_to_job Etc. 34 Surprises with Parallel Uni › Scheduling is mostly-fifo Can override with first-fit No fair-share › Accounting is a bit funny 35 Summary MPI jobs are un-HTC Two ways to run in condor Prefer to keep all jobs on one machine
© Copyright 2026 Paperzz