The Bologna Batch System: Flexible Policy with Condor

The Bologna Batch System:
Flexible Policy with Condor
UK Condor Week
NeSC – Scotland – Oct 2004
Condor Team
Computer Sciences Department
University of Wisconsin-Madison
http://www.cs.wisc.edu/condor
The Bologna Batch System
› Custom batch scheduling system for local
users at INFN in Bologna, Italy.
 “Istituto Nazionale di Fisica Nucleare”
 Dr. Paolo Mazzanti initiated the idea.
› Implement on a small subset of machines
within the larger nationwide INFN Condor
pool:
 INFN Condor Pool: ~300 CPUs
 INFN-Bologna Condor: ~100 CPUs
 Bologna Batch System: ~50 CPUs
www.cs.wisc.edu/condor
2
Where We Started
› Basic Condor Policy
 Opportunistic resources
• Jobs only run when machines are otherwise idle
• Jobs can be preempted for machine owners or higherpriority users
 Fair-share across INFN pool
• Highest priority user in the pool gets first crack at a
given resource
• The more you use, the worse your priority becomes
› Some problems:
 Long-running vanilla jobs (with no checkpointing)
were frequently preempted before running to
completion
 Users dislike waiting for a resource if they only want
to run a short job
 High-priority users from other INFN sites running
on local resources while lower-priority local users
wait.
www.cs.wisc.edu/condor
3
BBS Policy Requirements
› Prioritize local work
 Share resources, but run outside jobs as backfill
› Treat local servers as “dedicated” resources for
local jobs, but “opportunistic” resources for other
jobs.
 Run outside Condor jobs only if the server is idle.
 Run local batch jobs regardless of other system load or
console activity.
 Preempt outside Condor jobs to allow local batch jobs to
run, but don’t preempt local jobs for outside work.
www.cs.wisc.edu/condor
4
BBS Policy Requirements
› Ensure resource availability for both short and
long-running jobs
 Prioritize short batch jobs so that they are never kept
waiting by long batch jobs.
 Prevent long batch jobs from being preempted or starved
by short jobs.
› Never waste resources
 No idle CPUs when jobs are waiting to run!
 No preemption of vanilla jobs!
• Preemption ideal if you can checkpoint, but here we can’t…
www.cs.wisc.edu/condor
5
A Contradiction!
› No way to guarantee resource availability for
›
short or long jobs without “reserving” some CPUs
for each…
...But no way to avoid idle CPUs without allowing
them to start any kind of job:
 If CPUs reserved for short jobs are used for long jobs,
they become unavailable to run short jobs.
 If CPUs reserved for short jobs are not used for long
jobs, they’re being wasted when there are no short jobs
to run.
› What to do, what to do…
www.cs.wisc.edu/condor
6
A Solution!
› Allow resources to be temporarily overcommitted
 We treat one CPU as two…
 On a two-CPU machine, define four Condor VMs (virtual
machines): two for short jobs and two for long jobs.
› Allow jobs to be suspended rather than preempted
 Think of as “checkpointing to swap”…
› OR allow jobs to be “de-prioritized” temporarily
 If memory is adequate, allow “suspended” long jobs to
continue running at a poor OS priority and steal cycles
whenever “active” short jobs are busy doing I/O.
www.cs.wisc.edu/condor
7
Everybody wins!
› Short jobs start right away on dedicated
›
›
›
“short” VMs
Long jobs aren’t preempted by short jobs,
but rather suspend temporarily or run at a
lower priority.
Outside jobs run only when no Bologna jobs
waiting.
All CPUs available to all types of jobs.
 No idle CPUs when jobs are waiting.
www.cs.wisc.edu/condor
8
Okay, how?
› Flipside of flexibility is complexity!
› It’s pretty cool that Condor allows you to
›
combine dedicated and opportunistic
scheduling in one system, but it takes a bit
of work to get it all set up…
Luckily for y’all, we’ve already done the
hard part, and now you can copy it. 
www.cs.wisc.edu/condor
9
Copy it from where?
› Bologna Batch System document
 http://www.cs.wisc.edu/~pfc/bbs.doc
› A detailed walk-through of the specific
›
›
policies and the necessary Condor
configuration to make each one work.
Line by line examples of how we
implemented each.
What’s in it? Let’s take a look…
www.cs.wisc.edu/condor
10
First Step: No hand waving!!
›
›
›
›
›
›
›
›
›
Bologna Batch Jobs are specially-designated jobs which may run only on speciallydesignated Bologna Batch Servers.
Only users in Bologna may submit Bologna Batch Jobs.
Bologna Batch Jobs must be vanilla-universe jobs (and therefore are not capable of
checkpointing and resuming), and thus once they start they must not be preempted
for other jobs.
Bologna Batch Servers prefer Bologna Batch Jobs over other Condor jobs, and will
start Bologna Batch Jobs regardless of system load or console activity.
There are two types of Bologna Batch Jobs, short-running and long-running.
Bologna Batch Jobs are assumed to be short-running unless they are explicitly
labeled as long-running when they are submitted.
A short-running Bologna Batch Job must not be forced to wait for the completion
of a long-running Bologna Batch Job before starting.
When short and long-running Bologna Batch Jobs are running simultaneously on the
same physical machine, the short-running job processes should run at a lower
(better) OS priority than the long-running jobs.
A short-running Bologna Batch Job may only run for one hour, after which point it
should be killed and removed from the queue.
Bologna Batch Jobs have priority over other Condor jobs. This means two things:
other jobs must never preempt Bologna Batch Jobs, and Bologna Batch Jobs must
always immediately preempt other jobs.
www.cs.wisc.edu/condor
11
› Job
Review…
 Requirements
› Machine
 START
 PREEMPT
 RANK
 WANT_SUSPEND, …
 JOB_RENICE_INCREMENT
› PREEMPTION_REQUIREMENTS
› STARTD_EXPRS, SUBMIT_EXPRS
www.cs.wisc.edu/condor
12
Requirement #1, “Bologna Batch Jobs are
specially-designated jobs which may run
only on specially-designated Bologna Batch
Servers.”:
To identify the servers, place into local condor
config:
BolognaBatchServer = True
STARTD_EXPRS = $(STARTD_EXPRS) BolognaBatchServer
To indentify Bologna Batch Jobs by inserting the following line into
their job submit description files:
+BolognaBatchJob = True
Now Bologna Batch Jobs and Servers can identify one another, users
ensure that Bologna Batch Jobs run only on Bologna Batch Servers
by specifying a job requirement:
Requirements = (BolognaBatchServer == True)
www.cs.wisc.edu/condor
13
Requirement #2, “Only users in Bologna
may submit Bologna Batch Jobs.”:
Each Bologna Batch Server double-checks the origin of a
job claiming to be a Bologna Batch Job :
IsBBJob = ( TARGET.BolognaBatchJob =?= True \
&& TARGET.SUBMIT_SITE_DOMAIN ==
$(SUBMIT_SITE_DOMAIN) )
SUBMIT_SITE_DOMAIN is an attribute that INFN defines on all
machines, and which they previously configured the Condor schedd
to automatically add to each job’s classad . Individual Condor users
are not able to override it:
SUBMIT_SITE_DOMAIN = "$(UID_DOMAIN)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS) SUBMIT_SITE_DOMAIN
www.cs.wisc.edu/condor
14
Requirement #3, “BB Jobs must be
vanilla-universe jobs, and thus once they
start they must not be preempted“
Next we modified each Bologna Batch Server’s
WANT_SUSPEND_VANILLA and PREEMPT expressions,
which Condor uses to decide when to suspend or
preempt a vanilla job, so that INFN’s default
preemption policy would only affect non-Bologna
Batch Jobs. :
IsNotBBJob = ( $(IsBBJob) =!= True )
WANT_SUSPEND_VANILLA = ( $(IsNotBBJob) &&
($(WANT_SUSPEND_VANILLA)) )
PREEMPT = ( $(IsNotBBJob) && ($(PREEMPT)) )
www.cs.wisc.edu/condor
15
Requirement #4, “Bologna Batch Servers
prefer Bologna Batch Jobs over other
Condor jobs, and will start Bologna Batch
Jobs regardless of system load or console
activity“
RANK = $(IsBBJob)
INFN_START = ( (LoadAvg - CondorLoadAvg) < 0.3 \
&& KeyboardIdle > (15 * 60) \
&& TotalCondorLoadAvg <= 1.0 )
START = ( $(IsBBJob) || ($(INFN_START)) )
www.cs.wisc.edu/condor
16
Requirement #5, “There are two types
of Bologna Batch Jobs, short-running and
long-running. Bologna Batch Jobs are
assumed to be short-running unless they
are explicitly labeled as long-running when
they are submitted.“
Declare long running jobs by placing the
following into submit file:
+LongRunningJob = True
The in the config file, take advantage of metaoperators:
IsLongBBJob = ( $(IsBBJob) &&
TARGET.LongRunningJob =?= True )
IsShortBBJob = ( $(IsBBJob) &&
TARGET.LongRunningJob =!= True )
www.cs.wisc.edu/condor
17
Requirement #6, “A short-running
Bologna Batch Job must not be forced to
wait for the completion of a long-running
Bologna Batch Job before starting..“
Declare more Virtual Machines than there are actual CPUs (dual CPU = 2
short VMs, 4 long):
NUM_SHORT_RUNNING_VMS = 2
IsShortRunningVM = (VirtualMachineID <= $(NUM_SHORT_RUNNING_VMS))
IsLongRunningVM = (VirtualMachineID > $(NUM_SHORT_RUNNING_VMS))
Change the start expression:
SHORT_RUNNING_VM_START = ( $(IsShortBBJob) \
|| ( $(IsNotBBJob) && $(INFN_START) ) )
LONG_RUNNING_VM_START = $(IsLongBBJob)
START = ( ( $(IsShortRunningVM) && $(SHORT_RUNNING_VM_START) ) \
|| ( $(IsLongRunningVM) && $(LONG_RUNNING_VM_START) ) )
www.cs.wisc.edu/condor
18
Requirement #7, “When short and long-
running BB Jobs are running simultaneously
on the same physical machine, the shortrunning job processes should run at a lower
(better) OS priority”
JOB_RENICE_INCREMENT =
( 5 + ( 10 * ( LongRunningJob =?= True \
|| BolognaBatchJob =!= True ) )
If LongRunningJob is true in the job classad, the
expression evaluates to (5 + (10 * 1)), or 15.
If LongRunningJob is undefined or false in the
job classad, but BolognaBatchJob is true, the
expression evaluates to (5 + (10 * 0)), or 5.
If neither is defined, the expression evaluates
to (5 + (10 * 1)), or 15
www.cs.wisc.edu/condor
19
Requirement #8, “A short-running
Bologna Batch Job may only run for one
hour, after which point it should be killed
and removed from the queue.
Declare long running jobs by placing the following into
submit file:
PREEMPT = ( ( $(IsNotBBJob) && ($(PREEMPT)) ) \
|| ( $(IsShortBBJob) && ($(ActivityTimer) >
60*60) ) )
SHORT_RUNNING_VM_START = (( $(IsShortBBJob) \
&& (RemoteWallClockTime<60*60) =!= False) \
|| ( $(IsNotBBJob) && ($(INFN_START)) ) )
To remove from the queue, in the job ad add:
Periodic_Remove = ( LongRunningJob =!= True \
&& (RemoteWallClockTime < 60*60) )
www.cs.wisc.edu/condor
20
Requirement #9, “Bologna Batch Jobs
have priority over other Condor jobs:
other jobs must never preempt BBJobs,
and BB Jobs must always immediately
preempt other jobs..
RANK already dealt with, now priority preemption:
INFN_PREEMPTION_REQUIREMENTS =
( $(StateTimer) > (2 * (60 * 60)) \
&& RemoteUserPrio > SubmittorPrio * 1.2 )
PREEMPTION_REQUIREMENTS = \
(( BolognaBatchServer=!=True &&
$(INFN_PREEMPTION_REQUIREMENTS)) \
|| (BolognaBatchServer =?= True \
&& ( BolognaBatchJob =!= True \
&& ( TARGET.BolognaBatchJob =?= True \
|| $(INFN_PREEMPTION_REQUIREMENTS)
))))
www.cs.wisc.edu/condor
21
Wrap condor_submit to make it
easy for users
bbs_submit_short / bbs_submit_long
#!/bin/sh
_CONDOR_APPEND_REQ_VANILLA='(BolognaBatchServer ==
True)'
export _CONDOR_APPEND_REQ_VANILLA
condor_submit -a '+BolognaBatchJob = True' \
-a 'should_transfer_files = IF_NEEDED' \
-a 'when_to_transfer_output = ON_EXIT' \
-a 'universe = vanilla' \
-a 'periodic_remove = ( LongRunningJob =!= True
&& (RemoteWallClockTime > 60*60) ) ' \
$*
www.cs.wisc.edu/condor
22
Simple for Users
› Although policy is complicated, the interface for
users is kept simple:
 Users call bbs_submit_long or bbs_submit_short,
just as they would condor_submit…
• Short jobs start quickly, but those that run for >1 hour are
killed.
• Long jobs will run to completion...
 bbs_submit_* scripts automatically add the appropriate
classad attributes to the job to take advantage of the
long or short running VMs on Bologna Batch Servers.
www.cs.wisc.edu/condor
23
Any Questions?
› Email me at [email protected].
› Check the Bologna Batch System document
›
at http://www.cs.wisc.edu/~pfc/bbs.doc
Thanks!
www.cs.wisc.edu/condor
24