The Maui scheduler for use with PBS.

Maui
HepSysMan 1St/2nd July 2004
Batch System
Batch server
and cluster
Configuration,
Job queue,
State table
qsub,
qdel,
qstat
Job,
start,
stop,
status
Batch Server
(pbsserver)
Node,
job,
start,
stop
Job, start, stop, status
Execution host
(pbsmom
Execution host
(pbsmom
Execution host
(pbsmom)
status
Execution host
(pbsmom)
HepSysMan 1St/2nd July 2004
Scheduler and
additional
cluster
Configuration
Scheduler
plug-in
Maui scheduler
• Seems to originate at Maui High
Performance Computing Centre (MHPCC)
http://www.mhpcc.edu
• But now available from
http://www.supercluster.org/maui/
in Covered Bridge Canyon, Utah
HepSysMan 1St/2nd July 2004
Maui/PBS Integration
[martin@masternode martin]$ qmgr
Max open servers: 4
Qmgr: list server
Server masternode
server_state = Idle
scheduling = False
default_queue = dque
log_events = 127
mail_from = adm
query_other_jobs = True
resources_default.walltime =
00:01:00
scheduler_iteration = 60
node_pack = False
pbs_version = OpenPBS_2.4
# maui.cfg 3.2
#
# 18/5/04 built by maui with extras added by xCAT and the 12Mar04 version
#
SERVERHOST
masternode
# primary admin must be first in list
ADMIN1
root
RMCFG[base] TYPE=PBS
RMPOLLINTERVAL
SERVERPORT
SERVERMODE
00:01:00
42559
NORMAL
HepSysMan 1St/2nd July 2004
Maui Philosophy (1)
• Maui is particularly concerned about scheduling
multiprocessor jobs
• How do you arrange a matching set of processors
to be simultaneously available for a single job ?
• Maui tries to plan the execution of such jobs at a
particular time when it expects sufficient
processors to be available - on the basis of the job
maximum walltime parameters.
• It establishes reservations on a set of processors
for a job – ensuring all the processors are free at
the planned time
HepSysMan 1St/2nd July 2004
Reservations
walltime
cpu
Job
12340
Reservation for job
12345
Job 12341
Reservation for job
12345
Job
12345
Reservation for job
12345
Job 12343
Reservation for job
12345
Job 12344
Reservation for job
12345
Job
12340
Reservation for job
12345
HepSysMan 1St/2nd July 2004
Maui Philosophy(2)
• As the reservations take effect, more and more
processors become idle as the planned job time
approaches
• A scheme called backfill tries to exploit these idle
processors by running short single/few processor
jobs out of priority order in the gaps
• Maximum efficiency is achieved by scheduling
big jobs first and running small jobs in the gaps !
perhaps not what the users really want ?
• Maui really cares about walltimes
HepSysMan 1St/2nd July 2004
Job Priority (1)
• Jobs are selected for execution in priority order
• Priority is calculated as a linear combination of
factors based on
–
–
–
–
–
Credentials – who, class/queue,..
Fair Share
Resources requested
Waiting time
Target Service level – eg maximum wait
• Most sites would have most coefficients set to 0
HepSysMan 1St/2nd July 2004
Sample Priority Component
•
•
•
•
5.1.2.2 Fairshare (FS) Component Fairshare components allow a site to favor jobs
based on short term historical usage. The Fairshare Overview describes the configuration
and use of Fairshare in detail.
After the brief reprieve from complexity found in the QOS factor, we come to the Fairshare
factor. This factor is used to adjust a job's priority based on the historical percentage system
utilization of the jobs user, group, account, or QOS. This allows you to 'steer' the workload
toward a particular usage mix across user, group, account, and QOS dimensions. The
fairshare priority factor calculation is
Priority += FSWEIGHT * MIN(FSCAP, (
FSUSERWEIGHT * DeltaUserFSUsage +
FSGROUPWEIGHT * DeltaGroupFSUsage +
FSACCOUNTWEIGHT * DeltaAccountFSUsage +
FSQOSWEIGHT
* DeltaQOSFSUsage +
FSCLASSWEIGHT * DeltaClassFSUsage))
All '*WEIGHT' parameters above are specified on a per partition basis in the maui.cfg
file. The 'Delta*Usage' components represents the difference in actual fairshare usage from
a fairshare usage target. Actual fairshare usage is determined based on historical usage
over the timeframe specified in the fairshare configuration. The target usage can be either a
target, floor, or ceiling value as specified in the fairshare config file. The fairshare
documentation covers this in detail but an example should help obfuscate things
completely. Consider the following information associated with calculating the fairshare
factor for job X.
HepSysMan 1St/2nd July 2004
Job Priority (2)
• Multiple queues/classes are but one factor
in maui calculations and decisions
• Jobs are normally given a whole cpu or
even a whole execution host
• Priorities are recalculated on every maui
iteration – say 1 per minute
• Jobs selected for backfill can bypass higher
priority jobs
HepSysMan 1St/2nd July 2004
Fairness
• Jobs can be given priority increments or
decrements according to whether their
user/group/…. ‘s recent usage is below or
above target fairshare
• There are a selection of throttling
parameters to prevent various forms of
excessive behaviour – max jobs, max
submission rate,….
HepSysMan 1St/2nd July 2004
Reservations
• The administrator can set manual
reservations – handy for shutting node
down at particular time
• Standing reservations repeat – eg
ScotGRID-Glasgow reserves a few nodes
for short jobs 08:00 – 20:00 every day.
– Backfill allows a jobs of  12 hours on these
nodes during the night
HepSysMan 1St/2nd July 2004
Node selection
• Some heterogeneity in the cluster may require all
processors for a job to come from some subset for
best performance eg sharing a Myrinet switch.
• Some constraints on node selection based on
ownership may be demanded
• Maui has additional cluster configuration settings
that can define sets of execution hosts as
partitions (simple member list) or as nodesets
(set defined by common node feature)
HepSysMan 1St/2nd July 2004
Simulation
• Maui has a scheme for recording a usage
profile over some period – eg a week
• The profile can then be played back with a
different maui configuration in simulation
mode to test new settings
• Quite a few “under construction” sections in
the manual about this
HepSysMan 1St/2nd July 2004
Resource Allocation Manager
• “Payment” for usage
• Maui can interwork with the QBank resource allocation
manager
– http://www.emsl.pnl.gov/docs/mscf/qbank/
• Pacific Northwest National Laboratory (PNNL) in Richland,
Washington
– Reserves payment before job (lien) and takes actual payment for
resources used after the job
• May be important when cluster is funded from many
sources and value for money needs to be proved
HepSysMan 1St/2nd July 2004
ScotGRID-Glasgow
Experience (1)
– OpenPBS and maui built and configured by IBM’s eXtreme
Cluster Administration Tool (xCAT)
• http://www.xcat.org
• xCAT is not a product – more a kit of parts supplied to IBM
customers to operate Linux clusters – some Open Source
• xCAT includes scripts to build OpenPBS and Maui according to
the xCAT scheme
– Fairshares used to balance between user groups
• Calculated wrt an average over 7 days – decaying 20% per day
• Most effective with a steady demand across all users/groups –
less good when job submission is more peaks and troughs
HepSysMan 1St/2nd July 2004
ScotGRID-Glasgow
Experience (2)
• Standing reservation for short jobs during daytime
–
–
–
–
Currently 3 nodes with a maximum walltime of 1 hour
Intended for development/test runs
Grid monitoring test jobs
No experience yet of multiprocessor jobs, simulation,
resource allocation management
• Bioinfomatics group demonstrated that maui has a
compiled limit of 4096 on the maximum number of
jobs that can be in the queue !
HepSysMan 1St/2nd July 2004
ScotGRID-Glasgow
Experience(3)
• Maui Documentation is extensive but not completely
comprehensive
• Maui is not keen on error messages
• Priority calculation is hard to get to grips with
• A misbehaving pbs_mom hangs both OpenPBS and Maui
– ssh allnodes service pbs status
– hope to use ganglia ( http://ganglia.sourceforge.net/ ) to spot cases
where whole execution host in trouble
• Ganglia’s gmetad (that aggregates local data) contributes a load
average of ~1 on our 1 GHz PIII ..
Looks like gmetad needs its own cpu
HepSysMan 1St/2nd July 2004
Grid(1)
• The EDG (and LCG?) job submission
system relies on sites giving an
estimate of time before a job would start
to execute – FIFO behaviour
• Maui does not execute jobs in
submission order – non FIFO behaviour
• RB gets an unreliable estimate
HepSysMan 1St/2nd July 2004
Grid(2)
• Gridpp have a Batch solution replacing
OpenPBS with Torque and Maui – see
words of Steve Traylen at
– http://www.gridpp.ac.uk/tb-support/faq/torque.html
• A Google search on
Maui lcg rpm
reveals many other sites getting into maui
HepSysMan 1St/2nd July 2004