Job Scheduling with High Performance Computers

Job Scheduling with High Performance Computers
by
Andrew Botting
Submitted to the School of Information Technology and Mathematical
Science in partial fulfillment of the requirements for the degree of
Bachelor of Computing (Honours)
at the
UNIVERSITY of BALLARAT
November 2003
c University of Ballarat, School of ITMS 2003. All rights reserved.
°
Certified by: Glenn Stevens
Lecturer, School of ITMS
Thesis Supervisor
Certified by: David Bannon
Systems Manager, Victorian Partnership for Advanced Computing
Thesis Supervisor
Job Scheduling with High Performance Computers
by
Andrew Botting
Submitted to the School of Information Technology and Mathematical Science on
November, 12, 2003, in partial fulfillment of the requirements for the degree of Bachelor of
Computing (Honours)
Abstract
The Victorian Partnership for Advanced Computing (VPAC) installed a Linux based high
performance cluster for solving complex computational problems. The question raised is,
what is an effective technique to schedule these jobs for execution in this environment?
Scheduling jobs as they arrive (First-Come-First-Serve) is a fair way to schedule jobs,
but can lead to fragmentation and low system utilisation, while the system slowly gathers
the needed resources to service the next job in the queue. One answer to this problem is
the use backfilling. Backfilling allows the jobs to be executed out of order, and by making
intelligent scheduling decisions, make better use of systems resources.
Job workloads from the University of Utah’s ’Icebox’ cluster were used to perform three
scheduling experiments, to optimise the job scheduler. Backfilling methods supported by
Maui were tested to find their affect on the job data and were found to reduce job turnaround
time and increase job throughput. The affect of wallclock accuracy was also tested on the
supported backfilling methods, and the results showed that the accurate estimated wallclock
by users can increase system utilisation by 20%.
The third experiment tested the expansion factor of short jobs executing for less than two
hours, by creating a standing reservation. The results showed that with the Icebox data, the
more nodes dedicated to the reservation, the lower the expansion factor became. A standing
reservation size of 32 nodes proved to be the best solution for reducing the expansion factor
of short jobs, without significant increase in expansion factor on larger jobs.
Thesis Supervisor: Glenn Stevens
Title: Lecturer, School of ITMS
Thesis Supervisor: David Bannon
Title: Systems Manager, Victorian Partnership for Advanced Computing
Acknowledgments
I would like to thank:
• My Family for their wonderful support throughout this year.
• Bek Trotter for her understanding, patience, support and assistance throughout the
year.
• Glenn Stevens for his supervision and guidance of this project.
• David Bannon, Chris Samuel and everyone at VPAC for making me part of the team,
and allowing me to use their facilities.
Contents
1 Introduction
1
1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
The Organisation Of The Thesis . . . . . . . . . . . . . . . . . . . . . . . . .
2
2 Background
3
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.2
Victorian Partnership for Advanced Computing . . . . . . . . . . . . . . . .
3
2.2.1
VPAC Hardware Profile . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2.2
Policies
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.2.3
VPAC’s System Options . . . . . . . . . . . . . . . . . . . . . . . . .
5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3
3 Literature Review
3.1
9
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
9
3.2
Parallel High Performance Computing . . . . . . . . . . . . . . . . . . . . .
9
3.2.1
High performance computing architectures . . . . . . . . . . . . . . .
10
3.2.2
Early High Performance Computers . . . . . . . . . . . . . . . . . . .
12
Batch Processing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.3.1
Resource Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.3.2
VPAC’s Choice of a Batch Processing System . . . . . . . . . . . . .
17
Job scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.4.1
The Scheduling Issue . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.5
Job scheduling strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.6
Specific Job Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.6.1
FIFO Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.6.2
Maui Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.3
3.4
3.7
4 Methodology
28
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.2
Selected research methodology . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.2.1
Initial construction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.2.2
Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.2.3
Develop Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.2.4
Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
4.3
ii
5 Analysis of Results
34
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.2
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
5.2.1
Workload Profile Comparison . . . . . . . . . . . . . . . . . . . . . .
35
Scheduling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
5.3.1
Experiment 1: Backfilling . . . . . . . . . . . . . . . . . . . . . . . .
38
5.3.2
Experiment 2: Wallclock Accuracy . . . . . . . . . . . . . . . . . . .
41
5.3.3
Experiment 3: Standing Reservation for Short Jobs . . . . . . . . . .
42
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
5.3
5.4
6 Conclusion
44
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
6.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
6.3
Problems Encountered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
6.4
Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
iii
List of Tables
2.1
Institution usage percentages
. . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.2
Priority levels on Grendel . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
5.1
Job Data on Grendel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
5.2
Job Data on Brecca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.3
Job Data on Icebox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
iv
List of Figures
2.1
VPAC’s Linux cluster, ’Brecca’ . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.1
First-Come-First-Serve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
Backfilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
5.1
Grendel Workload Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
5.2
Brecca Workload Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
5.3
Icebox Workload Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
5.4
Number of Completed Jobs: Icebox . . . . . . . . . . . . . . . . . . . . . . .
40
5.5
Wallclock effect on Backfilling measuring Dedicated Processor Hours: Icebox
41
5.6
Effect of Standing Reservation for Short Jobs: Icebox . . . . . . . . . . . . .
43
v
Chapter 1
Introduction
1.1
Introduction
High performance computing has been defined as computing resources, much greater in
magnitude of computing power than what is normally found on one’s desk (Zenios, 1999).
High performance computers exist in several forms, the most common being symmetric multiprocessor systems, massively parallel systems, vector systems and the cluster of workstations.
Clustering offers significant price/performance advantages for many high-performance
workloads. Linux clusters can further extend these advantages by harnessing low-cost servers
and Open Source software (IBM Corporation, 2003). This paper aims to investigate cluster
optimisation techniques, by testing them and applying them to the VPAC cluster.
1
1.2
Research question
The research that will be conducted asks ”What is the most efficient technique to schedule
jobs on VPAC’s high performance computer?”
In an attempt to answer this question, several subquestions that will be addressed are:
• What are VPAC’s requirements?
• How is efficiency measured?
• What job scheduling techniqes are available?
• What techniques will be used, and why?
1.3
Methodology
The methodology for this project will be a test and measure approach. Data will be collected
and analysed with the results being used to optimise the job scheduling algorithm. This will
be discussed in more detail in Chapter 4.
1.4
The Organisation Of The Thesis
Chapter 2 will describe the background of this project. Chapter 3 will examine literature
surrounding the topic area. Chapter 4 outlines the methodology being employed, and discusses why this approach has been taken. Chapter 5 analyses and discusses the results of
the project, and Chapter 6 draws some conclusions regarding the success of the project.
2
Chapter 2
Background
2.1
Introduction
This chapter is an overview of the background regarding this project. It outlines the VPAC
organisation, their goals, objectives and the policies which will need to be addressed in a
cluster scheduling solution.
2.2
Victorian Partnership for Advanced Computing
The Victorian Partnership for Advanced Computing (VPAC) is an organisation, established
in 2000 by six Victorian universities. These universities include: La Trobe University;
Monash University; RMIT University; Swinburne University of Technology; The University
of Ballarat and The University of Melbourne. VPAC’s goal is to provide high performance
computing (HPC) facilities to its member universities. By combining the resources and skills
3
of the member universities, VPAC can deliver high-performance computing resources which
these universities could not provide in isolation. VPAC receives funding from its six founding
universities, the Australian Partnership for Advanced Computing (APAC) and the Federal
Government in the form of a $6million science and technology grant over three years. VPAC
is highly committed to research and development, industry training and support in the area
of high performance computing. More information can be found at the VPAC website:
http://www.vpac.org.
2.2.1
VPAC Hardware Profile
VPAC’s current system is a Compaq AlphaServer SC, with 128 nodes and 1.4Tb of RAM, its
name is ’Grendel’ and it used by all supporting universities. Grendel is a batch processing
system, allowing jobs to be submitted to a queue and processed once the requested resources
become available.
2.2.2
Policies
Jobs submitted to Grendel are given a priority level depending on which category they
best fit. These categories use a quota system which guarantees the supporting universities
are allocated their entitled share of computational time. (See Table 2.1) The quotas are
calculated by the size of the contribution made by each university. VPAC not only enforces
a quota on the university, but also a project quota. Each project registered on Grendel has
a percentage of the total quota of the institution, creating a split-level quota system. Table
4
Priority Levels
Monash University
21.25%
University of Melbourne
21.25%
La Trobe University
15.94%
RMIT
15.94%
Swinburne University of Technology
7.97%
University of Ballarat
2.66%
VPAC
5.00%
Table 2.1: Institution usage percentages
2.2 shows the priority levels currently enforced on Grendel.
2.2.3
VPAC’s System Options
Two years after the purchase Grendel, VPAC decided that more computing power was
needed. After weighing up their options, VPAC chose a Linux based IBM cluster. The IBM
system was chosen for two reasons; the proposal was a very good price for the computational
performance of the system, and IBM were interested in setting up a business relationship
with VPAC in the field of research and development.
VPAC eventually decided on the IBM eServer xSeries 335 Intel based Linux cluster (see
Figure 2.1). The agreement made between IBM and VPAC was that the system would
perform at a minimum of 600 Gigaflops. After applying some optimisation techniques such
5
Priority Levels
Running Parallel
94
Running Parallel Under Quota but Institute Over
90
Running Parallel Over Quota
84
Running Single CPU
70
Suspended Parallel
80
Suspended Single CPU
72
Queued Parallel Under Quota
75
Queued Single Under Quota
60
Queued Parallel Proj Under Quota but Institute Over 65
Queued Parallel Over Quota
60
Queued Single Over Quota
50
Table 2.2: Priority levels on Grendel
6
as Intel compilers, and using the MPICH message passing library which worked directly
with the Myrinet network interface, VPAC successfully achieved a LINPACK mark of 631
Gigaflops.
Figure 2.1: VPAC’s Linux cluster, ’Brecca’
2.3
Conclusion
This chapter outlined the background in which this project is based, including VPAC, their
choice of system and their existing set up and policies which govern it. The next chapter
7
will examine the literature surrounding the area of clusters and job scheduling.
8
Chapter 3
Literature Review
3.1
Introduction
This chapter provides an overview of high performance computing and job scheduling algorithms and solutions. It provides some solutions to job scheduling issues and presents some
resource managers and job scheduling packages available.
3.2
Parallel High Performance Computing
High performance computing has been defined as computing resources, much greater in
magnitude of computing power than what is normally found on one’s desk (Zenios, 1999).
Often, many applications require great amounts of computing power, that desktop computers
simply cannot fulfill. In the early 1980’s it was believed that to increase the computing
power, the only solution was to build faster processors. However, constraints exist which
9
hinder the development of processors. From this need, parallel computing evolved. Parallel
high performance computers are the result of connecting multiple processors together and
coordinating their efforts (Buyya, 1999). Parallel HPC’s also provide the benefit of being
scalable, so that more and more processors can be added, increasing the computing power.
Pfister (1998) defines a cluster as ”a type of parallel or distributed system that: consists of
a collection of interconnected whole computers, and is used as a single, unified computing
resource.” High performance technical computing is the idea dedicating available CPU’s to
a single job, rather than all jobs sharing a slice of the aggregate computing power. This is
often found in a research environment.
3.2.1
High performance computing architectures
Many system architectures have emerged in the parallel computing area. They are classified
by the arrangement of the processors, memory and their interconnect. Some of the most
common systems are:
• Vector
• Massively Parallel Processors (MPP)
• Symmetric Multiprocessor (SMP)
• Distributed Systems and Clusters
10
Vector Computers
Vector computers have specially designed processors optimised for arithmetic operations on
elements of arrays, known as vectors. This meant that large amounts of mathematical data
could be handled much faster than other types of processors, though vector CPU’s could
be slowed down when it came to complex instructions. Machines such as the Cray series of
HPC’s used this architecture, and were the fastest of their time (Apsen Systems, 2003).
Massively Parallel Processors
Massively parallel machines consist of a collection of separate units, known as nodes. The
nodes operate totally independent of each other and are connected together by a high speed
network. Each node can resemble a desktop system in some respects, because they contain
hardware such as hard drives, memory and their own copy of the operating system.
Symmetric Multiprocessor
Symmetric multiprocessor machines contain two or more processors which work independent
of each other. They are connected together by a very high speed and low latency bus, most
often, a motherboard. The processors share all hardware, including memory and operating
system. SMP machines are unable to scale well due to their bus and memory bandwidth
limitations.
11
Distributed Systems and Clusters
Distributed systems and clusters are similar to the MPP machines as they consist of totally
separate nodes. The difference is that they can often be normal desktop systems connected
by a standard networking interconnect. Some variations of this can be seen in setups such
as the Cluster of Workstations (COW). The COW model is sometimes used by companies
with many desktop machines which have very low utilisation. When these desktops are not
in use, they can be controlled by a central server to become a very wide parallel computer.
3.2.2
Early High Performance Computers
The Illiac IV
One of the first parallel high performance computers was the Illiac IV. The project started
in 1965 and had successful tests in 1976. The initial predicted cost was $US8 million in 1966,
but escalated to $31 million in 1972, and the predicted 1000 MFLOPS ended up being closer
to 15 MFLOPS. The computer was a failure as a production system, but it paved the way
for research into the area (Wikipedia, 2003).
The Cray-1
The Cray-1 was the first machine manufactured by the Cray company in 1976. It was
revolutionary for its time and paved way for many more Cray machines later on.
12
The Beowulf Cluster
The Beowulf cluster started in late 1993, by two men, Donald Becker and Thomas Sterling.
Their theory was to reduce the cost of HPC’s by using Commodity-Off-The-Shelf (COTS)
components and coupling them together using standard networking interconnect. The original Beowulf cluster was sixteen 486DX4 processors connected by 10Mbit/s channel bonded
Ethernet. A single 10Mbit/s card could not provide enough network bandwidth, so Becker
rewrote his Ethernet drivers for Linux and built a ”channel bonded” Ethernet where the
network traffic was striped across two or more Ethernet cards. The two keys to the success
of this system were the availability of cheap COTS hardware and the maturity and robustness of Linux. By decoupling the hardware and software, the Beowulf cluster remains vendor
independent, and by using open source software, programmers now have a guarantee that
the programs they write will run on future Beowulf clusters. From the progress made by
the Beowulf community, researchers within the HPC community now recognised Beowulf
clusters as their own genre within the HPC community. The Beowulf architecture makes it
not quite a MPP, and not quite a COW. It falls somewhere between the two. (Merkey, 2003)
3.3
Batch Processing System
A batch processing system provides users with a mechanism for submitting, launching and
tracking jobs on a shared resource (Supercluster Research and Development Group, 2002).
Batch systems attempt to share a HPC’s resources in a fair and efficient manner within three
main areas:
13
• Traffic Control
• Site Policies
• Optimisations
Traffic Control
A batch processing system is responsible for controlling jobs. If jobs are contending for a
systems resources, system slowdown results. The traffic control system defines and allocates
particular resources to particular jobs ensuring that jobs do not interfere with each other.
Site Policies
When a HPC is installed, it is usually installed for a particular purpose. The site policy
defines rules which govern the system in relation to how the it should be used, how much it
is used and whom it is used by.
Optimisations
When the demand on a HPC becomes greater than supply, intelligent decisions about
scheduling jobs can achieve greater optimisation. It is the role of the job scheduler to make
intelligent decisions to increase optimisation.
3.3.1
Resource Managers
A resource manager is a server which implements a batch processing system.
14
• Network Queue System (NQS)
• Portable Batch Scheduler (PBS)
Network Queue System
Batch queuing started with the Network Queue System (NQS) which was developed at the
Ames Research Facility at NASA. It was designed to run on their Cray 2 and Cray Y-MP
supercomputers and select a good job mix to achieve high machine utilisation. It was the
first of its kind, and soon became the de-facto standard. Jobs submitted to NQS must state
their memory and CPU requirements, and were placed in a suitable queue. The scheduler
would then select jobs based on these properties to create an efficient mix.
The limitations of NQS soon showed. It was not configurable enough for tuning purposes,
and it did not support parallel processing architectures. These limitations prompted work
on a new package.
The Portable Batch Scheduler
The Portable Batch Scheduler (PBS) is a batch software processing system also designed at
NASA’s Ames Research Facility to fill the need for a resource management system which
could handle parallel jobs. Jobs are placed onto the queue, and the Job scheduler component
of PBS decides which resources should be allocated to that job. The term resources is
generally used to describe HPC resources, meaning CPU’s, memory and harddisk. The
PBS server then sends the required information to the nodes running a PBS daemon called
the ’mom’. The mom processes the job, and sends back the required job information to
15
the server. PBS includes several of its own schedulers, with the primary one being the
First-In-First-Out (FIFO) scheduler. The focus of the PBS development has shifted toward
resource management for clusters, as they have become a much more viable option for many
organisations. PBS is now owned by Altair.
The three versions of PBS currently available are:
• OpenPBS
• PSBPro
• ScalablePBS
OpenPBS
OpenPBS is a free version of PBS. It makes the source code readily available, but the latest
release is not licensed under the GNU Public Licence, therefore users are not permitted to
modify and redistribute it. Many sites using OpenPBS have created patches, tailoring it
to their needs, and made their patches available over the Internet. More information about
OpenPBS can be found at: http://www.openpbs.com.
PBSPro
PBS Pro is a commercial version of PBS which requires users to purchase it for their machines. It provides many enhancements over the free version, including product support,
which may make it a viable option for many sites. More information about PBSPro can be
found at: http://www.pbspro.com.
16
ScalablePBS
ScalablePBS differs from both OpenPBS and PBSPro. The other two versions of PBS
are owned and distributed by Veridian Systems, whereas ScalablePBS is the child of the
supercluster.org group. Many bugs were found in OpenPBS which could cause the batch
system to fail, so the need for a stable, robust version of PBS was needed. Therefore,
the supercluster.org group took the source of OpenPBS at a point where the licence allowed
redistribution and applied many OpenPBS patches which had been created by various cluster
sites. This resulted in a more stable and scalable PBS. More information about ScalablePBS
can be found at: http://supercluster.org/projects/pbs.
3.3.2
VPAC’s Choice of a Batch Processing System
ScalablePBS was chosen by VPAC for the following reasons:
• Free.
• Open solution (Source code available).
• Linux and cluster support.
• Stable and robust.
3.4
Job scheduling
Job Scheduling of a parallel system is the activity of assigning a limited number of system
resources to a process so it can be executed. The meaning of the term resources in this
17
context relates to memory, hard disk space and CPU time.
3.4.1
The Scheduling Issue
The two main objectives of the job scheduler are as follows:
• Increase system utilisation.
• Minimise job turnaround time.
High performance computers cost a great deal of money and to gain a reasonable return
on investment on system of this capacity, the system must be utilised as high as possible.
Therefore, the efficiency of the job scheduler is a high concern for the machine owner. On
the other hand, users want a short turnaround time for their jobs. A balance between system
utilisation and fairness then needs to be established. Several job scheduling strategies exist,
and are examined.
3.5
Job scheduling strategies
Depending on the policies of the HPC site, many systems use space sharing, being the
concurrent execution of jobs (Schwiegelshohn and Yahyapour, 2000). This means that at
any one time, a HPC can be concurrently executing many different jobs from many different
users. Each job could request any number of available HPC resources and run for any length
of time. Due to the combination of users and jobs, a varied and unpredictable workload may
exist. Therefore, a HPC system requires some way to schedule these jobs.
18
Extensive research has been undertaken in this area, and also in the conception of many
scheduling algorithms, with very few of these actually being implemented in real scheduling
applications (Schwiegelshohn and Yahyapour, 1998).
Shortest Job First
The shortest job first (SJF) strategy attempts to execute the shortest jobs (or jobs that will
use a shorter amount of CPU time) before longer jobs. The rationale behind this scheduling
algorithm is if shorter jobs wait for a longer job, both jobs will have a long response time. If
the shorter job runs first, it will have a shorter response time and therefore reduce the overall
average response time. The major disadvantage in using this method is that long jobs are
penalised, and job starvation may occur if the system continues to recieve short jobs.
Time Slicing/Processor Sharing
Time slicing (or processor sharing) is a simple approach to job scheduling, whereby the
resources of a system are portrayed as one unit, and each job is given a slice of the resources
(Majumdar et al., 1988). This job scheduling strategy differs to many others as jobs do not
have exclusive access to their assigned resources. In a time sliced system, as more jobs are
started on the system, the less resources are available for existing jobs.
First-Come-First-Serve
The first-come-first-serve (FCFS) scheduling algorithm is a simple way to schedule jobs. The
FCFS algorithm processes each job as it is submitted to the batch scheduler. An advantage
19
of this algorithm is that it has low overhead and is not biased on the basis of job type, but
it suffers from low efficiency because no strategic decisions are made about space filling.
For example, a cluster system of 16 CPU’s was running at a high utilisation with 14
CPU’s. The next job in the queue requested 12 CPU’s. For the scheduler to execute the
next job, 10 more CPU’s would need to become free before it could execute. As currently
running jobs complete, more resources of the cluster are left idle waiting for the required
number of CPU’s to become available. Up to 11 processors could be left idle until the large
job in the queue has a chance to execute. (see Figure 3.1) This unused processing power is
know as fragmentation.
Running Job
CPU’s
Running Job
Large Job
Running Job
Running Job
t0
t1
t2
Time
Figure 3.1: First-Come-First-Serve
Priority
To extend the FCFS scheduling algorithm, many schedulers use a priority scheme. When a
job is submitted to a queue, it is assigned a priority, depending on it characteristics. The
20
priorities are a useful way of allowing an organisation to adhere to their goals or political
agendas, by giving higher priority to certain users, groups, type of jobs, quota of users, etc.
Backfilling
Backfilling is a scheduling technique which allows jobs to be run out of their priority order
to make better use of a machines resources. It aims to provide high system utilisation
by searching for free slots for processing, and fitting suitable jobs into these slots without
delaying them (Streit, 2001).
In the FCFS example (see Figure 3.1), many of the systems’s resources were unused
between t=1 and t=2, because of the large job reservation. While this reservation allowed
the large job to be processed as soon as the processors became available, much of the system
is left idle. Backfilling allows small jobs to run by allocating the CPU’s which would have
normally been left idle. For the scheduler to successfully backfill jobs, users must submit
with their job, an estimate of job execution time. This is known as awalltime estimate. The
walltime estimate is used by the scheduler to guarantee a latest possible completion time
for jobs. The backfilling method uses walltime estimate to schedule the required resources
for the highest priority job in the queue as soon as the resources become available. While
the job waits for those resources to become available, smaller jobs with a walltime estimate
shorter than the available unused time between the current time, and the time when the
high priority larger job is scheduled to begin, are given the opportunity to run, and hence
use the resources that would otherwise be wasted by the FCFS algorithm. Since additional
jobs can be processed, without any delay to the start of the larger, priority job, the overall
21
system utilisation is increased. Figure 3.2 illustrates how the backfill method works. A large
job is scheduled at t=2. Three jobs, all with walltime estimates which would allow them to
run in the available free resources between t=1 and t=2 are executed.
The wallclock estimate is used to schedule large jobs and also to backfill the small jobs,
hence the accuracy of the wallclock estimate can affect the success of backfilling. If a wallclock
estimate is too short, the resource manager will kill the job once the wallclock time is reached.
On the other hand, if the wallclock estimate is much larger than the actual duration of the
job, it may be overlooked when the scheduler is looking for backfilling candidates.
Running Job
Backfilled
Job
CPU’s
Running Job
Running Job
Running Job
t0
Large Job
Backfilled
Job
Backfilled Job
t1
t2
Time
Figure 3.2: Backfilling
Variable Partitioning
Many of the high performance computers require the users to submit the number of processors require for their job. When the processors become free, they are allocated to the
job, and run until completion or until it is killed. This is known as variable partitioning.
22
(Feitelson, 1994)
Preemption
Preemption is the ability to pause a running job, then relocate it to another node set. Once
relocated, the job is resumed as normal. This can cause some issues. Although this method
seems quite simple in nature, specialty networking hardware cannot support it. For instance,
using hardware such as Myrinet bypasses much of the operating system kernel by interfacing
directly to the hardware, hence there is no simple way to pause and relocate traffic while
data is in transit.
Preemption is common in Vector machines, such as the Cray or NEC systems.
Priority and Quality of Service
Priority and Quality of Service is a method used for ’fairness’. The goals of this strategy are
to increase the availability of the machine for all users. By implementing a quota system,
where every user is assigned a quota of processing time, heavy users over their quota will
be penalised in terms of priority. A job is then ’weighted’ by a number relating to priority.
Depending on the job scheduler, this weight should equate to a faster turnaround.
3.6
Specific Job Schedulers
Although many job scheduling applications exist, we are looking at those which are usable
within the Portable Batch System.
23
3.6.1
FIFO Scheduler
The FIFO scheduler is a simple FCFS scheduler which is supplied with PBS. It is designed
to be a starting point for development of a further scheduler, but many sites use this as their
primary scheduler. The FIFO scheduler was the primary scheduler at VPAC until August
2003, when Maui was implemented.
3.6.2
Maui Scheduler
The Maui scheduler is a highly configurable and effective batch scheduler, currently being
used at many leading HPC facilities, including the University of Utah’s ’Icebox’ cluster. The
Maui scheduler was designed as a HPC scheduler with advanced features such as backfilling,
fairshare scheduling, multiple fairness policies, dynamic prioritization, dedicated consumable resource tracking and enforcement, and a very extensive advance reservation system
(Jackson, 1999).
Maui uses a calculated priority to perform its job scheduling decision. The priority of all
jobs in the queue are dynamically calculated on each scheduling cycle.
Maui gives the system administrator full access to define the priorities as he or she sees
fit. The priority calculation is described below.
Priority = QueueTimeWeight x QueueTimeFactor
FSWeight
x FSFactor
+
XFactorWeight
x XFactor
+
24
+
UrgencyWeight
x UrgFactor
+
QOSWeight
x J->QOS
+
BypassWeight
x J->Bypass
+
ResourceWeight
x J->MinProcs
Each of the weight values are defined in the Maui configuration file to tailor Maui to the
exact needs of the workload, or organisation.
QueueTimeFactor
The QueueTimeFactor is a value determined by the amount of time a job has been waiting in
a queue to run. Therefore, the longer a job has been waiting, the higher its priority becomes.
FSFactor
The FSFactor is a fair share value. This value is calculated based on historical CPU usage
for the user, group and account associated with a job. Maui allows for a CPU usage target,
so the usage of machine can be fairly divided among users, groups and accounts.
XFactor
XFactor means expansion factor. Increasing this value has an effect of pushing short-running
jobs toward the top of the queue.
25
UrgFactor
The UrgFactor is an urgency value which is used to push jobs to completion within certain
XFactor limits. When a jobs needs to be processed as soon as possible, without regard for
other jobs in the queue, this value will ensure that is gets instant attention.
QOS
Quality of Service (QOS) facility allows a site to give special privileges to certain users,
groups or accounts. By increasing the weight of this, the scheduler gives a higher priority to
the job belonging to the user, group or account’s QOS.
Backfilling
Every time backfilling occurs, the jobs in the queue with a higher priority have their Bypass
value incremented by one. This is to ensure that job starvation is minimised.
MinProcs
MinProcs is another value to prevent job starvation. For jobs requiring many CPU’s, without
this, large jobs would be left starving.
The research by Bode et al., (2000) shows that an increase in system utilisation can be
seen by using the Maui scheduler over the PBS scheduler, FIFO. From research such as
this, and also the advanced fairsharing capabilities, Maui is the selected scheduler of VPAC.
Maui which will ensure that the organisational goals for the cluster system will be met, and
each organisation, group and account will be monitored to ensure that they recieve their
26
entitlement of computing power.
3.7
Conclusion
This chapter has given an overview of the literature available for this topic area. It has introduced parallel high performance computing, some early parallel machines, batch processing
systems and job schedulers and some algorithms associated with them.
27
Chapter 4
Methodology
4.1
Introduction
This chapter will examine the research methodology selected for this project. It will include constraints of the project, the available research methodology options and the selected
methodology.
4.2
Selected research methodology
The aims of this project are to set up an effective job queueing and scheduling mechanism
for the VPAC high performance Linux cluster, using a test and measure approach.
Using the Maui scheduler plug-in to ScalablePBS, a simple scheduling system shall be
created using some estimated values. This system will allow users to log in and submit their
jobs to PBS, and then Maui will decide which jobs are run in which order, based on its
28
configuration.
The methodology used in this project consists of 4 phases:
Initial construction Build a simple PBS and Maui configuration.
Data analysis Examine the workload data and assess its suitability for the optimisation
simulations.
Develop simulations Using Maui, construct a simulation environment for simulating optimisation techniques.
Results analysis Analyse the data and draw conclusion about the effectiveness of the optimisation.
4.2.1
Initial construction
This phase begins the initial software set up and install. The steps involved in this process
are:
Resource manager
The first step in setting up a batch processing system is to set up a resource manager. The
original resource manager used at VPAC was OpenPBS 2.3.16 but after testing it for several
months, it was found to be unstable. ScalablePBS is the current resource manager and has
proved to be considerably more stable and scalable than OpenPBS.
During this phase, the default PBS scheduler, FIFO will be used until the Maui scheduler
has been tested and is ready for use.
29
Job Scheduler
The Maui scheduler was set up and installed into the system and tested. By starting with
a default configuration, Maui will interface with PBS to obtain and record job and node
information. While in test mode, Maui will not disrupt the scheduling of jobs while PBS’s
FIFO scheduler is running. This should allow for any configuration or software problems to
be detected before running ’live’.
4.2.2
Data Analysis
This phase of the project examines the data, and assesses its suitability. Workload comparisons between Grendel, Brecca and Icebox are compared and a conclusion about the
suitability of simulating the Icebox data made.
4.2.3
Develop Simulations
This phase examines the job scheduling optimisation techniques, and develops a simulation
framework for running the optimisation experiment simulations.
The following experiments were conducted on the Icebox data:
• Backfill comparison
• Wallclock accuracy test
• Standing reservation for short jobs
30
Backfill Comparison
This experiment is designed to evaluate the effectiveness of the Maui backfilling methods on
the Icebox job data. Many papers such as Talby and Feitelson (1999) and Schweigelshohn
and Yahyapour (1998) have investigated backfilling methods, but neither have used the Maui
scheduler. Maui uses two main variations of backfilling: Firstfit and Bestfit. The Firstfit
method selects the first job in the queue that will fit in the available resource space, while
the Bestfit method selects a job to maximise the selected backfill criteria based on either
processor usage, processor second usage or seconds. The data will be simulated by running
many copies of Maui, each stopping at a point in time, and using either of no backfilling,
Firstfit or Bestfit backfilling. A graph of this will be generated for an easy comparison
between the backfilling methods. The metric that we will be using will be the number of
completed jobs in that time.
Wallclock Accuracy Test
The wallclock accuracy test is designed to evaluate the affect that the predicted wallclock
time has on the backfilling scheduling algorithm. When scheduling jobs, Maui uses the
wallclock time which users submit with their jobs, as an estimate of how long the job will
run for. From this value, Maui can make a basic prediction of when the job will finish
so that it can attempt to schedule more jobs, after the current jobs have finished. Maui
can then give a guarantee of the latest time that a job can start and finish. For Maui to
guarantee that the jobs will finish by a certain time, it must enforce the wallclock limit on
31
running jobs. If a job runs longer than the specified wallclock time, it will be killed to ensure
that it does not delay queued jobs. Therefore, users are suggested to slightly increase this
wallclock value to ensure that their job is not killed accidentally. The simulations for this
experiment test from 10 percent to 100 percent accuracy in steps of 10 percent. In each
round of simulations, the simulator tests Firstfit backfilling, Bestfit backfilling and without
backfilling. The results will then be graphed to give an indication of the how important the
wallclock accuracy is when calculating backfilling. The default backfilling metric for Bestfit
is to best fit the available resource by processors, and this value was used throughout the
simulations.
Standing Reservation for Short Jobs
The aim behind this experiment is to reduce the expansion factor for short running jobs.
Expansion factor is defined as:
Expansion Factor = 1 + Queue Wait Time / Job Execution Duration
For example, researchers want a quick turnaround time for their testing jobs, while they
refine their processes and data. A short turnaround time for these short jobs is essential
for the researchers to progress with their research. By defining a block of nodes dedicated
exclusively to these short jobs, they not only have the ability to run in this defined block,
but the whole system.
Jobs were split up into three categories: small, medium and large. Jobs running for up
to two hours are defined as small, the medium jobs run between 2 and 8 hours, and any job
32
running longer than 8 hours is defined as a large job.
4.2.4
Optimisation
Using the information found from the data analysis, the values governing the current scheduling system will be updated. The data collected in the earlier phase will then be put into the
Maui scheduler to output a figure that should indicate whether the optimisation would have
been successful on that particular data set.
4.3
Conclusion
This chapter has outlined the methodology to be used in this project, and has explored
the experiments to be performed. The results of these experiments is presented in the next
chapter.
33
Chapter 5
Analysis of Results
5.1
Introduction
This chapter focuses on the research questions by presenting, analysing and discussing the
results from simulations of the described scheduling techniques. An analysis of the quality
of data gathered is examined first, then the data is examined. Finally, a conclusion is made
about the scheduling techniques used.
5.2
Aims
The research question proposed is ”What is an efficient technique to schedule jobs on VPAC’s
high performance computer?” To answer this question, and the questions it raises, a series
of simulations were run to test various scheduling techniques. Data from the University of
Utah’s cluster, Icebox, was used to perform simulations of three job scheduling techniques.
34
The methodology behind these experiments is found in Chapter 4.
5.2.1
Workload Profile Comparison
At the time the simulations were undertaken, Brecca, VPAC’s Linux cluster was still in
development and therefore, the system was underutilised. This underutilisation is due to
factors such as Brecca is still being tested for stability and many libraries and applications
used on Grendel have not yet been ported over to Brecca. The current workload profile of
Brecca is expected to change as users from VPAC’s old system, Grendel, are migrated over
to Brecca, and hence the workload profile of Brecca is expected to slowly change to resemble
the current workload profile of Grendel.
The usefulness and validity of simulations undertaken as part of this project depend on
using job scheduling data similar in profile to the anticipated workload of a fully loaded
Brecca as it is anticipated to be some time in the future. Brecca is the successor to Grendel and will inherit Grendel’s workload once Grendel is decommissioned. Hence the most
accurate prediction has to be the recent workload history of Grendel.
The workload profile of Grendel is shown in Figure 5.1 and Table 5.1. This data was
gathered through scripts run on Grendel to extract job data on a daily basis and although
this data is an accurate reflection of Grendel’s workload, it is not suitable for Maui simulation
because it does not comply with the stringent format the Maui simulator requires.
Job profile data was gathered from Brecca over a period between October 2003 and
November 2003, with a total of 7524 jobs being processed in that time. This data is shown
35
Processors
Total
1
695
2
212
4
303
8
103
16
14
32
7
64
0
128
0
Total
1334
Table 5.1: Job Data on Grendel
Workload profile for Grendel
700
600
Number of Jobs
500
400
300
200
100
0
1
2
4
8
16
32
Number of Requested CPU’s
64
Figure 5.1: Grendel Workload Profile
36
128
Processors
0:02:00
0:04:00
0:08:00
0:16:00
0:32:00
1:04:00
2:08:00
4:16:00
8:32:00
17:04:00
34:08:00
Total
1
0
0
3
2552
21
69
18
0
219
13
3193
6088
2
58
9
0
102
0
0
9
0
0
0
372
550
4
25
18
14
66
0
0
3
0
0
8
322
456
8
5
0
0
28
0
0
9
0
0
7
144
193
16
0
0
0
23
0
0
0
3
0
0
33
59
32
21
0
0
4
0
0
0
0
3
11
104
143
64
0
0
0
0
0
0
0
0
0
7
28
35
128
0
0
0
0
0
0
0
0
0
0
0
0
Total
109
27
17
2775
21
69
39
3
222
46
4196
7524
Table 5.2: Job Data on Brecca
in Table 5.2 and Figure 5.2. This data is also unsuitable for simulation because it was
gathered during a period of commissioning and has a definite predominance towards single
processor jobs.
Workload profile for Brecca
7000
6000
Number of Jobs
5000
4000
3000
2000
1000
0
1
2
4
8
16
32
Number of Requested CPU’s
64
128
Figure 5.2: Brecca Workload Profile
Data gathered from the University of Utah’s Center for High Performance Computing cluster, Icebox was obtained from the supercluster.org’s tracefile repository (located at
37
Processors
0:02:00
0:04:00
0:08:00
0:16:00
0:32:00
1:04:00
2:08:00
4:16:00
8:32:00
17:04:00
34:08:00
Total
1
71
18
77
373
244
674
642
378
469
445
3851
7242
2
128
1
166
100
78
74
54
56
65
74
665
1461
4
97
0
60
211
116
204
104
109
165
270
1838
3174
8
21
0
20
28
23
145
257
755
950
687
2128
5014
16
25
0
5
86
9
77
63
58
457
145
538
1463
32
0
0
0
5
6
34
15
85
89
82
885
1201
64
0
0
0
2
0
10
3
1
3
2
27
48
128
0
0
0
0
0
0
2
0
0
0
5
7
Total
342
19
328
805
476
1218
1140
1442
2198
1705
9937
19610
Table 5.3: Job Data on Icebox
http://supercluster.org/research/traces). An analysis of this data is shown in Figure 5.3 and
Table 5.3. The workload profiles of Grendel and Icebox show strong similarities., both profiles have a predominate number of single processor jobs and a large number of 2, 4 and 8
processor jobs. It is the mix of single processor and multiprocessor jobs that cause resource
fragmentation and therefore, provides an opportunity for optimisation by using backfilling
and a standing reservation for short jobs. The similarity between the Icebox and Grendel
workload profiles, justify using the Icebox data for the experiments of this project.
5.3
5.3.1
Scheduling Experiments
Experiment 1: Backfilling
The aim of this experiment is to evaluate the effectiveness of two backfilling methods supported by Maui. The methodology behind this experiment is found in section 4.2.3. Three
rounds of simulations were run, each round modifying the backfiling method used. The first
round of simulations disabled backfilling, the second round tested the First Fit algorithm
38
Workload profile for Icebox
8000
7000
Number of Jobs
6000
5000
4000
3000
2000
1000
0
1
2
4
8
16
32
Number of Requested CPU’s
64
128
Figure 5.3: Icebox Workload Profile
while the third round tested the Best Fit algorithm. Within each simulation round, the
number of days for each simulation was increased from 10 to 90 days, at 10 days steps. In
total, 27 simulations were executed using the Icebox data. The results of the simulation for
the Icebox data can be seen in figure 5.4.
This graph (Figure 5.4) shows the number of completed jobs, from 5 to 90 days, comparing
the two backfilling methods and with backfilling disabled. The results show almost twice
as many jobs being processed in the same time period when either of the two backfilling
methods were, compared with no backfilling. It is interesting to note that comparing the
two backfilling methods showed very similar results, although Firstfit performed marginally
better.
The results shown here are a demonstration of the affect that backfilling is having on this
39
Backfill comparison against number of completed jobs: Icebox
7000
Backfill: NONE
Backfill: FIRSTFIT
Backfill: BESTFIT
6000
Completed Jobs
5000
4000
3000
2000
1000
0
0
10
20
30
40
Days
50
60
70
80
90
Figure 5.4: Number of Completed Jobs: Icebox
workload. When backfilling is disabled, the scheduler uses a FCFS algorithm. The results
in Figure 5.4 highlight the inefficiency of this algorithm. To understand how this situation
arises, consider the following example.
The next job in the queue requests more resources than are currently available, and
therefore must wait until those resources are freed. If that job requests a substantial number
of CPU’s, a considerable amount of resources will remain idle until the requested job can
start. In the the two backfilling tests we see short jobs processed in the resource gaps, during
the period that the large job waits for the total amount of requested resource to become
available. Figure 3.2 in Chapter 3 shows an example of this. Although Figure 5.4 shows
that backfilling has doubled the throughput of jobs, this does not necessarily translate into
a double increase in system utilisation. The increase of job throughput is the direct result
40
of the backfilling algorithm assigning idle resources to short jobs to reduce fragmentation.
5.3.2
Experiment 2: Wallclock Accuracy
This experiment is aimed to test the effect that the wallclock accuracy has on the backfilling
technique. The methodology behind this experiment can be found in section 4.2.3.
Dedicated Processor Hours (Percent)
Wallclock effect on Backfilling measuring Dedicated Processor Hours: Icebox
100
Backfill: FIRSTFIT
Backfill: BESTFIT
80
60
40
20
0
10
20
30
40
50
60
70
WallClock Accuracy (Percent)
80
90
100
Figure 5.5: Wallclock effect on Backfilling measuring Dedicated Processor Hours: Icebox
In this experiment, the wallclock accuracy was tested against dedicated processor hours.
This serves as a good metric for system utilisation. The graph (Figure 5.5) shows that
the predicted wallclock accuracy does have an impact on the utilisation of the cluster. At
10% accuracy, it seems that the two backfilling methods can provide a system utilisation of
around 75%, while at 100% accuracy, the scheduler can operate at 98% system utilisation.
This can be expected, as the scheduler uses the wallclock accuracy when scheduling jobs
41
using backfilling. The scheduler makes the best decisions it can by using the wallclock value
as predicted by the users, but this value cannot be controlled. The accuracy of the wallclock
value predicted can only be improved by experience and knowledge of the user. As stated
in the Maui Scheduler Administrator’s Guide, reasonable wallclock accuracy is around 40%.
The average wallclock accuracy of the Icebox data was 30%.
5.3.3
Experiment 3: Standing Reservation for Short Jobs
A technique used in HPC is to give special privileges to short running jobs, so that they
may be processed quickly. Often short jobs are queued by users as a test before launching
a much longer job. For example, researchers want a quick turnaround time for their testing
jobs, while they refine their processes and data. A short turnaround time for these short
jobs is essential for the researchers to progress with their research. By defining a block of
resources for dedicated use by short jobs, their turnaround time can be reduced significantly.
This technique is quantified by the expansion factor. The expansion factor is relative to
the size of the job and the duration it waits in the queue, so to achieve an overall smaller
expansion factor, short jobs need to be given priority of starting over larger jobs. The aim of
this experiment is to reduce the expansion factor for short jobs which should in turn, reduce
the overall expansion factor for all jobs. The methodology behind this experiment is found
in section 4.2.3
From assessing the graph of results (Figure 5.6) it can be seen that increasing the amount
of processors dedicated to the short job standing reservation decreases the expansion factor.
42
35
Effect of standing reservation on short running jobs: Icebox
Short Jobs
Medium Jobs
Long Jobs
Average XFactor (Percent)
30
25
20
15
10
5
0
4
8
16
32
64
Number of reserved CPU’s
128
256
Figure 5.6: Effect of Standing Reservation for Short Jobs: Icebox
From the simulation, it seems that the best combination for reducing the expansion factor
for short jobs seems to be around the 32 or 64 node mark. At these points the expansion
factor for the short jobs is the small, but it does not affecting the expansion factor for larger
jobs.
5.4
Conclusion
This chapter has presented the results of the experiments discussed in the last chapter. A
conclusion of the results is made in the next chapter.
43
Chapter 6
Conclusion
6.1
Introduction
This chapter discusses the conclusions made from the results in the last chapter, discusses
any problems encountered throughout the project and suggests ideas for further research in
this area.
6.2
Results
The research question proposed is ”What is an efficient technique to schedule jobs on VPAC’s
high performance computer?” By testing the backfilling algorithm, the wallclock accuracy
when using backfilling and a standard reservation for short jobs we see that these optimisation
techniques have proved significant.
From the experiments performed in this project, the results show that the optimisations
44
have proved to be significant.
The backfilling tests showed that an increase of nearly twice as many jobs being processed
in the same time period by using either Firstfit or Bestfit backfilling, compared to using
no backfilling. Neither Firstfit or Bestfit showed any great advantage over the other, but
demonstrated that the method of backfilling was not an important factor. An important
factor of the success of backfilling is the wallclock accuracy which was tested. The accuracy of
the wallclock estimate had a direct relationship with the system utilisation. With a wallclock
accuracy of 100%, the utilisation of the system ran at an average of 98%. Although this
shows a great increase in performance, it would be unrealistic for users to predict their job
duration to this degree. Regardless of the wallclock accuracy, backfilling has proved itself as
a significant optimisation in job scheduling.
The short job reservation proved to be successful by reduced the turnaround time for
short jobs. The average turnaround time for these short jobs dropped from over 30 hours,
down to 2 hours at the 32 processor mark, on the Icebox data, without penalty to medium
or large jobs. This ensures that short jobs, designed for testing, can be processed with a
quick turnaround time, allowing researchers to reduce wasted computing power, and make
better use of their time.
6.3
Problems Encountered
Although the Maui Scheduler team claim that Maui is ’the most advanced scheduler in the
world’, it still has a long way to go. From a simulation perspective, it does support many nice
45
features, but they are not always successful. This is often due the lack of documentation and
also documentation that is incorrect. For example, fixing the wallclock accuracy to measure
the effect that it has on backfilling was a difficult task. Initially, this value was stumbled
upon from a paper about simulations with Maui. (Jackson et al., 2001) This actual value
quoted was ’SIMWCA’ but was actually ’SIMWCACCURACY’ which was obtained from a
Maui log, with the log detail set to 9. Many issues like this were encountered, and many
emails were exchanged with the supercluster.org group. Their responses were exceptional,
but would not have been needed if the relevant documentation existed.
6.4
Further Research
In this project, several techniques were used to not only increase system utilisation but also
to reduce turnaround time for short jobs. One aspect that was not investigated as part
of this project was user feedback. Throughout the simulations, a constant job depth of 32
jobs was used. This ensured that at all times, there were 32 jobs waiting in the queue.
Although, this was sufficient in testing backfilling for example, it did not consider the fact
that users generally do not submit more jobs until they have the results from their previous
jobs. Therefore, the jobs submitted to the queue would not necessarily have had the constant
32 jobs waiting in the queue. Realistically, the submission of the jobs could be much more
varied, and possibly be affected by factors such as day of week, time of day or even an event
such as a research conference. Some papers (Feitelson and Nitzberg, 1995) (Hotovy, 1996)
discuss workload evaluations, and (Jackson et al., 2001) discusses this issue further. For
46
further research, this issue of varied job submission could be investigated. In the case of
VPAC, once the machine reaches a job/work saturation point, data could then be collected
and further optimisation could be done. This could possibly give a better understanding of
the workload, and provide a more robust optimisation.
47
Bibliography
Apsen Systems (2003). The era of supercomputing. Retrieved 21st November, 2003 from
http://www.aspsys.com/clusters/beowulf/history.
Bode, B., Halstead, D. M., Kendall, R., and Lei, Z. (2000). The portable batch scheduler and
the maui scheduler on linux clusters. In Proceedings of the 4th annual linux showcase
and conference, Atlanta.
Buyya, R. (1999). High Performance Cluster Computing: Architectures and Systems. Prentice Hall PTR, NJ, USA.
Feitelson, D. (1994). Job scheduling in multiprogrammed parallel systems. Technical report,
IBM Research Report RC.
Feitelson, D. and Nitzberg, B. (1995). Job characteristics of a production parallel scientific
workload on the nasa ames ipsc/ 860. In Feitelson, D. G. and Rudolph, L., editors,
Proceedings of IPPS ’95 Workshop on Job Scheduling Strategies for Parallel Processing,
volume 949, pages 337–360. Springer.
Hotovy, S. (1996). Workload evolution on the cornell theory center ibm sp2. In Feitelson,
48
D. G. and Rudolph, L., editors, Proceedings of IPPS ’96 Workshop on Job Scheduling
Strategies for Parallel Processing, pages 27–40. Springer-Verlag.
IBM Corporation (2003). IBM eServer Cluster 1350 Description. IBM Corporation.
Jackson, D. B. (1999). Advanced scheduling of linux clusters using maui. Retrieved 25th
August from http://supercluster.org/research/papers/xlinux99.html.
Jackson, D. B., Jackson, H., and Snell, Q. O. (2001). Simulation based hpc workload analysis.
In Proceedings of the 15th International Parallel and Distributed Processing Symposium
(IPDPS-01), San Francisco, CA, April 23-27, 2001. IEEE Computer Society.
Majumdar, S., Eager, D. L., and Bunt, R. B. (1988). Scheduling in multiprogrammed parallel
systems. In Proceedings of the 1988 ACM SIGMETRICS conference on Measurement
and modeling of computer systems, pages 104–113. ACM Press.
Merkey,
P. (2003).
Beowulf history.
Retrieved 21st November,
2003 from
http://www.beowulf.org/beowulf/history.html.
Pfister, G. F. (1998). In Search of Clusters: The Ongoing Battle in Lowly Parallel Computing.
Prentice Hall, 2nd edition.
Schwiegelshohn, U. and Yahyapour, R. (1998). Analysis of first-come-first-serve parallel job
scheduling. In Symposium on Discrete Algorithms (A Conference on Theoretical and
Experimental Analysis of Discrete Algorithms).
49
Schwiegelshohn, U. and Yahyapour, R. (2000). Fairness in parallel job scheduling. Journal
of Scheduling.
Streit, A. (2001). On job scheduling for hpc-clusters and the dynp scheduler. Lecture Notes
in Computer Science, 2228.
Supercluster Research and Development Group (2002). Maui Scheduler Administrator’s
Guide. Supercluster Research and Development Group.
Talby, D. and Feitelson, D. (1999). Supporting priorities and improving utilization of the
ibm sp scheduler using slack-based backfilling. In Proceedings of the 13th International
Parallel Processing Symposium.
Wikipedia
(2003).
ILLIAC
IV.
Retrieved
4th
June,
2003
from
http://www.wikipedia.org/wiki/ILLIAC IV.
Zenios, S. A. (1999). High-performance in finance: The last 10 years and the next. Parallel
computing, 25:2149–2175.
50