Job Scheduling with High Performance Computers by Andrew Botting Submitted to the School of Information Technology and Mathematical Science in partial fulfillment of the requirements for the degree of Bachelor of Computing (Honours) at the UNIVERSITY of BALLARAT November 2003 c University of Ballarat, School of ITMS 2003. All rights reserved. ° Certified by: Glenn Stevens Lecturer, School of ITMS Thesis Supervisor Certified by: David Bannon Systems Manager, Victorian Partnership for Advanced Computing Thesis Supervisor Job Scheduling with High Performance Computers by Andrew Botting Submitted to the School of Information Technology and Mathematical Science on November, 12, 2003, in partial fulfillment of the requirements for the degree of Bachelor of Computing (Honours) Abstract The Victorian Partnership for Advanced Computing (VPAC) installed a Linux based high performance cluster for solving complex computational problems. The question raised is, what is an effective technique to schedule these jobs for execution in this environment? Scheduling jobs as they arrive (First-Come-First-Serve) is a fair way to schedule jobs, but can lead to fragmentation and low system utilisation, while the system slowly gathers the needed resources to service the next job in the queue. One answer to this problem is the use backfilling. Backfilling allows the jobs to be executed out of order, and by making intelligent scheduling decisions, make better use of systems resources. Job workloads from the University of Utah’s ’Icebox’ cluster were used to perform three scheduling experiments, to optimise the job scheduler. Backfilling methods supported by Maui were tested to find their affect on the job data and were found to reduce job turnaround time and increase job throughput. The affect of wallclock accuracy was also tested on the supported backfilling methods, and the results showed that the accurate estimated wallclock by users can increase system utilisation by 20%. The third experiment tested the expansion factor of short jobs executing for less than two hours, by creating a standing reservation. The results showed that with the Icebox data, the more nodes dedicated to the reservation, the lower the expansion factor became. A standing reservation size of 32 nodes proved to be the best solution for reducing the expansion factor of short jobs, without significant increase in expansion factor on larger jobs. Thesis Supervisor: Glenn Stevens Title: Lecturer, School of ITMS Thesis Supervisor: David Bannon Title: Systems Manager, Victorian Partnership for Advanced Computing Acknowledgments I would like to thank: • My Family for their wonderful support throughout this year. • Bek Trotter for her understanding, patience, support and assistance throughout the year. • Glenn Stevens for his supervision and guidance of this project. • David Bannon, Chris Samuel and everyone at VPAC for making me part of the team, and allowing me to use their facilities. Contents 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 The Organisation Of The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Victorian Partnership for Advanced Computing . . . . . . . . . . . . . . . . 3 2.2.1 VPAC Hardware Profile . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.2 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.3 VPAC’s System Options . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 3 Literature Review 3.1 9 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 9 3.2 Parallel High Performance Computing . . . . . . . . . . . . . . . . . . . . . 9 3.2.1 High performance computing architectures . . . . . . . . . . . . . . . 10 3.2.2 Early High Performance Computers . . . . . . . . . . . . . . . . . . . 12 Batch Processing System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Resource Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2 VPAC’s Choice of a Batch Processing System . . . . . . . . . . . . . 17 Job scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.1 The Scheduling Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 Job scheduling strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6 Specific Job Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.6.1 FIFO Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6.2 Maui Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 3.4 3.7 4 Methodology 28 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Selected research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1 Initial construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.3 Develop Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2.4 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 ii 5 Analysis of Results 34 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.1 Workload Profile Comparison . . . . . . . . . . . . . . . . . . . . . . 35 Scheduling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.1 Experiment 1: Backfilling . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.2 Experiment 2: Wallclock Accuracy . . . . . . . . . . . . . . . . . . . 41 5.3.3 Experiment 3: Standing Reservation for Short Jobs . . . . . . . . . . 42 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 5.4 6 Conclusion 44 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.3 Problems Encountered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.4 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 iii List of Tables 2.1 Institution usage percentages . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Priority levels on Grendel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5.1 Job Data on Grendel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Job Data on Brecca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Job Data on Icebox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 iv List of Figures 2.1 VPAC’s Linux cluster, ’Brecca’ . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 First-Come-First-Serve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Backfilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.1 Grendel Workload Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2 Brecca Workload Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.3 Icebox Workload Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 Number of Completed Jobs: Icebox . . . . . . . . . . . . . . . . . . . . . . . 40 5.5 Wallclock effect on Backfilling measuring Dedicated Processor Hours: Icebox 41 5.6 Effect of Standing Reservation for Short Jobs: Icebox . . . . . . . . . . . . . 43 v Chapter 1 Introduction 1.1 Introduction High performance computing has been defined as computing resources, much greater in magnitude of computing power than what is normally found on one’s desk (Zenios, 1999). High performance computers exist in several forms, the most common being symmetric multiprocessor systems, massively parallel systems, vector systems and the cluster of workstations. Clustering offers significant price/performance advantages for many high-performance workloads. Linux clusters can further extend these advantages by harnessing low-cost servers and Open Source software (IBM Corporation, 2003). This paper aims to investigate cluster optimisation techniques, by testing them and applying them to the VPAC cluster. 1 1.2 Research question The research that will be conducted asks ”What is the most efficient technique to schedule jobs on VPAC’s high performance computer?” In an attempt to answer this question, several subquestions that will be addressed are: • What are VPAC’s requirements? • How is efficiency measured? • What job scheduling techniqes are available? • What techniques will be used, and why? 1.3 Methodology The methodology for this project will be a test and measure approach. Data will be collected and analysed with the results being used to optimise the job scheduling algorithm. This will be discussed in more detail in Chapter 4. 1.4 The Organisation Of The Thesis Chapter 2 will describe the background of this project. Chapter 3 will examine literature surrounding the topic area. Chapter 4 outlines the methodology being employed, and discusses why this approach has been taken. Chapter 5 analyses and discusses the results of the project, and Chapter 6 draws some conclusions regarding the success of the project. 2 Chapter 2 Background 2.1 Introduction This chapter is an overview of the background regarding this project. It outlines the VPAC organisation, their goals, objectives and the policies which will need to be addressed in a cluster scheduling solution. 2.2 Victorian Partnership for Advanced Computing The Victorian Partnership for Advanced Computing (VPAC) is an organisation, established in 2000 by six Victorian universities. These universities include: La Trobe University; Monash University; RMIT University; Swinburne University of Technology; The University of Ballarat and The University of Melbourne. VPAC’s goal is to provide high performance computing (HPC) facilities to its member universities. By combining the resources and skills 3 of the member universities, VPAC can deliver high-performance computing resources which these universities could not provide in isolation. VPAC receives funding from its six founding universities, the Australian Partnership for Advanced Computing (APAC) and the Federal Government in the form of a $6million science and technology grant over three years. VPAC is highly committed to research and development, industry training and support in the area of high performance computing. More information can be found at the VPAC website: http://www.vpac.org. 2.2.1 VPAC Hardware Profile VPAC’s current system is a Compaq AlphaServer SC, with 128 nodes and 1.4Tb of RAM, its name is ’Grendel’ and it used by all supporting universities. Grendel is a batch processing system, allowing jobs to be submitted to a queue and processed once the requested resources become available. 2.2.2 Policies Jobs submitted to Grendel are given a priority level depending on which category they best fit. These categories use a quota system which guarantees the supporting universities are allocated their entitled share of computational time. (See Table 2.1) The quotas are calculated by the size of the contribution made by each university. VPAC not only enforces a quota on the university, but also a project quota. Each project registered on Grendel has a percentage of the total quota of the institution, creating a split-level quota system. Table 4 Priority Levels Monash University 21.25% University of Melbourne 21.25% La Trobe University 15.94% RMIT 15.94% Swinburne University of Technology 7.97% University of Ballarat 2.66% VPAC 5.00% Table 2.1: Institution usage percentages 2.2 shows the priority levels currently enforced on Grendel. 2.2.3 VPAC’s System Options Two years after the purchase Grendel, VPAC decided that more computing power was needed. After weighing up their options, VPAC chose a Linux based IBM cluster. The IBM system was chosen for two reasons; the proposal was a very good price for the computational performance of the system, and IBM were interested in setting up a business relationship with VPAC in the field of research and development. VPAC eventually decided on the IBM eServer xSeries 335 Intel based Linux cluster (see Figure 2.1). The agreement made between IBM and VPAC was that the system would perform at a minimum of 600 Gigaflops. After applying some optimisation techniques such 5 Priority Levels Running Parallel 94 Running Parallel Under Quota but Institute Over 90 Running Parallel Over Quota 84 Running Single CPU 70 Suspended Parallel 80 Suspended Single CPU 72 Queued Parallel Under Quota 75 Queued Single Under Quota 60 Queued Parallel Proj Under Quota but Institute Over 65 Queued Parallel Over Quota 60 Queued Single Over Quota 50 Table 2.2: Priority levels on Grendel 6 as Intel compilers, and using the MPICH message passing library which worked directly with the Myrinet network interface, VPAC successfully achieved a LINPACK mark of 631 Gigaflops. Figure 2.1: VPAC’s Linux cluster, ’Brecca’ 2.3 Conclusion This chapter outlined the background in which this project is based, including VPAC, their choice of system and their existing set up and policies which govern it. The next chapter 7 will examine the literature surrounding the area of clusters and job scheduling. 8 Chapter 3 Literature Review 3.1 Introduction This chapter provides an overview of high performance computing and job scheduling algorithms and solutions. It provides some solutions to job scheduling issues and presents some resource managers and job scheduling packages available. 3.2 Parallel High Performance Computing High performance computing has been defined as computing resources, much greater in magnitude of computing power than what is normally found on one’s desk (Zenios, 1999). Often, many applications require great amounts of computing power, that desktop computers simply cannot fulfill. In the early 1980’s it was believed that to increase the computing power, the only solution was to build faster processors. However, constraints exist which 9 hinder the development of processors. From this need, parallel computing evolved. Parallel high performance computers are the result of connecting multiple processors together and coordinating their efforts (Buyya, 1999). Parallel HPC’s also provide the benefit of being scalable, so that more and more processors can be added, increasing the computing power. Pfister (1998) defines a cluster as ”a type of parallel or distributed system that: consists of a collection of interconnected whole computers, and is used as a single, unified computing resource.” High performance technical computing is the idea dedicating available CPU’s to a single job, rather than all jobs sharing a slice of the aggregate computing power. This is often found in a research environment. 3.2.1 High performance computing architectures Many system architectures have emerged in the parallel computing area. They are classified by the arrangement of the processors, memory and their interconnect. Some of the most common systems are: • Vector • Massively Parallel Processors (MPP) • Symmetric Multiprocessor (SMP) • Distributed Systems and Clusters 10 Vector Computers Vector computers have specially designed processors optimised for arithmetic operations on elements of arrays, known as vectors. This meant that large amounts of mathematical data could be handled much faster than other types of processors, though vector CPU’s could be slowed down when it came to complex instructions. Machines such as the Cray series of HPC’s used this architecture, and were the fastest of their time (Apsen Systems, 2003). Massively Parallel Processors Massively parallel machines consist of a collection of separate units, known as nodes. The nodes operate totally independent of each other and are connected together by a high speed network. Each node can resemble a desktop system in some respects, because they contain hardware such as hard drives, memory and their own copy of the operating system. Symmetric Multiprocessor Symmetric multiprocessor machines contain two or more processors which work independent of each other. They are connected together by a very high speed and low latency bus, most often, a motherboard. The processors share all hardware, including memory and operating system. SMP machines are unable to scale well due to their bus and memory bandwidth limitations. 11 Distributed Systems and Clusters Distributed systems and clusters are similar to the MPP machines as they consist of totally separate nodes. The difference is that they can often be normal desktop systems connected by a standard networking interconnect. Some variations of this can be seen in setups such as the Cluster of Workstations (COW). The COW model is sometimes used by companies with many desktop machines which have very low utilisation. When these desktops are not in use, they can be controlled by a central server to become a very wide parallel computer. 3.2.2 Early High Performance Computers The Illiac IV One of the first parallel high performance computers was the Illiac IV. The project started in 1965 and had successful tests in 1976. The initial predicted cost was $US8 million in 1966, but escalated to $31 million in 1972, and the predicted 1000 MFLOPS ended up being closer to 15 MFLOPS. The computer was a failure as a production system, but it paved the way for research into the area (Wikipedia, 2003). The Cray-1 The Cray-1 was the first machine manufactured by the Cray company in 1976. It was revolutionary for its time and paved way for many more Cray machines later on. 12 The Beowulf Cluster The Beowulf cluster started in late 1993, by two men, Donald Becker and Thomas Sterling. Their theory was to reduce the cost of HPC’s by using Commodity-Off-The-Shelf (COTS) components and coupling them together using standard networking interconnect. The original Beowulf cluster was sixteen 486DX4 processors connected by 10Mbit/s channel bonded Ethernet. A single 10Mbit/s card could not provide enough network bandwidth, so Becker rewrote his Ethernet drivers for Linux and built a ”channel bonded” Ethernet where the network traffic was striped across two or more Ethernet cards. The two keys to the success of this system were the availability of cheap COTS hardware and the maturity and robustness of Linux. By decoupling the hardware and software, the Beowulf cluster remains vendor independent, and by using open source software, programmers now have a guarantee that the programs they write will run on future Beowulf clusters. From the progress made by the Beowulf community, researchers within the HPC community now recognised Beowulf clusters as their own genre within the HPC community. The Beowulf architecture makes it not quite a MPP, and not quite a COW. It falls somewhere between the two. (Merkey, 2003) 3.3 Batch Processing System A batch processing system provides users with a mechanism for submitting, launching and tracking jobs on a shared resource (Supercluster Research and Development Group, 2002). Batch systems attempt to share a HPC’s resources in a fair and efficient manner within three main areas: 13 • Traffic Control • Site Policies • Optimisations Traffic Control A batch processing system is responsible for controlling jobs. If jobs are contending for a systems resources, system slowdown results. The traffic control system defines and allocates particular resources to particular jobs ensuring that jobs do not interfere with each other. Site Policies When a HPC is installed, it is usually installed for a particular purpose. The site policy defines rules which govern the system in relation to how the it should be used, how much it is used and whom it is used by. Optimisations When the demand on a HPC becomes greater than supply, intelligent decisions about scheduling jobs can achieve greater optimisation. It is the role of the job scheduler to make intelligent decisions to increase optimisation. 3.3.1 Resource Managers A resource manager is a server which implements a batch processing system. 14 • Network Queue System (NQS) • Portable Batch Scheduler (PBS) Network Queue System Batch queuing started with the Network Queue System (NQS) which was developed at the Ames Research Facility at NASA. It was designed to run on their Cray 2 and Cray Y-MP supercomputers and select a good job mix to achieve high machine utilisation. It was the first of its kind, and soon became the de-facto standard. Jobs submitted to NQS must state their memory and CPU requirements, and were placed in a suitable queue. The scheduler would then select jobs based on these properties to create an efficient mix. The limitations of NQS soon showed. It was not configurable enough for tuning purposes, and it did not support parallel processing architectures. These limitations prompted work on a new package. The Portable Batch Scheduler The Portable Batch Scheduler (PBS) is a batch software processing system also designed at NASA’s Ames Research Facility to fill the need for a resource management system which could handle parallel jobs. Jobs are placed onto the queue, and the Job scheduler component of PBS decides which resources should be allocated to that job. The term resources is generally used to describe HPC resources, meaning CPU’s, memory and harddisk. The PBS server then sends the required information to the nodes running a PBS daemon called the ’mom’. The mom processes the job, and sends back the required job information to 15 the server. PBS includes several of its own schedulers, with the primary one being the First-In-First-Out (FIFO) scheduler. The focus of the PBS development has shifted toward resource management for clusters, as they have become a much more viable option for many organisations. PBS is now owned by Altair. The three versions of PBS currently available are: • OpenPBS • PSBPro • ScalablePBS OpenPBS OpenPBS is a free version of PBS. It makes the source code readily available, but the latest release is not licensed under the GNU Public Licence, therefore users are not permitted to modify and redistribute it. Many sites using OpenPBS have created patches, tailoring it to their needs, and made their patches available over the Internet. More information about OpenPBS can be found at: http://www.openpbs.com. PBSPro PBS Pro is a commercial version of PBS which requires users to purchase it for their machines. It provides many enhancements over the free version, including product support, which may make it a viable option for many sites. More information about PBSPro can be found at: http://www.pbspro.com. 16 ScalablePBS ScalablePBS differs from both OpenPBS and PBSPro. The other two versions of PBS are owned and distributed by Veridian Systems, whereas ScalablePBS is the child of the supercluster.org group. Many bugs were found in OpenPBS which could cause the batch system to fail, so the need for a stable, robust version of PBS was needed. Therefore, the supercluster.org group took the source of OpenPBS at a point where the licence allowed redistribution and applied many OpenPBS patches which had been created by various cluster sites. This resulted in a more stable and scalable PBS. More information about ScalablePBS can be found at: http://supercluster.org/projects/pbs. 3.3.2 VPAC’s Choice of a Batch Processing System ScalablePBS was chosen by VPAC for the following reasons: • Free. • Open solution (Source code available). • Linux and cluster support. • Stable and robust. 3.4 Job scheduling Job Scheduling of a parallel system is the activity of assigning a limited number of system resources to a process so it can be executed. The meaning of the term resources in this 17 context relates to memory, hard disk space and CPU time. 3.4.1 The Scheduling Issue The two main objectives of the job scheduler are as follows: • Increase system utilisation. • Minimise job turnaround time. High performance computers cost a great deal of money and to gain a reasonable return on investment on system of this capacity, the system must be utilised as high as possible. Therefore, the efficiency of the job scheduler is a high concern for the machine owner. On the other hand, users want a short turnaround time for their jobs. A balance between system utilisation and fairness then needs to be established. Several job scheduling strategies exist, and are examined. 3.5 Job scheduling strategies Depending on the policies of the HPC site, many systems use space sharing, being the concurrent execution of jobs (Schwiegelshohn and Yahyapour, 2000). This means that at any one time, a HPC can be concurrently executing many different jobs from many different users. Each job could request any number of available HPC resources and run for any length of time. Due to the combination of users and jobs, a varied and unpredictable workload may exist. Therefore, a HPC system requires some way to schedule these jobs. 18 Extensive research has been undertaken in this area, and also in the conception of many scheduling algorithms, with very few of these actually being implemented in real scheduling applications (Schwiegelshohn and Yahyapour, 1998). Shortest Job First The shortest job first (SJF) strategy attempts to execute the shortest jobs (or jobs that will use a shorter amount of CPU time) before longer jobs. The rationale behind this scheduling algorithm is if shorter jobs wait for a longer job, both jobs will have a long response time. If the shorter job runs first, it will have a shorter response time and therefore reduce the overall average response time. The major disadvantage in using this method is that long jobs are penalised, and job starvation may occur if the system continues to recieve short jobs. Time Slicing/Processor Sharing Time slicing (or processor sharing) is a simple approach to job scheduling, whereby the resources of a system are portrayed as one unit, and each job is given a slice of the resources (Majumdar et al., 1988). This job scheduling strategy differs to many others as jobs do not have exclusive access to their assigned resources. In a time sliced system, as more jobs are started on the system, the less resources are available for existing jobs. First-Come-First-Serve The first-come-first-serve (FCFS) scheduling algorithm is a simple way to schedule jobs. The FCFS algorithm processes each job as it is submitted to the batch scheduler. An advantage 19 of this algorithm is that it has low overhead and is not biased on the basis of job type, but it suffers from low efficiency because no strategic decisions are made about space filling. For example, a cluster system of 16 CPU’s was running at a high utilisation with 14 CPU’s. The next job in the queue requested 12 CPU’s. For the scheduler to execute the next job, 10 more CPU’s would need to become free before it could execute. As currently running jobs complete, more resources of the cluster are left idle waiting for the required number of CPU’s to become available. Up to 11 processors could be left idle until the large job in the queue has a chance to execute. (see Figure 3.1) This unused processing power is know as fragmentation. Running Job CPU’s Running Job Large Job Running Job Running Job t0 t1 t2 Time Figure 3.1: First-Come-First-Serve Priority To extend the FCFS scheduling algorithm, many schedulers use a priority scheme. When a job is submitted to a queue, it is assigned a priority, depending on it characteristics. The 20 priorities are a useful way of allowing an organisation to adhere to their goals or political agendas, by giving higher priority to certain users, groups, type of jobs, quota of users, etc. Backfilling Backfilling is a scheduling technique which allows jobs to be run out of their priority order to make better use of a machines resources. It aims to provide high system utilisation by searching for free slots for processing, and fitting suitable jobs into these slots without delaying them (Streit, 2001). In the FCFS example (see Figure 3.1), many of the systems’s resources were unused between t=1 and t=2, because of the large job reservation. While this reservation allowed the large job to be processed as soon as the processors became available, much of the system is left idle. Backfilling allows small jobs to run by allocating the CPU’s which would have normally been left idle. For the scheduler to successfully backfill jobs, users must submit with their job, an estimate of job execution time. This is known as awalltime estimate. The walltime estimate is used by the scheduler to guarantee a latest possible completion time for jobs. The backfilling method uses walltime estimate to schedule the required resources for the highest priority job in the queue as soon as the resources become available. While the job waits for those resources to become available, smaller jobs with a walltime estimate shorter than the available unused time between the current time, and the time when the high priority larger job is scheduled to begin, are given the opportunity to run, and hence use the resources that would otherwise be wasted by the FCFS algorithm. Since additional jobs can be processed, without any delay to the start of the larger, priority job, the overall 21 system utilisation is increased. Figure 3.2 illustrates how the backfill method works. A large job is scheduled at t=2. Three jobs, all with walltime estimates which would allow them to run in the available free resources between t=1 and t=2 are executed. The wallclock estimate is used to schedule large jobs and also to backfill the small jobs, hence the accuracy of the wallclock estimate can affect the success of backfilling. If a wallclock estimate is too short, the resource manager will kill the job once the wallclock time is reached. On the other hand, if the wallclock estimate is much larger than the actual duration of the job, it may be overlooked when the scheduler is looking for backfilling candidates. Running Job Backfilled Job CPU’s Running Job Running Job Running Job t0 Large Job Backfilled Job Backfilled Job t1 t2 Time Figure 3.2: Backfilling Variable Partitioning Many of the high performance computers require the users to submit the number of processors require for their job. When the processors become free, they are allocated to the job, and run until completion or until it is killed. This is known as variable partitioning. 22 (Feitelson, 1994) Preemption Preemption is the ability to pause a running job, then relocate it to another node set. Once relocated, the job is resumed as normal. This can cause some issues. Although this method seems quite simple in nature, specialty networking hardware cannot support it. For instance, using hardware such as Myrinet bypasses much of the operating system kernel by interfacing directly to the hardware, hence there is no simple way to pause and relocate traffic while data is in transit. Preemption is common in Vector machines, such as the Cray or NEC systems. Priority and Quality of Service Priority and Quality of Service is a method used for ’fairness’. The goals of this strategy are to increase the availability of the machine for all users. By implementing a quota system, where every user is assigned a quota of processing time, heavy users over their quota will be penalised in terms of priority. A job is then ’weighted’ by a number relating to priority. Depending on the job scheduler, this weight should equate to a faster turnaround. 3.6 Specific Job Schedulers Although many job scheduling applications exist, we are looking at those which are usable within the Portable Batch System. 23 3.6.1 FIFO Scheduler The FIFO scheduler is a simple FCFS scheduler which is supplied with PBS. It is designed to be a starting point for development of a further scheduler, but many sites use this as their primary scheduler. The FIFO scheduler was the primary scheduler at VPAC until August 2003, when Maui was implemented. 3.6.2 Maui Scheduler The Maui scheduler is a highly configurable and effective batch scheduler, currently being used at many leading HPC facilities, including the University of Utah’s ’Icebox’ cluster. The Maui scheduler was designed as a HPC scheduler with advanced features such as backfilling, fairshare scheduling, multiple fairness policies, dynamic prioritization, dedicated consumable resource tracking and enforcement, and a very extensive advance reservation system (Jackson, 1999). Maui uses a calculated priority to perform its job scheduling decision. The priority of all jobs in the queue are dynamically calculated on each scheduling cycle. Maui gives the system administrator full access to define the priorities as he or she sees fit. The priority calculation is described below. Priority = QueueTimeWeight x QueueTimeFactor FSWeight x FSFactor + XFactorWeight x XFactor + 24 + UrgencyWeight x UrgFactor + QOSWeight x J->QOS + BypassWeight x J->Bypass + ResourceWeight x J->MinProcs Each of the weight values are defined in the Maui configuration file to tailor Maui to the exact needs of the workload, or organisation. QueueTimeFactor The QueueTimeFactor is a value determined by the amount of time a job has been waiting in a queue to run. Therefore, the longer a job has been waiting, the higher its priority becomes. FSFactor The FSFactor is a fair share value. This value is calculated based on historical CPU usage for the user, group and account associated with a job. Maui allows for a CPU usage target, so the usage of machine can be fairly divided among users, groups and accounts. XFactor XFactor means expansion factor. Increasing this value has an effect of pushing short-running jobs toward the top of the queue. 25 UrgFactor The UrgFactor is an urgency value which is used to push jobs to completion within certain XFactor limits. When a jobs needs to be processed as soon as possible, without regard for other jobs in the queue, this value will ensure that is gets instant attention. QOS Quality of Service (QOS) facility allows a site to give special privileges to certain users, groups or accounts. By increasing the weight of this, the scheduler gives a higher priority to the job belonging to the user, group or account’s QOS. Backfilling Every time backfilling occurs, the jobs in the queue with a higher priority have their Bypass value incremented by one. This is to ensure that job starvation is minimised. MinProcs MinProcs is another value to prevent job starvation. For jobs requiring many CPU’s, without this, large jobs would be left starving. The research by Bode et al., (2000) shows that an increase in system utilisation can be seen by using the Maui scheduler over the PBS scheduler, FIFO. From research such as this, and also the advanced fairsharing capabilities, Maui is the selected scheduler of VPAC. Maui which will ensure that the organisational goals for the cluster system will be met, and each organisation, group and account will be monitored to ensure that they recieve their 26 entitlement of computing power. 3.7 Conclusion This chapter has given an overview of the literature available for this topic area. It has introduced parallel high performance computing, some early parallel machines, batch processing systems and job schedulers and some algorithms associated with them. 27 Chapter 4 Methodology 4.1 Introduction This chapter will examine the research methodology selected for this project. It will include constraints of the project, the available research methodology options and the selected methodology. 4.2 Selected research methodology The aims of this project are to set up an effective job queueing and scheduling mechanism for the VPAC high performance Linux cluster, using a test and measure approach. Using the Maui scheduler plug-in to ScalablePBS, a simple scheduling system shall be created using some estimated values. This system will allow users to log in and submit their jobs to PBS, and then Maui will decide which jobs are run in which order, based on its 28 configuration. The methodology used in this project consists of 4 phases: Initial construction Build a simple PBS and Maui configuration. Data analysis Examine the workload data and assess its suitability for the optimisation simulations. Develop simulations Using Maui, construct a simulation environment for simulating optimisation techniques. Results analysis Analyse the data and draw conclusion about the effectiveness of the optimisation. 4.2.1 Initial construction This phase begins the initial software set up and install. The steps involved in this process are: Resource manager The first step in setting up a batch processing system is to set up a resource manager. The original resource manager used at VPAC was OpenPBS 2.3.16 but after testing it for several months, it was found to be unstable. ScalablePBS is the current resource manager and has proved to be considerably more stable and scalable than OpenPBS. During this phase, the default PBS scheduler, FIFO will be used until the Maui scheduler has been tested and is ready for use. 29 Job Scheduler The Maui scheduler was set up and installed into the system and tested. By starting with a default configuration, Maui will interface with PBS to obtain and record job and node information. While in test mode, Maui will not disrupt the scheduling of jobs while PBS’s FIFO scheduler is running. This should allow for any configuration or software problems to be detected before running ’live’. 4.2.2 Data Analysis This phase of the project examines the data, and assesses its suitability. Workload comparisons between Grendel, Brecca and Icebox are compared and a conclusion about the suitability of simulating the Icebox data made. 4.2.3 Develop Simulations This phase examines the job scheduling optimisation techniques, and develops a simulation framework for running the optimisation experiment simulations. The following experiments were conducted on the Icebox data: • Backfill comparison • Wallclock accuracy test • Standing reservation for short jobs 30 Backfill Comparison This experiment is designed to evaluate the effectiveness of the Maui backfilling methods on the Icebox job data. Many papers such as Talby and Feitelson (1999) and Schweigelshohn and Yahyapour (1998) have investigated backfilling methods, but neither have used the Maui scheduler. Maui uses two main variations of backfilling: Firstfit and Bestfit. The Firstfit method selects the first job in the queue that will fit in the available resource space, while the Bestfit method selects a job to maximise the selected backfill criteria based on either processor usage, processor second usage or seconds. The data will be simulated by running many copies of Maui, each stopping at a point in time, and using either of no backfilling, Firstfit or Bestfit backfilling. A graph of this will be generated for an easy comparison between the backfilling methods. The metric that we will be using will be the number of completed jobs in that time. Wallclock Accuracy Test The wallclock accuracy test is designed to evaluate the affect that the predicted wallclock time has on the backfilling scheduling algorithm. When scheduling jobs, Maui uses the wallclock time which users submit with their jobs, as an estimate of how long the job will run for. From this value, Maui can make a basic prediction of when the job will finish so that it can attempt to schedule more jobs, after the current jobs have finished. Maui can then give a guarantee of the latest time that a job can start and finish. For Maui to guarantee that the jobs will finish by a certain time, it must enforce the wallclock limit on 31 running jobs. If a job runs longer than the specified wallclock time, it will be killed to ensure that it does not delay queued jobs. Therefore, users are suggested to slightly increase this wallclock value to ensure that their job is not killed accidentally. The simulations for this experiment test from 10 percent to 100 percent accuracy in steps of 10 percent. In each round of simulations, the simulator tests Firstfit backfilling, Bestfit backfilling and without backfilling. The results will then be graphed to give an indication of the how important the wallclock accuracy is when calculating backfilling. The default backfilling metric for Bestfit is to best fit the available resource by processors, and this value was used throughout the simulations. Standing Reservation for Short Jobs The aim behind this experiment is to reduce the expansion factor for short running jobs. Expansion factor is defined as: Expansion Factor = 1 + Queue Wait Time / Job Execution Duration For example, researchers want a quick turnaround time for their testing jobs, while they refine their processes and data. A short turnaround time for these short jobs is essential for the researchers to progress with their research. By defining a block of nodes dedicated exclusively to these short jobs, they not only have the ability to run in this defined block, but the whole system. Jobs were split up into three categories: small, medium and large. Jobs running for up to two hours are defined as small, the medium jobs run between 2 and 8 hours, and any job 32 running longer than 8 hours is defined as a large job. 4.2.4 Optimisation Using the information found from the data analysis, the values governing the current scheduling system will be updated. The data collected in the earlier phase will then be put into the Maui scheduler to output a figure that should indicate whether the optimisation would have been successful on that particular data set. 4.3 Conclusion This chapter has outlined the methodology to be used in this project, and has explored the experiments to be performed. The results of these experiments is presented in the next chapter. 33 Chapter 5 Analysis of Results 5.1 Introduction This chapter focuses on the research questions by presenting, analysing and discussing the results from simulations of the described scheduling techniques. An analysis of the quality of data gathered is examined first, then the data is examined. Finally, a conclusion is made about the scheduling techniques used. 5.2 Aims The research question proposed is ”What is an efficient technique to schedule jobs on VPAC’s high performance computer?” To answer this question, and the questions it raises, a series of simulations were run to test various scheduling techniques. Data from the University of Utah’s cluster, Icebox, was used to perform simulations of three job scheduling techniques. 34 The methodology behind these experiments is found in Chapter 4. 5.2.1 Workload Profile Comparison At the time the simulations were undertaken, Brecca, VPAC’s Linux cluster was still in development and therefore, the system was underutilised. This underutilisation is due to factors such as Brecca is still being tested for stability and many libraries and applications used on Grendel have not yet been ported over to Brecca. The current workload profile of Brecca is expected to change as users from VPAC’s old system, Grendel, are migrated over to Brecca, and hence the workload profile of Brecca is expected to slowly change to resemble the current workload profile of Grendel. The usefulness and validity of simulations undertaken as part of this project depend on using job scheduling data similar in profile to the anticipated workload of a fully loaded Brecca as it is anticipated to be some time in the future. Brecca is the successor to Grendel and will inherit Grendel’s workload once Grendel is decommissioned. Hence the most accurate prediction has to be the recent workload history of Grendel. The workload profile of Grendel is shown in Figure 5.1 and Table 5.1. This data was gathered through scripts run on Grendel to extract job data on a daily basis and although this data is an accurate reflection of Grendel’s workload, it is not suitable for Maui simulation because it does not comply with the stringent format the Maui simulator requires. Job profile data was gathered from Brecca over a period between October 2003 and November 2003, with a total of 7524 jobs being processed in that time. This data is shown 35 Processors Total 1 695 2 212 4 303 8 103 16 14 32 7 64 0 128 0 Total 1334 Table 5.1: Job Data on Grendel Workload profile for Grendel 700 600 Number of Jobs 500 400 300 200 100 0 1 2 4 8 16 32 Number of Requested CPU’s 64 Figure 5.1: Grendel Workload Profile 36 128 Processors 0:02:00 0:04:00 0:08:00 0:16:00 0:32:00 1:04:00 2:08:00 4:16:00 8:32:00 17:04:00 34:08:00 Total 1 0 0 3 2552 21 69 18 0 219 13 3193 6088 2 58 9 0 102 0 0 9 0 0 0 372 550 4 25 18 14 66 0 0 3 0 0 8 322 456 8 5 0 0 28 0 0 9 0 0 7 144 193 16 0 0 0 23 0 0 0 3 0 0 33 59 32 21 0 0 4 0 0 0 0 3 11 104 143 64 0 0 0 0 0 0 0 0 0 7 28 35 128 0 0 0 0 0 0 0 0 0 0 0 0 Total 109 27 17 2775 21 69 39 3 222 46 4196 7524 Table 5.2: Job Data on Brecca in Table 5.2 and Figure 5.2. This data is also unsuitable for simulation because it was gathered during a period of commissioning and has a definite predominance towards single processor jobs. Workload profile for Brecca 7000 6000 Number of Jobs 5000 4000 3000 2000 1000 0 1 2 4 8 16 32 Number of Requested CPU’s 64 128 Figure 5.2: Brecca Workload Profile Data gathered from the University of Utah’s Center for High Performance Computing cluster, Icebox was obtained from the supercluster.org’s tracefile repository (located at 37 Processors 0:02:00 0:04:00 0:08:00 0:16:00 0:32:00 1:04:00 2:08:00 4:16:00 8:32:00 17:04:00 34:08:00 Total 1 71 18 77 373 244 674 642 378 469 445 3851 7242 2 128 1 166 100 78 74 54 56 65 74 665 1461 4 97 0 60 211 116 204 104 109 165 270 1838 3174 8 21 0 20 28 23 145 257 755 950 687 2128 5014 16 25 0 5 86 9 77 63 58 457 145 538 1463 32 0 0 0 5 6 34 15 85 89 82 885 1201 64 0 0 0 2 0 10 3 1 3 2 27 48 128 0 0 0 0 0 0 2 0 0 0 5 7 Total 342 19 328 805 476 1218 1140 1442 2198 1705 9937 19610 Table 5.3: Job Data on Icebox http://supercluster.org/research/traces). An analysis of this data is shown in Figure 5.3 and Table 5.3. The workload profiles of Grendel and Icebox show strong similarities., both profiles have a predominate number of single processor jobs and a large number of 2, 4 and 8 processor jobs. It is the mix of single processor and multiprocessor jobs that cause resource fragmentation and therefore, provides an opportunity for optimisation by using backfilling and a standing reservation for short jobs. The similarity between the Icebox and Grendel workload profiles, justify using the Icebox data for the experiments of this project. 5.3 5.3.1 Scheduling Experiments Experiment 1: Backfilling The aim of this experiment is to evaluate the effectiveness of two backfilling methods supported by Maui. The methodology behind this experiment is found in section 4.2.3. Three rounds of simulations were run, each round modifying the backfiling method used. The first round of simulations disabled backfilling, the second round tested the First Fit algorithm 38 Workload profile for Icebox 8000 7000 Number of Jobs 6000 5000 4000 3000 2000 1000 0 1 2 4 8 16 32 Number of Requested CPU’s 64 128 Figure 5.3: Icebox Workload Profile while the third round tested the Best Fit algorithm. Within each simulation round, the number of days for each simulation was increased from 10 to 90 days, at 10 days steps. In total, 27 simulations were executed using the Icebox data. The results of the simulation for the Icebox data can be seen in figure 5.4. This graph (Figure 5.4) shows the number of completed jobs, from 5 to 90 days, comparing the two backfilling methods and with backfilling disabled. The results show almost twice as many jobs being processed in the same time period when either of the two backfilling methods were, compared with no backfilling. It is interesting to note that comparing the two backfilling methods showed very similar results, although Firstfit performed marginally better. The results shown here are a demonstration of the affect that backfilling is having on this 39 Backfill comparison against number of completed jobs: Icebox 7000 Backfill: NONE Backfill: FIRSTFIT Backfill: BESTFIT 6000 Completed Jobs 5000 4000 3000 2000 1000 0 0 10 20 30 40 Days 50 60 70 80 90 Figure 5.4: Number of Completed Jobs: Icebox workload. When backfilling is disabled, the scheduler uses a FCFS algorithm. The results in Figure 5.4 highlight the inefficiency of this algorithm. To understand how this situation arises, consider the following example. The next job in the queue requests more resources than are currently available, and therefore must wait until those resources are freed. If that job requests a substantial number of CPU’s, a considerable amount of resources will remain idle until the requested job can start. In the the two backfilling tests we see short jobs processed in the resource gaps, during the period that the large job waits for the total amount of requested resource to become available. Figure 3.2 in Chapter 3 shows an example of this. Although Figure 5.4 shows that backfilling has doubled the throughput of jobs, this does not necessarily translate into a double increase in system utilisation. The increase of job throughput is the direct result 40 of the backfilling algorithm assigning idle resources to short jobs to reduce fragmentation. 5.3.2 Experiment 2: Wallclock Accuracy This experiment is aimed to test the effect that the wallclock accuracy has on the backfilling technique. The methodology behind this experiment can be found in section 4.2.3. Dedicated Processor Hours (Percent) Wallclock effect on Backfilling measuring Dedicated Processor Hours: Icebox 100 Backfill: FIRSTFIT Backfill: BESTFIT 80 60 40 20 0 10 20 30 40 50 60 70 WallClock Accuracy (Percent) 80 90 100 Figure 5.5: Wallclock effect on Backfilling measuring Dedicated Processor Hours: Icebox In this experiment, the wallclock accuracy was tested against dedicated processor hours. This serves as a good metric for system utilisation. The graph (Figure 5.5) shows that the predicted wallclock accuracy does have an impact on the utilisation of the cluster. At 10% accuracy, it seems that the two backfilling methods can provide a system utilisation of around 75%, while at 100% accuracy, the scheduler can operate at 98% system utilisation. This can be expected, as the scheduler uses the wallclock accuracy when scheduling jobs 41 using backfilling. The scheduler makes the best decisions it can by using the wallclock value as predicted by the users, but this value cannot be controlled. The accuracy of the wallclock value predicted can only be improved by experience and knowledge of the user. As stated in the Maui Scheduler Administrator’s Guide, reasonable wallclock accuracy is around 40%. The average wallclock accuracy of the Icebox data was 30%. 5.3.3 Experiment 3: Standing Reservation for Short Jobs A technique used in HPC is to give special privileges to short running jobs, so that they may be processed quickly. Often short jobs are queued by users as a test before launching a much longer job. For example, researchers want a quick turnaround time for their testing jobs, while they refine their processes and data. A short turnaround time for these short jobs is essential for the researchers to progress with their research. By defining a block of resources for dedicated use by short jobs, their turnaround time can be reduced significantly. This technique is quantified by the expansion factor. The expansion factor is relative to the size of the job and the duration it waits in the queue, so to achieve an overall smaller expansion factor, short jobs need to be given priority of starting over larger jobs. The aim of this experiment is to reduce the expansion factor for short jobs which should in turn, reduce the overall expansion factor for all jobs. The methodology behind this experiment is found in section 4.2.3 From assessing the graph of results (Figure 5.6) it can be seen that increasing the amount of processors dedicated to the short job standing reservation decreases the expansion factor. 42 35 Effect of standing reservation on short running jobs: Icebox Short Jobs Medium Jobs Long Jobs Average XFactor (Percent) 30 25 20 15 10 5 0 4 8 16 32 64 Number of reserved CPU’s 128 256 Figure 5.6: Effect of Standing Reservation for Short Jobs: Icebox From the simulation, it seems that the best combination for reducing the expansion factor for short jobs seems to be around the 32 or 64 node mark. At these points the expansion factor for the short jobs is the small, but it does not affecting the expansion factor for larger jobs. 5.4 Conclusion This chapter has presented the results of the experiments discussed in the last chapter. A conclusion of the results is made in the next chapter. 43 Chapter 6 Conclusion 6.1 Introduction This chapter discusses the conclusions made from the results in the last chapter, discusses any problems encountered throughout the project and suggests ideas for further research in this area. 6.2 Results The research question proposed is ”What is an efficient technique to schedule jobs on VPAC’s high performance computer?” By testing the backfilling algorithm, the wallclock accuracy when using backfilling and a standard reservation for short jobs we see that these optimisation techniques have proved significant. From the experiments performed in this project, the results show that the optimisations 44 have proved to be significant. The backfilling tests showed that an increase of nearly twice as many jobs being processed in the same time period by using either Firstfit or Bestfit backfilling, compared to using no backfilling. Neither Firstfit or Bestfit showed any great advantage over the other, but demonstrated that the method of backfilling was not an important factor. An important factor of the success of backfilling is the wallclock accuracy which was tested. The accuracy of the wallclock estimate had a direct relationship with the system utilisation. With a wallclock accuracy of 100%, the utilisation of the system ran at an average of 98%. Although this shows a great increase in performance, it would be unrealistic for users to predict their job duration to this degree. Regardless of the wallclock accuracy, backfilling has proved itself as a significant optimisation in job scheduling. The short job reservation proved to be successful by reduced the turnaround time for short jobs. The average turnaround time for these short jobs dropped from over 30 hours, down to 2 hours at the 32 processor mark, on the Icebox data, without penalty to medium or large jobs. This ensures that short jobs, designed for testing, can be processed with a quick turnaround time, allowing researchers to reduce wasted computing power, and make better use of their time. 6.3 Problems Encountered Although the Maui Scheduler team claim that Maui is ’the most advanced scheduler in the world’, it still has a long way to go. From a simulation perspective, it does support many nice 45 features, but they are not always successful. This is often due the lack of documentation and also documentation that is incorrect. For example, fixing the wallclock accuracy to measure the effect that it has on backfilling was a difficult task. Initially, this value was stumbled upon from a paper about simulations with Maui. (Jackson et al., 2001) This actual value quoted was ’SIMWCA’ but was actually ’SIMWCACCURACY’ which was obtained from a Maui log, with the log detail set to 9. Many issues like this were encountered, and many emails were exchanged with the supercluster.org group. Their responses were exceptional, but would not have been needed if the relevant documentation existed. 6.4 Further Research In this project, several techniques were used to not only increase system utilisation but also to reduce turnaround time for short jobs. One aspect that was not investigated as part of this project was user feedback. Throughout the simulations, a constant job depth of 32 jobs was used. This ensured that at all times, there were 32 jobs waiting in the queue. Although, this was sufficient in testing backfilling for example, it did not consider the fact that users generally do not submit more jobs until they have the results from their previous jobs. Therefore, the jobs submitted to the queue would not necessarily have had the constant 32 jobs waiting in the queue. Realistically, the submission of the jobs could be much more varied, and possibly be affected by factors such as day of week, time of day or even an event such as a research conference. Some papers (Feitelson and Nitzberg, 1995) (Hotovy, 1996) discuss workload evaluations, and (Jackson et al., 2001) discusses this issue further. For 46 further research, this issue of varied job submission could be investigated. In the case of VPAC, once the machine reaches a job/work saturation point, data could then be collected and further optimisation could be done. This could possibly give a better understanding of the workload, and provide a more robust optimisation. 47 Bibliography Apsen Systems (2003). The era of supercomputing. Retrieved 21st November, 2003 from http://www.aspsys.com/clusters/beowulf/history. Bode, B., Halstead, D. M., Kendall, R., and Lei, Z. (2000). The portable batch scheduler and the maui scheduler on linux clusters. In Proceedings of the 4th annual linux showcase and conference, Atlanta. Buyya, R. (1999). High Performance Cluster Computing: Architectures and Systems. Prentice Hall PTR, NJ, USA. Feitelson, D. (1994). Job scheduling in multiprogrammed parallel systems. Technical report, IBM Research Report RC. Feitelson, D. and Nitzberg, B. (1995). Job characteristics of a production parallel scientific workload on the nasa ames ipsc/ 860. In Feitelson, D. G. and Rudolph, L., editors, Proceedings of IPPS ’95 Workshop on Job Scheduling Strategies for Parallel Processing, volume 949, pages 337–360. Springer. Hotovy, S. (1996). Workload evolution on the cornell theory center ibm sp2. In Feitelson, 48 D. G. and Rudolph, L., editors, Proceedings of IPPS ’96 Workshop on Job Scheduling Strategies for Parallel Processing, pages 27–40. Springer-Verlag. IBM Corporation (2003). IBM eServer Cluster 1350 Description. IBM Corporation. Jackson, D. B. (1999). Advanced scheduling of linux clusters using maui. Retrieved 25th August from http://supercluster.org/research/papers/xlinux99.html. Jackson, D. B., Jackson, H., and Snell, Q. O. (2001). Simulation based hpc workload analysis. In Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS-01), San Francisco, CA, April 23-27, 2001. IEEE Computer Society. Majumdar, S., Eager, D. L., and Bunt, R. B. (1988). Scheduling in multiprogrammed parallel systems. In Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 104–113. ACM Press. Merkey, P. (2003). Beowulf history. Retrieved 21st November, 2003 from http://www.beowulf.org/beowulf/history.html. Pfister, G. F. (1998). In Search of Clusters: The Ongoing Battle in Lowly Parallel Computing. Prentice Hall, 2nd edition. Schwiegelshohn, U. and Yahyapour, R. (1998). Analysis of first-come-first-serve parallel job scheduling. In Symposium on Discrete Algorithms (A Conference on Theoretical and Experimental Analysis of Discrete Algorithms). 49 Schwiegelshohn, U. and Yahyapour, R. (2000). Fairness in parallel job scheduling. Journal of Scheduling. Streit, A. (2001). On job scheduling for hpc-clusters and the dynp scheduler. Lecture Notes in Computer Science, 2228. Supercluster Research and Development Group (2002). Maui Scheduler Administrator’s Guide. Supercluster Research and Development Group. Talby, D. and Feitelson, D. (1999). Supporting priorities and improving utilization of the ibm sp scheduler using slack-based backfilling. In Proceedings of the 13th International Parallel Processing Symposium. Wikipedia (2003). ILLIAC IV. Retrieved 4th June, 2003 from http://www.wikipedia.org/wiki/ILLIAC IV. Zenios, S. A. (1999). High-performance in finance: The last 10 years and the next. Parallel computing, 25:2149–2175. 50
© Copyright 2026 Paperzz