Performance Evaluation of a Dynamic Single Round Scheduling Algorithm for Divisible Load Applications Leila Ismail, Liren Zhang, Khaled Shuaib, and Sameer Bataineh Faculty of Information Technology, UAE University, 17551 Al-Maqam, Al-Ain, United Arab Emirates Abstract— Divisible load applications occur in many scientific and engineering applications and can easily be mapped to a distributed environment, such as computational grids or clouds, using a master-worker pattern. However, dividing an application and deploying it on a set of heterogeneous computing resources pose challenges to obtain an optimal performance due to the underlying system processing and networking capacities. We provide a dynamic scheduling algorithm for allocating divisible loads on a set of heterogeneous resources to decrease the overall application execution time. The algorithm uses a single-round strategy which is a known approach adopted by the majority of the developers as it is simple to design and implement. Our algorithm computes the chunk size that should be distributed to each worker. We analyze the performance of the algorithm in different scenarios of heterogeneous computing and networking capacities. Keywords: Distributed Systems, Divisible Load Application, Dynamic Scheduling, Performance 1. INTRODUCTION The capacity of today’s infrastructures, the ubiquity of network resources, and the low storage cost has led to the emergence of heterogeneous distributed memory systems such as Grid and Cloud computing. These platforms are promising technologies to run parallel applications in a low cost [1] [2] [3] [4] [5]. In this work, we propose a dynamic scheduling algorithm to obtain an optimal performance when distributing divisible load applications [6] to a set of heterogeneous resources in a computational platform, such as a Grid or a Cloud. Divisible load model represents a class of applications, where an application can be divided into a number of tasks that can be processed independently in parallel ( [6], [7]) with no or negligeable inter-tasks communication. Many scientific and engineering applications fall into this category, such as search for a pattern, compression, join and graph coloring and generic search applications [8], multimedia and video processing [9], [10], convolution [11], and image processing [12] [13]. The tasks of a divisible load application are amenable to a master-worker model [14] that can be easily implemented and deployed on computing platforms, such as clusters, computational Grids, or computational Clouds. The master is a processor which divides an application load into tasks and assigns each task to a separate worker. In this work, we compute the chunk sizes of an application that should be distributed to a set of computing workers to obtain an optimal performance. The algorithm takes into consideration the communication and the computation capacities of the underlying platform. It also accounts for computing delays and communications latencies. In our scenarios, our algorithm shows a maximum of 1.29 of relative makespan to the ideal algorithm in the heterogeneous environments we have set, and with no specific selection policy. The ideal algorithm determines a chunk size in an ideal situation in which all computing workers are homogeneous, and that networking and computing overheads are negligeable. Several works have been done to find an optimal scheduling algorithm to schedule a divisible load application [15] [16] [17] [18] [19]. The multi-installment algorithm, studied in [15], proposed a model of scheduling tasks in a single round, but communication time and computation time are assumed to be proportional. A multiround algorithm [16] was built based on [15] by adding communication and computation latencies to the model. The algorithm aims at optimizing the number of rounds to obtain an optimal makespan of the application. Consequently some computing workers will be waiting, while others are working within a single round. [20] studies the complexity of multi-round divisible load schedule, and concluded that the selection process of computing workers, their ordering and the distribution of the adequate chunk sizes remain an open problem. [17] has built up a scheduling model with the assumptions that computations are suspended by communications. Other works have been done on specialized infrastructure, such as in [21] for distributed bus network. The rest of the paper is structured as follows. Section 2 describes the system model we consider. Our dynamic scheduling algorithm is described in section 3. In section 4, we evaluate the proposed model and discuss the obtained results. Section 5 concludes the work. 2. SYSTEM MODEL The computing platforms that we are targeting are Grids and Clouds of computing resources. As shown in Figure 1, a computing platform consists of nodes which are connected via a network link whose speed dictates the speed of the Fig. 1: Computing Platform Model communication processing when data is transmitted among the nodes. A node consists of multiple cores. A core is the smallest processing unit available in the system. A core can be a context, a processor or a physical core. Upon receiving of an application or a user’s request, the master divides the whole application, of a total load of Wtotal , into different tasks and schedules these tasks to the selected computing workers based on their individual capability in terms of: the conditions of the communication link between the master and each computing worker, and the existing computing power of each computing worker. Each computing worker in the system has a computing capacity, µi , where i = 1, ..., N , and N is the number of computing workers in the system. In the rest of the paper, we use the terms task and chunk interchangeably. Consider a portion of the total load, chunki ≤ Wtotal , which is to be processed on worker i. We model the time required for a computing worker to perform the computation, T Pi , as: T Pi = θi + chunki µi (1) where θi , is a fixed latency, in seconds, for starting the computation at the computing worker r. We model the time for sending chunki units of load to computing worker i, T Ci , as: T Ci = λi + chunki Ci (2) where the component λi in T Ci , is a latency, in seconds, incurred by the master to initiate data transfer to worker r. Ci is the data transfer rate in terms of units of computing loads per second that can be provided by the communication link from the master to the worker i. The λi component may caused by the delay when initiating communicating from the master to the worker i by using SSH or any Grid or Cloud software which is used to access the worker i. 3. SCHEDULING ALGORITHM To obtain an optimal performance for a single round, our main objective is to determine the chunk size that is to be assigned to each computing worker so that all computing Fig. 2: Distribution of Chunk Sizes to Computing Workers in a Single Round workers complete their computations at the same time. The total load Wtotal computation units is divided into N computing worker. We assume that the master starts to send chunks to N available computing workers in a sequential fashion. Consequently, every computing worker i will get its chunk size chunki to process. Figure 2 shows an example, in which four computing workers are used. i Let us define αi = chunk Ci , is the time required to transfer chunki form the master to the worker i by using the communication link between the master and the computing worker i Ci , and βi = chunk is the time required to compute the µi chunk chunki by the computing worker i with computing processing capacity of µi . The time required by the first computing worker to complete the assigned task, chunk1 , is given by the following formula: T1 = T C1 + T P1 chunk1 chunk1 + θ1 + = λ1 + C1 µ1 1 1 = λ1 + θ1 + chunk1 + C1 µ1 (3) The time required by the second computing worker to complete its task, chunk2 , is given by the following formula: Based on Equation 10, we obtain the following equation: T2 = T C1 + T C2 + T P2 (4) chunk1 = λ1 + + λ2 C1 chunk2 chunk2 + θ2 + + C2 µ2 1 1 chunk1 + λ2 + θ2 + chunk2 + = λ1 + C1 C2 µ2 = chunk2 = j=i−1 X T Cj + T Ci + T Pi = j=i−1 X chunkj ) + θi + chunki (λj + Cj j=1 1 1 + Ci µi = j=N X−1 = j=N X−1 T Cj + T CN + T PN chunki (6) j=1 (λj + j=1 chunkj ) + θN + chunkN Cj (14) C3 µ3 λ C3 µ3 chunk2 − µ2 (C3 + µ3 ) C3 + µ3 C2 C3 µ3 chunk1 = µ1 (C2 + µ2 )(C3 + µ3 ) 2 ) C2 C3 µ3 λ(1 + C2C+µ 2 − (C2 + µ2 )(C3 + µ3 ) (15) (16) Based on Equation 12, and by substitutions, for a computing worker i, we obtain the following formula: The time required by the last computing worker N to complete the task, chunkN , is given by: TN C2 µ2 λ C2 +µ2 = chunk3 i = 1, ..., N (5) j=1 − Based on Equations 11 and 14, we obtain the following formula: In general, the time required by the computing worker i to complete its task, chunki , is given by: Ti C2 µ2 chunk1 µ1 (C2 +µ2 ) = Qj=i j=2 Cj µi chunk1 Qj=i µ1 j=2 (Cj + µj ) −λ Qj=i j=2 Cj µi (1 + Qj=i Q k=j k=2 (Ck +µk ) Q k=j ) k=2 k=2 Ck Pk=j j=2 (Cj (17) + µj ) P As Wtotal = i=N chunki , and based on Equation 17, i=1 1 1 + then we obtain the value of chunk1 as follows: CN µN To achieve the goal that all workers complete the computation of their assigned chunks at the same time, it is required that: T1 = T2 = ... = TN = T (7) To simplify the formulas, we assume that: λ1 = λ2 = ... = λN (8) and chunk1 Wtotal + P k=i−1 Q k=i (Ck +µk ) k=2 ) Q k=i Pi=N i=2 Cj µi (1+ k=2 C k=2 k Q j=i ) λ( i=2 j=2 (Cj +µj ) Q j=i Pi=N j=2 Cj µi 1 + i=2 µ Q j=i 1 j=2 (Cj +µj ) = Q i=N (18) Based on Equation 17, the remaining chunks, chunki , i = 2, ..., N , can be obtained by substituting the value of chunk1 obtained by Equation 18. 4. PERFORMANCE EVALUATION θ1 = θ2 = ... = θN (9) Then based on the Equation 7 and the Equations 3, 4, 5, and 6, we obtain: chunk1 µ1 chunk2 µ2 ... chunki−1 µi−1 ... chunkN −1 µN −1 1 1 + C µ2 2 1 1 = λ + chunk3 + C3 µ3 1 1 = λ + chunki + Ci µi 1 1 = λ + chunkN + CN µN = λ + chunk2 (10) (11) (12) (13) In this section, we evaluate the efficiency of our scheduling algorithm in different scenario. The efficiency of the algorithm is measured and compared to the ideal makespan. The best makespan is calculated by dividing the total workload over the summation of the computing powers of the total , i = 1, ..., N ). In particular, we individual workers( PWi=N i=1 µi analyze the impact of the system parameters (µi , Ci ) on the performance of our scheduling algorithm. 4.1 Experimental Runs The experiments use a set of 10 computing workers. We vary the heterogeneity degree of the workers and study the performance of the scheduling algorithm. In a first step, we study the performance of the algorithm running on homogeneous clusters with increasing network capacity. Table 1 shows the experimental values in which all the computing workers of the cluster have similar computing powers and they are connected to the master via similar network capacity. In a second step, we run the experiments with heterogeneous processing powers and homogeneous network connecting the master to the computing workers. We study the performance of the algorithm with increasing network capacity. Table 2 shows computing processing powers which have been randomly generated. In these experiments, the network and the processing powers have been chosen in a way that the computation to communication ratio can represent realistic scenarios. To that purpose, we have run a parallel MPEG video compressor in our heterogeneous Lab cluster. An input video is composed of frames and each frame can be processed independently [22]. Table 4 shows the communication and the execution times of 600MByte of raw video and the ratio of communication time to execution time in our experimental environment. In order to analyze the impact of the degree of heterogeneity of resources on the scheduling algorithm, experiments are conducted using clusters of different heterogeneity. We randomly generate computing and communication resources by variating the standard deviation and by using the same mean of 0.50 and 30 for computation and communication respectively. Table 3 shows the standard deviations for generating the different set of communication and computation respectively. For every run, we compute a normalized makespan relative to the ideal makespan. Every run is executed for 100 times and the average over the runs is taken. Table 1: Experimental Runs using Homogeneous Resources (N = 10, Wtotal = 2000, µi = 1, λ = θ = 1) Run 1 2 3 4 5 Value of Ci (i = 1, ..., N ) 10 20 30 40 50 Table 2: Experimental Runs using Homogeneous Resources (N = 10, Wtotal = 2000, µi = 0.45, 0.65, 0.84, 0.5, 0.4, 0.44, 0.37, 0.55, 0.64, 0.4, λ = θ = 1) Run 1 2 3 4 5 Value of Ci (i = 1, ..., N ) 10 20 30 40 50 Table 3: Heterogeneity Change for Communication and Computation in Experimental Scenarios Standard Deviation for µ 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5 Standard Deviation for C 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5 Table 4: Video Compressor Communication and Computation Times Transfer (Sec) 26.375 Time Computation Time (Sec) 578.9375 Ratio of Communication to Computation 21.95023697 4.2 Performance Analysis In devising a scheduling algorithm for a single-roundbased applications, several requirements have to be considered. Those requirements considerations have an impact on the experimental results: a) Minimize waiting time of processes. This waiting time can be generated in 2 cases: • Processes wait for other processes to complete: after being assigned their chunks, processes which complete their executions first will wait for other processes to complete their execution. Our scheduling algorithm imposes that all computing workers complete their executions at the same time. • Processes wait for other processes to start: as our algorithm distributes the chunks to the processes in a sequential fashion, processes wait to receive their chunks. A selection policy of resources can decrease the impact of this problem on the overall performance of the scheduler. Our algorithm does not impose a selection policy and any selection policy can be used. Figure 3 shows the relative performance of our scheduling algorithm on a homogeneous environment. The makespan decreases when the network capacities increase to become close to the ideal makespan with large network capacities; i.e., C=60. Figure 4 shows the chunk sizes that are distributed to the computing workers according to the scheduling algorithm. It shows that as the network capacity increases between the master and the computing worker, the chunk size decreases. This is to accommodate quicker transfer time for a current process during which previous processes are being executed. The scheduling algorithm shows a good performance with heterogeneous computing powers as shown in Figure 5. Figure 6 shows the corresponding chunk sizes distribution. To know the performance of the scheduling in a heterogeneous environment for both computing and networking resources, we calculate the relative makespan in different heterogeneous sets of computing and networking scenarios as shown in Figure 7. The scheduling algorithm achieves a relative makespan of 1.21 for low processing heterogeneity (standard deviation of 0.1) and medium networking heterogeneity (standard deviation of 6). The performance of the algorithm decrease with high computing heterogeneity (standard deviation of 0.35) and high network heterogeneity (standard deviation of 10). At the performance level, while it looks intuitive that a selection policy by increasing of networking capacities will achieve a better performance than using no selection, as the waiting time of the remaining processes waiting for the chunk to be distributed to the previous processes will be less, yet our first experiments were to schedule loads without any selection policy. By sorting the resources based on their computing capacities of resources in an increasing order, the scheduling algorithm achieves a minimum makespan of 1.20 and a maximum is 1.31, as shown in Figure 8. When resources are sorted by network capacities the algorithm achieves a relative performance of a minimum of 1.19 and a maximum of 1.22, as shown in Figure 9; an improvement over the performance of the algorithm with no selection policy or with computing-based selection. Fig. 4: Chunks Allocated to Computing Workers on Homogeneous Clusters with Increasing Network Capacity. Fig. 5: Relative Performance of the Scheduling Algorithm with Heterogeneous Computing Powers and No Selection Policy versus Increasing Network Capacity. Fig. 3: Relative Performance of the Scheduling Algorithm on Homogeneous Clusters with Increasing Network Capacities. 5. CONCLUSION In this paper, we presented a dynamic scheduling algorithm for divisible load applications to a set of heterogeneous computing workers to obtain an optimal performance. The algorithm uses single round, as this is the most used design algorithm by developers for its simplicity to program. Taking into consideration the processing and the networking capacities, our algorithm shows a good performance compared to the ideal performance, in particular whenever a selection policy is applied. The performance results show that our algorithm has a relative makespan of 1.19 with increasing network and processing heterogeneity of of 9 and 0.3 respectively, by using a network-based selection with increasing order of network capacities. References [1] Ian Foster. ŤWhat is the Grid? A Three Point ChecklistŤ. Argonne National Laboratory and University of ChicagoŤ. 20 July 2002 [2] Ian Foster and Carl Kesselman. ŤThe Grid: Blueprint for a Future Computing InfrastructureŤ. Morgan Kaufmann, 1998 [3] R. Buyya, C.S. Yeo, and S. Venugopal, Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities, Keynote Paper, in Proc. 10th IEEE International Conference on High Performance Computing and Communications (HPCC 2008), IEEE CS Press, Sept. 25Ű27, 2008, Dalian, China Fig. 6: Chunk Sizes Assigned to Heterogeneous Computing Powers with No Selection Policy versus Increasing Network Capacity. Fig. 7: Performance of the Scheduling Algorithm with Increasing Platform Heterogeneity and No Selection Policy. [4] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, Ivona Brandic. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems. Volume 25, Issue 6, June 2009 [5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia. Above the Clouds: A Berkeley View of Cloud computing. Technical Report No. UCB/EECS- 2009-28, University of California at Berkley, USA, February 10, 2009 [6] V. Bharadwaj, D. Ghose, and T. Robertazzi. Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 6(1):7Ű17, 2003 [7] D. Ghose and T. Robertazzi, editors, "Special issue on Divisible Load Scheduling", Cluster Computing, 6, 1, 2003 [8] Maciej Drozdowski and Powel Wolniewicz, "Experiments with Scheduling Divisible Tasks in Cluster of Workstations", Euro-Par 2000, pp.311-319, 2000 [9] D. Altilar and Y. Paker, "An optimal Scheduling Algorithm for Parallel Video Processing", In IEEE Int. Conference on Multimedia Computing and Systems, IEEE Computer Society Press, 1998 [10] D. Altilar and Y. Paker, "Optimal Scheduling Algorithms for Communication Constrained Parallel Processing", In Euro-Par 2002, LNCS 2400, pp. 197-206, Springer Verlag, 2002 [11] Leila Ismail and Driss Guerchi, "Performance Evaluation of Convolution on the IBM Cell Processor", IEEE Transactions on Parallel and Distributed Systems, 01 Apr. 2010, IEEE computer Society Digital Library. IEEE Computer Society, http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.70 [12] C. Lee and M. Hamdi, "Parallel Image Processing Applications on a Network of Workstations", Parallel Computing, Vol. 21, pp. 137-160, 1995 [13] V. Bharadwaj and S. Ranganath, "Theoretical and Experimental Study on Large Size Image Processing Applications Using Divisible Load Paradigms on Distributed Bus Networks", Image and Vision Computing, Vol.20, nos.13-14,pp.917-1034, 2002 [14] Ian Foster, "Designing and Building Parallel Programs", AddisonWesley (ISBN 9780201575941), 1995 [15] V. Bharadwaj, D. Ghose, and V. Mani, "Multi-Installment Load Distribution in Tree Networks with Delays", IEEE Transactions Aerospace and Electronics Systems, vol.31, no.2, pp.555-567, 1995 [16] Yang Yang, Krijn van der Raadt, and Henri Casanova, "Multiround Algorithms for Scheduling Divisble Loads", IEEE Transactions on Parallel and Distributed Systems, vol. 16, no.11, November 2005 [17] Maciej Drozdowski and Marcin Lawenda, "Multi-installment Divisi- ble Load Processing in Heterogeneous Systems with Limited Memory", PPAM 2005, LNCS 3911, pp.847-854, 2006 [18] Olivier Beaumount, Henri Casanova, Arnaud Legrand, Yves Robert, Yang Yang, "Scheduling Divisible Loads on Star and Tree Networks: Results and Open Problems", IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 3, pp.207-218, March 2005 [19] Leila Ismail, Bruce Mills, and Alain Hennebelle, "A Formal Model of Dynamic Resource Allocation in Grid Computing Environment", Proceedings of The IEEE Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD2008), Phuket, Thailand, pp. 685693, 2008 [20] Yang Yang, Henri Casanova, Maciej Drozdowski, Marcin Lawenda, Arnaud Legran, "On the Complexity of the Multi-Round Divisible Load Scheduling", Report No. 6096, ISSN0249-6399, INRIA, January 2007 [21] S. Bataineh, T.-Y. Hsiung, and T.G. Robertazzi, ŞClosed Form Solutions for Bus and Tree Networks of Processors Load Sharing a Divisible Job,Ť IEEE Trans. Computers, vol. 43, no. 10, Oct. 1994. [22] K. Shen, L.A. Rowe, and E.J. Delp, ŞA Parallel Implementation of an MPEG1 Encoder: Faster than Real-Time!,Ť Proc. SPIE Conf. Digital Video Compression: Algorithms and Technologies, pp. 407-418, Feb. 1995. Fig. 8: Performance of the Scheduling Algorithm with Increasing Platform Heterogeneity and Computing-based Selection Policy. Fig. 9: Performance of the Scheduling Algorithm with Increasing Platform Heterogeneity and Network-based Selection Policy.
© Copyright 2026 Paperzz