Performance Evaluation of a Dynamic Single Round Scheduling

Performance Evaluation of a Dynamic Single Round Scheduling
Algorithm for Divisible Load Applications
Leila Ismail, Liren Zhang, Khaled Shuaib, and Sameer Bataineh
Faculty of Information Technology, UAE University, 17551 Al-Maqam, Al-Ain, United Arab Emirates
Abstract— Divisible load applications occur in many scientific and engineering applications and can easily be mapped
to a distributed environment, such as computational grids
or clouds, using a master-worker pattern. However, dividing
an application and deploying it on a set of heterogeneous
computing resources pose challenges to obtain an optimal
performance due to the underlying system processing and
networking capacities. We provide a dynamic scheduling
algorithm for allocating divisible loads on a set of heterogeneous resources to decrease the overall application
execution time. The algorithm uses a single-round strategy
which is a known approach adopted by the majority of
the developers as it is simple to design and implement.
Our algorithm computes the chunk size that should be
distributed to each worker. We analyze the performance
of the algorithm in different scenarios of heterogeneous
computing and networking capacities.
Keywords: Distributed Systems, Divisible Load Application,
Dynamic Scheduling, Performance
1. INTRODUCTION
The capacity of today’s infrastructures, the ubiquity of
network resources, and the low storage cost has led to the
emergence of heterogeneous distributed memory systems
such as Grid and Cloud computing. These platforms are
promising technologies to run parallel applications in a low
cost [1] [2] [3] [4] [5].
In this work, we propose a dynamic scheduling algorithm
to obtain an optimal performance when distributing divisible
load applications [6] to a set of heterogeneous resources in a
computational platform, such as a Grid or a Cloud. Divisible
load model represents a class of applications, where an
application can be divided into a number of tasks that can
be processed independently in parallel ( [6], [7]) with no
or negligeable inter-tasks communication. Many scientific
and engineering applications fall into this category, such as
search for a pattern, compression, join and graph coloring
and generic search applications [8], multimedia and video
processing [9], [10], convolution [11], and image processing
[12] [13].
The tasks of a divisible load application are amenable to
a master-worker model [14] that can be easily implemented
and deployed on computing platforms, such as clusters,
computational Grids, or computational Clouds. The master
is a processor which divides an application load into tasks
and assigns each task to a separate worker. In this work,
we compute the chunk sizes of an application that should
be distributed to a set of computing workers to obtain an
optimal performance. The algorithm takes into consideration
the communication and the computation capacities of the underlying platform. It also accounts for computing delays and
communications latencies. In our scenarios, our algorithm
shows a maximum of 1.29 of relative makespan to the ideal
algorithm in the heterogeneous environments we have set,
and with no specific selection policy. The ideal algorithm
determines a chunk size in an ideal situation in which all
computing workers are homogeneous, and that networking
and computing overheads are negligeable.
Several works have been done to find an optimal scheduling algorithm to schedule a divisible load application [15]
[16] [17] [18] [19]. The multi-installment algorithm, studied
in [15], proposed a model of scheduling tasks in a single
round, but communication time and computation time are
assumed to be proportional. A multiround algorithm [16]
was built based on [15] by adding communication and
computation latencies to the model. The algorithm aims
at optimizing the number of rounds to obtain an optimal
makespan of the application. Consequently some computing
workers will be waiting, while others are working within
a single round. [20] studies the complexity of multi-round
divisible load schedule, and concluded that the selection
process of computing workers, their ordering and the distribution of the adequate chunk sizes remain an open problem.
[17] has built up a scheduling model with the assumptions
that computations are suspended by communications. Other
works have been done on specialized infrastructure, such as
in [21] for distributed bus network.
The rest of the paper is structured as follows. Section
2 describes the system model we consider. Our dynamic
scheduling algorithm is described in section 3. In section 4,
we evaluate the proposed model and discuss the obtained
results. Section 5 concludes the work.
2. SYSTEM MODEL
The computing platforms that we are targeting are Grids
and Clouds of computing resources. As shown in Figure 1, a
computing platform consists of nodes which are connected
via a network link whose speed dictates the speed of the
Fig. 1: Computing Platform Model
communication processing when data is transmitted among
the nodes. A node consists of multiple cores. A core is the
smallest processing unit available in the system. A core can
be a context, a processor or a physical core.
Upon receiving of an application or a user’s request, the
master divides the whole application, of a total load of
Wtotal , into different tasks and schedules these tasks to
the selected computing workers based on their individual
capability in terms of: the conditions of the communication
link between the master and each computing worker, and the
existing computing power of each computing worker. Each
computing worker in the system has a computing capacity,
µi , where i = 1, ..., N , and N is the number of computing
workers in the system. In the rest of the paper, we use the
terms task and chunk interchangeably.
Consider a portion of the total load, chunki ≤ Wtotal ,
which is to be processed on worker i. We model the time
required for a computing worker to perform the computation,
T Pi , as:
T Pi
= θi +
chunki
µi
(1)
where θi , is a fixed latency, in seconds, for starting the
computation at the computing worker r.
We model the time for sending chunki units of load to
computing worker i, T Ci , as:
T Ci
= λi +
chunki
Ci
(2)
where the component λi in T Ci , is a latency, in seconds,
incurred by the master to initiate data transfer to worker r.
Ci is the data transfer rate in terms of units of computing
loads per second that can be provided by the communication
link from the master to the worker i. The λi component may
caused by the delay when initiating communicating from the
master to the worker i by using SSH or any Grid or Cloud
software which is used to access the worker i.
3. SCHEDULING ALGORITHM
To obtain an optimal performance for a single round, our
main objective is to determine the chunk size that is to be
assigned to each computing worker so that all computing
Fig. 2: Distribution of Chunk Sizes to Computing Workers
in a Single Round
workers complete their computations at the same time.
The total load Wtotal computation units is divided into N
computing worker. We assume that the master starts to send
chunks to N available computing workers in a sequential
fashion. Consequently, every computing worker i will get its
chunk size chunki to process. Figure 2 shows an example,
in which four computing workers are used.
i
Let us define αi = chunk
Ci , is the time required to transfer
chunki form the master to the worker i by using the communication link between the master and the computing worker
i
Ci , and βi = chunk
is the time required to compute the
µi
chunk chunki by the computing worker i with computing
processing capacity of µi . The time required by the first
computing worker to complete the assigned task, chunk1 ,
is given by the following formula:
T1
= T C1 + T P1
chunk1
chunk1
+ θ1 +
= λ1 +
C1
µ1
1
1
= λ1 + θ1 + chunk1
+
C1
µ1
(3)
The time required by the second computing worker to
complete its task, chunk2 , is given by the following formula:
Based on Equation 10, we obtain the following equation:
T2
= T C1 + T C2 + T P2
(4)
chunk1
= λ1 +
+ λ2
C1
chunk2
chunk2
+ θ2 +
+
C2
µ2
1
1
chunk1
+ λ2 + θ2 + chunk2
+
= λ1 +
C1
C2
µ2
=
chunk2
=
j=i−1
X
T Cj + T Ci + T Pi
=
j=i−1
X
chunkj
) + θi + chunki
(λj +
Cj
j=1
1
1
+
Ci
µi
=
j=N
X−1
=
j=N
X−1
T Cj + T CN + T PN
chunki
(6)
j=1
(λj +
j=1
chunkj
) + θN + chunkN
Cj
(14)
C3 µ3 λ
C3 µ3 chunk2
−
µ2 (C3 + µ3 )
C3 + µ3
C2 C3 µ3 chunk1
=
µ1 (C2 + µ2 )(C3 + µ3 )
2
)
C2 C3 µ3 λ(1 + C2C+µ
2
−
(C2 + µ2 )(C3 + µ3 )
(15)
(16)
Based on Equation 12, and by substitutions, for a computing worker i, we obtain the following formula:
The time required by the last computing worker N to
complete the task, chunkN , is given by:
TN
C2 µ2 λ
C2 +µ2
=
chunk3
i = 1, ..., N (5)
j=1
−
Based on Equations 11 and 14, we obtain the following
formula:
In general, the time required by the computing worker i
to complete its task, chunki , is given by:
Ti
C2 µ2 chunk1
µ1 (C2 +µ2 )
=
Qj=i
j=2 Cj µi chunk1
Qj=i
µ1 j=2 (Cj + µj )
−λ
Qj=i
j=2 Cj µi (1 +
Qj=i
Q k=j
k=2 (Ck +µk )
Q k=j
)
k=2
k=2 Ck
Pk=j
j=2 (Cj
(17)
+ µj )
P
As Wtotal = i=N chunki , and based on Equation 17,
i=1
1
1
+
then we obtain the value of chunk1 as follows:
CN
µN
To achieve the goal that all workers complete the computation of their assigned chunks at the same time, it is required
that:
T1 = T2 = ... = TN = T
(7)
To simplify the formulas, we assume that:
λ1 = λ2 = ... = λN
(8)
and
chunk1
Wtotal +
P k=i−1 Q k=i
(Ck +µk )
k=2
)
Q k=i
Pi=N i=2 Cj µi (1+ k=2
C
k=2 k
Q j=i
)
λ( i=2
j=2 (Cj +µj )
Q j=i
Pi=N
j=2 Cj µi
1 + i=2 µ Q j=i
1
j=2 (Cj +µj )
=
Q i=N
(18)
Based on Equation 17, the remaining chunks, chunki ,
i = 2, ..., N , can be obtained by substituting the value of
chunk1 obtained by Equation 18.
4. PERFORMANCE EVALUATION
θ1 = θ2 = ... = θN
(9)
Then based on the Equation 7 and the Equations 3, 4, 5,
and 6, we obtain:
chunk1
µ1
chunk2
µ2
...
chunki−1
µi−1
...
chunkN −1
µN −1
1
1
+
C
µ2
2
1
1
= λ + chunk3
+
C3
µ3
1
1
= λ + chunki
+
Ci
µi
1
1
= λ + chunkN
+
CN
µN
= λ + chunk2
(10)
(11)
(12)
(13)
In this section, we evaluate the efficiency of our scheduling algorithm in different scenario. The efficiency of the
algorithm is measured and compared to the ideal makespan.
The best makespan is calculated by dividing the total workload over the summation of the computing powers of the
total
, i = 1, ..., N ). In particular, we
individual workers( PWi=N
i=1 µi
analyze the impact of the system parameters (µi , Ci ) on the
performance of our scheduling algorithm.
4.1 Experimental Runs
The experiments use a set of 10 computing workers. We
vary the heterogeneity degree of the workers and study
the performance of the scheduling algorithm. In a first
step, we study the performance of the algorithm running
on homogeneous clusters with increasing network capacity.
Table 1 shows the experimental values in which all the
computing workers of the cluster have similar computing
powers and they are connected to the master via similar
network capacity. In a second step, we run the experiments
with heterogeneous processing powers and homogeneous
network connecting the master to the computing workers. We
study the performance of the algorithm with increasing network capacity. Table 2 shows computing processing powers
which have been randomly generated. In these experiments,
the network and the processing powers have been chosen
in a way that the computation to communication ratio can
represent realistic scenarios. To that purpose, we have run
a parallel MPEG video compressor in our heterogeneous
Lab cluster. An input video is composed of frames and each
frame can be processed independently [22]. Table 4 shows
the communication and the execution times of 600MByte of
raw video and the ratio of communication time to execution
time in our experimental environment.
In order to analyze the impact of the degree of heterogeneity of resources on the scheduling algorithm, experiments
are conducted using clusters of different heterogeneity. We
randomly generate computing and communication resources
by variating the standard deviation and by using the same
mean of 0.50 and 30 for computation and communication
respectively. Table 3 shows the standard deviations for
generating the different set of communication and computation respectively. For every run, we compute a normalized
makespan relative to the ideal makespan. Every run is
executed for 100 times and the average over the runs is
taken.
Table 1: Experimental Runs using Homogeneous Resources
(N = 10, Wtotal = 2000, µi = 1, λ = θ = 1)
Run
1
2
3
4
5
Value of Ci (i =
1, ..., N )
10
20
30
40
50
Table 2:
Experimental Runs using Homogeneous
Resources (N
=
10, Wtotal
=
2000, µi
=
0.45, 0.65, 0.84, 0.5, 0.4, 0.44, 0.37, 0.55, 0.64, 0.4, λ
=
θ = 1)
Run
1
2
3
4
5
Value of Ci (i =
1, ..., N )
10
20
30
40
50
Table 3: Heterogeneity Change for Communication and
Computation in Experimental Scenarios
Standard Deviation for µ
0.05, 0.1, 0.15,
0.2, 0.25, 0.3,
0.35, 0.4, 0.45,
0.5
Standard Deviation for C
0.5, 1, 1.5, 2, 2.5,
3, 3.5, 4, 4.5, 5
Table 4: Video Compressor Communication and Computation Times
Transfer
(Sec)
26.375
Time
Computation
Time (Sec)
578.9375
Ratio of Communication to Computation
21.95023697
4.2 Performance Analysis
In devising a scheduling algorithm for a single-roundbased applications, several requirements have to be considered. Those requirements considerations have an impact on
the experimental results:
a)
Minimize waiting time of processes. This waiting
time can be generated in 2 cases:
• Processes wait for other processes to complete:
after being assigned their chunks, processes
which complete their executions first will wait
for other processes to complete their execution. Our scheduling algorithm imposes that all
computing workers complete their executions
at the same time.
• Processes wait for other processes to start: as
our algorithm distributes the chunks to the
processes in a sequential fashion, processes
wait to receive their chunks. A selection policy
of resources can decrease the impact of this
problem on the overall performance of the
scheduler. Our algorithm does not impose a
selection policy and any selection policy can
be used.
Figure 3 shows the relative performance of our scheduling
algorithm on a homogeneous environment. The makespan
decreases when the network capacities increase to become
close to the ideal makespan with large network capacities;
i.e., C=60. Figure 4 shows the chunk sizes that are distributed
to the computing workers according to the scheduling algorithm. It shows that as the network capacity increases
between the master and the computing worker, the chunk
size decreases. This is to accommodate quicker transfer
time for a current process during which previous processes
are being executed. The scheduling algorithm shows a
good performance with heterogeneous computing powers
as shown in Figure 5. Figure 6 shows the corresponding
chunk sizes distribution. To know the performance of the
scheduling in a heterogeneous environment for both computing and networking resources, we calculate the relative
makespan in different heterogeneous sets of computing and
networking scenarios as shown in Figure 7. The scheduling
algorithm achieves a relative makespan of 1.21 for low
processing heterogeneity (standard deviation of 0.1) and
medium networking heterogeneity (standard deviation of
6). The performance of the algorithm decrease with high
computing heterogeneity (standard deviation of 0.35) and
high network heterogeneity (standard deviation of 10).
At the performance level, while it looks intuitive that a
selection policy by increasing of networking capacities will
achieve a better performance than using no selection, as
the waiting time of the remaining processes waiting for the
chunk to be distributed to the previous processes will be
less, yet our first experiments were to schedule loads without
any selection policy. By sorting the resources based on their
computing capacities of resources in an increasing order,
the scheduling algorithm achieves a minimum makespan of
1.20 and a maximum is 1.31, as shown in Figure 8. When
resources are sorted by network capacities the algorithm
achieves a relative performance of a minimum of 1.19 and
a maximum of 1.22, as shown in Figure 9; an improvement
over the performance of the algorithm with no selection
policy or with computing-based selection.
Fig. 4: Chunks Allocated to Computing Workers on Homogeneous Clusters with Increasing Network Capacity.
Fig. 5: Relative Performance of the Scheduling Algorithm
with Heterogeneous Computing Powers and No Selection
Policy versus Increasing Network Capacity.
Fig. 3: Relative Performance of the Scheduling Algorithm
on Homogeneous Clusters with Increasing Network Capacities.
5. CONCLUSION
In this paper, we presented a dynamic scheduling algorithm for divisible load applications to a set of heterogeneous
computing workers to obtain an optimal performance. The
algorithm uses single round, as this is the most used design
algorithm by developers for its simplicity to program. Taking
into consideration the processing and the networking capacities, our algorithm shows a good performance compared
to the ideal performance, in particular whenever a selection
policy is applied. The performance results show that our
algorithm has a relative makespan of 1.19 with increasing
network and processing heterogeneity of of 9 and 0.3 respectively, by using a network-based selection with increasing
order of network capacities.
References
[1] Ian Foster. ŤWhat is the Grid? A Three Point ChecklistŤ. Argonne
National Laboratory and University of ChicagoŤ. 20 July 2002
[2] Ian Foster and Carl Kesselman. ŤThe Grid: Blueprint for a Future
Computing InfrastructureŤ. Morgan Kaufmann, 1998
[3] R. Buyya, C.S. Yeo, and S. Venugopal, Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as
Computing Utilities, Keynote Paper, in Proc. 10th IEEE International
Conference on High Performance Computing and Communications
(HPCC 2008), IEEE CS Press, Sept. 25Ű27, 2008, Dalian, China
Fig. 6: Chunk Sizes Assigned to Heterogeneous Computing
Powers with No Selection Policy versus Increasing Network
Capacity.
Fig. 7: Performance of the Scheduling Algorithm with
Increasing Platform Heterogeneity and No Selection Policy.
[4] Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg,
Ivona Brandic. Cloud computing and emerging IT platforms: Vision,
hype, and reality for delivering computing as the 5th utility. Future
Generation Computer Systems. Volume 25, Issue 6, June 2009
[5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski,
G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia. Above the
Clouds: A Berkeley View of Cloud computing. Technical Report
No. UCB/EECS- 2009-28, University of California at Berkley, USA,
February 10, 2009
[6] V. Bharadwaj, D. Ghose, and T. Robertazzi. Divisible load theory:
a new paradigm for load scheduling in distributed systems. Cluster
Computing, 6(1):7Ű17, 2003
[7] D. Ghose and T. Robertazzi, editors, "Special issue on Divisible Load
Scheduling", Cluster Computing, 6, 1, 2003
[8] Maciej Drozdowski and Powel Wolniewicz, "Experiments with
Scheduling Divisible Tasks in Cluster of Workstations", Euro-Par 2000,
pp.311-319, 2000
[9] D. Altilar and Y. Paker, "An optimal Scheduling Algorithm for Parallel
Video Processing", In IEEE Int. Conference on Multimedia Computing
and Systems, IEEE Computer Society Press, 1998
[10] D. Altilar and Y. Paker, "Optimal Scheduling Algorithms for Communication Constrained Parallel Processing", In Euro-Par 2002, LNCS
2400, pp. 197-206, Springer Verlag, 2002
[11] Leila Ismail and Driss Guerchi, "Performance Evaluation of
Convolution on the IBM Cell Processor", IEEE Transactions
on Parallel and Distributed Systems, 01 Apr. 2010, IEEE
computer Society Digital Library. IEEE Computer Society,
http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.70
[12] C. Lee and M. Hamdi, "Parallel Image Processing Applications on a
Network of Workstations", Parallel Computing, Vol. 21, pp. 137-160,
1995
[13] V. Bharadwaj and S. Ranganath, "Theoretical and Experimental Study
on Large Size Image Processing Applications Using Divisible Load
Paradigms on Distributed Bus Networks", Image and Vision Computing, Vol.20, nos.13-14,pp.917-1034, 2002
[14] Ian Foster, "Designing and Building Parallel Programs", AddisonWesley (ISBN 9780201575941), 1995
[15] V. Bharadwaj, D. Ghose, and V. Mani, "Multi-Installment Load Distribution in Tree Networks with Delays", IEEE Transactions Aerospace
and Electronics Systems, vol.31, no.2, pp.555-567, 1995
[16] Yang Yang, Krijn van der Raadt, and Henri Casanova, "Multiround
Algorithms for Scheduling Divisble Loads", IEEE Transactions on
Parallel and Distributed Systems, vol. 16, no.11, November 2005
[17] Maciej Drozdowski and Marcin Lawenda, "Multi-installment Divisi-
ble Load Processing in Heterogeneous Systems with Limited Memory",
PPAM 2005, LNCS 3911, pp.847-854, 2006
[18] Olivier Beaumount, Henri Casanova, Arnaud Legrand, Yves Robert,
Yang Yang, "Scheduling Divisible Loads on Star and Tree Networks:
Results and Open Problems", IEEE Transactions on Parallel and
Distributed Systems, vol. 16, no. 3, pp.207-218, March 2005
[19] Leila Ismail, Bruce Mills, and Alain Hennebelle, "A Formal Model
of Dynamic Resource Allocation in Grid Computing Environment",
Proceedings of The IEEE Ninth ACIS International Conference on
Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD2008), Phuket, Thailand, pp. 685693, 2008
[20] Yang Yang, Henri Casanova, Maciej Drozdowski, Marcin Lawenda,
Arnaud Legran, "On the Complexity of the Multi-Round Divisible Load
Scheduling", Report No. 6096, ISSN0249-6399, INRIA, January 2007
[21] S. Bataineh, T.-Y. Hsiung, and T.G. Robertazzi, ŞClosed Form Solutions for Bus and Tree Networks of Processors Load Sharing a Divisible
Job,Ť IEEE Trans. Computers, vol. 43, no. 10, Oct. 1994.
[22] K. Shen, L.A. Rowe, and E.J. Delp, ŞA Parallel Implementation of an
MPEG1 Encoder: Faster than Real-Time!,Ť Proc. SPIE Conf. Digital
Video Compression: Algorithms and Technologies, pp. 407-418, Feb.
1995.
Fig. 8: Performance of the Scheduling Algorithm with
Increasing Platform Heterogeneity and Computing-based
Selection Policy.
Fig. 9: Performance of the Scheduling Algorithm with
Increasing Platform Heterogeneity and Network-based Selection Policy.