Dynamic parallel machine scheduling with random breakdowns

Dynamic parallel machine scheduling with random breakdowns
using the learning agent
Biao Yuan, Zhibin Jiang* and Lei Wang
Department of Industrial Engineering & Management,
Research Institute for Service Science & Enterprise Innovation,
Shanghai Jiao Tong University,
800 Dong Chuan Road, Shanghai 200240, P.R.China
Email: [email protected]
Email: [email protected]
Email: [email protected]
* Corresponding author
Abstract: Agent technology has been widely applied in the manufacturing process due to its
flexibility, autonomy, and scalability. In this paper, the learning agent is proposed to solve a
dynamic parallel machine scheduling problem which considers random breakdowns. The duty of
the agent, which is based on the Q-Learning algorithm, is to dynamically assign arriving jobs to
idle machines according to the current state of its environment. A state-action table involving
machine breakdowns is constructed to define the state of the agent’s environment. Three rules,
including SPT (Shortest Processing Time), EDD (Earliest Due Date) and FCFS (First Come First
Served) are used as actions of the agent, and the ε-greedy policy is adopted by the agent to select
an action. In the simulation experiment, two different objectives, including minimizing the
maximum lateness and minimizing the percentage of tardy jobs, are utilized to validate the ability
of the learning agent. The results demonstrate that the proposed agent is suitable for the complex
parallel machine environment.
Keywords: parallel machine, dynamic scheduling, reinforcement learning, Q-Learning, learning
agent, machine breakdowns
1
Introduction
Scheduling is one of the key problems in manufacturing systems. And machine scheduling
problems have been widely addressed in literature. These previous studies on the scheduling
problems assume that the set of jobs have all required information at the initial time. Thus, most of
the approaches schedule jobs in a static manner. However, some real-time events often occur in
the shop floor. They can be classified into two categories: resource-related and job-related. The
former includes machine breakdown, operator illness, tool failures, loading limits and so on. The
latter includes rush jobs, job cancellation, due date changes, early or late arrival of jobs, change in
job processing time and so on (Ouelhadj, 2009). Due to the emergence of real-time events, the
relationship between jobs and the shop floor is not so static that scheduling systems in that way
cannot be suitable in real life (Aydin and Öztemel 2000). So a dynamic scheduling system is more
appropriate than a static one in real-world applications. And this is also particularly useful in the
application of technologies of Internet of Things (IoT) in manufacturing enterprises (Huang et al.
2007; Fang et al. 2012).
1
In this paper, we study a dynamic parallel machine scheduling problem with random
breakdowns. As one kind of elementary machine scheduling problems, parallel machine
scheduling is an important problem from both a theoretical and a practical point of view. From a
theoretical viewpoint, it is a generalization of the single machine scheduling and a special case of
the flexible flow shop scheduling. From a practical viewpoint, it is important because the
appearance of resources in parallel is widely existed in the real word (Pinedo 2002). Dynamic
parallel machine scheduling problem has been addressed by some researchers. However, most of
parallel machine scheduling methods are proposed for a certain objective or a class of problems
(Tang et al. 2010, Lee et al. 2010, Ying and Cheng 2010), and may not be fit for different complex
dynamic manufacturing situations.
Recently, reinforcement learning algorithms have been applied to scheduling machines,
especially the Q-Learning algorithm due to its model-free characteristic. Aydin and Öztemel (2000)
addressed a dynamic job shop scheduling problem with the objective of minimization of the mean
tardiness by the Q-III learning algorithm to. Experiments results demonstrated that the proposed
algorithm outperformed each of the three rules including SPT (Shortest Processing Time),
COVERT (C over T), and CR (Critical Ratio). Wang and Usher (2004, 2005) proposed the
Q-Learning algorithm to dynamically schedule the single machine with different objectives, and
discussed the main factors which affected the performance of the algorithm, and suggested how to
set its parameters. Zhang et al. (2007) proposed the Q-Learning algorithm to solve a dynamic
unrelated parallel machine scheduling problem with minimization of the mean weighted tardiness.
Five rules, including WSPT (Weighted Shortest Processing Time), WMDD (Weighted Modified
Due Date), WCOVERT (Weighted COVERT), RATCS (Ranking Apparent Tardiness Cost with
Setups) and LFJ-WCOVERT (Least Flexible Job – Weighted COVERT), are utilized as actions to
approximate the Q-value function by the linear function with the gradient-descent method. Wang
et al (2007) used the Q-Learning algorithm to dynamically schedule the single machine with three
different objectives which consisted of minimizing the maximum tardiness, the number of tardy
jobs, and mean flow-time. Besides the Q-Learning algorithm, other learning algorithms, like the
R-Learning algorithm (Zhang el al. 2012) and the sarsa algorithm (Zhang el al. 2011), are also
proposed to solve machine scheduling problems. All the above research demonstrates the
effectiveness of the learning algorithms. Machine breakdown is one of real-time events often
occurring during manufacturing process. However, to our best knowledge, little attention has been
paid to machine scheduling with breakdowns by learning algorithms.
This paper focuses on solving the dynamic parallel machine scheduling problem with random
breakdowns by the learning agent based on the Q-Learning algorithm. It is organized as follows.
Section 2 models the problem by the learning agent. Section 3 presents the details of the
Q-Learning algorithm. Section 4 gives the parameters and process of the simulation, and the
discussion of the experiment. Finally, the conclusions are presented in Section 5.
2
Learning agent for parallel machine scheduling
The relationship between the agent and its environment is illustrated in Figure 1. The learning
agent, whose duty is to assign arriving jobs to machines, contains three components: a perception
module which perceives the current state of its environment, a cognition module which evaluates
the current state and produces an appropriate action, an action module which obtains information
about the selected action and sends it back to the environment. The cognition module consists of a
decision maker, a set of objectives, a set of behavioral rules (action list), and a learning
2
mechanism which generates rewards (i.e., a way for accumulating experience or knowledge)
according to the feedback of the environment after the completion of an action.
Learning
Learning Agent
Agent
Congition
Action List
SPT
FIFO
EDD
Decision
Maker
Objectives
Q-value
Learning
Mechanism
Action
State
Perception
State
Action
Reward
Action
System Environment
New state
Figure 1 Relationship between learning agent and system environment
The environment of the agent is a parallel machine system which includes a single buffer for
storing jobs required to be processed. The jobs arrive randomly, and are scheduled on m identical
machines. The arrival time of job j is aj. The processing time of job j is tj, and its due date is dj. If
the machine is idle when a job arrives, the job immediately begins processing; otherwise, it is
stored in the buffer.
To facilitate the discussion, some assumptions considered in this research are given as follows:
a) Each job consists of one operation and needs to be processed on only one machine;
b) Each machine can process one job at one time;
c) The job can be interrupted when machine breakdown occurs;
d) Only one machine is broken at the same time;
e) The rest processing of the interrupted job can be completed by other machines;
f) The size of buffer is infinite.
Our goal is to use the Q-Learning algorithm as the learning mechanism of the agent in this study.
To apply the algorithm, the problem should be converted into a reinforcement learning problem.
The conversion process, including modeling the agent’s environment, actions, action-selection
policy and reward function, is depicted in next section.
3
Q-Learning algorithm
3.1 Steps of Q-Learning algorithm
The Q-Learning algorithm is one of the most widely applied reinforcement learning algorithms
based on the value iteration. It was firstly proposed by Watkins in 1989 (Watkins 1989) to
approximately solve the large-scale Markov decision process (MDP) or semi-Markov decision
process (SMDP). The nature of the Q-Learning algorithm is the Bellman optimality principle. Let
Q(s, a), called the Q-value or the state-action pair value, denote the long-term excepted reward for
each pair of state and action (denoted by s and a). The Q-values have been proven to converge to
the optimal state-action values (Watkins and Dayan 1992). The formal steps of the algorithm are
3
described as follows (Sutton and Barto 1998):
Step 1: Initialize the Q(s, a) values arbitrarily.
Step 2: Perceive the current state s0.
Step 3: Select an action (a) for the given state s0 according to a certain policy.
Step 4: Execute the selected action (a), receive the immediate reward (r), and perceive the next
state s1.
Step 5: Update the Q-value as follows:
Q(s0, a) = Q(s0, a) + α [r + γ maxa’ Q(s1, a’) - Q(s0, a)]
(1)
Step 6: Set s0 equal to s1.
Step 7: Go to Step 3 until the terminal state.
Step 8: Repeat Steps 2- 7 for a number of episodes.
The iteration of Steps 2 - 7 represents a learning cycle, called an “episode”. The step-size
parameter (α) affects the learning rate whose value can be constant or varied from step to step. The
discount-rate parameter (γ) relates with the present value of future rewards and is set between zero
and one. The Q(s, a) values are initialized randomly. Generally speaking, if no actions for some
states are preferred, all the Q(s, a) values are set equal to the same value at the beginning of the
Q-Learning algorithm. The goal of Step 3 is to keep a balance between exploration and
exploitation, and many state-action selection approaches, including greedy, ε-greedy and softmax
action selection, can be applied in this step (Whiteson et al. 2007).
3.2 State determination criteria
The decision on which rule is employed for selecting a job from the buffer is based on the current
state of the system. There are several ways for describing the system states, such as the number of
jobs in the buffer, the lateness or tardiness of jobs in the buffer, and the mean slack time of jobs in
the buffer. In our study, the number of jobs (N), the mean lateness of jobs (ML), and the mean
processing time of jobs (MPT) in the buffer are selected as the state determination criteria.
Meanwhile, machine breakdowns lessen the availability of machines and influence the capacity of
the production system. Some situations or states should be treated differently: (1) when no job or
only one job exits in the buffer (i.e., N = 0 or 1); (2) when machine breakdown occurs (i.e., a job is
interrupted and should be finished by other machines). The above three states called dummy states
in that no decision need to be made to select a rule. As it can be seen, the state space of the system
is continuous. A discretization method based on ML and MPT can be adopted to discretize the
state space (Usher 2004, 2005). So the state-action table can be achieved, and Table 1 presents an
example (n+3 states and m actions).
Table 1 State-action table
State
Determination criteria
Action1
Action2
Action m
Dummy 1
Dummy 2
Dummy 3
1
2
n-1
n
N = 0 (no job in buffer)
N = 1 (one job in buffer)
Machine breakdown occurs
N > 1 & ML < 0
N > 1 & 0 ≤ ML < MPT
N > 1 & (n - 3) * MPT ≤ ML < (n - 2) * MPT
N > 1 & (n - 2) * MPT ≤ ML
0
0
0
Q(1, 1)
Q(2, 1)
Q(n-1, 1)
Q(n, 1)
0
0
0
Q(1, 2)
Q(2, 2)
Q(n-1, 2)
Q(n, 2)
0
0
0
Q(1, m)
Q(2, m)
Q(n-1, m)
Q(n, m)
4
Due to the existence of three dummy states, the Q-value updates according to different
situations. If the current state is a dummy state and the next state is a dummy one, the Q-value
doesn’t update; if the current state is not a dummy state but the next state is a dummy state, the
Q-value should update according to Eq. (2); otherwise, updates by Eq. (1).
Q(s0, a) = Q(s0, a) + α r
(2)
3.3 Action
To make full use of the agent’s learning ability, simple dispatching rules or heuristics are often
adopted as actions. The decision-making (i.e., select action) occurs when in the system there are at
least one idle machine and more than one job in the buffer. In this study, we adopt SPT, EDD
(Earliest Due Date), and FCFS (First Come First Served) as three types of actions.
Action 1 (SPT): Select the job whose processing time is the shortest in the buffer.
Action 2 (EDD): Select the job whose due date is the earliest in the buffer.
Action 3 (FCFS): Select the job which arrives earliest in the buffer.
3.4 Action-selection policy
In the Q-Learning algorithm, the agent selects an action in terms of exploration and exploitation.
Exploration represents that the agent tries some actions that have not been selected before whereas
exploitation denotes that the agent favors the action that is taken and rewarded before, and the
former can provide more chances for the agent to make the total reward largest in the long run
while the latter can guarantee that the agent achieves better reward in one action. The ε-greedy
method is used as an exploration and exploitation policy in this study. In the method, the agent
selects the action with the largest Q-value with the probability of (1-ε); otherwise, randomly
selects an action with the probability of ε.
3.5 Reward function
All action of the agent selected during the learning process are partially determined by the reward
function. A reward function is defined on the goal of the learning agent, and links an action with
immediate reward. Since the agent always tries to maximize the total reward, a reasonable reward
function should be adopted as a guide for the agent to achieve the goal.
In this study, we consider two different objectives to test the performance of the proposed
algorithm. One is to minimize the maximum lateness. The reward function is defined as: if the
lateness of a finished job is greater than the maximum lateness of previously completed jobs, the
agent receives a reward of -1; otherwise, it receives a reward of +1. The other is to minimize the
percentage of tardy jobs. The reward function is defined as: if the finished job is tardy, the agent
receives a reward of -1; otherwise, it receives a reward of +1.
4
Experiments
The simulation experiment is carried out to examine the effect on utilizing the Q-Learning
algorithm to train a learning agent to select the proper rules. It is developed by the MATLAB
software, and performed on the computer with an Intel Core 2 Duo 2.93 GHz CPU and 2 GB
RAM. The following subsections show the details of the simulation.
4.1 Parameters
The time between job arrivals follows an exponential distribution with a mean of 5.0. The
processing times of jobs are uniformly distributed between 7.5 and 8.5. The due date of the job is
5
calculated by the following equation (3) where the tightness factor is uniformly distributed
between 1.5 and 2.5. The mean time between failure (MTBF) and the mean time to repair (MTTR)
of machines follow an exponential distribution with a mean of 1000.0 and 50.0, respectively.
Due date (dj) = Arrival time (aj) + Tightness factor * Processing time (tj)
(3)
In the Q-Learning algorithm, all state-action pairs Q(s, a) are initialized to zero assuming that
no prior knowledge of which rules is the best to be utilized in any situation. The total number of
states is 18. According to other example systems (Sutton and Barto 1998), the step-size parameter
(α), the discount-rate parameter (γ), and the probability of ε-greedy method (ε) are set to 0.1, 0.9
and 0.1, respectively.
4.2 Simulation process
Two machines are considered in the parallel machine scheduling system. The machine breakdown
and repair occur according to the parameters set in Section 4.1. Fifty thousand jobs are processed
as a system warm-up, and after the warm-up, another 50,000 jobs are processed by the system and
the values of objectives are recorded as observations for the experiment. The simulation is based
on the event-driven mechanism.
The simulation time (simulation clock) updates as follows: 1) when both machines are busy,
increase the simulation time to the smallest one among finish times of two machines and the time
of the nearest machine breakdown; 2) when both machines are idle, increase the simulation time to
the smaller one between the arrival time of next job and the time of the nearest machine
breakdown; 3) when one machine is busy and the other is idle, increase the simulation time to the
smallest one among the finish time, the arrival time of next job, and the time of the nearest
machine breakdown; 4) when one machine is broken and the other is idle, increase the simulation
time to the smaller one between the arrival time of next job and the recovery time of the broken
machine; 5) when one machine is broken and the other is busy, increase the simulation time to the
smaller one between the finish time and the recovery time of the broken machine.
Once a machine finishes working or is repaired (i.e., idle), the assignment process begins.
Whenever any assignment event happens, the learning agent perceives the current state of its
environment, evaluates it, and selects an action (i.e., rule) from the action list based on a certain
policy (e.g. ε-greedy method). The rule is applied to select one of the jobs waiting in the buffer.
Then the job is assigned to the idle machine, and is removed from the buffer. The stopping
condition is that all jobs are finished.
4.3 Results and discussions
In order to demonstrate the effectiveness of the Q-Learning algorithm, SPT, EDD, and FCFS are
adopted individually in the experiments using the same parameters. Table 2 lists the selection
percentage of the three rules in the Q-Learning algorithm after the warm-up of the simulation. The
largest percentage values are bold in the table. And Table 3 lists the results of SPT, EDD, FCFS,
and Q-Learning for different objectives. The largest objective values among four algorithms are
bold in the table. Figure 2 and 3 depicts the trend of the selection percentages of rules during
warm-up and the trend of different objective values during the scheduling process, respectively.
Table 2 Selection percentage of rules in the Q-Learning algorithm
Rules
Max Lateness
Percentage of Tardy Jobs
SPT
30.49
80.85
6
EDD
FCFS
9.97
9.18
36.98
32.53
Table 3 Results of the four algorithms
Rules
Max Lateness
Percentage of Tardy Jobs (%)
SPT
EDD
FCFS
Q-Learning
1061.24
152.59
153.91
258.92
22.49
39.59
39.38
26.57
100
Percentage of Rules / %
Percentage of Rules / %
60
SPT
EDD
FCFS
50
40
30
20
10
0
0
10
20
30
40
Number of Jobs * 1000
SPT
EDD
FCFS
60
40
20
0
0
50
(a) Rule selection percentage for Max Lateness
80
10
20
30
40
Number of Jobs * 1000
50
(b) Rule selection percentage for Percentage of Tardy Jobs
Figure 2 Simulation results for rule selection percentage during the warm-up
Max Lateness
1000
800
600
Percentage of Tardy Jobs / %
1200
Q-Learning
SPT
EDD
FCFS
400
200
0
0
10
20
30
40
Number of Jobs * 1000
(a) Max lateness changes of four algorithms
50
60
50
Q-Learning
SPT
EDD
FCFS
40
30
20
0
10
20
30
40
Number of Jobs * 1000
50
(b) Percentage of tardy jobs changes of four algorithms
Figure 3 Simulation results for two objectives during the scheduling process
The following conclusions can be drawn from the observations in the above tables and figures.
Firstly, although the relative percentages of selected rules are not known in the beginning, they
become stable as the number of arriving jobs increases. This phenomenon can be observed in Fig.
2, especially in Fig. 2 (a). Secondly, the best rule for a given objective is not known at first. After
the warm-up, the best rule can be achieved by comparing the selection percentages. The best rule
has the largest selection percentage. As it can be seen in Fig. 2 and Table 2, EDD and SPT are best
for the objectives of minimization of maximum lateness and percentage of tardy jobs among the
three rules, respectively. For minimization of max lateness, the selection percentage of the best
7
rule EDD is about 36.98%. For minimization of percentage of tardy jobs, the selection percentage
of the best rule SPT is about 80.85%. Thirdly, none of the three rules (SPT, EDD, and FCFS) is
best for the two objectives. For example, SPT is best for minimizing the percentage of tardy jobs
but worst for minimizing the maximum lateness. The results achieved by the Q-Learning
algorithm for different objectives are stable, and close to those obtained by the corresponding best
rules. In summary, the learning agent based on the Q-Learning algorithm is able to select the best
rule for different objectives, and is suitable for parallel machine scheduling system.
5
Conclusions
The paper proposed a learning agent to solve the dynamic parallel machine scheduling problem
with random breakdowns. The state-action table, actions, reward function, and action-selection
policy of the Q-Learning algorithm, which was regarded as the learning mechanism of the agent,
were described at length. The details of the simulation process were presented, and the comparison
among the results of four approaches for two different objectives proved that the proposed agent
was more suitable for complex dynamic machine scheduling environment.
This study provides a basis for further research into applying the Q-Learning algorithm or other
reinforcement learning algorithms for more complex agent-based scheduling systems, like flexible
flow shop scheduling, job shop scheduling.
Acknowledgment
The authors are grateful to the editors and the anonymous referees for their helpful comments and
suggestions. This research work is supported by the National Natural Science Foundation of China
(Grant No. 61203322).
References
Aydin, M.E. & Oztemel, E., 2000. Dynamic job-shop scheduling using reinforcement learning agents.
Robotics and Autonomous Systems, 33 (2–3), 169-178.
Fang, J., Huang, G.Q. & Li, Z., 2012. Event-driven multi-agent ubiquitous manufacturing execution
platform for shop floor work-in-progress management. International Journal of Production
Research, 51 (4), 1168-1185.
Huang, G.Q., Zhang, Y.F. & Jiang, P.Y., 2007. RFID-based wireless manufacturing for walking-worker
assembly
islands
with
fixed-position
layouts.
Robotics
and
Computer-Integrated
Manufacturing, 23 (4), 469-477.
Lee, Z.J., Lin, S.W. & Ying, K.C., 2010. Scheduling jobs on dynamic parallel machines with
sequence-dependent setup times. International Journal of Advanced Manufacturing
Technology, 47 (5-8), 773-781.
Leitao, P., 2009. Agent-based distributed manufacturing control: A state-of-the-art survey. Engineering
Applications of Artificial Intelligence, 22 (7), 979-991.
Ouelhadj, D. & Petrovic, S., 2009. A survey of dynamic scheduling in manufacturing systems. Journal
of Scheduling, 2009 (12), 417-431.
Pinedo, M., 2002. Scheduling: Theory, algorithms, and systems, 2nd edn. Prentice-Hall, Englewood
Cliffs, New Jersey.
Sutton, R.S. & Barto, A.G., 1998. Reinforcement learning: An introduction. The MIT Press, Cambridge,
Massachusetts, London, England.
8
Tang, L.X., Jiang, S.J. & Liu, J.Y., 2010. Rolling horizon approach for dynamic parallel machine
scheduling problem with release times. Industrial & Engineering Chemistry Research, 49 (1),
381-389.
Ying, K.C. & Cheng, H.M., 2010. Dynamic parallel machine scheduling with sequence-dependent
setup times using an iterated greedy heuristic. Expert Systems with Applications, 37 (4),
2848-2852.
Wang, S.J., Sun, S., Zhou, B.H. & Xi, L.F., 2007. Q-learning based dynamic single machine scheduling.
Journal of Shanghai Jiaotong University, 41 (8), 1227-1232+1243. (in Chinese).
Wang, Y.C. & Usher, J.M., 2004. Learning policies for single machine job dispatching. Robotics and
Computer-Integrated Manufacturing, 20 (6), 553-562.
Wang, Y.C. & Usher, J.M., 2005. Application of reinforcement learning for agent-based production
scheduling. Engineering Applications of Artificial Intelligence, 18 (1), 73-82.
Watkins, C.J. & Dayan, P., 1992. Q-learning. Machine learning, 8 (3-4), 279-292.
Watkins, C.J.C.H., 1989. Learning from delayed rewards. PhD thesis, Cambridge University,
Cambridge, England..
Whiteson, S., Taylor, M.E. & Stone, P., 2007. Empirical studies in action selection with reinforcement
learning. Adaptive Behavior, 15 (1), 33-50.
Zhang, Z., Zheng, L., Hou, F. & Li, N., 2011. Semiconductor final test scheduling with
sarsa(lambda, k) algorithm. European Journal of Operational Research, 215 (2), 446-458.
Zhang, Z., Zheng, L., Li, N., Wang, W., Zhong, S. & Hu, K., 2012. Minimizing mean weighted
tardiness in unrelated parallel machine scheduling with reinforcement learning.
Computers & Operations Research, 39 (7), 1315-1324.
Zhang, Z., Zheng, L. & Weng, M., 2007. Dynamic parallel machine scheduling with mean weighted
tardiness objective by Q-learning. International Journal of Advanced Manufacturing
Technology, 34 (9-10), 968-980.
9