Reinforcement Learning and Iterative Learning Control: Similarity

F3E.2
1
Reinforcement Learning and Iterative Learning Control:
Similarity and Difference
Hyo-Sung Ahn†
Abstract— This paper addresses similarities and differences
between reinforcement learning and iterative learning control. Brief summary of reinforcement learning and iterative
learning control is reported; then similarities and differences
between these two algorithms are addressed.
Index Terms— Reinforcement learning, Iterative learning
control, Similarities and differences
I. I NTRODUCTION
Reinforcement learning (RL) is a method to decide a
sequence of commands in a direction of optimizing a
performance index. Outcomes of actions are stored in a
table, namely Q-table. Reinforcement learning can be used
for both deterministic systems and stochastic systems governed by Markov decision process. Reinforcement learning
updates its current Q-table based on immediate rewards
and learned knowledge or experience obtained in the past.
Iterative learning control (ILC) is a method to generate a
sequence of control inputs such that a desired trajectory can
be obtained in a repetitive dynamic system [1]. Though the
ILC is typically used for deterministic systems, it can be
also employed to generate a control sequence for various
stochastic systems. The basic assumptions of ILC systems
are repetitive desired trajectory, periodic disturbances, or
iteration-invariant dynamics. In terms of iterative learning
update, RL and ILC have similarities while they are different in the point of learning structure. In this paper, detailed
issues on similarities and differences between RL and ILC
are addressed.
By this iteration, eventually, it finds the best action set.
Since the reinforcement learning algorithm seeks a new
action, which might be a better choice than the previous
selections, it explores a new environment. However, it
also needs to exploit the past information to compute a
new action based on learned knowledge from previous
experience. Thus, the reinforcement learning takes both
exploration and exploitation, which is a kind of trade-off.
It is certain that the algorithm may find a better optimal
by searching more environment and possible states. But, in
this case, the computational time could be huge, and the
algorithm may not converge at all. On the contrary, if the
algorithm is dependent on the past learned information, it
will be hard to find a better solution. For the formulation
of reinforcement learning, we need to define “agent” and
“environment”. An agent provides an action to environment;
then the environment reacts to the action, and gives some
feedbacks in the form of penalty or rewards. Also, by the
action, the agent changes its state. Then, at a new state
changed by action, the agent chooses an action in the
direction of optimization. This process repeats until the
desired optimality is achieved. The desired optimality is
to maximize the accumulated rewards, or is to minimize
accumulated penalties. Basically, the reinforcement learning
tries to select the best action at a given state by repetition.
Thus, when the agent stays on a state, it should search the
learned knowledge at the same state in the past. On the
other hand, if the states are continuously changing without
any repetition, then the learning cannot be performed. The
model of reinforcement learning consists of
II. R EINFORCEMENT L EARNING : R EVIEW
This section summarizes reinforcement learning from
[2]. Reinforcement learning algorithm is to determine an
optimal path or a set of optimal actions given a set
of finite states. At each state, the reinforcement learning
decides the best action such that an optimality increases
(or cost decreases). It is usually assumed that an action
transforms a state to another state with a fixed probability.
After conducting an action, it receives a reward from
environment immediately or with a delay. It stores the
received reward, the current state, and action. Then, at the
next iteration, when it returns back to the same state, it
decides its new action based on stored reward and action.
† Dept. of Mechatronics, Gwangju Institute of Science and Technology (GIST), 1 Oryong-dong, Buk-gu, Gwangju 500-712, Korea.
[email protected]
This research was supported by the MKE(The Ministry of Knowledge
Economy), Korea, under the ITRC(Information Technology Research
Center) support program supervised by the IITA(Institute for Information
Technology Advancement) (IITA-2009-C1090-0902-0008).
•
•
•
Environment (E) and a finite set of states (S)
Agent (AG) and a finite set of actions (A)
Rewards or penalties (r) according to the pair of state
and action (s, a), where a ∈ A and s ∈ S
The action of agent transfers the state of agent such as:
s1 −→a s2 , s1 , s2 ∈ S, a ∈ A
(1)
The central goal of reinforcement learning is to select
the action a ∈ A at every sampling time. The set of
actions is called policy π. Note that in (1), the same action
at the same state may result in different state, but with
fixed probability. Typically, in reinforcement learning, it is
determined by Markov decision process. There are various
algorithms for the choice of actions. One of the most
important algorithms is Q-learning. Let us denote Q∗ (s, a)
as the expected discounted rewards of taking action a in the
state s and taking optimal actions onwards. Then, Q∗ (s, a)
ICMIT 2009
2009 International Conference on Mechatronics and Information Technology
2
where s is the state transferred from the current state s by
action a, γ is the discount rate, and a is the action that
will be taken at the state s . If the state transformation is
governed by Markov decision process with the probability
of P (s, a, s ) (Σs ∈S P (s, a, s ) = 1), which is the probability of the change from s to s by action a, then (2) can
be changed as:
horizon 0 ≤ t ≤ T . It is shown that the control update
uk+1 (t) = uk (t) + Γėk (t), where ėk (t) = ẏd (t) − ẏk (t),
ensure the convergence if I − BCΓ < 1. Thus, the ILC
consists of the system and controller. Given an input and
output mapping y = f (u), ILC is a contractive mapping
algorithm such that the input signals u assues yd −yk+1 =
Ayd − yk , where the magnitude of A is less than 1. In
ILC, for the current control signal update, it makes use of
the learned knowledge in the past. It uses the error signals
and control inputs used in the previous iteration.
Q∗ (s, a) = r(s, a) + γΣs ∈S P (s, a, s ) max
Q∗ (s , a ) (3)
IV. D IFFERENCES
can be represented as:
Q∗ (s, a) = r(s, a) + γΣs ∈S max
Q∗ (s , a )
a
(2)
a
The above equations (2) and (3), however, only emphasize
the exploration. These equations do not consider exploitation. To consider the exploitation, we can change (2) and
(3) such as:
Q(s, a)k+1
= (1 − α)Q(s, a)k + α[r(s, a)k+1
+ γΣs ∈S max
Q(s , a )k ]
(4)
a
Q(s, a)k+1
= (1 − α)Q(s, a)k + α[r(s, a)k+1
+ γΣs ∈S P (s, a, s ) max
Q(s , a )k ] (5)
a
where α is the parameter weighting between exploration and
exploitation. If α = 1, then the algorithm only explores new
states, while if it is zero, then it only considers the learned
knowledge (exploitation). In (4) and (5), the subscript k
represents the number of episode (i.e., searching step; it
corresponds to iteration number in ILC).
III. I TERATIVE L EARNING C ONTROL
Iterative learning control (ILC) is a control strategy to
achieve a perfect desired trajectory of repetitive systems.
The repetitive systems should have an initial reset every
iteration. After resetting, whole process is repeating again
and again until a desired convergence is achieved. The
repetition could be considered on time, spatial, or trajectory
domains, or on events. Iterative learning control is an online
learning algorithm to generate a sequence of control signals
on a fixed repetition domain. By repetition, the learning
controller produces a desired control signals for a fixed
desired trajectory along the iteration domain. Iterative learning control follows a typical learning procedure: Conduct
a task, store learned information including control efforts
and measured outputs, update control signals, and repeat
the task again from the same initial point. Iterative learning
control is essentially an attempt to obtain input sequence
for ensuring the system to produce a desired output; thus,
it is a mapping between input and output. This input/output
mapping is learned by iteration. Let us consider a simple
linear system:
ẋ(t)
= Ax(t) + Bu(t)
(6)
y(t)
= Cx(t)
(7)
The main task of ILC is to ensure the output to follow
the desired trajectory; i.e., y(t) = yd (t) for a given finite
ILC is a control algorithm to generate a set of control
signals such that the output signals follow a sequence of
desired outputs. It requires a repetitive periodic desired
trajectory. Or, it can be used to achieve a desired output
trajectory with periodic or repetitive disturbance. Thus,
ILC is a kind of supervisory learning since it needs a
repetitive reference trajectory. Also, in ILC, the system
should be repetitive or stationary; otherwise, it the error
signals measured in the previous iterations cannot be used
in the current learning process. In ILC, the set of states
and set of actions are not fixed. They are sought by
learning algorithm. In reinforcement learning, however, the
set of states and set of actions are fixed. Given these sets,
reinforcement learning algorithm searches the best solution.
Reinforcement learning does not require any reference
signal and reference trajectory; instead, it just maximizes
the optimality (Q-values) by measuring immediate rewards.
Reinforcement learning is more flexible than ILC in terms
of repetition. In reinforcement learning, the initial reset and
periodic iteration are not necessary; it is only requested to
exploit the same state if there have been same states as the
current state. By exploiting the same state, it updates the Qvalue in an optimal direction. Thus, though reinforcement
learning essentially depends on a kind of repetition, the
mathematical formulation does not strictly request an exact
repetition on a repetitive time domain. However, the basic
formulation of ILC is established under the assumptions of
finite repetitive time domain, and initial reset condition (i.e.,
every initial starting points should be same). Though some
of recent ILC algorithms do not require finite repetitive
time domain, basic ILC formulation usually is based on
this assumption. Thus, in terms of repetition, reinforcement
learning is more flexible and general than ILC.
V. S IMILARITIES
In reinforcement learning, agent and environment interact
each other. In ILC, the system can be considered as environment and ILC controller can be considered agent. The agent
determines action command, while ILC controller decides
control signals. In RL, the action that maximizes Q(s , a )
is selected, while in ILC, the control signal is computed by
uk+1 (t) = uk (t) + Γėk (t). In both cases, the action and
control signals are computed from the learned information
from the past. That is, in RL, Q-table is continuously
updated and the stored Q-table is used for the selection
3
Current iteration
uk
Measured outputs
yk
Plant
Learning update
uk+1
ILC Controller
Next iteration
Fig. 1.
ek
-
yd
Learning process of iterative learning controller.
Past episode
Qk
Action
a
Environment
Measured state
s
Learning update
Qk+1
Current episode
Fig. 2.
Q-learning
Learning process of reinforcement learning.
of action. In this update, the state s should be related
with the current state s; so, in RL, the matching between
the current state s, and stored state s should be matched.
Otherwise, if there is no information matched to the current
state s, then it cannot decide the action. In other words,
for the generation of action at the current state s, there
should be learned information from the past at the exact
same state s. If there is no learned knowledge at the state
s, RL cannot compute an optimal action signals. There is
the same matching problem in ILC. For the control signal
at the time point t, there should be learned information at
the exact same time point. That is, we need the learned
knowledge at time t such as uk (t) and ėk (t). Otherwise,
ILC cannot update its control signal at the time point t.
RL has to store measured reward and update Q-table, while
ILC has to store error signals. So, both algorithms have the
storing process. Then, at given states, RL searches Q-values
stored from previous steps, and ILC searches the error and
control signals from the previous iterations corresponding to
the current time point. Then, RL selects action by seeking a
maximized Q-value at the continuous state. ILC just updates
control signals by contractive mapping process, which is
basically an optimization technique. Table I summarizes the
similarities between RL and ILC. Fig. 1 depicts the learning
process of iterative learning controller and Fig. 2 depicts
the learning process of reinforcement learning. As shown
in these figures, ILC and RL have similar update structure
and storage mechanisms.
VI. C ONCLUSIONS
This paper has been devoted to discussions on similarities
and differences between iterative learning control (ILC) and
TABLE I
C OMPARISON BETWEEN RL
RL
Agent
Environment
Q-table update
state and action pair (s, a)
a = arga {max Q∗ (s , a )}
Rewards r
AND
ILC
ILC
ILC controller
System equations
Data storage
sampling time point t
uk+1 (t) = uk (t) + Γėk (t)
Error ek reinforcement learning (RL). ILC has been widely studied
in the field of engineering, while RL has been studied in the
area of computer science. ILC eventually is a quantitative
learning algorithm, while RL is a qualitative learning algorithm. Both methods require some sort of repetitions and
experiences. The learning structures of the both algorithms
are similar while the controller update mechanisms are
different. In a future publication, more detailed discussion
on these topics will be reported.
R EFERENCES
[1] Hyo-Sung Ahn, Kevin L. Moore, and YangQuan Chen, Iterative
learning control: Robustness and monotonic convergence for interval
systems, Communications and Control Engineering. Springer, 2007.
[2] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
learning: A survey,” J. of Artificial Intelligence Research, vol. 4, pp.
237–285, 1996.