F3E.2 1 Reinforcement Learning and Iterative Learning Control: Similarity and Difference Hyo-Sung Ahn† Abstract— This paper addresses similarities and differences between reinforcement learning and iterative learning control. Brief summary of reinforcement learning and iterative learning control is reported; then similarities and differences between these two algorithms are addressed. Index Terms— Reinforcement learning, Iterative learning control, Similarities and differences I. I NTRODUCTION Reinforcement learning (RL) is a method to decide a sequence of commands in a direction of optimizing a performance index. Outcomes of actions are stored in a table, namely Q-table. Reinforcement learning can be used for both deterministic systems and stochastic systems governed by Markov decision process. Reinforcement learning updates its current Q-table based on immediate rewards and learned knowledge or experience obtained in the past. Iterative learning control (ILC) is a method to generate a sequence of control inputs such that a desired trajectory can be obtained in a repetitive dynamic system [1]. Though the ILC is typically used for deterministic systems, it can be also employed to generate a control sequence for various stochastic systems. The basic assumptions of ILC systems are repetitive desired trajectory, periodic disturbances, or iteration-invariant dynamics. In terms of iterative learning update, RL and ILC have similarities while they are different in the point of learning structure. In this paper, detailed issues on similarities and differences between RL and ILC are addressed. By this iteration, eventually, it finds the best action set. Since the reinforcement learning algorithm seeks a new action, which might be a better choice than the previous selections, it explores a new environment. However, it also needs to exploit the past information to compute a new action based on learned knowledge from previous experience. Thus, the reinforcement learning takes both exploration and exploitation, which is a kind of trade-off. It is certain that the algorithm may find a better optimal by searching more environment and possible states. But, in this case, the computational time could be huge, and the algorithm may not converge at all. On the contrary, if the algorithm is dependent on the past learned information, it will be hard to find a better solution. For the formulation of reinforcement learning, we need to define “agent” and “environment”. An agent provides an action to environment; then the environment reacts to the action, and gives some feedbacks in the form of penalty or rewards. Also, by the action, the agent changes its state. Then, at a new state changed by action, the agent chooses an action in the direction of optimization. This process repeats until the desired optimality is achieved. The desired optimality is to maximize the accumulated rewards, or is to minimize accumulated penalties. Basically, the reinforcement learning tries to select the best action at a given state by repetition. Thus, when the agent stays on a state, it should search the learned knowledge at the same state in the past. On the other hand, if the states are continuously changing without any repetition, then the learning cannot be performed. The model of reinforcement learning consists of II. R EINFORCEMENT L EARNING : R EVIEW This section summarizes reinforcement learning from [2]. Reinforcement learning algorithm is to determine an optimal path or a set of optimal actions given a set of finite states. At each state, the reinforcement learning decides the best action such that an optimality increases (or cost decreases). It is usually assumed that an action transforms a state to another state with a fixed probability. After conducting an action, it receives a reward from environment immediately or with a delay. It stores the received reward, the current state, and action. Then, at the next iteration, when it returns back to the same state, it decides its new action based on stored reward and action. † Dept. of Mechatronics, Gwangju Institute of Science and Technology (GIST), 1 Oryong-dong, Buk-gu, Gwangju 500-712, Korea. [email protected] This research was supported by the MKE(The Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) support program supervised by the IITA(Institute for Information Technology Advancement) (IITA-2009-C1090-0902-0008). • • • Environment (E) and a finite set of states (S) Agent (AG) and a finite set of actions (A) Rewards or penalties (r) according to the pair of state and action (s, a), where a ∈ A and s ∈ S The action of agent transfers the state of agent such as: s1 −→a s2 , s1 , s2 ∈ S, a ∈ A (1) The central goal of reinforcement learning is to select the action a ∈ A at every sampling time. The set of actions is called policy π. Note that in (1), the same action at the same state may result in different state, but with fixed probability. Typically, in reinforcement learning, it is determined by Markov decision process. There are various algorithms for the choice of actions. One of the most important algorithms is Q-learning. Let us denote Q∗ (s, a) as the expected discounted rewards of taking action a in the state s and taking optimal actions onwards. Then, Q∗ (s, a) ICMIT 2009 2009 International Conference on Mechatronics and Information Technology 2 where s is the state transferred from the current state s by action a, γ is the discount rate, and a is the action that will be taken at the state s . If the state transformation is governed by Markov decision process with the probability of P (s, a, s ) (Σs ∈S P (s, a, s ) = 1), which is the probability of the change from s to s by action a, then (2) can be changed as: horizon 0 ≤ t ≤ T . It is shown that the control update uk+1 (t) = uk (t) + Γėk (t), where ėk (t) = ẏd (t) − ẏk (t), ensure the convergence if I − BCΓ < 1. Thus, the ILC consists of the system and controller. Given an input and output mapping y = f (u), ILC is a contractive mapping algorithm such that the input signals u assues yd −yk+1 = Ayd − yk , where the magnitude of A is less than 1. In ILC, for the current control signal update, it makes use of the learned knowledge in the past. It uses the error signals and control inputs used in the previous iteration. Q∗ (s, a) = r(s, a) + γΣs ∈S P (s, a, s ) max Q∗ (s , a ) (3) IV. D IFFERENCES can be represented as: Q∗ (s, a) = r(s, a) + γΣs ∈S max Q∗ (s , a ) a (2) a The above equations (2) and (3), however, only emphasize the exploration. These equations do not consider exploitation. To consider the exploitation, we can change (2) and (3) such as: Q(s, a)k+1 = (1 − α)Q(s, a)k + α[r(s, a)k+1 + γΣs ∈S max Q(s , a )k ] (4) a Q(s, a)k+1 = (1 − α)Q(s, a)k + α[r(s, a)k+1 + γΣs ∈S P (s, a, s ) max Q(s , a )k ] (5) a where α is the parameter weighting between exploration and exploitation. If α = 1, then the algorithm only explores new states, while if it is zero, then it only considers the learned knowledge (exploitation). In (4) and (5), the subscript k represents the number of episode (i.e., searching step; it corresponds to iteration number in ILC). III. I TERATIVE L EARNING C ONTROL Iterative learning control (ILC) is a control strategy to achieve a perfect desired trajectory of repetitive systems. The repetitive systems should have an initial reset every iteration. After resetting, whole process is repeating again and again until a desired convergence is achieved. The repetition could be considered on time, spatial, or trajectory domains, or on events. Iterative learning control is an online learning algorithm to generate a sequence of control signals on a fixed repetition domain. By repetition, the learning controller produces a desired control signals for a fixed desired trajectory along the iteration domain. Iterative learning control follows a typical learning procedure: Conduct a task, store learned information including control efforts and measured outputs, update control signals, and repeat the task again from the same initial point. Iterative learning control is essentially an attempt to obtain input sequence for ensuring the system to produce a desired output; thus, it is a mapping between input and output. This input/output mapping is learned by iteration. Let us consider a simple linear system: ẋ(t) = Ax(t) + Bu(t) (6) y(t) = Cx(t) (7) The main task of ILC is to ensure the output to follow the desired trajectory; i.e., y(t) = yd (t) for a given finite ILC is a control algorithm to generate a set of control signals such that the output signals follow a sequence of desired outputs. It requires a repetitive periodic desired trajectory. Or, it can be used to achieve a desired output trajectory with periodic or repetitive disturbance. Thus, ILC is a kind of supervisory learning since it needs a repetitive reference trajectory. Also, in ILC, the system should be repetitive or stationary; otherwise, it the error signals measured in the previous iterations cannot be used in the current learning process. In ILC, the set of states and set of actions are not fixed. They are sought by learning algorithm. In reinforcement learning, however, the set of states and set of actions are fixed. Given these sets, reinforcement learning algorithm searches the best solution. Reinforcement learning does not require any reference signal and reference trajectory; instead, it just maximizes the optimality (Q-values) by measuring immediate rewards. Reinforcement learning is more flexible than ILC in terms of repetition. In reinforcement learning, the initial reset and periodic iteration are not necessary; it is only requested to exploit the same state if there have been same states as the current state. By exploiting the same state, it updates the Qvalue in an optimal direction. Thus, though reinforcement learning essentially depends on a kind of repetition, the mathematical formulation does not strictly request an exact repetition on a repetitive time domain. However, the basic formulation of ILC is established under the assumptions of finite repetitive time domain, and initial reset condition (i.e., every initial starting points should be same). Though some of recent ILC algorithms do not require finite repetitive time domain, basic ILC formulation usually is based on this assumption. Thus, in terms of repetition, reinforcement learning is more flexible and general than ILC. V. S IMILARITIES In reinforcement learning, agent and environment interact each other. In ILC, the system can be considered as environment and ILC controller can be considered agent. The agent determines action command, while ILC controller decides control signals. In RL, the action that maximizes Q(s , a ) is selected, while in ILC, the control signal is computed by uk+1 (t) = uk (t) + Γėk (t). In both cases, the action and control signals are computed from the learned information from the past. That is, in RL, Q-table is continuously updated and the stored Q-table is used for the selection 3 Current iteration uk Measured outputs yk Plant Learning update uk+1 ILC Controller Next iteration Fig. 1. ek - yd Learning process of iterative learning controller. Past episode Qk Action a Environment Measured state s Learning update Qk+1 Current episode Fig. 2. Q-learning Learning process of reinforcement learning. of action. In this update, the state s should be related with the current state s; so, in RL, the matching between the current state s, and stored state s should be matched. Otherwise, if there is no information matched to the current state s, then it cannot decide the action. In other words, for the generation of action at the current state s, there should be learned information from the past at the exact same state s. If there is no learned knowledge at the state s, RL cannot compute an optimal action signals. There is the same matching problem in ILC. For the control signal at the time point t, there should be learned information at the exact same time point. That is, we need the learned knowledge at time t such as uk (t) and ėk (t). Otherwise, ILC cannot update its control signal at the time point t. RL has to store measured reward and update Q-table, while ILC has to store error signals. So, both algorithms have the storing process. Then, at given states, RL searches Q-values stored from previous steps, and ILC searches the error and control signals from the previous iterations corresponding to the current time point. Then, RL selects action by seeking a maximized Q-value at the continuous state. ILC just updates control signals by contractive mapping process, which is basically an optimization technique. Table I summarizes the similarities between RL and ILC. Fig. 1 depicts the learning process of iterative learning controller and Fig. 2 depicts the learning process of reinforcement learning. As shown in these figures, ILC and RL have similar update structure and storage mechanisms. VI. C ONCLUSIONS This paper has been devoted to discussions on similarities and differences between iterative learning control (ILC) and TABLE I C OMPARISON BETWEEN RL RL Agent Environment Q-table update state and action pair (s, a) a = arga {max Q∗ (s , a )} Rewards r AND ILC ILC ILC controller System equations Data storage sampling time point t uk+1 (t) = uk (t) + Γėk (t) Error ek reinforcement learning (RL). ILC has been widely studied in the field of engineering, while RL has been studied in the area of computer science. ILC eventually is a quantitative learning algorithm, while RL is a qualitative learning algorithm. Both methods require some sort of repetitions and experiences. The learning structures of the both algorithms are similar while the controller update mechanisms are different. In a future publication, more detailed discussion on these topics will be reported. R EFERENCES [1] Hyo-Sung Ahn, Kevin L. Moore, and YangQuan Chen, Iterative learning control: Robustness and monotonic convergence for interval systems, Communications and Control Engineering. Springer, 2007. [2] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. of Artificial Intelligence Research, vol. 4, pp. 237–285, 1996.
© Copyright 2026 Paperzz