SYS3060 Stochastic Decision Models (Reinforcement Learning) Lecture 1 Quanquan Gu Dept. of Systems and Information Engineering University of Virginia Quanquan Gu Reinforcement Learning 1 / 18 Quanquan Gu Reinforcement Learning 2 / 18 What is Reinforcement Learning? Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do?how to map situations to actions?so as to maximize a numerical reward signal Quanquan Gu Reinforcement Learning 3 / 18 Reinforcement Learning Closed-loop learning Continual learning and planning Environment is stochastic and uncertain Quanquan Gu Reinforcement Learning 4 / 18 Reinforcement Learning Reward System State Action Quanquan Gu Controller Reinforcement Learning 5 / 18 Sequential Decision Making Goal: select actions to maximize total future reward Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward, e.g., a financial investment (may take months to mature) Quanquan Gu Reinforcement Learning 6 / 18 Terminology of Reinforcement Learning Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what Quanquan Gu Reinforcement Learning 7 / 18 Supervised Learning Quanquan Gu Reinforcement Learning 8 / 18 Reinforcement Learning Quanquan Gu Reinforcement Learning 9 / 18 Contrast with supervised learning No explicit labeled training data, only a sequence of reward signals The data are not independent and identically distributed (i.i.d.), it is an online sequential machine learning Quanquan Gu Reinforcement Learning 10 / 18 Key Features of RL Agent/Learner is not told which actions to take Agent’s actions affect the subsequent data it receives Trial-and-Error search Possibility of delayed reward I Sacrifice short-term gains for greater long-term gains The need to explore and exploit Quanquan Gu Reinforcement Learning 11 / 18 Exploration v.s. Exploitation The agent want to find a good policy, from its experiences of the environment, and without losing too much reward along the way Exploration finds more information about the environment Exploitation exploits known information to maximize reward It is often important to explore as well as exploit Quanquan Gu Reinforcement Learning 12 / 18 Examples of Exploration and Exploitation Restaurant Selection I I Exploitation: Go to your favourite restaurant Exploration: Try a new restaurant suggested by Yelp Oil Drilling I I Exploitation: Drill at the best known location Exploration: Drill at a new location Game Playing I I Exploitation: Play the move you believe is best Exploration: Play an experimental move Quanquan Gu Reinforcement Learning 13 / 18 Uncertainty “Uncertainty is the only certainty there is, and knowing how to live with insecurity is the only security.” (John Allen Paulos, 1945–) Next state might be uncertain A transition from S after taking action A: S 0 = f (S, A, D), R = g(S, A, D) D – random variable; “disturbance” f – transition function g – reward function Quanquan Gu Reinforcement Learning 14 / 18 An oversimplified model Note The transitions can be represented as S 0 = f (s, a, D), R = g(s, a, D). Quanquan Gu Reinforcement Learning 15 / 18 Applications Robot Control Board Games, e.g., grid world, chess, Go Elevator scheduling Inventory management Personalized Medicine and Clinical Trials ... Quanquan Gu Reinforcement Learning 16 / 18 The structure of the course Markov decision processes I Stochasticity, state, action, reward, value functions, policies Planning I I Bellman (optimality) equations, operators, fixed-points Value iteration, policy iteration Value prediction I I I I Temporal difference learning unifies Monte-Carlo and bootstrapping Function approximation to deal with large spaces Least-squares methods New gradient based methods Control I I I I Closed-loop interactive learning: exploration vs. exploitation Q-learning SARSA Policy gradient, natural actor-critic Quanquan Gu Reinforcement Learning 17 / 18 Acknowledgement Many thanks to Dr. Csaba Szepesvari @ University of Alberta for allowing me to reuse his tutorial slides Many thanks to Dr. Richard Sutton @ University of Alberta for providing the slides of his wonderful textbook Quanquan Gu Reinforcement Learning 18 / 18
© Copyright 2026 Paperzz