Reinforcement Learning

SYS3060 Stochastic Decision Models
(Reinforcement Learning)
Lecture 1
Quanquan Gu
Dept. of Systems and Information Engineering
University of Virginia
Quanquan Gu
Reinforcement Learning
1 / 18
Quanquan Gu
Reinforcement Learning
2 / 18
What is Reinforcement Learning?
Learning from interaction
Goal-oriented learning
Learning about, from, and while interacting with an external
environment
Learning what to do?how to map situations to actions?so as to
maximize a numerical reward signal
Quanquan Gu
Reinforcement Learning
3 / 18
Reinforcement Learning
Closed-loop learning
Continual learning and planning
Environment is stochastic and uncertain
Quanquan Gu
Reinforcement Learning
4 / 18
Reinforcement Learning
Reward
System
State
Action
Quanquan Gu
Controller
Reinforcement Learning
5 / 18
Sequential Decision Making
Goal: select actions to maximize total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more
long-term reward, e.g., a financial investment (may take months to
mature)
Quanquan Gu
Reinforcement Learning
6 / 18
Terminology of Reinforcement Learning
Policy: what to do
Reward: what is good
Value: what is good because it predicts reward
Model: what follows what
Quanquan Gu
Reinforcement Learning
7 / 18
Supervised Learning
Quanquan Gu
Reinforcement Learning
8 / 18
Reinforcement Learning
Quanquan Gu
Reinforcement Learning
9 / 18
Contrast with supervised learning
No explicit labeled training data, only a sequence of reward
signals
The data are not independent and identically distributed (i.i.d.), it
is an online sequential machine learning
Quanquan Gu
Reinforcement Learning
10 / 18
Key Features of RL
Agent/Learner is not told which actions to take
Agent’s actions affect the subsequent data it receives
Trial-and-Error search
Possibility of delayed reward
I
Sacrifice short-term gains for greater long-term gains
The need to explore and exploit
Quanquan Gu
Reinforcement Learning
11 / 18
Exploration v.s. Exploitation
The agent want to find a good policy, from its experiences of the
environment, and without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximize reward
It is often important to explore as well as exploit
Quanquan Gu
Reinforcement Learning
12 / 18
Examples of Exploration and Exploitation
Restaurant Selection
I
I
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant suggested by Yelp
Oil Drilling
I
I
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
I
I
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Quanquan Gu
Reinforcement Learning
13 / 18
Uncertainty
“Uncertainty is the only certainty there is, and
knowing how to live with insecurity is the only
security.” (John Allen Paulos, 1945–)
Next state might be uncertain
A transition from S after taking action A:
S 0 = f (S, A, D),
R = g(S, A, D)
D – random variable; “disturbance”
f – transition function
g – reward function
Quanquan Gu
Reinforcement Learning
14 / 18
An oversimplified model
Note
The transitions can be represented as
S 0 = f (s, a, D),
R = g(s, a, D).
Quanquan Gu
Reinforcement Learning
15 / 18
Applications
Robot Control
Board Games, e.g., grid world, chess, Go
Elevator scheduling
Inventory management
Personalized Medicine and Clinical Trials
...
Quanquan Gu
Reinforcement Learning
16 / 18
The structure of the course
Markov decision processes
I
Stochasticity, state, action, reward, value functions, policies
Planning
I
I
Bellman (optimality) equations, operators, fixed-points
Value iteration, policy iteration
Value prediction
I
I
I
I
Temporal difference learning unifies Monte-Carlo and bootstrapping
Function approximation to deal with large spaces
Least-squares methods
New gradient based methods
Control
I
I
I
I
Closed-loop interactive learning: exploration vs. exploitation
Q-learning
SARSA
Policy gradient, natural actor-critic
Quanquan Gu
Reinforcement Learning
17 / 18
Acknowledgement
Many thanks to Dr. Csaba Szepesvari @ University of Alberta for
allowing me to reuse his tutorial slides
Many thanks to Dr. Richard Sutton @ University of Alberta for
providing the slides of his wonderful textbook
Quanquan Gu
Reinforcement Learning
18 / 18