Q-Learning - HUJI moodle

Q-Learning
Christopher Watkins and Peter Dayan
Noga Zaslavsky
The Hebrew University of Jerusalem
Advanced Seminar in Deep Learning (67679)
November 1, 2015
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
1 / 21
Outline
1
Introduction
What is reinforcement learning?
Modeling the problem
Bellman optimality
2
Q-Learning
Algorithm
Convergence analysis
3
Assumptions and Limitations
4
Summary
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
2 / 21
Introduction
Outline
1
Introduction
What is reinforcement learning?
Modeling the problem
Bellman optimality
2
Q-Learning
Algorithm
Convergence analysis
3
Assumptions and Limitations
4
Summary
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
3 / 21
Introduction
What is reinforcement learning?
Reinforcement Learning: Agent-Environment Interaction
Reinforcement learning is learning how to behave in order to
maximize value or achieve some goal
Environment
Agent
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
4 / 21
Introduction
What is reinforcement learning?
Reinforcement Learning: Agent-Environment Interaction
Reinforcement learning is learning how to behave in order to
maximize value or achieve some goal
It is not supervised learning, and it is not unsupervised learning
There is no expert to tell the learning agent right from wrong - it is
forced to learn from its own experience
The learning agent receives feedback from its environment - it can
learn by trying different actions at different situations
Environment
Agent
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
4 / 21
Introduction
What is reinforcement learning?
Reinforcement Learning: Agent-Environment Interaction
Reinforcement learning is learning how to behave in order to
maximize value or achieve some goal
It is not supervised learning, and it is not unsupervised learning
There is no expert to tell the learning agent right from wrong - it is
forced to learn from its own experience
The learning agent receives feedback from its environment - it can
learn by trying different actions at different situations
There is a tradeoff between exploration and exploitation
Environment
Agent
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
4 / 21
Introduction
What is reinforcement learning?
Reinforcement Learning: Agent-Environment Interaction
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
5 / 21
Introduction
What is reinforcement learning?
Toy Problem
sf
s0
Noga Zaslavsky
s
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
6 / 21
Introduction
Modeling the problem
Markov Decision Process (MDP)
S ∈ S - the state of the system
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
7 / 21
Introduction
Modeling the problem
Markov Decision Process (MDP)
S ∈ S - the state of the system
A ∈ A - the agent’s action
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
7 / 21
Introduction
Modeling the problem
Markov Decision Process (MDP)
S ∈ S - the state of the system
A ∈ A - the agent’s action
p (s0 |s, a)- the dynamics of the system
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
7 / 21
Introduction
Modeling the problem
Markov Decision Process (MDP)
S ∈ S - the state of the system
A ∈ A - the agent’s action
p (s0 |s, a)- the dynamics of the system
r : S × A → R - the reward function (possibly stochastic)
Definition
MDP is the tuple hS , A, p, ri
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
7 / 21
Introduction
Modeling the problem
Markov Decision Process (MDP)
S ∈ S - the state of the system
A ∈ A - the agent’s action
p (s0 |s, a)- the dynamics of the system
r : S × A → R - the reward function (possibly stochastic)
Definition
MDP is the tuple hS , A, p, ri
π (a|s) - the agent’s policy
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
7 / 21
Introduction
Modeling the problem
Markov Decision Process (MDP)
S ∈ S - the state of the system
A ∈ A - the agent’s action
p (s0 |s, a)- the dynamics of the system
r : S × A → R - the reward function (possibly stochastic)
Definition
MDP is the tuple hS , A, p, ri
π (a|s) - the agent’s policy
St−1
St
St+1
p
Rt−1
π
Noga Zaslavsky
p
Rt
π
At−1
St+2
p
Rt+1
π
At
Q-Learning (Watkins & Dayan, 1992)
At+1
November 1, 2015
7 / 21
Introduction
Modeling the problem
The Value Function
The un-discounted value function for any given policy π
"
#
T
π
V ( s ) = E ∑ r ( st , at ) s0 = s
t=0
T is the horizon - can be finite, episodic, or infinite
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
8 / 21
Introduction
Modeling the problem
The Value Function
The un-discounted value function for any given policy π
"
#
T
π
V ( s ) = E ∑ r ( st , at ) s0 = s
t=0
T is the horizon - can be finite, episodic, or infinite
The discounted value function for any given policy π
"
#
∞
V π ( s ) = E ∑ γ t r ( st , at ) s0 = s
t=0
0 < γ < 1 is the discounting factor
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
8 / 21
Introduction
Modeling the problem
The Value Function
The un-discounted value function for any given policy π
"
#
T
π
V ( s ) = E ∑ r ( st , at ) s0 = s
t=0
T is the horizon - can be finite, episodic, or infinite
The discounted value function for any given policy π
"
#
∞
V π ( s ) = E ∑ γ t r ( st , at ) s0 = s
t=0
0 < γ < 1 is the discounting factor
The goal: find a policy π ∗ that maximizes the value ∀s ∈ S
∗
V ∗ (s) = V π (s) = max V π (s)
π
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
8 / 21
Introduction
Bellman optimality
The Bellman Equation
The value function can be written recursively
"
#
∞
V π (s) = E ∑ γt r (st , at )s0 = s =
E
t=0
Noga Zaslavsky
a∼π (·|s)
s0 ∼p(·|s,a)
Q-Learning (Watkins & Dayan, 1992)
r (s, a) + γV π s0
November 1, 2015
9 / 21
Introduction
Bellman optimality
The Bellman Equation
The value function can be written recursively
"
#
∞
V π (s) = E ∑ γt r (st , at )s0 = s =
E
a∼π (·|s)
s0 ∼p(·|s,a)
t=0
r (s, a) + γV π s0
The optimal value satisfies the Bellman equation
V ∗ (s) = max E
r (s, a) + γV ∗ s0
π
Noga Zaslavsky
a∼π (·|s)
s0 ∼p(·|s,a)
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
9 / 21
Introduction
Bellman optimality
The Bellman Equation
The value function can be written recursively
"
#
∞
V π (s) = E ∑ γt r (st , at )s0 = s =
E
a∼π (·|s)
s0 ∼p(·|s,a)
t=0
r (s, a) + γV π s0
The optimal value satisfies the Bellman equation
V ∗ (s) = max E
r (s, a) + γV ∗ s0
π
a∼π (·|s)
s0 ∼p(·|s,a)
π ∗ is not necessarily unique, but V ∗ is unique
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
9 / 21
Introduction
Bellman optimality
The Q-Function
The value of taking action a at state s:
Q (s, a) = r (s, a) + γ
Noga Zaslavsky
E
s0 ∼p(·|s,a)
Q-Learning (Watkins & Dayan, 1992)
V s0
November 1, 2015
10 / 21
Introduction
Bellman optimality
The Q-Function
The value of taking action a at state s:
Q (s, a) = r (s, a) + γ
E
s0 ∼p(·|s,a)
V s0
If we know V ∗ then an optimal policy is to decide deterministically
a∗ (s) = arg max Q∗ (s, a)
a
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
10 / 21
Introduction
Bellman optimality
The Q-Function
The value of taking action a at state s:
Q (s, a) = r (s, a) + γ
E
s0 ∼p(·|s,a)
V s0
If we know V ∗ then an optimal policy is to decide deterministically
a∗ (s) = arg max Q∗ (s, a)
a
This means that V ∗ (s) = max Q∗ (s, a)
a
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
10 / 21
Introduction
Bellman optimality
The Q-Function
The value of taking action a at state s:
Q (s, a) = r (s, a) + γ
E
s0 ∼p(·|s,a)
V s0
If we know V ∗ then an optimal policy is to decide deterministically
a∗ (s) = arg max Q∗ (s, a)
a
This means that V ∗ (s) = max Q∗ (s, a)
a
Conclusion
Learning Q∗ makes it easy to obtain an optimal policy
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
10 / 21
Q-Learning
Outline
1
Introduction
What is reinforcement learning?
Modeling the problem
Bellman optimality
2
Q-Learning
Algorithm
Convergence analysis
3
Assumptions and Limitations
4
Summary
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
11 / 21
Q-Learning
Learning Q
Algorithm
∗
If the agent knows the dynamics p and the reward function r, it can
find Q∗ by dynamic programing (e.g. value iteration)
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
12 / 21
Q-Learning
Learning Q
Algorithm
∗
If the agent knows the dynamics p and the reward function r, it can
find Q∗ by dynamic programing (e.g. value iteration)
Otherwise, it needs to estimate Q∗ from its experience
The experience of an agent is a sequence s0 , a0 , r0 , s1 , a1 , r1 , s2 , . . .
The n-th episode is (sn , an , rn , sn+1 )
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
12 / 21
Q-Learning
Learning Q
Algorithm
∗
Q-Learning
Initialize Q0 (s, a) for all a ∈ A and s ∈ S
For each episode n
observe the current state sn
select and execute an action an
observe the next state sn+1 and reward rn
Qn (sn , an ) ← (1 − αn ) Qn−1 (sn , an ) + αn rn + γmax Qn−1 (sn+1 , a)
a
αn ∈ (0, 1) - the learning rate
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
13 / 21
Q-Learning
Algorithm
Exploration vs. Exploitation
How should the agent gain experience in order to learn?
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
14 / 21
Q-Learning
Algorithm
Exploration vs. Exploitation
How should the agent gain experience in order to learn?
If it explores too much it might suffer from a low value
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
14 / 21
Q-Learning
Algorithm
Exploration vs. Exploitation
How should the agent gain experience in order to learn?
If it explores too much it might suffer from a low value
If it exploits the learned Q function too early, it might get stuck at bad
states without even knowing about better possibilities (overfitting)
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
14 / 21
Q-Learning
Algorithm
Exploration vs. Exploitation
How should the agent gain experience in order to learn?
If it explores too much it might suffer from a low value
If it exploits the learned Q function too early, it might get stuck at bad
states without even knowing about better possibilities (overfitting)
One common approach is to use an e-greedy policy
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
14 / 21
Q-Learning
Convergence analysis
Convergence of the Q function
Theorem
Given bounded rewards |rn | ≤ R, and learning rates 0 ≤ αn ≤ 1 s.t.
∞
∑
αn = ∞
∞
and
n=1
∑ α2n < ∞
n=1
then with probability 1
Qn (s, a) −→ Q∗ (s, a)
n→ ∞
The exploration policy should be such that each state-action pair will be
encountered infinitely many times
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
15 / 21
Q-Learning
Convergence analysis
Convergence of the Q function
Poof - main idea
Define the operator
0 0
B [Q]s,a = r (s, a) + γ ∑ p s0 |s, a max
Q
s
,
a
0
s0
a
B is a contraction under the k · k∞ norm
kB [Q1 ] − B [Q2 ] k∞ ≤ γkQ1 − Q2 k∞
It is easy to see that B [Q∗ ] = Q∗
We’re not done because the updates come from a stochastic process
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
16 / 21
Assumptions and Limitations
Outline
1
Introduction
What is reinforcement learning?
Modeling the problem
Bellman optimality
2
Q-Learning
Algorithm
Convergence analysis
3
Assumptions and Limitations
4
Summary
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
17 / 21
Assumptions and Limitations
What assumptions did we make?
Q-learning is model free
The state and reward are assumed to be fully observable
POMDP is a much harder problem
The Q function can be represented by a look-up table
otherwise we need to do something else such as function
approximation, and this is where deep learning comes in!
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
18 / 21
Summary
Outline
1
Introduction
What is reinforcement learning?
Modeling the problem
Bellman optimality
2
Q-Learning
Algorithm
Convergence analysis
3
Assumptions and Limitations
4
Summary
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
19 / 21
Summary
Summary
The basic reinforcement learning problem can be modeled as an MDP
If the model is known we can solve precisely by dynamic programing
Q-learning allows learning an optimal policy when the model is not
known, and without trying to learn the model
It converges asymptotically to the optimal Q if the learning rate is not
too fast and not too slow, and if the exploration policy is satisfactory
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
20 / 21
Summary
Further Reading
Richard S. Sutton and Andrew G. Barto
Reinforcement Learning: An Introduction.
MIT Press, 1998
Noga Zaslavsky
Q-Learning (Watkins & Dayan, 1992)
November 1, 2015
21 / 21