11. Reinforcement Learning

Artificial Intelligence
11. Reinforcement Learning
prof. dr. sc. Bojana Dalbelo Bašić
doc. dr. sc. Jan Šnajder
University of Zagreb
Faculty of Electrical Engineering and Computing (FER)
Academic Year 2015/2016
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
1 / 56
Markov Decision Processes and Non-Deterministic Search
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
2 / 56
The Model of the World
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
3 / 56
GridWorld
A maze-like problem
Noisy movement
I
Actions don’t always end up as planned
Immediate reward (”Living” reward)
I
Each time-step the agent lives, he receives a reward.
The goal of the game is to maximize the sum of rewards - Utility
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
4 / 56
Deterministic vs Stochastic Movement
Deterministic
Dalbelo Bašić, Šnajder (UNIZG FER)
Stochastic
AI – Reinforcement Learning
Academic Year 2015/2016
5 / 56
Markov Decision Processes
Andrey Markov (1856. - 1922.)
”Markov” in MDP’s means that an action
outcome depends only on the current state:
P (St+1 = s0 |St = st , At = at )
The successor function cares only about the
current state, and not where you came from
Therefore, we want to model the optimal
action choice (Policy) for all given states
π∗ : S → A
I
I
A policy π gives an action for each state
An optimal policy maximizes expected
utility
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
6 / 56
Formalized Markov Decision Processes
A Markov Decision Process (MDP) is a Non-deterministic search
problem
MDP Formalization
I
A set of states s ∈ S
I
A set of actions a ∈ A
I
A transition function T (s, a, s0 )
F
F
The probability that taking an action a in state s takes you to state s0
Also known as the model of the world the agent lives in
I
A reward function R(s, a, s0 )
I
A start state
I
Maybe, a terminal state(s)
In the scope of the course, only the transition function T is a
probability distribution, while generally even the reward function could
be a probability distribution itself
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
7 / 56
Gridworld demo
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
8 / 56
Optimal Policies and Immediate Rewards (Punishments?)
r(s, ·, s0 ) = −0.01
r(s, ·, s0 ) = −0.03
r(s, ·, s0 ) = −0.4
r(s, ·, s0 ) = −2.0
How would the policy look like for positive immediate rewards?
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
9 / 56
MDP Search Trees
Imagine that instead of playing a game against an opponent that
wants to minimize your reward, your opponent takes random actions
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
10 / 56
Sequence Preference
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
11 / 56
Sequence Preference
Two main, seemingly straightforward issues
More or less
I
1, 2, 2 vs 1, 2, 3
Now or later
I
0, 0, 1 vs 1, 0, 0
It’s logical and simple to prefer the larger sum of rewards
How to adapt the algorithm to prefer equal amounts sooner instead of
later?
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
12 / 56
Discounting
Exponential decay of rewards
Each step we take, the rewards are multiplied by a factor of γ
P
t
Discounted utility = ∞
t=0 γ rt
The utility for an action a in the state s that ends up in state s0 is the
sum of the immediate reward and the discounted future reward
V (s) = R(s, a, s0 ) + γV (s0 )
Finding the same utility faster is usually better (otherwise, the reward
function is not defined properly)
0.9910 = 0.904 (0.810 = 0.107)
Gamma values are usually from [0.8, 1]
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
13 / 56
Infinite utilities
Some MDP’s can go on forever with infinite
rewards
Solutions
P∞
γ t rt ≤
Rmax
(1−γ)
1
Discounted utility =
2
Limited number of steps (terminate
learning after T steps)
Absorbing state - a terminal state sA which
will always be reached (imagine that every
state has a small probability to transition
into sA )
F
3
t=0
Smaller gamma = shorter focus
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
14 / 56
Markov Decision Processes Revisited
M DP = (S, s0 ∈ S, A, T (s, a, s0 ), R(s, a, s0 ), γ)
A set of states s ∈ S
An initial state s0
Maybe, a terminal state(s)
A set of actions a ∈ A
A transition function T (s, a, s0 )
I
I
The probability that taking an action a in state s takes you to state s0
Also known as the model of the world the agent lives in
A reward function R(s, a, s0 )
Discounting factor γ
Terms
I
I
Policy = which action to take in each state
Utility = sum of discounted rewards
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
15 / 56
Solving Markov Decision Processes
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
16 / 56
Optimal Quantities
Optimal quantities are the values we want at the end of our process
I
I
I
Optimal policy
π ∗ (s) = optimal action to take from state s
Value (utility) of a state
V ∗ (s) = expected utility when starting in s and acting optimally
Value (utility) of a Q-state
Q∗ (s, a) = expected utility when having commited to an action a from
a state s and acting optimally afterwards
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
17 / 56
Values of states
Computing the expected maximum value of a state
I
I
I
I
P
Expectation E[X] = x∈X P (x) · x
The utility is the probability weighted sum of rewards of possible
transitions when committed to an action a
Or, in more simple terms, the utility is the expected future reward
The optimal action provides maximum future utility
Two recursive relations (”Bellman equations”)
I
I
I
I
The value of a state is the maximum Q-value of all the possible actions
V ∗ (s) = maxa Q∗ (s, a)
The Q-value of a state is the expectation of the immediate reward for
a transition P
(s, a, s0 ) and the discounted future reward
∗
Q (s, a) = s0 T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]
Bellman form - substituting Q(s, a)
I
V ∗ (s) = maxa
Dalbelo Bašić, Šnajder (UNIZG FER)
P
s0
T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]
AI – Reinforcement Learning
Academic Year 2015/2016
18 / 56
Value Iteration
Initialize the vector of values to zero V0 (s) = 0
I
Instead of 0, we can use any constant - the system is still going to
converge to stationary values in the limit
Iterate over the Bellman form
I
Vk+1 (s) ← maxa
P
s0
T (s, a, s0 )[R(s, a, s0 ) + γVk (s0 )]
Repeat until convergence
Complexity of each step = O(S 2 A)
Policy (optimal action) will converge before the values values (Demo)
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
19 / 56
Demo: value iteration
No randomized movement, no living penalty, gamma = 0.9
python gridworld.py -n 0 -a value -v -i 20
Randomized movement p = 0.2, living penalty = -0.01, gamma = 0.9
python gridworld.py -r -0.01 -a value -v -i 20
Randomized movement p = 0.2, living penalty = -0.01, gamma = 1
python gridworld.py -r -0.01 -a value -v -i 20 -d 1
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
20 / 56
Policy Iteration
Value iteration is slow - it considers every possible action in every
possible step (complexity S 2 A)
The maximum value (policy) rarely changes!
Policy iteration
1
Fix a non-optimal policy π
2
Policy evaluation: calculate the utilities (values) for the fixed
(non-optimal) policy until convergence
3
Policy improvement: update the policy by using the resulting
(non-optimal) utilities
4
Repeat until convergence
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
21 / 56
Policy Iteration
Policy Iteration
1
Fix π0
2
π (s) ←
Vk+1
3
4
T (s, πi (s), s0 )[R(s, πi (s), s0 ) + γVkπ (s0 )]
P
πi+1 (s) = arg maxa s0 T (s, a, s0 )[R(s, a, s0 ) + γVkπ (s0 )]
P
s0
Repeat until convergence
In value iteration, we calculate the optimal values without an explicit
policy and then compute the policy by finding the max value in the
successor states.
In policy iteration, we use a fixed non-optimal policy which we then
improve until convergence.
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
22 / 56
Double Bandit Problem
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
23 / 56
Double Bandit Problem
First bandit (blue): With probability p = 1, win one dollar
Second bandit (red): With probability pw = 0.75, win two dollars,
with probability pl = 0.25 win zero dollars
Even without explicit calculation, we have a good idea which strategy
we should use
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
24 / 56
Game on!
$2 $2 $0 $2 $2
$0 $2 $2 $2 $2
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
25 / 56
Offline Planning
When all the values are known, solving MDP’s is easy
P
V (blue) = x∈X P (x) · x = 1 · 1
P
V (red) = x∈X P (x) · x = 0.75 · 2 + 0.25 · 0 = 1.5
What happens when we don’t know the values?
Same problem, but the transition function T is now unknown for both
bandits
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
26 / 56
Game on again!!
$2 $2 $0 $0 $0
$0 $0 $0 $0 $2
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
27 / 56
Reinforcement Learning Intro
What happened? We changed our policy after sampling long enough
The underlying problem is still the same (MDP), however the
transition function is unknown - forcing us to estimate the transition
probabilities
Keywords
I
I
I
Exploration: Trying unknown actions to gather information
Exploitation: Use what you know to be optimal
Sampling: Trying things out with varying degrees of success and
estimating the underlying probability distribution of the MDP
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
28 / 56
Reinforcement Learning
Ivan Petrovič Pavlov
Conditioned reflexes - based on repeated
sampling, the dogs learned a reaction to
hearing a bell
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
29 / 56
Reinforcement Learning
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
30 / 56
Reinforcement Learning
Take actions based on current knowledge of the environment
Receive feedback in the form of rewards
Learn based on received rewards and update behavior (policy) in order
to maximize the expected reward (utility)
We start with zero knowledge and sample our way to the optimal
policy
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
31 / 56
Example videos: learning to walk, game playing, crawler
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
32 / 56
Reinforcement Learning
Assume an underlying Markov Decision Process
I
I
I
I
A
A
A
A
set of states s ∈ S
set of actions a ∈ A
transition function T (s, a, s0 )
reward function R(s, a, s0 )
Goal: find optimal policy π ∗ (s)
Difficulty: Don’t know T or R
Possible approaches
1
2
Estimate the transition and reward functions and then find the optimal
model
Directly estimate the optimal model without explicitly calculating the
transition and reward function
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
33 / 56
Offline Planning vs. Online Learning
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
34 / 56
Model-Based Learning
Learn an approximate model based on experience
Assume the learned model is correct and solve for its values
1
Learn empirical MDP model
F
F
F
2
Count outcomes s0 for each (s, a)
Normalize values to estimate T (s, a, s0 )
Discover each R(s, a, s0 ) when sampling
Solve the learned MDP
F
Use value iteration
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
35 / 56
Example: expected age
Estimation
E[A] =
P
a P (a)
×a=
1
N
Learn the estimation as
you go
Estimate the prior
P̂ (a) =
count(ai )
N
Calculate the expectation
I
E[A] ≈
P
a
i=0 ai
Model-Free learning
Model-Based learning
I
PN
I
E[A] ≈
1
N
PN
i=0
ai
P̂ (a) × a
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
36 / 56
Passive Reinforcement Learning
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
37 / 56
Passive Reinforcement Learning
Evaluate a given, non-optimal policy
π(s)
Both the transition and reward
function are unknown
Goal: Learn state values
The learner follows instructions given from an oracle and has no
choice regarding which action to take
This is not offline planning - we cannot deduce the state values
without explicitely trying out sequences of moves (playing)
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
38 / 56
Direct evaluation
Goal: Compute the value for each state under fixed non-optimal
policy π
Solution: Average sampled values
I
I
I
I
Act according to π
On each visit to a state, write down what the sum of discounted
rewards turned out to be in the end, starting from that state
Average the samples
Similar to model-free learning
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
39 / 56
Example of direct evaluation
γ=1
Observed transitions
Sample 1
Sample 3
B, east, E, -1
E, east, D, -1
D, exit, x, 10
B, east, E, -1
E, east, D, -1
D, exit, x, 10
Sample 2
Sample 4
C, north, E, -1
E, east, D, -1
D, exit, x, 10
C, north, E, -1
E, east, A, -1
D, exit, x, -10
Observed rewards
T(B, e, E) = 1.0
R(B, e, E) = -1
T(E, e, D) = 0.75
R(D, exit, x) = 10
T(E, e, A) = 0.25
...
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
40 / 56
Example of direct evaluation
γ=1
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
41 / 56
The Good and the Bad
Pros of direct evaluation
Easy to understand
Requires no knowledge of T and R
Eventually computes the correct values just by sampling
Cons of direct evaluation
Information about state connections is unused (T)
Each state is learned independently
Learning takes a long time
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
42 / 56
Policy Evaluation
Value iteration:
V0π (s) = 0
π (s) ← max
Vk+1
a
P
s0
T (s, a, s0 )[R(s, a, s0 ) + γVkπ (s0 )]
We don’t have T and R!
How to average without
knowing the weights?
I
Sampling!
Pn
π (s) ← 1
0
π 0
Vk+1
i=1 [R(s, π(s), s1 ) + γVk (s1 )]
n
We need samples for each outcome from state s - hard to do in
practice (some states are hard to reach)
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
43 / 56
Temporal Difference Learning
Keep track of the past and adapt to the present
Still using a fixed, non-optimal policy!
I
I
Update V (s) every time we get a new experience (s, a, s0 , r)
The outcomes s0 that are more likely based on the unknown probability
distribution T will contribute to the updates more often
Temporal difference learning
I
Sample from V(s): sample = (s, a, s0 , r)
I
Update V(s): V π (s) = V π (s) + α[(R(s, π(s), s0 ) + γV π (s0 ) − V π (s)]
I
I
I
The update factor is the difference between the observed value and the
current estimate scaled by the learning rate α
dif f erence = [R(s, π(s), s0 ) + γV π (s0 )] − V π (s)
If the observed value is higher than what we estimated, we increase the
estimation, otherwise we decrease (or keep it the same)
This type of update is called the running average - we move the
values toward the value of whichever successor occurs
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
44 / 56
Caveats of Temporal Difference Value Learning
Temporal difference value learning allows us to approximate the value
of each state
However, we wanted to learn the policy - and all we do is estimate
the values based on a given non-optimal policy
I
I
π(s) = argP
maxa Q(s, a)
Q(s, a) = s0 T (s, a, s0 )[R(s, a, s0 ) + γV (s0 )]
We learned all the values, so in theory we could find the policy - but
we didn’t keep track of the transition function T, and therefore
cannot find the arguments to the max
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
45 / 56
Active Reinforcement Learning
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
46 / 56
Active Reinforcement Learning
The differences
I
I
I
Now, you select which action to take based on prior experience
The transitions T(s,a,s’) and rewards R(s,a,s’) are still unknown
The goal: Learn the optimal policy and values
Q-value Iteration
I
Remember the recursive definition of the Bellman updates
F
F
I
∗
V ∗ (s) = max
Pa Q (s, a) 0
∗
Q (s, a) = s0 T (s, a, s )[R(s, a, s0 ) + γV ∗ (s0 )]
Instead of learning the value function, learn all of its arguments - if we
have all the Q-values (expected utilities when commited to an action in
a state), calculating the policy is a simple max over the actions for a
state
F
F
F
Q0 (s, a) = 0 P
Qk+1 (s, a) ← s0 T (s, a, s0 )[R(s, a, s0 ) + γ maxa0 Qk (s0 , a0 )]
The policy π is the action that for a given state gives the maximal
Q-value
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
47 / 56
Q-Learning
Sample-based Q-value iteration
I
Qk+1 (s, a) ←
P
s0
T (s, a, s0 )[R(s, a, s0 ) + γ maxa0 Qk (s0 , a0 )]
Learn Q(s,a) values on-the-fly
I
I
I
Sample (s,a,s’,r)
Incorporate it into the running average
Q(s, a) ← Q(s, a) + (α)[R(s, a, s0 ) + γ maxa0 Q(s0 , a0 ) − Q(s, a)]
Q-learning converges to an optimal policy even if we make suboptimal
decisions on the way
Off-policy learning
Caveats!
I
I
I
It takes a long time to converge
Eventually, you have to reduce the learning rate to converge, however
you have to do it at the right time
However, as the number of iterations approaches infinity, as long as
you explore enough, the algorithm will learn the optimal policy!
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
48 / 56
Exploration vs Exploitation
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
49 / 56
Epsilon-Greedy
Forcing exploration through random actions
When choosing an action
I
I
With probability choose a random action
With probability 1 − act according to current optimal policy
We still need to converge to a fixed policy
1
2
Decay over time (similar to the discounting factor)
Smarter exploration functions
F
F
Explore each state proportionally to its estimated (Q-)value
Sample enough times to be sure that a state is bad and not just due to
the movement noise
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
50 / 56
Approximate Q-Learning
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
51 / 56
Generalizing
So far, we keep track of every
single possible state
We have no notion of
similarities between states
What happens when we scale
the problem?
Ideally, we want to be able to
generalize from a small number
of observed experiences
For this, we need to have a
notion of similarity between
states
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
52 / 56
Similar states
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
53 / 56
Feature space embeddings
Describe a state using a vector of features
I
I
Features are mappings from the full problem space to a lower
dimensional space (boolean values, enumerations, categories) which
captures the important properties of the problem space
Features
F
F
F
F
F
F
I
Distance to the closest ghost
Distance to closest food dot
Number of ghosts
Minimal distance to ghost
Distance to ghost when moving in direction d
Is Pacman in a tunnel
We can also describe a q-state (state, action pair) with features, such
as ”This action moves us further away from the ghosts”
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
54 / 56
Linear Value Functions
By using feature based representation, we can describe the value of
each state (or q-state) by the linear combination of its weights
I
I
I
I
I
I
I
Features are lower-dimensional mappings of the current state
f (s) = [f1 (s), . . . , fn (s)]
We assign a weight to each feature to measure its importance
W = [wn, . . . , wn ]
The value of a (Q-)state is the dot product of the weights and features
of that state
V (s) = w1 f1 (s) + · · · + wn fn (s) = W · f (s)
Q(s, a) = w1 f1 (s, a) + · · · + wn fn (s, a) = W · f (s)
We get:
I
I
I
I
Reduced dimensionality
Faster learning
Loss of information!
States may share features but have different values
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
55 / 56
Approximate Q-Learning
Q-state decomposition
Q(s, a) = w1 f1 (s, a) + · · · + wn fn (s, a)
Q-learning with linear Q-functions:
I
I
dif f erence = [R + γ maxa0 Q(s0 , a0 )] − Q(s, a)
wi ← wi + α(dif f erence)fi (s, a)
Adjust weights instead of Q-values
When something unexpected happens (our current policy leads us to
a bad outcome), find the responsible features (the ones that
contributed the most to the policy) and update them
Dalbelo Bašić, Šnajder (UNIZG FER)
AI – Reinforcement Learning
Academic Year 2015/2016
56 / 56