1. dia

Reinforcement learning
02/05/2017
Copyrights:
Szepesvári Csaba: Megerősítéses tanulás (2004)
Szita István, Lőrincz András: Megerősítéses tanulás (2005)
Richard S. Sutton and Andrew G. Barto: Reinforcement Learning: An Introduction (1998)
Reinforcement learning
http://www.youtube.com/watch?v=mRpX9DFCdwI
http://www.youtube.com/watch?v=VCdxqn0fcnE
Reinforcement learning
Reinforcement learning
Reinforcement learning
•
Pavlov: Nomad 200 robot
•
Nomad 200 simulator
Sridhar Mahadevan
UMass
Reinforcement learning
•
•
•
•
Controll tasks
Planning of multiple actions
Learning from interaction
Objective: maximising reward (i.e. task-specific)
+3
+50
-1
-1
…
r1
…
r4
r5
r9
s1
s2
s3
s4
s5
…
s9
a1
a2
a3
a4
a5
…
a9
Supervised vs
Reinforcement learning
Both are machine learning
Supervised
Reinforcement
Prompt supervision
Late, indirect reinforcement
Passive learnng
(training dataset is given)
Active learning
(actions taken by the system
which will be then reinforced)
Reinforcement learning
•
•
•
•
•
time:
states:
actions:
reward:
policy (strategy):
– deterministic:
– stochasztik:
(s,a) is the likelihood that we choose action a
being in state s
• (infinate horizon)
• process:
• model of the environment: transition
probabilites and reward
• objective: find a policy which
maximises the expected value of total
reward
Markov assumption
→ the dynamics of the system can be
given by:
Markov Decision Processes (MDPs)
• Stochastic transitions
r=0
1
a1
a2
2
r=2
The exploration – exploitation dilemma
The k-armed bandit bandit
Avg. reward
rewards
10
0, 0, 5, 10, 35
5, 10, -15, -15, -10
-5
-20, 0, 50
agent
100
Maximising the reward on a long-term we have to explore the world’s dynamics
then we can exploit this knowladge and collect reward.
Discounting
• infinate horizon
– rt can be infinate!
– solution: discounting. Instead of rt we
use t rt , <1
– always finate
Markov Decision Process
• environment changes according to P
and R
• agent takes an action:
• we are looking for the optimal policy 
which maximises
Long-term reward
• The policy  of the agent is fixed
• Rt is the total discounted reward (return)
after the step t
+3
r1
+50
-1
-1
r4
r5
r9
Value = expected total reward
• The expected value of Rt depends on 
• V(s) is the value function
• Task: find optimal policy * which maximises Rt
in each state
– We optimise (search for *) for the long-term
reward instead of promptly (greedy) rewards
at
at+1
st
at+2
st+1
rt+1
st+2
rt+2
st+3
rt+3
17
Bellman equation
Based on the Markov assumption, a
recursive formula can be derived for the
expcted return:
4
(s)
s
3
5
Preference relation among
policies
• 1 ≥ 2, iff
• a partial ordering
• * is optimal if * ≥  for every policy 
• optimal policy exists for every problem
example MDP
-10
2
B
A
-10 1
2
-10
1
1
-10
-10
C
D
-10
2
2
1
+100
objective
• 4 states, 2 actions
• 10% chace to take the non-selected action
Two example policies
(A,1) = 1
(B,1) = 1
(C,1) = 1
(D,1) = 1
(A,2) = 0
(B,2) = 0
(C,2) = 0
(D,2) = 0
A
2
C
-10 2
-101
1
-10
B
1
1
-10
2
-102
D
2
1
+100
• solution:
• solution for 2 :
a third policy
3(A,1) = 0,4
3(B,1) = 1
3(C,1) = 0
3(D,1) = 1
-10
3(A,2) = 0,6
3(B,2) = 0
3(C,2) = 1
3(D,2) = 0
2
B
A
-10 1
2
1
1
-10
C
-10 2
1
D
-10
2
2
1
+100
solution:
Comparision of the 3 policies
1
2
3
A
75.61
75.61
77.78
B
87.56
68.05
87.78
C
68.05
87.56
87.78
D
100
100
100
• 1 ≤ 3 and 2 ≤ 3
• 3 is optimal
• there can be many optimal policies!
• the optimal value function (V) is unique
Optimal policies and the Bellman
equation
• Q is the action-value function
• The optmal policies share the same value
functon:
• Greedy policy:
argmaxa Q*(s,a)
• The greedy policy is optimal!!!
Optimal policies and the Bellman
equation
• non-linear!
• has a unique solution
• solves the long-term planning problem
dynamic programming for MDPs
DP
• assume P and R are known
• Searching for optimal 
• Policy iteration
• Value iteration
Policy iteration
Jack's Car Rental Problem: Jack manages two locations for a nationwide car rental company. Each day, some
number of customers arrive at each location to rent cars. If Jack has a car available, he rents it out and is credited
$10 by the national company. If he is out of cars at that location, then the business is lost. Cars become available
for renting the day after they are returned. To help ensure that cars are available where they are needed, Jack can
move them between the two locations overnight, at a cost of $2 per car moved. We assume that the number of
cars requested and returned at each location are Poisson random variables with parameter λ. Suppose λ is 3 and
4 for rental requests at the first and second locations and 3 and 2 for returns. To simplify the problem slightly, we
assume that there can be no more than 20 cars at each location (any additional cars are returned to the
nationwide company, and thus disappear from the problem) and a maximum of five cars can be moved from one
location to the other in one night. We take the discount rate to be 0.9 and formulate this as an MDP, where the time
steps are days, the state is the number of cars at each location at the end of the day, and the actions are the net
numbers of cars moved between the two locations overnight.
If P and R are NOT known
• searching for V
• R(s): return starting from s
(random variable)

V (s)
estimation of
by
Monte Carlo methods, MC
• estimating R(s) by simulation
(remember we don’t know P and R)
• take N episodes starting from s
according to 
Monte Carlo policy
evaluation