Reinforcement Learning

Reinforcement Learning
Yishay Mansour
Tel-Aviv University
Outline
• Goal of Reinforcement Learning
• Mathematical Model (MDP)
• Planning
2
Goal of Reinforcement Learning
Goal oriented learning through interaction
Control of large scale stochastic environments with
partial knowledge.
Supervised / Unsupervised Learning
Learn from labeled / unlabeled examples
3
Reinforcement Learning - origins
Artificial Intelligence
Control Theory
Operation Research
Cognitive Science & Psychology
Solid foundations; well established research.
4
Typical Applications
• Robotics
– Elevator control [CB].
– Robo-soccer [SV].
• Board games
– backgammon [T],
– checkers [S].
– Chess [B]
• Scheduling
– Dynamic channel allocation [SB].
– Inventory problems.
5
Contrast with Supervised Learning
The system has a “state”.
The algorithm influences the state distribution.
Inherent Tradeoff: Exploration versus Exploitation.
6
Mathematical Model - Motivation
Model of uncertainty:
Environment, actions, our knowledge.
Focus on decision making.
Maximize long term reward.
Markov Decision Process (MDP)
7
Mathematical Model - MDP
Markov decision processes
S- set of states
A- set of actions
d - Transition probability
R - Reward function
Similar to DFA!
8
MDP model - states and actions
Environment = states
0.7
0.3
action a
Actions = transitions
d (s, a, s')
9
MDP model - rewards
R(s,a) = reward at state s
for doing action a
(a random variable).
Example:
R(s,a) = -1 with probability 0.5
+10 with probability 0.35
+20 with probability 0.15
10
MDP model - trajectories
trajectory:
s0 a0
r0
s1 a1
r1
s2 a2
r2
11
MDP - Return function.
Combining all the immediate rewards to a single value.
Modeling Issues:
Are early rewards more valuable than later rewards?
Is the system “terminating” or continuous?
Usually the return is linear in the immediate rewards.
12
MDP model - return functions
return 
Finite Horizon - parameter H
1 i  H
Infinite Horizon
i
i

discounted - parameter g<1.
undiscounted
 R(s , a )
1
N
N 1

i0
return   γ i R(s i ,a i )
i 0
N
R(s i ,a i ) 
 return
Terminating MDP
13
MDP model - action selection
AIM: Maximize the expected return.
Fully Observable - can “see” the “entire” state.
Policy - mapping from states to actions
Optimal policy: optimal from any start state.
THEOREM: There exists a deterministic optimal policy
14
Contrast with Supervised Learning
Supervised Learning:
Fixed distribution on examples.
Reinforcement Learning:
The state distribution is policy dependent!!!
A small local change in the policy can make a huge
global change in the return.
15
MDP model - summary
sS
a A
- set of states, |S|=n.
- set of k actions, |A|=k.
d ( s1 , a , s 2 ) - transition function.
R(s,a)
 :S  A
- immediate reward function.
- policy.

g
i0
i
ri
- discounted cumulative return.
16
Simple example: N- armed bandit
Single state.
Goal: Maximize sum of
immediate rewards.
a1
s
a2
a3
Given the model:
Greedy action.
Difficulty:
unknown model.
17
N-Armed Bandit: Highlights
• Algorithms (near greedy):
– Exponential weights
• Gi sum of rewards of action ai
• wi = eGi
– Follow the (perturbed) leader
• Results:
– For any sequence of T rewards:
– E[online] > maxi {Gi} - sqrt{T log N}
18
Planning - Basic Problems.
Given a complete MDP model.
Policy evaluation - Given a policy , estimate its return.
Optimal control -
Find an optimal policy * (maximizes
the return from any start state).
19
Planning - Value Functions
V(s) The expected return starting at state s following .
Q(s,a) The expected return starting at state s with
action a and then following .
V*(s) and Q*(s,a) are define using an optimal policy *.
V*(s) = max V(s)
20
Planning - Policy Evaluation
Discounted infinite horizon (Bellman Eq.)
V(s) = Es’~  (s) [ R(s, (s)) + g V(s’)]
Rewrite the expectation
V ( s)  E[ R( s,  ( s))]  g s ' d ( s,  ( s), s' )V ( s' )


Linear system of equations.
21
Algorithms - Policy Evaluation
Example
A={+1,-1}
g = 1/2
d(si,a)= si+a
 random
"a: R(si,a) = i
s0
s3
s1
0
1
3
2
s2
V(s0) = 0 +g [(s0,+1)V(s1) + (s0,-1) V(s3) ]
22
Algorithms -Policy Evaluation
Example
A={+1,-1}
g = 1/2
d(si,a)= si+a
 random
s0
"a: R(si,a) = i
s3
s1
0
1
3
2
V(s0) = 5/3
V(s1) = 7/3
V(s2) = 11/3
V(s3) = 13/3
s2
V(s0) = 0 + (V(s1) + V(s3) )/4
23
Algorithms - optimal control
State-Action Value function:
Q(s,a)  E [ R(s,a)] + g Es’~ (s,a) [ V(s’)]
Note
V  ( s )  Q  ( s ,  ( s ))
For a deterministic policy .
24
Algorithms -Optimal control
Example
A={+1,-1}
g = 1/2
d(si,a)= si+a
 random
s0
R(si,a) = i
s3
s1
0
1
3
2
Q(s0,+1) = 5/6
Q(s0,-1) = 13/6
s2
Q(s0,+1) = 0 +g V(s1)
25
Algorithms - optimal control
CLAIM: A policy  is optimal if and only if at each state s:
V(s)  MAXa {Q(s,a)}
(Bellman Eq.)
PROOF: Assume there is a state s and action a s.t.,
V(s) < Q(s,a).
Then the strategy of performing a at state s (the first time)
is better than .
This is true each time we visit s, so the policy that
performs action a at state s is better than .
p
26
Algorithms -optimal control
Example
A={+1,-1}
g = 1/2
d(si,a)= si+a
 random
s0
R(si,a) = i
s3
s1
0
1
3
2
s2
Changing the policy using the state-action value function.
27
Algorithms - optimal control
The greedy policy with respect to Q(s,a) is
(s) = argmaxa{Q(s,a) }
The e-greedy policy with respect to Q(s,a) is
(s) = argmaxa{Q(s,a) } with probability 1-e, and
(s) = random action with probability e
28
MDP - computing optimal policy
1. Linear Programming
2. Value Iteration method.
V i 1 (s)  max{R(s, a)  g s ' d (s, a, s' ) V i (s' )}
a
3. Policy Iteration method.
 i ( s )  arg max {Q
 i 1
( s, a)}
a
29
Convergence
• Value Iteration
– Drop in distance from optimal each iteration
maxs {V*(s) – Vt(s)}
• Policy Iteration
– Policy can only improve
"s Vt+1(s)  Vt(s)
• Less iterations then Value Iteration, but
• more expensive iterations.
30
Relations to Board Games
•
•
•
•
•
•
state = current board
action = what we can play.
opponent action = part of the environment
value function = probability of winning
Q- function = modified policy.
Hidden assumption: Game is Markovian
31
Planning versus Learning
Tightly coupled in Reinforcement Learning
Goal: maximize return while learning.
32
Example - Elevator Control
Learning (alone):
Model the arrival model well.
Planning (alone) :
Given arrival model build schedule
Real objective: Construct a
schedule while updating model
33
Partially Observable MDP
Rather than observing the state we observe some
function of the state.
Ob - Observable function.
a random variable for each states.
Example: (1) Ob(s) = s+noise. (2) Ob(s) = first bit of s.
Problem: different states may “look” similar.
The optimal strategy is history dependent !
34
POMDP - Belief State Algorithm
Given a history of actions and observable value
we compute a posterior distribution for the state
we are in (belief state).
The belief-state MDP:
States: distribution over S (states of the POMDP).
actions: as in the POMDP.
Transition: the posterior distribution (given the observation)
We can perform the planning and learning on the belief-state MDP.
35
POMDP Hard computational problems.
Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a deterministic POMDP is P-spacehard (NP-complete) [PT,L].
Computing an infinite (polynomial) horizon undiscounted
optimal strategy for a stochastic POMDP is EXPTIMEhard (P-space-complete) [PT,L].
Computing an infinite (polynomial) horizon
undiscounted optimal policy for an MDP is
P-complete [PT] .
36
Resources
• Reinforcement Learning (an introduction)
[Sutton & Barto]
• Markov Decision Processes [Puterman]
• Dynamic Programming and Optimal
Control [Bertsekas]
• Neuro-Dynamic Programming [Bertsekas &
Tsitsiklis]
• Ph. D. thesis - Michael Littman
37