Making Complex Decisions – Markov Decision Processes Making

Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Making Complex Decisions – Markov
Decision Processes
Vasant Honavar
Artificial Intelligence Research Laboratory
Department of Computer Science
Bioinformatics and Computational Biology Program
Center for Computational Intelligence, Learning, & Discovery
Iowa State University
[email protected]
www.cs.iastate.edu/~honavar/
www.cild.iastate.edu/
www.bcb.iastate.edu/
www.igert.iastate.edu
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Making Complex Decisions:
Markov Decision Problem
How to use knowledge about the world to make
decisions when
• there is uncertainty about consequences of actions
• Rewards are delayed
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
The Solution
Sequential decision problems in uncertain
environments can be solved by calculating a
policy that associates an optimal decision with
every environmental state
Markov Decision Process (MDP)
Vasant Honavar, 2006.
1
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Example
The world
Actions have uncertain
consequences
3
+1
2
-1
1
0.8
0.1
0.1
start
1
2
3
4
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
2
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
3
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Cumulative Discounted Reward
Suppose rewards are bounded by M
Cumulative discounted reward is bounded by
M + M (1 − γ ) + ..M (1 − γ ) = M
n
(1 − γ )
n +1
(1 − γ )
Note : For the geometric series to converge, 0 ≤ γ < 1
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Utility of a State Sequence
g
Additive rewards
U h ([ s0 , s1 , s2 ...]) = R( s0 ) + R( s1 ) + R( s2 ) + ...
g
Discounted rewards
U h ([ s0 , s1 , s2 ...]) = R ( s0 ) + γR ( s1 ) + γ 2 R ( s2 ) + ...
Vasant Honavar, 2006.
4
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Utility of a State
g
The utility of each state is the expected sum of
discounted rewards if the agent executes the policy π
⎡∞
⎤
U π ( s) = E ⎢∑ γ t R( st ) π , s0 = s ⎥
⎣ t =0
⎦
g
The true utility of a state corresponds to the optimal
policy π*
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Vasant Honavar, 2006.
5
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Calculating the Optimal Policy
ƒ
ƒ
Value iteration
Policy iteration
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Value Iteration
g
Calculate the utility of each state
g
Then use the state utilities to select an optimal action
in each state
π * ( s) = arg max ∑ T ( s, a, s / )U ( s / )
a
s/
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Value Iteration Algorithm
function value-iteration(MDP) returns a utility function
local variables: U, U’ initially identical to R
repeat
U← U’
for each state s do
U ( s) ← R( s) + γ
end
until close-enough(U, U’)
return U
max ∑ T (s, a, s )U (s )
/
a
/
s/
Bellman update
Vasant Honavar, 2006.
6
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Value Iteration Algorithm: Example
3
0.812
2
0.762
1
0.705
1
0.868
0.655
0.912
+1
0.660
-1
0.611
0.388
2
3
4
The Utilities of the States Obtained
After Value Iteration
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Policy Iteration
• Pick a policy, then calculate the utility of each state
given that policy (value determination step)
• Update the policy at each state using the utilities of
the successor states
• Repeat until the policy stabilizes
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Policy Iteration Algorithm
function policy-iteration(MDP) returns a policy
local variables: U, a utility function, π, a policy
repeat
U← value-determination(π,U,MDP,R)
unchanged? ← true
for each state s do
⎞
⎛
if ⎜⎜ max ∑ T ( s, a, s / )U ( s / ) > ∑ T ( s, π ( s ), s / )U ( s / ) ⎟⎟ then
/
/
a
s
s
⎠
⎝
π ( s ) ← arg max ∑ T ( s, a, s / )U ( s / )
a
unchanged? ← false
s/
end
until unchanged?
return π
Vasant Honavar, 2006.
7
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Value Determination
g Simplification
of the value iteration
algorithm because the policy is fixed
g Linear
equations because the max()
operator has been removed
g Solve
exactly for the utilities using
standard linear algebra
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Optimal Policy
(policy iteration with 11 linear equations)
3
+1
2
-1
1
1
2
3
4
u(1,1) = 0.8 u(1,2) + 0.1 u(1,2) + 0.1 u(1,1)
u(1,2) = 0.8 u(1,3) + 0.2 u(1,2)
…
Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Partially observable MDP (POMDP)
• In an inaccessible environment, the percept does not
provide enough information to determine the state or
the transition probability
• POMDP
– State transition function: P(st+1 | st, at)
– Observation function: P(ot | st, at)
– Reward function: E(rt | st, at)
• Approach
– Calculate a probability distribution over the possible
states given all previous percepts, and to base
decision on this distribution
Vasant Honavar, 2006.
8
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Learning from Interaction with the world
• An agent receives sensations or percepts from the
environment through its sensors and acts on the
environment through its effectors and occasionally receives
rewards or punishments from the environment
• The goal of the agent is to maximize its reward (pleasure)
or minimize its punishment (or pain) as it stumbles along in
an a-priori unknown, uncertain, environment
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Supervised Learning
Experience = Labeled Examples
Inputs
Supervised
Learning
System
Outputs
Objective – Minimize Error between desired and actual outputs
Copyright Vasant Honavar, 2006.
9
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Reinforcement Learning
Experience = Action-induced State Transitions and Rewards
Reinforcement
Learning
System
Inputs
Outputs = actions
Objective – Maximize reward
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Reinforcement learning
• Learner is not told which actions to take
• Rewards and punishments may be delayed
– Sacrifice short-term gains for greater long-term gains
• The need to tradeoff between exploration and exploitation
• Environment may not be observable or only partially
observable
• Environment may be deterministic or stochastic
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Reinforcement learning
Environment
state
action
reward
Agent
Copyright Vasant Honavar, 2006.
10
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Key elements of an RL System
Policy
Reward
Value
Model of
environment
Policy – what to do
Reward – what is good
Value – what is good because it predicts reward
Model – what follows what
•
•
•
•
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
An Extended Example: Tic-Tac-Toe
X
X
X
X
O X
O X
X
O X
O
O X
O
X O X
X O X
O X
O
O X
X
O
} X moves
x
...
...
x o
...
x
} O moves
x
...
o
o x
x
} X moves
x
...
...
...
...
...
} O moves
} X moves
Assume an imperfect opponent:
—he/she sometimes makes mistakes
x o
x
xo
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
A Simple RL Approach to Tic-Tac-Toe
Make a table with one entry per state
V(s) – estimated probability of winning
State
0.5
0.5
x
x o
o
o
0
...
...
o x o
o x x
x o o
1
...
...
x
...
...
x x x
o
o
0.5
win
Now play lots of games.
To pick our moves,
look ahead one step
Current state
loss
draw
*
Possible next states
Pick the next state with the highest estimated prob. of winning
— the largest V(s) – a greedy move; Occasionally pick a move
at random – an exploratory move.
Copyright Vasant Honavar, 2006.
11
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
RL Learning Rule for Tic-Tac-Toe
opponent's move
our move
opponent's move
our move
opponent's move
our move
{
{
{
{
{
{
starting position
a
•
s
•b
s ′ – the state after our greedy move
– the state before our greedy move
•
c c*
•d
e*
“Exploratory” move
•e
•f
•
g g*
..
.
We increment each V(s) toward V( s′) – a backup :
V(s) ← V (s) + α [V( s′) − V (s)]
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Why is Tic-Tac-Toe Too Easy?
• Number of states is small and finite
• One-step look-ahead is always possible
• State completely observable
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Some Notable RL Applications
• TD-Gammon – world’s best backgammon program
(Tesauro)
• Elevator Control –Crites & Barto
• Inventory Management – 10 – 15% improvement over
industry standard methods – Van Roy, Bertsekas, Lee and
Tsitsiklis
• Dynamic Channel Assignment -- high performance
assignment of radio channels to mobile telephone calls –
Singh and Bertsekas
Copyright Vasant Honavar, 2006.
12
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
The n-Armed Bandit Problem
• Choose repeatedly from one of n actions; each
choice is called a play
• After each play at , you get a reward rt , where
*
E rt | at = Q (at )
Distribution of rt depends only on at
• Objective is to maximize the reward in the long term,
e.g., over 1000 plays
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
The Exploration – Exploitation Dilemma
• Suppose you form action value estimates
*
Qt (a) ≈ Q (a)
at* = argmax Qt (a)
a
• The greedy action at t is a = a * ⇒ exploitation
t
t
at ≠ at* ⇒ exploration
• You can’t exploit all the time; you can’t explore all the
time
• You can never stop exploring; but you could reduce
exploring
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Action-Value Methods
• Stateless
• Adapt action-value estimates and nothing else.
• Suppose by the t-th play, action a had been chosen k a
times, producing rewards r1 , r2 , K, rk , then
a
Qt (a) =
r1 + r2 + Lrk a
ka
*
lim Qt (a) = Q (a)
ka →∞
Copyright Vasant Honavar, 2006.
13
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
ε-Greedy Action Selection
ƒ Greedy
at = at* = arg max Qt (a)
a
⎧a with probability 1 − ε
at = ⎨
⎩random action with probability ε
ƒ ε-Greedy
*
t
ƒ Boltzmann
Pr(choosing action a at time t ) =
eQt ( a ) τ
∑
n
b =1
e Qt ( b ) τ
where τ is computational temperature
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Incremental Implementation
Recall the sample average estimation method
The average of the first k rewards is
Qk =
r1 + r2 + Lrk
k
Incremental update rule – does not require storing past
rewards
Qk +1 = Qk +
1
[r − Qk ]
k + 1 k +1
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Tracking a Nonstationary Environment
Choosing Qk to be a sample average is appropriate in a
Stationary environment in which the dependence of
Rewards on actions is time invariant when none of the
*
Q (a) change over time,
In a nonstationary environment, it is better to use
exponential, recency-weighted average
Qk +1 = Qk + α [rk +1 − Qk ]
for constant α , 0 < α ≤ 1
k
= (1− α ) k Q0 + ∑α (1 − α )k −i ri
i =1
Copyright Vasant Honavar, 2006.
14
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Reinforcement learning when the agent can sense and
respond to environmental states
Agent
state
st
reward
action
rt
rt+1
st+1
at
Environment
Agent and environment interact at discrete time steps: t = 0,1, 2, K
Agent observes state at step t : st ∈S
produces action at step t : at ∈ A(st )
gets resulting reward: rt +1 ∈ℜ
and resulting next state: st +1
rt +2
rt +1
rt +3 s
...
...
st +1
st +2
st a
t +3 a
at +1
at +2
t
t +3
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
The Agent Learns a Policy
Policy at step t ,π t :
a mapping from states to action probabilities
π t ( s, a) = probability that at = a when st = s
• Reinforcement learning methods specify how the agent
changes its policy as a result of experience.
• Roughly, the agent’s goal is to get as much reward as it
can over the long run.
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Agent-Environment Interface -- Goals
and Rewards
• Is a scalar reward signal an adequate notion of a
goal? – maybe not, but it is surprisingly flexible.
• A goal should specify what we want to achieve, not
how we want to achieve it.
• A goal is typically outside the agent’s direct control
• The agent must be able to measure success:
• explicitly
• frequently during its lifespan
Copyright Vasant Honavar, 2006.
15
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Rewards
Suppose the sequence of rewards after step t is :
rt +1 ,rt + 2 ,rt +3 ,K
What do we want to maximize?
In general, we want to maximize the expected return, E{Rt },
for each step t.
Episodic tasks – interaction breaks naturally into
episodes, e.g., plays of a game, trips through a
maze.
R = r + r +L+ r ,
t
t +1
t +2
T
where T is a final time step at which a terminal state is
reached, ending an episode.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Rewards for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
∞
Rt = rt +1 + γ rt + 2 + γ 2 rt + 3 + L = ∑ γ k rt + k +1 ,
k =0
where γ ,0 ≤ γ ≤ 1, is the discount rate.
shortsight ed 0 ←γ → 1 farsighted
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Example – Pole Balancing Task
Avoid failure: the pole falling beyond
a critical angle or the cart hitting end
of track.
As an episodic task where episode ends upon failure:
reward = +1 for each step before failure
⇒ return = number of steps before failure
As a continuing task with discounted return:
reward = −1 upon failure; 0 otherwise
⇒ return = − γ k , for k steps before failure
In either case, return is maximized by avoiding failure for
as long as possible.
Copyright Vasant Honavar, 2006.
16
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Example -- Driving task
Get to the top of the hill
as quickly as possible.
reward = −1 for each step when not at top of hill
⇒ return = − number of steps before reaching top of hill
Return is maximized by minimizing the number of
steps taken to reach the top of the hill.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
The Markov Property
• By the state at step t, we mean whatever information is
available to the agent at step t about its environment.
• The state can include immediate sensations, highly
processed sensations, and structures built up over time
from sequences of sensations.
• Ideally, a state should summarize past sensations so as
to retain all essential information – it should have the
Markov Property:
Pr {st +1 = s ′,rt +1 = r st ,at ,rt ,st −1 , at −1 ,K, r1 , s 0 , a 0 } = Pr {st +1 = s ′,rt +1 = r st , at }
∀s ′,r,and histories st ,at ,rt ,st −1 , at −1 ,K, r1 , s0 , a 0 .
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Markov Decision Processes
• If a reinforcement learning task has the Markov Property, it
is called a Markov Decision Process (MDP).
• If state and action sets are finite, it is a finite MDP.
• To define a finite MDP, you need to specify:
• state and action sets
• one-step dynamics defined by transition probabilities:
Psas′ = Pr {st +1 = s ′ st = s,at = a} ∀ s, s ′ ∈ S ,a ∈ A( s ).
• reward probabilities:
Rsas′ = E {rt +1 st = s,at = a, s t +1 = s ′} ∀ s, s ′ ∈ S ,a ∈ A( s ).
Copyright Vasant Honavar, 2006.
17
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Finite MDP Example
Recycling Robot
•
•
•
•
At each step, robot has to decide whether it should
a) actively search for a can,
b) wait for someone to bring it a can, or
c) go to home base and recharge.
Searching is better but runs down the battery; if runs
out of power while searching, has to be rescued (which
is bad).
Decisions made on basis of current energy level: high,
low.
Reward = number of cans collected
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Value Functions
• The value of a state is the expected return starting from
that state; depends on the agent’s policy:
State - value function for policy π :
⎧∞
⎫
V π ( s ) = E π {Rt st = s} = E π ⎨∑ γ k rt + k +1 st = s ⎬
⎩ k =0
⎭
The value of taking an action in a state under policy π is the
expected return starting from that state, taking that action,
and thereafter following π :
Action - value function for policy π :
⎧∞
⎫
Q π ( s, a ) = E π {Rt st = s, at = a} = E π ⎨∑ γ k rt + k +1 st = s, at = a ⎬
⎩ k =0
⎭
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Bellman Equation for a Policy π
The basic idea:
Rt = rt +1 + γ rt + 2 + γ 2 rt +3 + γ 3 rt + 4 L
(
)
= rt +1 + γ rt + 2 + γ rt +3 + γ 2 rt + 4 L
= rt +1 + γ Rt +1
π
So: V (s) = Eπ {Rt st = s}
= Eπ {rt +1 + γ V (st +1 ) st = s}
Or, without the expectation operator:
[
]
V π ( s ) = ∑ π( s, a )∑ Psas′ Rsas′ + γV π ( s ′)
a
s′
Copyright Vasant Honavar, 2006.
18
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Optimal Value Functions
• For finite MDPs, policies can be partially ordered:
π ≥ π ′ if and only if V π (s) ≥ V π ′ (s) for all s ∈S
• There is always at least one (and possibly many) policies
that is better than or equal to all the others. This is an
optimal policy. We denote them all π *.
• Optimal policies share the same optimal state-value
function:
∗
π
V (s) = max V (s) for all s ∈S
π
• Optimal policies also share the same optimal action-value
function:
π
∗
Q (s,a) = max Q (s, a) for all s ∈S and a ∈ A(s)
π
This is the expected return for taking action a in
state s and thereafter following an optimal policy.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Bellman Optimality Equation for V*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:
π∗
∗
V (s) = max Q (s,a)
a∈A( s)
= max E{rt +1 + γ V (st +1 ) st = s, at = a}
∗
a∈A( s)
= max ∑ Psas′ [Rsas′ + γ V ∗ (s′ )]
a∈A( s)
s′
s
(a)
max
a
The relevant backup diagram:
r
s'
∗
V is the unique solution of this system of nonlinear equations.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Bellman Optimality Equation for Q*
{
}
Q∗ (s,a) = E rt +1 + γ max Q∗ (s t+1 , a′) st = s,at = a
[
a′
= ∑ P R + γ max Q ( s′, a′ )
s′
a
s s′
a
ss′
∗
a′
]
(b)
s,a
r
The relevant backup diagram:
s'
max
a'
Q* is the unique solution of this system of nonlinear equations.
Copyright Vasant Honavar, 2006.
19
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to V ∗ is an optimal policy.
∗
Therefore, given V , one-step-ahead search produces the
long-term optimal actions.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
What About Optimal Action-Value Functions?
*
Given Q , the agent does not even
have to do a one-step-ahead search:
π ∗ (s) = arg max Q∗ (s,a)
a∈A (s)
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Solving the Bellman Optimality Equation
• Finding an optimal policy by solving the Bellman
Optimality Equation requires:
– accurate knowledge of environment dynamics;
– enough space an time to do the computation;
– the Markov Property.
• How much space and time do we need?
– polynomial in number of states (via dynamic
programming methods),
– BUT, number of states is often huge
• We usually have to settle for approximations.
• Many RL methods can be understood as approximately
solving the Bellman Optimality Equation.
Copyright Vasant Honavar, 2006.
20
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Efficiency of DP
• To find an optimal policy is polynomial in the number of
states…
• BUT, the number of states often grows exponentially with
the number of state variables
• In practice, classical DP can be applied to problems with
a few millions of states.
• Asynchronous DP can be applied to larger problems, and
appropriate for parallel computation.
• It is surprisingly easy to come up with MDPs for which
DP methods are not practical.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Reinforcement learning
Environment
state
action
reward
Agent
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Markov Decision Processes
Assume
finite set of states S
set of actions A
at each discrete time agent observes state st ∈ S
and chooses action at ∈ A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = δ(st, at) and rt = r(st, at)
– i.e., rt and st+1 depend only on current state and
action
– functions δ and r may be nondeterministic
– functions δ and r not necessarily known to agent
•
•
•
•
Copyright Vasant Honavar, 2006.
21
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Agent’s learning task
• Execute actions in environment, observe results, and
• learn action policy π : S → A that maximizes
E [rt + γrt+1 + γ2rt+2 + … ]
from any starting state in S
• here 0 ≤ γ < 1 is the discount factor for future
rewards
•
•
•
•
Note something new:
Target function is π : S → A
but we have no training examples of form 〈s, a〉
training examples are of form 〈〈s, a 〉, r〉
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Reinforcement learning problem
• Goal: learn to choose actions that maximize
•
r0 + γr1 + γ2r2 + … , where 0 ≤ γ < 1
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Learning An Action-Value Function
Estimate Q π for the current behavior policy π.
st
st , at
rt+ 1
st+1
rt+2
st+2
st+1 , at+1
st+2, at+2
After every transition from a nontermina l state s t , do :
Q(st , at ) ← Q(st , at ) + α[rt +1 + γQ(st +1 , at +1 ) − Q(st , at )]
If st +1 is terminal, then Q( st +1 , at +1 ) = 0.
Copyright Vasant Honavar, 2006.
22
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Value function
• To begin, consider deterministic worlds...
• For each possible policy π the agent might adopt, we can
define an evaluation function over states
V π ( s ) ≡ rt + γrt +1 + γ 2 rt + 2 + ...
∞
≡ ∑ γ i rt + i
i =0
where rt, rt+1, ... are generated by following policy π starting
at state s
• Restated, the task is to learn the optimal policy π*
π * ≡ arg max V π ( s ), (∀s )
π
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
What to learn
• We might try to have agent learn the evaluation
function Vπ* (which we write as V*)
• It could then do a look-ahead search to choose best
action from any state s because
π * ( s ) ≡ arg max[ r ( s, a ) + γV * (δ ( s, a ))]
a
• A problem:
• This works well if agent knows δ : S × A → S, and
r:S×A→ℜ
• But when it doesn't, it can't choose actions this way
Copyright Vasant Honavar, 2006.
23
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Action-Value function – Q function
• Define a new function very similar to V*
Q ( s, a ) ≡ r ( s, a ) + γV * (δ ( s, a ))
• If agent learns Q, it can choose optimal action even
without knowing δ!
π * ( s ) ≡ arg max[r ( s, a ) + γV * (δ ( s, a ))]
π
π * ( s ) ≡ arg max Q( s, a )
π
• Q is the evaluation function the agent will learn
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Training rule to learn Q
• Note Q and V* are closely related:
V * ( s ) = max Q( s, a ' ))
a'
• Which allows us to write Q recursively as
Q( s1 , a1 ) = r ( s1 , a1 ) + γV * (δ ( s1 , a1 ))
= r ( s1 , a1 ) + γ max Q(δ ( st +1 , a' ))
a'
• Let Q̂ denote learner’s current approximation to Q.
Consider training rule
Qˆ ( s, a ) ← r + γ max Qˆ ( s' , a' )
a'
• where s’ is the state resulting from applying action a in
state s.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Q-Learning
[
]
Q(st , at ) ← Q (st , at ) + α rt +1 + γ max Q(s t +1 , a ) − Q(st , at )
a
Copyright Vasant Honavar, 2006.
24
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Q Learning for Deterministic Worlds
•
•
•
•
•
•
•
For each s, a initialize table entry
Observe current state s Q
ˆ ( s, a ) ← 0
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s’
ˆ ( s, a ) as follows:
Update the table entry for Q
Qˆ ( s ) ← r + γ max Qˆ ( s ' , a ' )
a'
• s ← s’.
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Updating Q
Qˆ ( s1 , a right ) ← r + γ max Qˆ ( s 2 ' , a' )
a'
← 0 + 0.9 max{63,81,100}
← 90
Notice if rewards non-negative, then
(∀s, a, n ) Qˆ n +1 ( s, a ) ≥ Qˆ n ( s, a )
and
(∀s, a, n) 0 ≤ Qˆ n ( s, a ) ≤ Q( s, a )
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Convergence theorem
• Theorem Q̂ converges to Q. Consider case of
deterministic world, with bounded immediate rewards,
where each 〈s, a〉 visited infinitely often.
• Proof: Define a full interval to be an interval during
which each 〈s, a〉 is visited. During each full interval
the largest error in Q̂ table is reduced by factor of γ.
• Let Q̂n be table after n updates, and Δn be the
maximum error in Q̂n : that is
Δ n = max | Qˆ n ( s , a ) − Q ( s , a ) |
s ,a
Copyright Vasant Honavar, 2006.
25
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Convergence theorem
• For any table entry Qˆ n ( s, a) updated on iteration n + 1,
the error in the revised estimate Qˆ n+1 ( s, a) is
| Qˆ n +1 ( s, a ) − Q ( s, a ) | = | (r + γ max Qˆ n ( s ' , a ' )) − (r + γ max Q( s ' , a ' )) |
a'
a'
= γ | max Qˆ n ( s ' , a ' ) − max Q( s' , a ' ) |
a'
a'
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Q Learning Recipe
| Qˆ n +1 ( s, a) − Q( s, a) | = | (r + γ max Qˆ n ( s ' , a' )) − (r + γ max Q( s' , a ' )) |
a'
a'
= γ | max Qˆ n ( s ' , a' ) − max Q( s ' , a' ) |
a'
a'
≤ γ max | Qˆ n ( s' , a ' ) − Q( s' , a ' ) |
a'
≤ γ max | Qˆ n ( s' ' , a ' ) − Q( s' ' , a' ) |
s '', a '
| Qˆ n +1 ( s, a) − Q( s, a) | = γΔ n
Note we used general fact that:
| max f1 (a ) − max f 2 (a ) |≤ max | f1 (a ) − f 2 (a ) |
a
a
a
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Non-deterministic case
• What if reward and next state are non-deterministic?
• We redefine V and Q by taking expected values.
V π ( s ) ≡ E[rt + γrt +1 + γ 2 rt + 2 + ...]
⎡∞
⎤
≡ E ⎢∑ γ i rt +i ⎥
⎣ i =0
⎦
Q ( s, a ) ≡ E[ r ( s, a ) + γV * (δ ( s, a ))]
Copyright Vasant Honavar, 2006.
26
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Nondeterministic case
Q learning generalizes to nondeterministic worlds
Alter training rule to
Qˆ n ( s, a ) ← (1 − α n )Qˆ n −1 ( s, a ) + α n [ r + max Qˆ n −1 ( s' , a' )]
a'
where
αn =
1
1 + visitsn ( s, a )
Convergence of Q̂ to Q can be proved [Watkins and
Dayan, 1992]
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Temporal Difference Learning
• Temporal Difference (TD) learning methods
• Can be used when accurate models of the environment
are unavailable – neither state transition function nor
reward function are known
• Can be extended to work with implicit representations of
action-value functions
• Are among the most useful reinforcement learning
methods
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Example – TD-Gammon
•
•
•
•
•
Learn to play Backgammon (Tesauro, 1995)
Immediate reward:
+100 if win
-100 if lose
0 for all other states
• Trained by playing 1.5 million games against
itself.
• Now comparable to the best human player.
Copyright Vasant Honavar, 2006.
27
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Temporal difference learning
Q learning: reduce discrepancy between successive
Q estimates
One step time difference:
Q (1) ( st , at ) ≡ rt + γ max Qˆ ( st +1 , a )
a
Why not two steps?
Or n?
Q ( 2 ) ( st , at ) ≡ rt + γrt +1 + γ 2 max Qˆ ( st + 2 , a )
a
Q ( n ) ( st , at ) ≡ rt + γrt +1 + L + γ ( n −1) rt + n −1 + γ n max Qˆ ( st + n , a )
a
Blend all of these:
Q λ ( st , at ) ≡ (1 − λ )[Q (1) ( st , at ) + λQ ( 2 ) ( st , at ) + λ2Q ( 3) ( st , at )
Copyright Vasant Honavar, 2006.
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Temporal difference learning
Q λ ( st , at ) ≡ (1 − λ )[Q (1) ( st , at ) + λQ ( 2 ) ( st , at ) + λ2Q ( 3) ( st , at )
Equivalent expression:
Q λ ( st , at ) = rt + γ [(1 − λ ) max Qˆ ( st , at ) + λQ λ ( st +1 , at +1 )]
a
•
•
•
•
TD(λ) algorithm uses above training rule
Sometimes converges faster than Q learning
converges for learning V* for any 0 ≤ λ ≤ 1 (Dayan, 1992)
Tesauro's TD-Gammon uses this algorithm
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Handling Large State Spaces
• Replace Q̂ table with neural net or other
function approximator
• Virtually any function approximator would work
provided it can be updated in an online fashion
Copyright Vasant Honavar, 2006.
28
Department of Computer Science
Artificial Intelligence Research Laboratory
Iowa State University
Learning state-action values
• Training examples of the form:
{description of ( st , at ), vt }
• The general gradient-descent rule:
r
r
θ t +1 = θ t + α [v t − Qt (st ,at )]∇θrQ(st ,at )
Copyright Vasant Honavar, 2006.
Iowa State University
Department of Computer Science
Artificial Intelligence Research Laboratory
Linear Gradient Descent Watkins’ Q(λ)
Copyright Vasant Honavar, 2006.
29