ECE-453 Lecture 1 - UTK-EECS

ECE-517: Reinforcement Learning
in Artificial Intelligence
Lecture 6: Optimality Criterion in MDPs
September 8, 2011
Dr. Itamar Arel
College of Engineering
Department of Electrical Engineering and Computer Science
The University of Tennessee
Fall 2011
1
Outline
Optimal value functions (cont.)
Implementation considerations
Optimality and approximation
ECE 517 - Reinforcement Learning in AI
2
Recap on Value Functions
We define the state-value function for policy p as
∆
0
Similarly, we define the action-value function for
0
The Bellman equation
0
The value function V p(s) is the unique solution to its
Bellman equation
ECE 517 - Reinforcement Learning in AI
3
Optimal Value Functions
A policy p is defined to be better than or equal to a policy
p*, if its expected return is greater than or equal to that
of p* for all states, i.e.
p  p * iff Vp ( s)  Vp * ( s) s  S
There is always at least one policy (a.k.a. optimal policy)
that is better than or equal to all other policies
V * ( s)  max Vp ( s) s  S
p
Optimal policies also share the same optimal action-value
function, defined as
Q* (s,a)  max Qp (s, a) s  S , a  A(s)
p
ECE 517 - Reinforcement Learning in AI
4
Optimal Value Functions (cont.)
The latter gives the expected return for taking action a in
state s and thereafter following an optimal policy
Thus, we can write

∆ Q* ( s,a )  E rt 1  V * ( st 1 ) | st  s, at  a

Since V *(s) is the value function for a policy, it must
satisfy the Bellman equation

This is called the Bellman optimality equation
Intuitively, the Bellman optimality equation expresses the
fact that the value of a state under an optimal policy must
equal the expected return for the best action from that
state
ECE 517 - Reinforcement Learning in AI
5
Optimal Value Functions (cont.)
∆
0
ECE 517 - Reinforcement Learning in AI
6
Optimal Value Functions (cont.)
The Bellman optimality equation for Q* is
0
0
Backup diagrams  arcs have been added at the agent's choice
points to represent that the maximum over that choice is taken
rather than the expected value (given some policy)
ECE 517 - Reinforcement Learning in AI
7
Optimal Value Functions (cont.)
For finite MDPs, the Bellman optimality equation has a unique
solution independent of the policy



The Bellman optimality equation is actually a system of equations,
one for each state
N equations (one for each state)
N variables – V *(s)
This assumes you know the dynamics of the environment
Once one has V *(s), it is relatively easy to determine an
optimal policy …


For each state there will be one or more actions for which the
maximum is obtained in the Bellman optimality equation
Any policy that assigns nonzero probability only to these actions
is an optimal policy
This translates to a one-step search, i.e. greedy decisions will
be optimal
ECE 517 - Reinforcement Learning in AI
8
Optimal Value Functions (cont.)
With Q*, the agent does not even have to do a one-stepahead search

For any state s – the agent can simply find any action that
maximizes Q*(s,a)
The action-value function effectively embeds the results
of all one-step-ahead searches
It provides the optimal expected long-term return as a
value that is locally and immediately available for each
state-action pair

∆
Agent does not need to know anything about the dynamics of
the environment
Q: What are the implementation tradeoffs here?
ECE 517 - Reinforcement Learning in AI
9
Implementation Considerations
Computational Complexity
∆



How complex is it to evaluate the value and statevalue functions?
In software
In hardware
Data flow constraints


Which part of the data needs to be globally vs.
locally available?
Impact of memory bandwidth limitations
ECE 517 - Reinforcement Learning in AI
10
Recycling Robot revisited
A transition graph is a useful way to summarize the
dynamics of a finite MDP


State node for each possible state
Action node for each possible state-action pair
0
ECE 517 - Reinforcement Learning in AI
11
Bellman Optimality Equations for the Recycling Robot
To make things more compact, we abbreviate the states
high and low, and the actions search, wait, and recharge
respectively by h, l, s, w, and re
0
ECE 517 - Reinforcement Learning in AI
12
Optimality and Approximation
Clearly, an agent that learns an optimal policy has done very
well, but in practice this rarely happens

Usually involves heavy computational load
Typically agents perform approximations to the optimal
policy
A critical aspect of the problem facing the agent is always
the computational resources available to it

In particular, the amount of computation it can perform in a
single time step
Practical considerations are thus:


Computational complexity
Memory available
Tabular methods apply for small state sets


Communication overhead (for distributed implementations)
Hardware vs. software
ECE 517 - Reinforcement Learning in AI
13
Are approximations good or bad ?
RL typically relies on approximation mechanisms (see
later)
This could be an opportunity


Efficient “Feature-extraction” type of approximation may
actually reduce “noise”
Make it practical for us to address large-scale problems
In general, making “bad” decisions in RL result in
learning opportunities (online)
The online nature of RL encourages learning more
effectively from events that occur frequently

Supported in nature
Capturing regularities is a key property of RL
ECE 517 - Reinforcement Learning in AI
14