ECE-517: Reinforcement Learning in
Artificial Intelligence
Lecture 15: Partially Observable Markov Decision
Processes (POMDPs)
October 27, 2011
Dr. Itamar Arel
College of Engineering
Electrical Engineering and Computer Science Department
The University of Tennessee
Fall 2011
1
Outline
Why use POMDPs?
Formal definition
Belief state
Value function
ECE 517 – Reinforcement Learning in AI
2
Partially Observable Markov Decision Problems (POMDPs)
To introduce POMDPs let us consider an example where an
agent learns to drive a car in New York city
The agent can look forward, backward, left or right
It can’t change speed but it can steer into the lane it is
looking at
The different types of observations are
the direction in which the agent's gaze is directed
the closest object in the agent's gaze
whether the object is looming or receding
the color of the object
whether a horn is sounding
To drive safely the agent must steer out of its lane to
avoid slow cars ahead and fast cars behind
ECE 517 – Reinforcement Learning in AI
3
POMDP Example
The agent is in control of the
middle car
The car behind is fast and will
not slow down
The car ahead is slower
To avoid a crash, the agent
must steer right
However, when the agent is
gazing to the right, there is no
immediate observation that
tells it about the impending
crash
ECE 517 – Reinforcement Learning in AI
4
POMDP Example (cont.)
This is not easy when the agent
has no explicit goals beyond
“performing well"
There are no explicit training
patterns such as “if there is a car
ahead and left, steer right."
However, a scalar reward is
provided to the agent as a
performance indicator (just like
MDPs)
The agent is penalized for
colliding with other cars or the
road shoulder
The only goal hard-wired into
the agent is that it must
maximize a long-term measure
of the reward
ECE 517 – Reinforcement Learning in AI
5
POMDP Example (cont.)
Two significant problems make it difficult to learn under
these conditions
Temporal credit assignment –
If our agent hits another car and is consequently penalized, how
does the agent reason about which sequence of actions should
not be repeated, and in what circumstances?
Generally same as in MDPs
Partial Observability If the agent is about to hit the car ahead of it, and there is a
car to the left, then circumstances dictate that the agent
should steer right
However, when it looks to the right it has no sensory
information regarding what goes on elsewhere
To solve the latter, the agent needs memory – creates
knowledge of the state of the world around it
ECE 517 – Reinforcement Learning in AI
6
Forms of Partial Observability
Partial Observability coarsely pertains to either
Lack of important state information in observations – must be
compensated using memory
Extraneous information in observations – needs to learn to avoid
In our example:
Color of the car in its gaze is extraneous (unless red cars really
drive faster)
It needs to build a memory-based model of the world in order to
accurately predict what will happen
Creates “belief state” information (we’ll see later)
If the agent has access to the complete state, such as a chess
playing machine that can view the entire board:
It can choose optimal actions without memory
Markov property holds – i.e. future state of the world is simply a
function of the current state and action
ECE 517 – Reinforcement Learning in AI
7
Modeling the world as a POMDP
Our setting is that of an agent taking actions in a world
according to its policy
The agent still receives feedback about its performance
through a scalar reward received at each time step
Formally stated, POMDPs consists of …
|S| states S = {1,2,…,|S|} of the world
|U| actions (or controls) U = {1,2,…, |U|} available
to the policy
|Y| observations Y = {1,2,…,|Y|}
a (possibly stochastic) reward r(i) for each state i
in S
ECE 517 – Reinforcement Learning in AI
8
Modeling the world as a POMDP (cont.)
ECE 517 – Reinforcement Learning in AI
9
MDPs vs. POMDPs
In MDP: one observation for each state
Concept of observation and state being interchangeable
Memoryless policy that does not make use of internal state
In POMDPs different states may have similar probability
distributions over observations
Different states may look the same to the agent
For this reason, POMDPs are said to have hidden state
Two hallways may look the same for a robot’s sensors
Optimal action for the first take left
Optimal action for the second take right
A memoryless policy can’t distinguish between the two
ECE 517 – Reinforcement Learning in AI
10
MDPs vs. POMDPs (cont.)
Noise can create ambiguity in state inference
Agent’s sensors are always limited in the amount of
information they can pick up
One way of overcoming this is to add sensors
Specific sensors that help it to “disambiguate” hallways
Only when possible, affordable or desirable
In general, we’re now considering agents that need to be
proactive (also called “anticipatory”)
Not only react to environmental stimuli
Self-create context using memory
POMDP problems are harder to solve, but represent
realistic scenarios
ECE 517 – Reinforcement Learning in AI
11
POMDP solution techniques – model based methods
If an exact model of the environment is available, POMDPs
can (in theory) be solved
i.e. an optimal policy can be found
Like model-based MDPs, it’s not so much a learning
problem
No real “learning”, or trial and error taking place
No exploration/exploitation dilemma
Rather a probabilistic planning problem find the optimal
policy
In POMDPs the above is broken into two elements
Belief state computation, and
Value function computation based on belief states
ECE 517 – Reinforcement Learning in AI
12
The belief state
Instead of maintaining the complete action/observation
history, we maintain a belief state b.
The belief state is a probability distribution over the
states
Given an observation
Dim(b) = |S|-1
The belief space is the entire probability space
We’ll use a two-state POMDP as a running example
Probability of being in state one = p probability of being in
state two = 1-p
Therefore, the entire space of belief states can be
represented as a line segment
ECE 517 – Reinforcement Learning in AI
13
The belief space
Here is a representation of the belief space
when we have two states (s0,s1)
ECE 517 – Reinforcement Learning in AI
14
The belief space (cont.)
The belief space is continuous, but we only visit a
countable number of belief points
Assumption:
Finite action set
Finite observation set
Next belief state b’ = f (b,a,o) where:
b: current belief state, a:action, o:observation
ECE 517 – Reinforcement Learning in AI
15
The Tiger Problem
• Standing in front of two closed doors
• World is in one of two states: tiger is behind left door or right door
• Three actions: Open left door, open right door, listen
•
• Listening is not free, and not accurate (may get wrong info)
Reward: Open the wrong door and get eaten by the tiger (large –r)
Open the right door and get a prize (small +r)
ECE 517 – Reinforcement Learning in AI
16
Tiger Problem: POMDP Formulation
Listen
SL
SR
SL
1.0
0.0
SR
0.0
1.0
Listening does not change the
tiger’s position
Each episode is a “Reset”
ECE 517 – Reinforcement Learning in AI
Next state
Two states: SL and SR (tiger is really behind left or right
door)
Three actions: LEFT, RIGHT, LISTEN
Current state
Transition probabilities:
Left
SL
SR
SL
0.5
0.5
SR
0.5
0.5
Right
SL
SR
SL
0.5
0.5
SR
0.5
0.5
17
Tiger Problem: POMDP Formulation (cont.)
Observations: TL (tiger left) or TR (tiger right)
Observation probabilities:
Next state
Current state
Listen
TL
TR
SL
0.85
0.15
SR
0.15
0.85
Left
TL
TR
SL
0.5
0.5
SR
0.5
0.5
Right
TL
TR
SL
0.5
0.5
SR
0.5
0.5
Rewards:
– R(SL, Listen) = R(SR, Listen) = -1
– R(SL, Left) = R(SR, Right) = -100
– R(SL, Right) = R(SR, Left) = +10
ECE 517 – Reinforcement Learning in AI
18
POMDP Policy Tree (Fake Policy)
Tiger roar
left
New belief state Listen
(0.6)
Tiger roar
left
Tiger roar right
Tiger roar
right
Open
Left door
Listen
New belief
State (0.9)
Starting belief state
Listen (tiger left probability: 0.3)
Open New belief
Left State (0.15)
door
…
…
Listen
ECE 517 – Reinforcement Learning in AI
19
POMDP Policy Tree (cont’)
o1
A1
o2
A2
o6
o4
A5
A3
o3
A4
o5
A7
A6
…
…
A8
ECE 517 – Reinforcement Learning in AI
20
How many POMDP policies possible
o1
A1
o2
A3
A2
o4
A5
o6
o5
A6
1
o3
A4
|O|
A7
…
…
A8
|O|^2
…
How many policy trees, if |A| actions, |O| observations, T horizon:
• How many nodes in a tree:
How many trees:
T-1
N=
i
|O| = (|O|T- 1) / (|O| - 1)
N
|A|
i=0
ECE 517 – Reinforcement Learning in AI
21
Belief State
Overall formula:
The belief state is updated proportionally to:
The prob. of seeing the current observation given state s’,
and to the prob. of arriving at state s’ given the action and
our previous belief state (b)
The above are all given by the model
ECE 517 – Reinforcement Learning in AI
22
Belief State (cont.)
Let’s look at an example:
Consider a robot that is initially completely uncertain
about its location
Seeing a door may, as specified by the model’s Pos '
occur in three different locations
Suppose that the robot takes an action and observes a
T-junction
It may be that given the action only one of the three
states could have lead to an observation of a T-junction
The agent now knows with
certainty which state it is in
Not in all cases the uncertainty
disappears like that
ECE 517 – Reinforcement Learning in AI
23
Finding an optimal policy
The policy component of a POMDP agent must map the
current belief state into action
It turns out that the process of maintaining belief
states is a sufficient statistic (i.e. Markovian)
We can’t do better even if we remembered the
entire history of observations and actions
We have now transformed the POMDP into a MDP
Good news: we have ways of solving those (GPI
algorithms)
Bad news: the belief state space is continuous !!
ECE 517 – Reinforcement Learning in AI
24
Value function
The belief state is the input to the second
component of the method: the value function
computation
The belief state is a point in a continuous space
of N-1 dimensions!
The value function must be defined over this
infinite space
Application of dynamic programming techniques
infeasible
ECE 517 – Reinforcement Learning in AI
25
Value function (cont.)
• Let’s assume only two states: S1 and S2
• Belief state [0.25 0.75] indicates b(s1) = 0.25, b(s2) = 0.75
• With two states, b(s1) is sufficient to indicate belief
state: b(s2) = 1 – b(s1)
V(b)
S1
[1, 0]
[0.5, 0.5]
S2
[0, 1]
b: belief state
ECE 517 – Reinforcement Learning in AI
26
Piecewise linear and Convex (PWLC)
Turns out that the value function is, or can be accurately
approximated, by a piecewise linear and convex function
Intuition on convexity: being certain of a state yields high
value, where as uncertainty lowers the value
V(b)
b: belief state
S1
[1, 0]
ECE 517 – Reinforcement Learning in AI
[0.5, 0.5]
S2
[0, 1]
27
Why does PWLC helps?
Vp1
Vp3
V(b)
Vp2
region1
S1
[1, 0]
region2
[0.5, 0.5]
region3
S2
[0, 1]
b: belief state
• We can directly work with regions (intervals) of belief space!
• The vectors are policies, and indicate the right action to take in
each region of the space
ECE 517 – Reinforcement Learning in AI
28
Summary
POMDPs model realistic scenarios more
accurately
Rely on belief states that are derived from
observations and actions
Can be transformed into an MDP with PWLC for
value function approximation
What if we don’t have a model???
Next class: (recurrent) neural networks come to
the rescue …
ECE 517 – Reinforcement Learning in AI
29
© Copyright 2026 Paperzz