slides 4-on-1

What difference does observability make?
CS-E4800 Artificial Intelligence
Camera A
Camera B
Jussi Rintanen
Department of Computer Science
Aalto University
Goal
February 23, 2017
Partial Observability
Observability
Deterministic observations partition the state space
Not all features of the state can be observed
⇒ Cannot determine current state unambiguously
⇒ Many possible current states
Sensors incomplete, imprecise, not enough sensors
Games:
Mastermind (1970ies code-breaking game)
Card games: Poker, Bridge
Memory (= remembering observation history)
becomes necessary (think about Chess vs. Poker)
The belief space
Belief State
Example
Belief State
with probabilities: probability distribution over states
(discussed later)
without probabilities: set of possible current states
belief state of size 1: state known unambiguously
large belief state → state very incompletely known
small belief state → state known more accurately
observations can reduce the size
nondeterministic actions can increase or reduce
deterministic actions can reduce
Example: belief state after WWW
door
Robot without sensors, in a 7 × 8
room, trying to exit
Position unknown, orientation known
Actions: North, South, East, West
Next slides: One possible location of
the robot (•) and the change in the
belief state at every execution step.
Example: WWWWWW
door
door
Example: WWWWWWNNN
Example: WWWWWWNNNNNNNE
door
door
The belief space
The belief space
Example
Example
Example
Belief space over state space {00, 01, 10, 11}
{01, 10}
00
{00, 01, 10}
state variables states belief states
n
n=2
2n = 4
22 = 16
Next slide:
red action: the complement 1st bit
blue action: assign a random value to the 2nd bit
{00, 01}
01
{00, 10}
{00, 01, 11}
{00, 01,
10, 11}
{00, 10, 11}
10
11
{01, 11}
{01, 10, 11}
{00, 11}
{10, 11}
Algorithms
Conditional Plans and AND-OR Trees
Without observations, plans are sequences of actions
from the initial belief state to a belief state included in
the goals
Belief states represented as formulas, BDDs, ...
Standard search algorithms (A∗, Greedy BFS, ...)
Belief space very large; limits applicability of these
algorithms with belief states
(Special case: deterministic actions, one initial state.
This is the case covered in the first lectures!)
AND-OR Trees and Conditional Plans
∨
action1
∧
obs1
act1
∧
∨
act2 act1
∧
∧
∨
∧
obs1
act2 act1
∧
∧
∨
obs2
act2 act1
∧
Mastermind
Code: sequence of 4 colors
The code is hidden from the player
Player tries to guess the code
Response to each guess: 0 to 4 of
action2
obs2
Possible behaviors under nondeterminism and
sensing/observations
AND-node: Uncontrollable events
(nondeterminism, observation)
OR-node: Controllable events (actions)
A conditional plan represent one possible action at
each (reachable) OR-node
AND-OR tree ∼ game tree (next lecture)
∧
∨
• correct color, correct position
◦ correct color, wrong position
act2
∧
action1 ; IF obs1 THEN (action2 ;...) ELSE (action1 ; ...)
Solution by AND-OR search in
discrete belief space
Critical: heuristics for choosing the guesses
Can always be solved with max. 5 moves (but is
difficult)
Expectimax Trees
Expectimax Trees
survey action: cost 1
block: cost 5
value of oil is 20
decision/action node: maximum
+1.31
Survey?
no
yes
+1.31
Oil?
+1.05
0.1 yes
Buy?
+14
max v (c)
0.9 no
-1
c∈children
Buy?
yes
Buy?
+1.05
yes
no
yes
no
+14
−1
−6
−1
no
0
Oil?
0.1 yes
+15
0.9 no
chance/observation node: expected value
X
P(c) × v (c)
c∈children
solution tree = prune all but best child of action
nodes
−5
Partial Observability and Sensing
Observations + Discrete Belief States
Deterministic observations corresponds to sets of states
(those states where the observation can be made.)
Most applications include sensing / observations
Sensing (observations) allows reducing size of
belief state
Belief State update with an Observation
An observation O ⊆ S updates a belief state B ⊆ S to
B0 = B ∩ O
sensing → alternative belief states
In many application, probabilities of states critical
⇒ extensions of probabilistic methods like MDPs
Example
For belief B = { Sunday, Monday, Tuesday } the
observation “weekend” O = { Saturday, Sunday }
yields
B 0 = B ∩ O = {Sunday}
Observations + Probabilistic Belief States
Observations + Probabilistic Belief States
Instead of set of possible states, a belief is probability
distribution B(S) over states.
Example
Belief State update with an Observation
0
An observation O updates B(S) to B (S) according to
Bayes Rule:
P(O|s)B(s)
s2 ∈S P(O|s2 )B(s2 )
B 0 (s) = P(s|O) = P
(1)
The denominator represents, for belief state B, and before any
observations are made, the probability that observation O will be
made.
B(s0 ) = 0.01
B(s1 ) = 0.99
P(O|s0 ) = 0.1
P(O|s1 ) = 0.0
0.1 · 0.01
= 1.0
0.1 · 0.01 + 0.0 ∗ 0.99
0.0 · 0.99
P(s1 |O) =
= 0.0
0.1 · 0.01 + 0.0 ∗ 0.99
P(s0 |O) =
Partially Observable MDPs (POMDP)
Belief Update for POMDPs
Definition (POMDP hS, A, P, R, E , Oi)
Belief update: action a ∈ A
S is a (finite) set of states
A is a (finite) set of actions
P : S × A × S → R gives transition probabilities
R : S × A × S → R is a reward function
E is a (finite) set of observations (evidence)
O : A × S × E → R gives observation probabilities
0
0
B (s ) =
X
(P(s, a, s 0 )B(s))
s∈S
Belief update: action a ∈ A, observation e ∈ E
P
0
0
O(a,
s
,
e)
s∈S (P(s, a, s )B(s))
0 0
P
B (s ) = P
(O(a,
s
,
e)
2
s2 ∈S
s1 ∈S (P(s1 , a, s2 )B(s1 )))
Value Iteration for POMDPS
Value Iteration for POMDPs
Example
Problem: Belief space infinite ⇒ Size of value
function representation unbounded
Solution:
Two states s0 and s1
Two actions Stay and Go
Two observations s0 and s1
R(s0 ) = R(s0 , ·, ·) = 0 and R(s1 ) = R(s1 , ·, ·) = 1
P(si , Stay , si ) = 0.9 and P(si , Stay , s1−i ) = 0.1
P(si , Go, s1−i ) = 0.9 and P(si , Go, si ) = 0.1
O(·, si , si ) = 0.6 and O(·, si , s1−i ) = 0.4
Iteration i: generate conditional plans π of depth i
Evaluate value of each plan π for every state
(vπ (s1 ), . . . , vπ (sn ))
The collection of all these vectors (vπ (s1 ), . . . , vπ (sn ))
represent value function for all possible belief states:
The maximum value obtainable for a given belief state
by one of the conditional plans
vGo (s0 ) =R(s0 ) + γ(0.9R(s1 ) + 0.1R(s0 )) = 0.9γ
vGo (s1 ) =R(s1 ) + γ(0.9R(s0 ) + 0.1R(s1 )) = 1.0 + 0.1γ
vStay (s0 ) =R(s0 ) + γ(0.9R(s0 ) + 0.1R(s1 )) = 0.1γ
vStay (s1 ) =R(s1 ) + γ(0.9R(s1 ) + 0.1R(s0 )) = 1.0 + 0.9γ
(Reward from starting state + reward from successor state!)
With γ = 0.9 these are:
Values under optimal 1-step plans
3
For every belief state
(probability distribution),
use the plan that has the
highest value.
2
value
Values of states under 1-step plans
1
Stay
Go
Values under Stay:
vGo (s0 ) =0.81
vGo (s1 ) =1.09
vStay (s0 ) =0.09
vStay (s1 ) =1.81
value of belief state
3
0.0
2
1
0.0
probability of s1
1.0
probability of s1
1.0
Values under optimal plans
POMDP Value Iteration Algorithm
1
2
Next iterations generate 2,
3, 4, ... step plans.
Dominated plans (not best
anywhere) ignored
Optimal plan has a
piecewise-linear value
function consisting of
non-dominated segments
(top three in the diagram left)
3
value
2
1
0.0
probability of s1
1.0
3
4
5
6
i := 1, U 0 := ∅
U := U 0
U0 := all policies of depth i
U 0 := policies in U0 not dominated in U0
i := i + 1
If improvement from U and U 0 “small”, go to 2.
Dominated policy
π is dominated in U0 if for every belief state B there is
π 0 ∈ U0 such that vπ0 (B) > vπ (B), where
X
vπ (B) =
B(s) · vπ (s)
s∈S
Domination test
Execution of a Policy
The following has a solution iff (w1 , . . . , wn ) not
dominated in {(w 1 , . . . , wn1 ), . . . , (w1N , . . . , wnN )}.
(Vectors are values of π and members of U0 in all states.)
p1 + · · · + pn = 1
p1 ≥ 0
..
.
pn ≥ 0
w1 p 1 + · · · + wn p n ≥
..
.
(p1 , . . . , pn represent a belief state)
w11 p1
1
2
3
4
5
+ ··· +
wn1 pn
w1 p1 + · · · + wn pn ≥ w1N p1 + · · · + wnN pn
6
Belief state initially is some B = (p1 , . . . , pn )
Find best plan π for belief state B = (p1 , . . . , pn )
from value function
Execute first action a of π (rest of π ignored!)
Get observation e
Update B to B 0 according to a and e
Go to 2.
Auxiliary concept: Markov Chains
Bounded-Memory Policies for POMDPs
Definition (Markov chain hS, P, Ri)
S is a (finite) set of states,
P : S × S → R gives transition probabilities
R : S × S → R is a reward function
Bounded-memory policies with memory/belief
states M = {m1 , . . . , mN } instead of the infinite
and continuous space of probability distributions
over S
Algorithms: Search in the (finite) space of
bounded-memory policies
Value function of a Markov chain can be evaluated by
solving a set of linear inequalities.
(This is what we did when evaluation a policy in Policy Iteration:
the Markov chain hS, P, Ri with P(s, s 0 ) = P0 (s, π(s), s 0 ) and
R(s, s 0 ) = R0 (s, π(s), s 0 ) was obtained from MDP hS, A, P0 , R0 i
and a policy π.)
Bounded-Memory Policies for POMDPs
Definition
A bounded memory policy with memory state M, for a
POMDP hS, A, P, R, E , Oi is a mapping
π : M × E → A × M.
The execution of π is a Markov chain hSx , Px , Rx i with
Sx = S × M × E (state, memory, last observation)
Px ((s, m, e), (s 0 , m0 , e 0 )) = P(s, a, s 0 )O(a, s 0 , e 0 )
where π(m, e) = (a, m0 )
Rx ((s, m), (s 0 , m0 )) = R(s, a, s 0 ) where
π(m, e) = (a, m0 )
Bounded-Memory Policies for POMDPs
General scheme for finding bounded-memory policies:
1
2
3
Generate policies π
Evaluate the Markov chain corresponding to π
Choose the best policy
For example, choose policy π and initial memory state
m to maximize
X
vπ (s, m, obs(s))B(s)
s∈S
where B represents the initial belief state (unrelated to
the initial memory state m) and obs(s) is observations
with initial states.
Execution of a Bounded-Memory Policy
1
2
3
4
5
Memory state initially some m ∈ M and initial
observation e = obs(e)
(a, m0 ) := π(m, e)
Execute a, obtaining new observation e 0
Update belief state m := m0 and e := e 0
Go to 2.
Special Case: Memory-Free Policies
Memory-free policy = Bounded-memory policy
with 1 memory state
Policies are mappings from observations to actions

Download Report

slides 4-on-1

Paperzz.com

Your Paperzz