Utilities and MDP: A Lesson in Multiagent System

Utilities and MDP: A Lesson in Multiagent System
Henry Hexmoor
SIUC
Utility
• Preferences are recorded as a utility function
Preferences are recorded as a utility function
ui : S  R
SS is the set of observable states in the world
i h
f b
bl
i h
ld
ui is utility function
R is real numbers
• States of the world become ordered.
States of the world become ordered
Properties of Utilites
Properties of Utilites
Reflexive: ui(s) ≥ ui(s)
Transitive: If ui(a) ≥ ui(b) and ui(b) ≥ ui(c)
then ui(a) ≥ ui(c).
(c)
Comparable: a,b either ui(a) ≥ ui(b) or
ui(b) ≥ ui(a).
( )
Selfish agents:
Selfish agents:
• A
A rational agent is one that wants to maximize rational agent is one that wants to maximize
its utilities, but intends no harm.
Agents
Selfish lf h
agents
Rational, non‐
selfish agents
Rational agents
Utility is not money:
Utility is not money:
• while
while utility represents an agent
utility represents an agent’ss preferences preferences
it is not necessarily equated with money. In fact the utility of money has been found to be
fact, the utility of money has been found to be roughly logarithmic.
Marginal Utility
Marginal Utility
• Marginal
Marginal utility is the utility gained from next utility is the utility gained from next
event
Example:
getting A for an A student.
versus A for an B student
Transition function
Transition function
Transition function is represented as
Transition
function is represented as
T ( s, a, s ' )
Transition function is defined as the probability ii f
i i d fi d
h
b bili
of reaching S’ from S with action ‘a’
Expected Utility
Expected Utility
• Expected
Expected utility is defined as the sum of utility is defined as the sum of
product of the probabilities of reaching s’ from s with action ‘a’
from s with action a and utility of the final and utility of the final
state.
E[ui , s, a ]   T ( s, a, s ' )ui ( s ' )
s 'S
Where S is set of all possible states
p
Value of Information
Value of Information
• Value
Value of information that current state is t and of information that current state is t and
not s: E  E[ui , t ,  i (t )]  E[ui , t ,  i ( s )]
E [u , t ,  (t )]
here represents updated, new info
E[ui , t ,  i ( s )] represents old value p
i
i
Markov Decision Processes: MDP
Markov Decision Processes: MDP
•
Graphical representation of a sample Markov decision process along with values for the transition and reward functions. We let the start state be s1. E.g.,
Reward Function: r(s)
Reward Function: r(s)
• Reward
Reward function is represented as function is represented as
r : S → R
Deterministic Vs Non‐Deterministic
Deterministic Vs Non
Deterministic
• Deterministic world: predictable effects
Deterministic world: predictable effects
Example: only one action leads to T=1 , else Ф
• Nondeterministic world: values change Policy:
Policy: 
• Policy
Policy is behavior of agents that maps states is behavior of agents that maps states
to action
• Policy is represented by Policy is represented by 
Optimal Policy
Optimal Policy
• Optimal
Optimal policy is a policy that maximizes policy is a policy that maximizes
expected utility.
*
• Optimal policy is represented as Optimal policy is represented as 
 i* ( s )  arg a A max E [u i , s , a ]
Discounted Rewards:
Discounted
Rewards:  ( 0  1)
• Discounted
Discounted rewards smoothly reduce the rewards smoothly reduce the
impact of rewards that are farther off in the future
 0 r ( s 1 )   1 r ( s 2 )   2 r ( s 3 )  ...
 ( 0  1)
Where represents discount factor
Wh
t di
tf t
 * ( s )  arg max  T ( s , a , s )u ( s )
a
s
Bellman Equation
Bellman Equation u ( s )  r ( s )   max
a
 T ( s , a , s ) u ( s )
s
Where r(s) represents immediate reward
T(s a s’)u(s’)
T(s,a,s
)u(s ) represents represents
future, discounted rewards
Brute Force Solution
Brute Force Solution
• Write
Write n Bellman equations one for each n n Bellman equations one for each n
states, solve …
• This is a non‐linear equation due to This is a non linear equation due to max
a
Value Iteration Solution
Value Iteration Solution
• Set values of u(s) to random numbers
Set values of u(s) to random numbers
• Use Bellman update equation
u t 1 ( s) 
 r ( s)   max  T ( s, a, s)u t ( s)
a
s
• Converge and stop using this equation when
u 
 (1   )

u
where max utility change
Value Iteration Algorithm
Value Iteration Algorithm
VALUE  ITERATION
d
do
(T , r ,  ,  )
u  u'
 
for
sS
do u  ( s ) 
  r ( s )   max
a
if
| u ' ( s )  u ( s ) | 
then
h
until
return
u
 T ( s , a , s ) u ( s )
 
 | u '(s)  u (s) |
 (1   )

s
  0.5 and  0.15. The algorithm stops after t=4
MDP for one agent
MDP for one agent
• Multiagent:
Multiagent: one agent changes , others are one agent changes others are
stationary.

• Better approach Better approach  T ( s, a, s' )

a is a vector of size ‘n’ showing each agent’s action. Where ‘n’ represents number of i
Wh
‘ ’
b
f
agents
• Rewards:
– Dole out equally among agents
– Reward proportional to contribution
Observation model
Observation model
• noise
noise + cannot observe world …
+ cannot observe world

• Belief state b P1, P2, P3,...,Pn 
• Observation model O(s,o) = probability of observing ‘o’, being in state ‘s’. s b( s)  O( s, o) T ( s, a, s)b( s)
s
 is normalization constant
Where is normalization constant
Where
Partially observable MDP
Partially observable MDP
T ( b , a , b )
* ‐ 
s
 O(s, o) T (s, a, s)b(s)
s
= s
0 otherwise


b  ( s  )   O ( s  , o )  T ( s , a , s  ) b ( s ) is true for b , a , b
new reward function
s
 (b )   b ( s ) r ( s )
s
•
if * h ld
if * holds Solving POMDP is hard.