Utilities and MDP: A Lesson in Multiagent System Henry Hexmoor SIUC Utility • Preferences are recorded as a utility function Preferences are recorded as a utility function ui : S R SS is the set of observable states in the world i h f b bl i h ld ui is utility function R is real numbers • States of the world become ordered. States of the world become ordered Properties of Utilites Properties of Utilites Reflexive: ui(s) ≥ ui(s) Transitive: If ui(a) ≥ ui(b) and ui(b) ≥ ui(c) then ui(a) ≥ ui(c). (c) Comparable: a,b either ui(a) ≥ ui(b) or ui(b) ≥ ui(a). ( ) Selfish agents: Selfish agents: • A A rational agent is one that wants to maximize rational agent is one that wants to maximize its utilities, but intends no harm. Agents Selfish lf h agents Rational, non‐ selfish agents Rational agents Utility is not money: Utility is not money: • while while utility represents an agent utility represents an agent’ss preferences preferences it is not necessarily equated with money. In fact the utility of money has been found to be fact, the utility of money has been found to be roughly logarithmic. Marginal Utility Marginal Utility • Marginal Marginal utility is the utility gained from next utility is the utility gained from next event Example: getting A for an A student. versus A for an B student Transition function Transition function Transition function is represented as Transition function is represented as T ( s, a, s ' ) Transition function is defined as the probability ii f i i d fi d h b bili of reaching S’ from S with action ‘a’ Expected Utility Expected Utility • Expected Expected utility is defined as the sum of utility is defined as the sum of product of the probabilities of reaching s’ from s with action ‘a’ from s with action a and utility of the final and utility of the final state. E[ui , s, a ] T ( s, a, s ' )ui ( s ' ) s 'S Where S is set of all possible states p Value of Information Value of Information • Value Value of information that current state is t and of information that current state is t and not s: E E[ui , t , i (t )] E[ui , t , i ( s )] E [u , t , (t )] here represents updated, new info E[ui , t , i ( s )] represents old value p i i Markov Decision Processes: MDP Markov Decision Processes: MDP • Graphical representation of a sample Markov decision process along with values for the transition and reward functions. We let the start state be s1. E.g., Reward Function: r(s) Reward Function: r(s) • Reward Reward function is represented as function is represented as r : S → R Deterministic Vs Non‐Deterministic Deterministic Vs Non Deterministic • Deterministic world: predictable effects Deterministic world: predictable effects Example: only one action leads to T=1 , else Ф • Nondeterministic world: values change Policy: Policy: • Policy Policy is behavior of agents that maps states is behavior of agents that maps states to action • Policy is represented by Policy is represented by Optimal Policy Optimal Policy • Optimal Optimal policy is a policy that maximizes policy is a policy that maximizes expected utility. * • Optimal policy is represented as Optimal policy is represented as i* ( s ) arg a A max E [u i , s , a ] Discounted Rewards: Discounted Rewards: ( 0 1) • Discounted Discounted rewards smoothly reduce the rewards smoothly reduce the impact of rewards that are farther off in the future 0 r ( s 1 ) 1 r ( s 2 ) 2 r ( s 3 ) ... ( 0 1) Where represents discount factor Wh t di tf t * ( s ) arg max T ( s , a , s )u ( s ) a s Bellman Equation Bellman Equation u ( s ) r ( s ) max a T ( s , a , s ) u ( s ) s Where r(s) represents immediate reward T(s a s’)u(s’) T(s,a,s )u(s ) represents represents future, discounted rewards Brute Force Solution Brute Force Solution • Write Write n Bellman equations one for each n n Bellman equations one for each n states, solve … • This is a non‐linear equation due to This is a non linear equation due to max a Value Iteration Solution Value Iteration Solution • Set values of u(s) to random numbers Set values of u(s) to random numbers • Use Bellman update equation u t 1 ( s) r ( s) max T ( s, a, s)u t ( s) a s • Converge and stop using this equation when u (1 ) u where max utility change Value Iteration Algorithm Value Iteration Algorithm VALUE ITERATION d do (T , r , , ) u u' for sS do u ( s ) r ( s ) max a if | u ' ( s ) u ( s ) | then h until return u T ( s , a , s ) u ( s ) | u '(s) u (s) | (1 ) s 0.5 and 0.15. The algorithm stops after t=4 MDP for one agent MDP for one agent • Multiagent: Multiagent: one agent changes , others are one agent changes others are stationary. • Better approach Better approach T ( s, a, s' ) a is a vector of size ‘n’ showing each agent’s action. Where ‘n’ represents number of i Wh ‘ ’ b f agents • Rewards: – Dole out equally among agents – Reward proportional to contribution Observation model Observation model • noise noise + cannot observe world … + cannot observe world • Belief state b P1, P2, P3,...,Pn • Observation model O(s,o) = probability of observing ‘o’, being in state ‘s’. s b( s) O( s, o) T ( s, a, s)b( s) s is normalization constant Where is normalization constant Where Partially observable MDP Partially observable MDP T ( b , a , b ) * ‐ s O(s, o) T (s, a, s)b(s) s = s 0 otherwise b ( s ) O ( s , o ) T ( s , a , s ) b ( s ) is true for b , a , b new reward function s (b ) b ( s ) r ( s ) s • if * h ld if * holds Solving POMDP is hard.
© Copyright 2026 Paperzz