Learning Monitoring Strategies to Compensate for Model

From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Learning Monitoring Strategies to Compensatefor Model Uncertainty
Eric A. Hansen,Paul R. Cohen
Experimental
Knowledge
SystemsLaboratory
Depa~ti~ent
of Computer
Science
University of Massachusetss, Amherst, MA01003
hansen~cs.umass.edu
Abstract
This paper addressesthe needfor monitoringthe
environment given an action model that is
uncertain or stochastic. Its contribution is to
describe
howmonitoring
costs
canbeincluded
in
theframework
of Markov
decision
problems.
makingit possible to acquire cost-effective
monitoringstrategies using dynamicpmgnmuning
or related reinforcement
learningalgorithms.
Statementof the Problem
transition function, P~(a), that gives the conditional
wobabititythat action a takenin state x producesstate y.
In addition, a payofffunction, R(x,a), gives the expected
single-step payoff for taking action a in state x. The
problem
is to find a policythat optimizespayoffin the long
term. This policy can be foundwith dynamicprogramming,
or variousreinforcexnont
learningalgorithmsthat havebeen
shownto have¯ theoretical basis in dynamicprogramming
(Burro,Sutton, & Watldns,1990).
In ¯ conventionalMarkov
decisionproblem,the state
of the environment
is automaticallymonitoredat each time
step withoutconsideringthe costs this mightincur. Because
this is exactlythe assumption
wewishto question,the key
step in our workis to showhowto expressmonitoringcosts
and fonmflatemonitoringstrategies in the framework
of ¯
Markovdecision problem. Given ¯ conventional Markov
decision problem with single-step state-transition
pmhabilities and ¯ single-step rewardfunction, weshow
howto transformit into another Markov
decision problem
in whichthe agentdoesnot automaticallymonitoreachtime
step, but considerssensingcosts in decidingwhenandhow
oftento monitor.It is not poss~leto describeall the details
of this approachin this short paper,but ¯ simpleexample
is
followedby ¯ brief commentary.
In realistic domains,an agentmustgenerateplans froman
action modelthat is imperfectanduncertain. Evenif the
modelcan be improved
by learning,there is usuallya limit
to howaccurateit can be made;to somedegree,the effects
of actions are stochastic. When
an agentcannotpredict the
effectsof its actionswithcertainty,it mustmonitor
the state
of its environment.(In the rest of this paper, the terms
"monitoring"and "sensing" are used interchangeably.)
However
if a cost is incurred for monitoring,it maybe
prohibitively expensivefor an agentto monitor every
feature of its environment
continuously.Hencethere is a
needfor cost-effectivemonitoring
slrategies.
Recentwork
that
considerssensingcosts in learning
stategies for robotic sensingattemptsto makesensingmore
efficient byselecting¯ subsetof the availablefeaturesof the A simple example
environmentto sense (Chrisman& Simmons,1991; Tan, Considera Madmv
chain
with five states labelled from0 to
1991).Aninteresting aspect of this workis that it deals 4, in whichthe highest numbered
state, 4, is an absorbin&
with the issue of incompletestate descriptions
(Whitehead state. The action set for the controller
is
& BAIlard, 1991).Anassamption
it m~h-J~,however,is that
A ffi {null,
restart}. If the controllerdoesnothing(i.e.,
the agentsensesits environment
(or selectedfeaturesof it)
thenullaction)st a giventimestep,theMarkov
at a fixed periodicinterval, typically at the beginningof l~forms
chain
hm
the atate transition probabilitiesshown
in f~ure1.
eachtime step or decisioncycle. In this paper,westart by
quesfioniagthis assumption.Weconsideran agentthat can
1.0
.7S
.5
.S
.S
decidefor itself whento sense the world,andhowlong to
wait before sensing again. So our focus is less on the
problemof whatfeatm~of the environmentto sense than
whenandhowoften to sensethem.
For both generalityand rigor, wehaveattempledto
i~gure1. Statetrnitkmwobabilitea
for the mallaction
tTeat this problemas ¯ Markovdecision problemusing
methodsbased on dynamicpmgrm~ing.This framework Theconlxoller can restart the Markov
chainonceit enters
state, 4. Restartingit restoresitto state 0 and
has beenused in recentworkto makeuseful connections the absorbing
hu ~e sine mmsitim
pmbabili~sshown
in figure 2.
betweenplanning and learning (Sutton, 1990). In this
framework,an action model takes the form of ¯ state
33
1
1
1
step but decides for itself howmanysteps to wait before
monitoring again. This new Markovdecision problem is
constructed from the original one by defining ¯ newaction
set:
A*= {null, restart} x (1,2,3 ....
1
1
2. Sta~ muuddon ~~ for rmum.
For example,a restart in state 4 effects a transition to state 0
but ¯ restart in state 2 does not change the state. (We
assumethe effect of a restart is instantaneous,whilethe the
effect of the null action takes one timestep.)
There is ¯ cost, -2, assesed for each time step the
process is in the absorbing state, and no cost assessed when
it is in the other states. Acost, .3, is incurredfor restarting
the process, and thae is no cost for the null action. Sothe
payoff function for this Woblem
is:
where¯ tuple, (a,m), represents a decision to take action,
a, and monitoragain mtime steps later.
A multi-step state transition function is defined by
matrix multipfication,
P~((a,m)) = (P(a)l~"-’(null))~
where a is either mall or restart, and P(null) and
P(restarZ) are the single-step state transition probability
matrices displayed graphically in figures I and 2.
A multi-step expected payoff function is defined in
termsof the original payofffunction as follows:
R’(x,(a,.))=. M+ R(x,a) ~=
(, l~.‘-tP’,,y((a,j))R(y,.ull))
R(x, mdi)={~ otherwise
0rz=4
R(x, restart)= The optimal control policy is self-evident; the
Wocessshould be restarted wheneverit enters its absorbing
state. This policy can also be found by computing the
optimal value function, below, by using some methodof
infinite-horizon
dynamic programming such as policy
iteration.
whereM= -I, again, is the cost incurred for monitoring.
The optimal value function for this new decision
problemis
÷
Itcanbesolved
onlyifthenumber
ofaction
andmonitoring
interval
pairs,
(a,m),
that
must
be
evaluated
in
eachstate,
V(x)= ~e {R(x,a)+ ~y~sP,,7(a)V(y)}
finite.
Onewayto ensure
thisis to put¯ boundon the
maximum
monitoring
interval
m. A lessarbitrary
solution
Using ¯ discount factor of ~. = 0.95,theoptimal
is to reusonthat the expectedvatue in state x for action a is
policy for I/tis problemand its expectedcost are displayedin
¯ unimndalfunction of the monitoriaginterval, m. This is a
table 1.
perfectly reasonable assumptionsince it amountsto saying
that the closer the monitoringinterval is to the optimum,the
better it is. This allows an optimal monitoringinterval for
3
4
0
1
2
taste
each state-action pair to be found by bounded search or
simple gradient ~,ent.
mall null mmatt
colul
m
man ill
The optimal policy and its expected cost have been
computed
by dynamic programmingand are displayed in
1
!
1
1
monitmiq
imen,d
1
table 2.
apm~ ooa (eot ~.t,d~
m omt)
-1.06 -!.28 -1.77
-2.64 .4.06
-21.28 -21.77 .22.64-24.O6
~4" ~n) (tncladiq -21.06
Table1. Optimalpolicy for original Mmkov
decisionproblem
The hut row of the table shows how the expected cost
wouldincrease if ¯ monitoring cost, M= -I, is assessed
each time step. In ¯ conventional Matkovdecision problem,
the controller has no control over costs incurred by
monitoring because the state of the Wncessis monitored
autonmfically at each time step. Weshowthat ¯ morecosteffective policy can be found by defining ¯ new Madmv
de.ion problem in which monitoring coats are considered
and the controller does not autmmtic~ymonitor each time
34
0
1
2
3
coatn/m
mill
null
null
null
mmitminS~emd
11
8
5
2
4
II
co.t (.’mcludins.4.98 -$.70 -6.9O -8.44 -7.9g
Tsb]eZ Optimalpolicy fc~ the redefinedMarkov
decisionprotean.
Twoobservations can be made about this policy.
First, taking monitoring costs into consideration makesit
possible to compute¯ more efficient policy in which the
state of the Markovwocessdoes not have to be monitored
at each time step. Second,the couu’oiler no longer monitors
at a fixed periodic rate. In this example, its rate of
monitoringincreases Mit gets closer to the costly absorbing
state. This reflects the generalidea that there are regions of
a problemspace or enviromnentin whicha controller needs
to "be more careful’, so to speak, by monitoring more
frequently. Wehave found that a pattern muchlike this
emerges from a number of different problem-solving
sit-~tlons. Atkin and Cohen(1993) describe an example
which it is worthwhile for an agent to monitor more
frequentlyas it approachesa goal.
Dynamic programming is not the only way to
determine a policy once monitoring costs and strategies
have been expressed in a Markov decision problem.
Various reinforcement learning algorithms have been
developed that have a theoretical basis in dynamic
l~ogrammingand can be used either for direct learning,
when an explicit cost and probability model are not
available, of for real-time planning and learning whena
modelis available but the problem space is too large to
perform full passes of dynamicprogrammingin real-time
(Barm, Bradtke, &Singh, 1993). Although not described
he:e, we have designed a simple extension of the Q-lem~g
algorithmthat learns to monitofas pert of its control policy.
It works by using a separate stochastic Gaussian unit for
each state-sction pair to fred the optimalmonitoringinterval
by gradient ascent.
Work in Prowess
Twointeresting classes of policies can be found using our
approach. The first is illustrated by the previous example:
an agent observesthe ctwrent state, performsa single action
(which could be the null action), and then waits some
numberof steps before monitoringagain. This is typical of
problemsin which, for example, a Im3CeU
is monitoredover
time to detect whena corrective action should be taken. It
also applies to problems in which a single action is
continuedfor a period of time and periodically monitofedto
decidewhetherto continueit or not.
Howevertl~ clam of policies is only ¯ special case
of a more general class in which, in each state, ¯ policy
specifies an open-loop sequence of actions to perform
before monitoring again somenumberof time steps latex.
The difference betweenthe tim class and the moregeneral
class is that, in the latter, ¯ sequenceof different actions is
taken before monitm’ingagain, rather than ¯ single action.
This broader class of policies is moteclosely related to
planning problemsin whicha sequenceof actions is taken to
get from¯ start state to a goal slate. However
the first class
of policies, described here, correslxa~ to an important set
of practical monitoring problems; moreover, finding an
optimal policy in the fnst case is computationallysimpler.
While the number
of possible action and moniuxinginterval
pairs, (a,m), in the first class is ¯ linear function of the
monitoring interval, the number of possible open-loop
sequences of actions in the general class grows
exponentiallyas ¯ function of the monitoringinta~al.
In both these classes of problems, the purpose of
monitoring is simply to deteaminothe current state of the
35
mvironment. Another purpose of monitmi~, one we have
not yet considered, might be to check whether the model
used by the agent to generateits behavioris correct or needs
to be revised. A very simplyexampleof Otis is described in
(Orefenstette & Ramsay,1992), where ¯ monitor detects
discrepmcies between the Ixmuneters of m agent’s model
and the environmentto decide whether to update the model
parameters and trigger ¯ learning algorithm to revise the
agent’s behaviof. It mightbe interesting to explore, within a
frameworklike the one described here, howuncertainty
about whetheran action modelis correct mightinfluence the
rote of monitoring.
Acknowledgments
Thanksto AndrewBarto and Satinder Singh for helping to
clarify someof these ideas in discussion,and to Scou
Andersonfor commentson a draft.
This research is s~ by the Defense Advanced
Research Projects Agency under comract #F49620-89-C00113;by the Air Force Office of Scientific Researchunder
the Intelligent Real-time Problem Solving Initiative,
contract #APOSR-91-0067;and by Augmentation Award
for Science and Engineering Resea~hTraining, PR No. C2-2675. The USGovernmentis ~ebodzedto nq3roduce and
distribute
reprints
for governmental purposes
notwithslmu~gany copyright notation hereon.
References
Atkin, M. and Cohen, P. (1993). Genetic programming
learn an agent’s monitoringstrategy. In tlds Imr~,dings.
Barto, A.G., Sutton, R.S., and Watkins, CJ.C.H. (1990).
Leaning and sequential decision making. In Leara/n& and
Computational Neurosciemce: Fotuutations of Adaptive
Networks. M. Galx’iei and J. W. Mome
(Eds.), MITPress,
pp. 539-6O2.
Barto, A., Bmdtke,A., and Singh, S. (1993). Learning
act using real-time dynamic programming. Computer
Science Technical Report 93-02, University of
Massachusetts, Amherst.
Chrisman, L. and Simmons,R. (1991). Sensible planning:
Focusing perceptual attention. Proceedings AAAI-91, pp.
756-761.
Grefeustette, JJ., & Ramsey,C.L. (1992). An approach
anytime leafing. Proceedings of the Ninth International
Cooferemceon MacA/~eLearn/rig, pp. 189-195.
Sutton, R.S. (1990). Integrated architectmes for learning,
planning, and reacting based on approximating dynamic
ixogranuning. Proceedis&s of the Sewnt/! International
Cooference on MacMme
Learning, pp. 216-224,
Tan. M. (1991). Learning cost-sensitive in ternal
w, pres~tation for reinforcementlearning. In Proceedingsof
the Eig~h lnternatio~d Workshop om MacMneLeart6n&,
pp. 358~362.
Whitehead, S.D., & Ballard, D.H. (1991). Learning
perceive and act by trial and error. Moc/tme
Learn/n&7:4583.

Download Report

Learning Monitoring Strategies to Compensate for Model

Paperzz.com

Your Paperzz