Q(s`,a`)

ArtificialIntelligence
Roman Barták
Department of Theoretical Computer Science and Mathematical Logic
Motivation
Considertheproblemof
learningtoplaychess.
Asupervisedlearningagentneedstobetoldthe
correctmoveforeachpositionitencounters.
– butsuchfeedbackisseldomavailable
Intheabsenceoffeedback, anagentcanlearna
transitionmodelforitsownmovesandandcan
perhapslearntopredicttheopponent's moves.
– withoutsomefeedback aboutwhatisgoodand
whatisbad,theagentwillhavenogroundsfor
decidingwhichmovetomake
Feedback
Atypicalkindoffeedbackiscalledareward,or
reinforcement
– ingameslikechess,thereinforcement isreceivedonly
attheendofthegame
– inotherproblems,therewardscomemorefrequently
(inping-pong,eachpointscoredcanbeconsidereda
reward)
Therewardispartoftheinputpercept,butthe
agentmustbe“hardwired”torecognizethatpartas
areward
– painandhungerarenegativerewards
– pleasureandfoodintakearepositiverewards
Reinforcementlearning
Reinforcementlearningmightbeconsidered
toencompassallofartificialintelligence:
– anagentisplacedinanenvironmentand
mustlearntobehavesuccessfullytherein
– inmanycomplexdomains,reinforcementlearningis
theonlyfeasiblewaytotrainaprogramtoperform
athighlevels
Wewillconsiderthreeoftheagentdesigns
– a utility-basedagent learnsautilityfunctionandusesittoselectactions
thatmaximizetheexpectedoutcomeutility
• mustalsohaveamodeloftheenvironment,becauseitmustknowthestatesto
whichitsactionswilllead
– a Q-learningagentlearnsanaction-utilityfunction(Q-function)givingthe
expectedutilityoftakingagivenactioninagivenstate
– a reflexagentlearnsapolicythatmapsdirectlyfromstatestoactions
Thetaskof passivelearningistolearntheutilitiesofthestates,where
theagent’spolicyisfixed.
Inactivelearningtheagentmustalsolearnwhattodo.
– Itinvolvessomeformofexploration:anagentmustexperienceasmuch
aspossibleofitsenvironmentinordertolearnhowtobehaveinit.
Passivereinforcementlearning
Theagent‘spolicyisfixed(instates,
italwaysexecutestheaction!(s)).
Thegoalistolearnhowgoodthe
policyis,thatis,tolearntheutility
function U!(s)=E[Σt=0,…,∞ $t.R(st)]
TheagentdoesnotknowthetransitionmodelP(s‘|s,a)
nordoesitknowtherewardfunctionR(s).
Acoreapproach:
– theagentexecutesasetoftrialsintheenvironmentusing
itspolicy !
– itsperceptsupplyboththecurrentstateandthereward
receivedatthatstate
(1,1)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (2,3)-0.04 → (3,3)-0.04 → (4,3)+1
(1,1)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (2,3)-0.04 → (3,3)-0.04 → (3,2)-0.04→ (3,3)-0.04 → (4,3)+1
(1,1)-0.04 → (2,1)-0.04 → (3,1)-0.04 → (3,2)-0.04 → (4,2)-1
Directutilityestimation
Theideaisthattheutilityofastateistheexpectedtotal
rewardfromthatstateonward(expectedreward-to-go).
– forstate(1,1)wegetasampletotalreward0.72inthefirsttrial
– forstate(1,2)wehavetwosamples0.76and0.84inthefirsttrial
Thesamestatemayappearinmoretrials(oreveninthesame
trial)sowekeeprunningaverageforeachstate.
Directutilityestimationisjustaninstanceofsupervised
learning(input=state,output=reward-to-go)
Majorinefficiency:
– Theutilitiesofstatesarenotindependent!
– TheutilityvaluesobeytheBellmanequationsforafixedpolicy
U! (s)=R(s)+$ Σs‘ P(s‘|s,!(s))U! (s‘)
– WesearchforUinahypothesisspacethatismuchlargerthanit
needstobe(itincludesmanyfunctionsthatviolatetheBellman
equations);forthisreason,thealgorithmoftenconvergesvery
slowly.
Adaptivedynamicprogramming
An adaptivedynamicprogramming(ADP)agent
takesadvantageoftheBellmanequations.
Theagentlearns:
– thetransitionmodelP(s‘|s,!(s))
• Usingthefrequencywithwhichs isreachedwhenexecuting
ains.ForexampleP((2,3)|(1,3),Right)=2/3.
– rewardsR(s)
• directlyobserved
TheutilityofstatesiscalculatedfromtheBellman
equations,forexampleusingthemodifiedpolicy
iteration.
ADPalgorithm
Temporal-differencelearning
Wecanusetheobservedtransitionstoadjustutilitiesofthestatesso
thattheyagreewiththeconstraintequations.
Example:
– considerthetransitionsfrom(1,3)to (2,3)
– supposethat,asaresultofthefirsttrial,theutilityestimatesare
U!(1,3)=0.84andU!(2,3)=0.92
– ifthistransitionoccurredallthetime,wewouldexpecttheutilityto
obeytheequations(if$ =1)
U!(1,3)=-0.04+U!(2,3)
– sotheutilitywouldbe
U!(1,3)=0.88
– hencethecurrentestimateU! (1,3)mightbealittlelowandshouldbe
increased
Ingeneral,weapplythefollowingupdate(&isthelearningrate
parameter):
U!(s)← U!(s)+&.(R(s)+$.U!(s‘)- U!(s))
Theaboveformulaisoftencalledthetemporal-difference (TD)equation.
TDalgorithm
ComparisonofADPandTD
BothADPandTDapproachestrytomakelocaladjustmentstotheutility
estimatesinordertomakeeachstate„agree“withitssuccessors.
• Temporaldifference
– doesnotneedatransitionmodeltoperformupdates
– adjustsastatetoagreewithitsobserved successor
– a singleadjustmentperobservedtransition
• Adaptivedynamicprogramming
– adjustsastatetoagreewithall ofthesuccessors
– makesasmanyadjustmentsasitneedstorestore
consistencybetweentheutilityestimates
First time reaches a
state with reward -1
Activereinforcementlearning
Apassivelearningagenthasafixedpolicythatdeterminesits
behavior.
An activeagentmustdecidewhatactionstotake.
Letusbeginwiththeadaptivedynamicprogrammingagent
– theutilitiesitneedstolearnaredefinedbytheoptimalpolicy;
theyobeytheBellmanequations
U! (s)=R(s)+$ maxa Σs‘ P(s‘|s,a)U! (s‘)
– theseequationscanbesolvedtoobtaintheutilityfunction
Whattodoateachstep?
– theagentcanextractanoptimalactiontomaximizethe
expectedutility
– thenitshouldsimplyexecutetheactiontheoptimalpolicy
recommends
– Orshouldit?
Greedyagent
Anexampleofpolicyfoundby
theactiveADPagent.
Thisisnotanoptimalpolicy!
Whatdidhappen?
Theagentfoundaroute(2,1),(3,1),(3,2),(3,3)tothegoal
withreward+1.
Afterexperimentingwithminorvariations,itsticksto
thatpolicy.
Asitdoesnotlearnutilitiesoftheotherstates,itnever
findstheoptimalroutevia(1,2),(1,3),(2,3),(3,3).
Wecallthisagentthegreedyagent.
Propertiesofgreedyagents
Howcanitbethatchoosingtheoptimalactionleadsto
suboptimalresults?
– thelearnedmodelisnotthesameasthetrue
environment;whatisoptimalinthelearnedmodelcan
thereforebesuboptimalinthetrueenvironment
– actionsdomorethanproviderewards;theyalsocontribute
tolearningthetruemodelbyaffectingtheperceptsthat
arereceived
– byimprovingthemodel,theagentwillreceivegreater
rewardsinthefuture
Anagentthereforemustmadetradeoffbetween
exploitation tomaximize itsrewardandexploration to
maximizeitslong-termwell-being.
Exploration
Whatistherighttrade-offbetweenexplorationand
exploitation?
– pureexplorationisofnouseifoneneverputsthat
knowledgeinpractice
– pureexploitationrisksgettingstuckinarut
Basicidea
– atthebeginningstrikingoutintotheunknowninthe
hopesofdiscoveringanewandbetterlife
– withgreaterunderstandinglessexplorationisnecessary
Ann-armedbandit
– aslotmachinewithn-levers
(ornone-armedslotmachines)
Whichlevertoplay?
• Theonethathaspaidoffbest,
ormaybeonethathasnotbeentried?
Explorationpolicies
Theagentchoosesarandomactionafraction1/tofthetimeandfollowsthe
greedypolicyotherwise
– itdoeseventuallyconvergetoanoptimalpolicy,butitcanbeextremelyslow
Amoresensibleapproachwouldgivesomeweighttoactionsthattheagenthas
nottriedveryoften,whiletendingtoavoidactionsthatarebelievedtobeof
lowutility.
– assignahigherutilityestimatetorelativelyunexploredstate-actionpairs
– valueiterationmayusethefollowingupdaterule
U+(s)← R(s)+) maxa f(*s‘ P(s‘|s,a)U+(s‘),N(s,a))
• N(s,a)isthenumberoftimesactiona hasbeentriedinstates
• U+(s)denotestheoptimisticestimateoftheutility
• f(u,n)iscalledtheexplorationfunction; itdetermineshowgreedistradedoffagainst
curiosity(shouldbeincreasinginu anddecreasinginn)
– forexample f(u,n)=R+ ifn<Ne, otherwise u
(R+ isanoptimistic estimate ofthebest possible rewardobtainable inanystate)
ThefactthatU+ ratherthanUappearsintheright-handsideisveryimportant.
• Asexplorationproceeds,thestatesandactionsnearthestartmightwellbetriedalarge
numberoftimes.
• IfweusedU,themorepessimisticutilityestimate,thentheagentwouldsoonbecome
disinclinedtoexplorefurtherafield.
• Thebenefitsofexplorationarepropagatedbackfromtheedgesofunexploredregionsso
thatactionsthatleadtowardunexploredregionsareweightedmorehighly.
Q-learning
Letusnowconsiderhowtoconstructanactivetemporal-difference
learningagent.
– Theupdateruleremainsunchanged:
U(s)← U(s)+&.(R(s)+$.U(s‘) - U(s))
– ThemodelacquisitionproblemfortheTDagentisidenticaltothatfor
theADPagent.
ThereisanalternativeTDmethod,calledQ-learning
– Q(s,a)denotesthevaluefordoingactionainstates
– theq-valuesaredirectlyrelatedtoutilityvaluesasfollows:
• U(s)=maxa Q(s,a)
– theTD-agentthatlearnsaQ-functiondoesnotneedamodelform
P(s‘|s,a)
• Q-learningiscalledamodel-freemethod
– wecanwriteaconstraintequationthatmustholdatequilibrium:
• Q(s,a)=R(s)+$ Σs‘ P(s‘|s,a)maxa‘ Q(s‘,a‘)
• ThisdoesrequirethatamodelP(s‘|s,a)alsobelearned!
– TheTDapproachrequiresnomodelofstatetransitions– allitneeds
aretheQvalues
• Q(s,a)← Q(s,a)++.(R(s)+) maxa‘Q(s‘,a‘)- Q(s,a))
• itiscalculatedwheneveractionaisexecutedinstatesleadingtostates’
Q-learningalgorithm
SARSA
State-Action-Reward-State-Action
• acloserelativetoQ-learningwiththefollowingupdaterule
Q(s,a)← Q(s,a)++.(R(s)+).Q(s‘,a‘)- Q(s,a))
• theruleisappliedattheendofeachs,a,r,s‘,a‘ quintuplet,
i.e.afterapplyingactiona’
ComparisonofSARSAandQ-learning:
– foragreedyagentthetwoalgorithmsareidentical(theactiona‘
maximizingQ(s‘,a‘)isalwaysselected)
– Whenexplorationisassumedthereisasubtledifference
• Q-learningpaysnoattentiontotheactualpolicybeingfollowed– itis
anoff-policylearningalgorithm(canlearnhowtobehavewelleven
whenguidedbyarandomoradversarialexplorationpolicy)
• SARSAismorerealistic:itisbettertolearnaQ-functionforwhat
actuallyhappenratherthanwhattheagentwouldliketohappen
– worksiftheoverallpolicyisevenpartlycontrolledbyotheragents
Finalnotes
BothQ-learningandSARSAlearntheoptimalpolicy,butdoso
atmuchslowerratethantheADPagent.
– thelocalupdatesdonotenforceconsistencyamongalltheQvaluesviathemodel
Isitbettertolearnamodelandautilityfunction(ADP)orto
learnanaction-utilityfunctionwithnomodel(Q-learning,
SARSA)?
– OneofthekeyhistoricalcharacteristicsofmuchofAIresearchis
itsadherencetotheknowledge-basedapproach;thisadheres
toassumptionthatthebestwaytorepresenttheagentfunction
istobuildarepresentationofsomeaspectsoftheenvironment
inwhichtheagentissituated.
– Availabilityofmodel-freemethodssuchasQ-learningmeans
thattheknowledge-basedapproachisunnecessary.
– Intuitionisthatastheenvironmentbecomesmorecomplex,the
advantagesofknowledge-basedapproachbecomemore
apparent.
© 2016 Roman Barták
Department of Theoretical Computer Science and Mathematical Logic
[email protected]