ArtificialIntelligence Roman Barták Department of Theoretical Computer Science and Mathematical Logic Motivation Considertheproblemof learningtoplaychess. Asupervisedlearningagentneedstobetoldthe correctmoveforeachpositionitencounters. – butsuchfeedbackisseldomavailable Intheabsenceoffeedback, anagentcanlearna transitionmodelforitsownmovesandandcan perhapslearntopredicttheopponent's moves. – withoutsomefeedback aboutwhatisgoodand whatisbad,theagentwillhavenogroundsfor decidingwhichmovetomake Feedback Atypicalkindoffeedbackiscalledareward,or reinforcement – ingameslikechess,thereinforcement isreceivedonly attheendofthegame – inotherproblems,therewardscomemorefrequently (inping-pong,eachpointscoredcanbeconsidereda reward) Therewardispartoftheinputpercept,butthe agentmustbe“hardwired”torecognizethatpartas areward – painandhungerarenegativerewards – pleasureandfoodintakearepositiverewards Reinforcementlearning Reinforcementlearningmightbeconsidered toencompassallofartificialintelligence: – anagentisplacedinanenvironmentand mustlearntobehavesuccessfullytherein – inmanycomplexdomains,reinforcementlearningis theonlyfeasiblewaytotrainaprogramtoperform athighlevels Wewillconsiderthreeoftheagentdesigns – a utility-basedagent learnsautilityfunctionandusesittoselectactions thatmaximizetheexpectedoutcomeutility • mustalsohaveamodeloftheenvironment,becauseitmustknowthestatesto whichitsactionswilllead – a Q-learningagentlearnsanaction-utilityfunction(Q-function)givingthe expectedutilityoftakingagivenactioninagivenstate – a reflexagentlearnsapolicythatmapsdirectlyfromstatestoactions Thetaskof passivelearningistolearntheutilitiesofthestates,where theagent’spolicyisfixed. Inactivelearningtheagentmustalsolearnwhattodo. – Itinvolvessomeformofexploration:anagentmustexperienceasmuch aspossibleofitsenvironmentinordertolearnhowtobehaveinit. Passivereinforcementlearning Theagent‘spolicyisfixed(instates, italwaysexecutestheaction!(s)). Thegoalistolearnhowgoodthe policyis,thatis,tolearntheutility function U!(s)=E[Σt=0,…,∞ $t.R(st)] TheagentdoesnotknowthetransitionmodelP(s‘|s,a) nordoesitknowtherewardfunctionR(s). Acoreapproach: – theagentexecutesasetoftrialsintheenvironmentusing itspolicy ! – itsperceptsupplyboththecurrentstateandthereward receivedatthatstate (1,1)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (2,3)-0.04 → (3,3)-0.04 → (4,3)+1 (1,1)-0.04 → (1,2)-0.04 → (1,3)-0.04 → (2,3)-0.04 → (3,3)-0.04 → (3,2)-0.04→ (3,3)-0.04 → (4,3)+1 (1,1)-0.04 → (2,1)-0.04 → (3,1)-0.04 → (3,2)-0.04 → (4,2)-1 Directutilityestimation Theideaisthattheutilityofastateistheexpectedtotal rewardfromthatstateonward(expectedreward-to-go). – forstate(1,1)wegetasampletotalreward0.72inthefirsttrial – forstate(1,2)wehavetwosamples0.76and0.84inthefirsttrial Thesamestatemayappearinmoretrials(oreveninthesame trial)sowekeeprunningaverageforeachstate. Directutilityestimationisjustaninstanceofsupervised learning(input=state,output=reward-to-go) Majorinefficiency: – Theutilitiesofstatesarenotindependent! – TheutilityvaluesobeytheBellmanequationsforafixedpolicy U! (s)=R(s)+$ Σs‘ P(s‘|s,!(s))U! (s‘) – WesearchforUinahypothesisspacethatismuchlargerthanit needstobe(itincludesmanyfunctionsthatviolatetheBellman equations);forthisreason,thealgorithmoftenconvergesvery slowly. Adaptivedynamicprogramming An adaptivedynamicprogramming(ADP)agent takesadvantageoftheBellmanequations. Theagentlearns: – thetransitionmodelP(s‘|s,!(s)) • Usingthefrequencywithwhichs isreachedwhenexecuting ains.ForexampleP((2,3)|(1,3),Right)=2/3. – rewardsR(s) • directlyobserved TheutilityofstatesiscalculatedfromtheBellman equations,forexampleusingthemodifiedpolicy iteration. ADPalgorithm Temporal-differencelearning Wecanusetheobservedtransitionstoadjustutilitiesofthestatesso thattheyagreewiththeconstraintequations. Example: – considerthetransitionsfrom(1,3)to (2,3) – supposethat,asaresultofthefirsttrial,theutilityestimatesare U!(1,3)=0.84andU!(2,3)=0.92 – ifthistransitionoccurredallthetime,wewouldexpecttheutilityto obeytheequations(if$ =1) U!(1,3)=-0.04+U!(2,3) – sotheutilitywouldbe U!(1,3)=0.88 – hencethecurrentestimateU! (1,3)mightbealittlelowandshouldbe increased Ingeneral,weapplythefollowingupdate(&isthelearningrate parameter): U!(s)← U!(s)+&.(R(s)+$.U!(s‘)- U!(s)) Theaboveformulaisoftencalledthetemporal-difference (TD)equation. TDalgorithm ComparisonofADPandTD BothADPandTDapproachestrytomakelocaladjustmentstotheutility estimatesinordertomakeeachstate„agree“withitssuccessors. • Temporaldifference – doesnotneedatransitionmodeltoperformupdates – adjustsastatetoagreewithitsobserved successor – a singleadjustmentperobservedtransition • Adaptivedynamicprogramming – adjustsastatetoagreewithall ofthesuccessors – makesasmanyadjustmentsasitneedstorestore consistencybetweentheutilityestimates First time reaches a state with reward -1 Activereinforcementlearning Apassivelearningagenthasafixedpolicythatdeterminesits behavior. An activeagentmustdecidewhatactionstotake. Letusbeginwiththeadaptivedynamicprogrammingagent – theutilitiesitneedstolearnaredefinedbytheoptimalpolicy; theyobeytheBellmanequations U! (s)=R(s)+$ maxa Σs‘ P(s‘|s,a)U! (s‘) – theseequationscanbesolvedtoobtaintheutilityfunction Whattodoateachstep? – theagentcanextractanoptimalactiontomaximizethe expectedutility – thenitshouldsimplyexecutetheactiontheoptimalpolicy recommends – Orshouldit? Greedyagent Anexampleofpolicyfoundby theactiveADPagent. Thisisnotanoptimalpolicy! Whatdidhappen? Theagentfoundaroute(2,1),(3,1),(3,2),(3,3)tothegoal withreward+1. Afterexperimentingwithminorvariations,itsticksto thatpolicy. Asitdoesnotlearnutilitiesoftheotherstates,itnever findstheoptimalroutevia(1,2),(1,3),(2,3),(3,3). Wecallthisagentthegreedyagent. Propertiesofgreedyagents Howcanitbethatchoosingtheoptimalactionleadsto suboptimalresults? – thelearnedmodelisnotthesameasthetrue environment;whatisoptimalinthelearnedmodelcan thereforebesuboptimalinthetrueenvironment – actionsdomorethanproviderewards;theyalsocontribute tolearningthetruemodelbyaffectingtheperceptsthat arereceived – byimprovingthemodel,theagentwillreceivegreater rewardsinthefuture Anagentthereforemustmadetradeoffbetween exploitation tomaximize itsrewardandexploration to maximizeitslong-termwell-being. Exploration Whatistherighttrade-offbetweenexplorationand exploitation? – pureexplorationisofnouseifoneneverputsthat knowledgeinpractice – pureexploitationrisksgettingstuckinarut Basicidea – atthebeginningstrikingoutintotheunknowninthe hopesofdiscoveringanewandbetterlife – withgreaterunderstandinglessexplorationisnecessary Ann-armedbandit – aslotmachinewithn-levers (ornone-armedslotmachines) Whichlevertoplay? • Theonethathaspaidoffbest, ormaybeonethathasnotbeentried? Explorationpolicies Theagentchoosesarandomactionafraction1/tofthetimeandfollowsthe greedypolicyotherwise – itdoeseventuallyconvergetoanoptimalpolicy,butitcanbeextremelyslow Amoresensibleapproachwouldgivesomeweighttoactionsthattheagenthas nottriedveryoften,whiletendingtoavoidactionsthatarebelievedtobeof lowutility. – assignahigherutilityestimatetorelativelyunexploredstate-actionpairs – valueiterationmayusethefollowingupdaterule U+(s)← R(s)+) maxa f(*s‘ P(s‘|s,a)U+(s‘),N(s,a)) • N(s,a)isthenumberoftimesactiona hasbeentriedinstates • U+(s)denotestheoptimisticestimateoftheutility • f(u,n)iscalledtheexplorationfunction; itdetermineshowgreedistradedoffagainst curiosity(shouldbeincreasinginu anddecreasinginn) – forexample f(u,n)=R+ ifn<Ne, otherwise u (R+ isanoptimistic estimate ofthebest possible rewardobtainable inanystate) ThefactthatU+ ratherthanUappearsintheright-handsideisveryimportant. • Asexplorationproceeds,thestatesandactionsnearthestartmightwellbetriedalarge numberoftimes. • IfweusedU,themorepessimisticutilityestimate,thentheagentwouldsoonbecome disinclinedtoexplorefurtherafield. • Thebenefitsofexplorationarepropagatedbackfromtheedgesofunexploredregionsso thatactionsthatleadtowardunexploredregionsareweightedmorehighly. Q-learning Letusnowconsiderhowtoconstructanactivetemporal-difference learningagent. – Theupdateruleremainsunchanged: U(s)← U(s)+&.(R(s)+$.U(s‘) - U(s)) – ThemodelacquisitionproblemfortheTDagentisidenticaltothatfor theADPagent. ThereisanalternativeTDmethod,calledQ-learning – Q(s,a)denotesthevaluefordoingactionainstates – theq-valuesaredirectlyrelatedtoutilityvaluesasfollows: • U(s)=maxa Q(s,a) – theTD-agentthatlearnsaQ-functiondoesnotneedamodelform P(s‘|s,a) • Q-learningiscalledamodel-freemethod – wecanwriteaconstraintequationthatmustholdatequilibrium: • Q(s,a)=R(s)+$ Σs‘ P(s‘|s,a)maxa‘ Q(s‘,a‘) • ThisdoesrequirethatamodelP(s‘|s,a)alsobelearned! – TheTDapproachrequiresnomodelofstatetransitions– allitneeds aretheQvalues • Q(s,a)← Q(s,a)++.(R(s)+) maxa‘Q(s‘,a‘)- Q(s,a)) • itiscalculatedwheneveractionaisexecutedinstatesleadingtostates’ Q-learningalgorithm SARSA State-Action-Reward-State-Action • acloserelativetoQ-learningwiththefollowingupdaterule Q(s,a)← Q(s,a)++.(R(s)+).Q(s‘,a‘)- Q(s,a)) • theruleisappliedattheendofeachs,a,r,s‘,a‘ quintuplet, i.e.afterapplyingactiona’ ComparisonofSARSAandQ-learning: – foragreedyagentthetwoalgorithmsareidentical(theactiona‘ maximizingQ(s‘,a‘)isalwaysselected) – Whenexplorationisassumedthereisasubtledifference • Q-learningpaysnoattentiontotheactualpolicybeingfollowed– itis anoff-policylearningalgorithm(canlearnhowtobehavewelleven whenguidedbyarandomoradversarialexplorationpolicy) • SARSAismorerealistic:itisbettertolearnaQ-functionforwhat actuallyhappenratherthanwhattheagentwouldliketohappen – worksiftheoverallpolicyisevenpartlycontrolledbyotheragents Finalnotes BothQ-learningandSARSAlearntheoptimalpolicy,butdoso atmuchslowerratethantheADPagent. – thelocalupdatesdonotenforceconsistencyamongalltheQvaluesviathemodel Isitbettertolearnamodelandautilityfunction(ADP)orto learnanaction-utilityfunctionwithnomodel(Q-learning, SARSA)? – OneofthekeyhistoricalcharacteristicsofmuchofAIresearchis itsadherencetotheknowledge-basedapproach;thisadheres toassumptionthatthebestwaytorepresenttheagentfunction istobuildarepresentationofsomeaspectsoftheenvironment inwhichtheagentissituated. – Availabilityofmodel-freemethodssuchasQ-learningmeans thattheknowledge-basedapproachisunnecessary. – Intuitionisthatastheenvironmentbecomesmorecomplex,the advantagesofknowledge-basedapproachbecomemore apparent. © 2016 Roman Barták Department of Theoretical Computer Science and Mathematical Logic [email protected]
© Copyright 2026 Paperzz