Learning Chapter 8 Adaptive Behavior One of many definitions: “Learning produces changes within an agent that over time enables it to perform more efficiently within its environment”” Examples: • Introducing new knowledge (facts, rules etc.) • Generalizing • Specializing • Reorganising information • Creating new concepts • Creating explanations of how things function • Reusing past experiences “Learning is not compulsory. Neither is survival”” Thomas Hellström Umeå University Sweden 1 © Thomas Hellström 2001 Adaption Adaptive Control System Adjustments to make the agent more attuned to its environment Conventional feedback Stimuli Examples: Controller action behavior: stimuli ⇒ respons • Adjusting behaviors • Sensor adaption (genotypic adaption) observation • Adaption: Adjustments of the behavior e.g: parameters in a neural network Adaption may cause : • Habituation • Sensitization 3 © Thomas Hellström 2001 4 © Thomas Hellström 2001 “Problems”” in Learning What can a Robot learn? º(s ) = r • Credit assignment problem: How is credit or blame assigned to a particular piece of knowledge? • Saliency problem: What features/stimulus are relevant to a specific task? • New term problem: When should a new (abstract) concept be created? • Indexing problem: How can memory be organized to support learning? • Utility problem: When is it acceptable to forget things? ρ = C(G • B (S)) Behavioral assemblages q: q = C(G • B (S)) The robot can learn: • What set of stimulus should be included for a respons • The mapping B . Either pointwise or the whole function • The gain factors G. • The coordination function C . • What set of behaviors should be included in an assemblage q? • B, G and C for q. © Thomas Hellström 2001 Controlled system Learning element (parametric adjustment) • Evolutionary adaption B is a vector of all behaviors º , ºª, ºª … G is a vector of gain factors g , gªª , … C is a coordination function The total response ¸ is given by: 2 © Thomas Hellström 2001 5 © Thomas Hellström 2001 6 Reinforcement Learning Classifying learning • Motivated by the psycological concept “the Law of Effect””: “Applying a reward immediately after the occurrence of a response increases its probability of reoccuring, while providing punishment after the response will decrease the probability”” (Thorndike 1911) • Assimilation/Accomodation: Assimilation: Modifying existing behaviors (adaption) Accomodation: Acquisition of new behaviors • Numeric/symbolic Numeric: functional mappings. e.g Neural networks Symbolic: production rules, semantic networks • Inductive/deductive Inductive: Interpolate/extrapolate from examples Deductice: Extract knowledge from a fixed data bank • Continuous/batch Continuous: During the interaction with the world Batch: Acquisition of data before learning • Unsupervised learning (?) I.e. there are no examples with pairs (stimuli, response) • The reward is often discrete; -2,-1,0,+1,+2 or even binary; pass/fail 7 © Thomas Hellström 2001 8 © Thomas Hellström 2001 Reinforcement Learning Reinforcement Learning Two types: Conventional feedback State information (sensor data) Controller policy function: state ⇒ action Controlled system Action Internal Reinforcement • Adaptive Heuristic Critic (AHC) learning: The learning of the decision policy (state⇒ ⇒action) in the controller is separated from learning the utility function (state ⇒utility) in the critic. Reward Critic utility function: state ⇒ utility • Q-learning: A single Q-function is used to model BOTH actions and states • Problems: How should the reward cause a change in the policy (credit assignment) ? Method: the utility function used by the Critic 9 © Thomas Hellström 2001 Q-learning Q-learning Algorithm: • A single utility Q-function is learned to evaluate BOTH actions and states • The decision policy (state⇒ ⇒action) is often represented as a lookup table (that’’s why discrete states are preferred…) State description State v vª v£ v¢ v° . x xª x£ x¢ x° x§ Q-value for 5 possible actions Initialize all Q(x,a) to 0 Do Forever Determine current world state x via sensing 90% of the time choose action a that maximizes Q(x,a) else pick random action Execute a Determine new current world state y via sensing Determine reward r Q(x,a) ⇐ Q(x,a)+β(r+λE(y)-Q(x,a)) Update Q(x’,a) for all x’ similar to x. Utility for state x . . Q(x,a ) Q(x,aª) Q(x,a£) Q(x,a¢) Q(x,a°) E(x) End Do β : learning rate parameter r : the reward λ : “discount factor” between 0 and 1 • Q(x,a) is the utility of doing action a in state x • E(x) = max(Q(x,a ) Q(x,aª) Q(x,a£) Q(x,a¢) Q(x,a°) ) © Thomas Hellström 2001 10 © Thomas Hellström 2001 E(y) utility of state y; max Q(y,a) for all actions a 11 © Thomas Hellström 2001 12 Adaptive Heuristic Critic (AHC) learning Genetic Algorithms “Biologically the species is the accumulation of the experiments of all successful individuals since the beginning” - H.G. Wells • Adaptive Heuristic Critic (AHC) learning: The learning of the decision policy (state⇒ ⇒action) in the controller is separated from learning the utility function (state ⇒utility) in the critic. • Inspired by nature. Natural selection and the survival of the fittest • A method for function optimization max f ( x ) without derivates. (Holland 1989) x∈Ω • The “individuals” are points x in the search space Ω • The “fitness” of a point is the function value f(x). • Each iteration corresponds to a generation • The “fittest” individuals are combined and survive • The average fitness for the population increases Example p.326-327 13 © Thomas Hellström 2001 14 © Thomas Hellström 2001 Genetic Operations Genetic Algorithms Individuals are selected with a probability based on their fitness 1. Generate a population (points x in Ω ) Reproduce 2. Compute the fitness function for each x 12.9 3. Select points with a probabilty proportional to their fitness. 4. Execute the genetic operations: Reproduce, Crossover and Mutation to produce next generation of individuals (points) 5. Repeat from 2 until convergence. 2.8 12.9 1.3 6.1 2.8 Crossover 1.3 6.1 2.8 12.9 1.3 0.1 2.0 19.0 0.8 0.1 2.0 19.0 0.8 6.1 2.8 12.9 0.6 6.1 2.8 Mutate 15 Genetic Algorithms for Learning Behavioral Control 1.3 6.1 2.8 16 © Thomas Hellström 2001 Genetic Algorithms in a Braitenberg Vehicle The behavior is parameterized. E.g: Gains for coordination of: goal attraction, obstacle ovoidance, noise... Stimuli: SÁ..Sö 8 infrared proximity detectors Respons: (L,R) (left and right motor control) The behavior (mapping Stimuli ⇒ Respons) is represented by two neural networks: The fitness function (object function) fz(x) : • Computed by letting the individuals act in either a simulated or real world. E.g: R = ∑ w i ⋅ Si + w 0 L = ∑ v i ⋅ Si + v 0 Algorithm: f = z •number_of_collisions + zÁ•number_of_steps + zª•distance_travelled 1) Generate 100 robots (i.e. 100 weight vectors w and v ) 2) Let them “live” one by one for 20 seconds and compute fitness values f([w v]) 3) Select some of the best for reproduction (reproduction, recombination, mutation) 4) repeat from 1) until the end of time • Controls what kind of behavior we will achieve; SAFE, FAST, DIRECT (p.335-336) • Corresponds to the environmental demands in real evolution. © Thomas Hellström 2001 6.1 12.9 12.9 © Thomas Hellström 2001 1.3 17 © Thomas Hellström 2001 18 The fitness function Genetic Algorithms in a Braitenberg Vehicle Should guide the genetic operations towards the solution. I.e: It musn’’t be too “sharp””. The fitness function f ([w v]) is computed by letting the robot live for 20 seconds: V=0; fit=0; n=0; max_s=10; while alive fitness sample S ,..,S compute (L,R) and execute the action (run the motors) V = V + sum(L+R)/(2*max_s); % reward high speed dv = abs(L-R)/(2*max_s);% reward straight trajectory i = max(SÁ,..,S )/1023;% The highest activity,normalized to [0 1] fit = fit + (1-sqrt(dv))*(1-i); % Accumulate dv and i n=n+1; end f = abs(V)/n * fit/n; % Calculate the total fitness of this individual behavioral parameter 19 © Thomas Hellström 2001 20 © Thomas Hellström 2001 Learning and Natural Selection The Baldwin Effect • There are indications that individuals who have learned a successful behaviour pass that over to their children! • Baldwin explained that with “ordinary” natural selection: • If we were born empty we would have to learn everything. • If we could learn everything we wouldn’’t need any natural selection. • Learned behavior do not change the genes. Learning has the effect of smoothing the fitness function. In this way individuals who can learn are favoured by evolution. I.e.: It looks as if the acquired skills are inherited (because the ability to learn is inherited) fitness fitness Fitness for those who can learn Fitness for those who can’t learn w ----------> wÁ learning behavioral parameter © Thomas Hellström 2001 behavioral parameter 21 Genetic Algorithms for Learning Behavioral Control 22 Genetic Algorithms in Robotics - • The behavior is parameterized. E.g: a Neural Network • Very time consuming Usually we don’’t have all the time in the world • On-Line Evolution (Steel 1994): The robot has a population of concurrent behaviours (individuals) that compete for actuator control. Fitness is computed for the robot and the fitness and behaviours impact on the robot controls reproduction of the behaviours. p.338. Consequences: - The robots are usually simulated, at least to start with. - The evolved behaviors are very simple. • Fit individuals will tend to "take over" the population (we will just find a LOCAL maxima) Solutions: - Geographical barriers such as islands with limited inter-breeding • Evolving Form Concurrently with Control (Sims 1994) Joints and sensors are combined by the genetic operators to create new “life forms” (p.340-341). + • Model free optimization of behaviors • Mimics the real world. Is used to study evolution. © Thomas Hellström 2001 © Thomas Hellström 2001 23 © Thomas Hellström 2001 24
© Copyright 2026 Paperzz