Chapter 8 Adaptive Behavior

Learning
Chapter 8
Adaptive Behavior
One of many definitions:
“Learning produces changes within an agent that
over time enables it to perform more efficiently
within its environment””
Examples:
• Introducing new knowledge (facts, rules etc.)
• Generalizing
• Specializing
• Reorganising information
• Creating new concepts
• Creating explanations of how things function
• Reusing past experiences
“Learning is not compulsory.
Neither is survival””
Thomas Hellström
Umeå University
Sweden
1
© Thomas Hellström 2001
Adaption
Adaptive Control System
Adjustments to make the agent
more attuned to its environment
Conventional feedback
Stimuli
Examples:
Controller
action
behavior:
stimuli ⇒ respons
• Adjusting behaviors
• Sensor adaption
(genotypic adaption)
observation
• Adaption: Adjustments of the behavior
e.g: parameters in a neural network
Adaption may cause :
• Habituation
• Sensitization
3
© Thomas Hellström 2001
4
© Thomas Hellström 2001
“Problems”” in Learning
What can a Robot learn?
º(s ) = r
• Credit assignment problem:
How is credit or blame assigned to a particular piece
of knowledge?
• Saliency problem:
What features/stimulus are relevant to a specific task?
• New term problem:
When should a new (abstract) concept be created?
• Indexing problem:
How can memory be organized to support learning?
• Utility problem:
When is it acceptable to forget things?
ρ = C(G • B (S))
Behavioral assemblages q:
q = C(G • B (S))
The robot can learn:
• What set of stimulus should be included for a respons
• The mapping B . Either pointwise or the whole function
• The gain factors G.
• The coordination function C .
• What set of behaviors should be included in an assemblage q?
• B, G and C for q.
© Thomas Hellström 2001
Controlled
system
Learning element
(parametric adjustment)
• Evolutionary adaption
B is a vector of all behaviors º , ºª,
ºª …
G is a vector of gain factors g , gªª , …
C is a coordination function
The total response ¸ is given by:
2
© Thomas Hellström 2001
5
© Thomas Hellström 2001
6
Reinforcement Learning
Classifying learning
• Motivated by the psycological concept “the Law of Effect””:
“Applying a reward immediately after the occurrence of a
response increases its probability of reoccuring, while
providing punishment after the response will decrease the
probability”” (Thorndike 1911)
• Assimilation/Accomodation:
Assimilation: Modifying existing behaviors (adaption)
Accomodation: Acquisition of new behaviors
• Numeric/symbolic
Numeric: functional mappings. e.g Neural networks
Symbolic: production rules, semantic networks
• Inductive/deductive
Inductive: Interpolate/extrapolate from examples
Deductice: Extract knowledge from a fixed data bank
• Continuous/batch
Continuous: During the interaction with the world
Batch: Acquisition of data before learning
• Unsupervised learning (?)
I.e. there are no examples with pairs (stimuli, response)
• The reward is often discrete; -2,-1,0,+1,+2 or even
binary; pass/fail
7
© Thomas Hellström 2001
8
© Thomas Hellström 2001
Reinforcement Learning
Reinforcement Learning
Two types:
Conventional feedback
State
information
(sensor data)
Controller
policy function:
state ⇒ action
Controlled
system
Action
Internal
Reinforcement
• Adaptive Heuristic Critic (AHC) learning:
The learning of the decision policy (state⇒
⇒action) in the
controller is separated from
learning the utility function (state ⇒utility) in the critic.
Reward
Critic
utility function:
state ⇒ utility
• Q-learning:
A single Q-function is used to model BOTH
actions and states
• Problems: How should the reward cause a change in the
policy (credit assignment) ?
Method: the utility function used by the Critic
9
© Thomas Hellström 2001
Q-learning
Q-learning Algorithm:
• A single utility Q-function is learned to evaluate BOTH
actions and states
• The decision policy (state⇒
⇒action) is often represented as
a lookup table (that’’s why discrete states are preferred…)
State description
State v vª v£ v¢ v° .
x
xª
x£
x¢
x°
x§
Q-value for 5 possible actions
Initialize all Q(x,a) to 0
Do Forever
Determine current world state x via sensing
90% of the time choose action a that maximizes Q(x,a)
else pick random action
Execute a
Determine new current world state y via sensing
Determine reward r
Q(x,a) ⇐ Q(x,a)+β(r+λE(y)-Q(x,a))
Update Q(x’,a) for all x’ similar to x.
Utility for state x
. . Q(x,a ) Q(x,aª) Q(x,a£) Q(x,a¢) Q(x,a°) E(x)
End Do
β : learning rate parameter
r : the reward
λ : “discount factor” between 0 and 1
• Q(x,a) is the utility of doing action a in state x
• E(x) = max(Q(x,a ) Q(x,aª) Q(x,a£) Q(x,a¢) Q(x,a°) )
© Thomas Hellström 2001
10
© Thomas Hellström 2001
E(y) utility of state y; max Q(y,a) for all actions a
11
© Thomas Hellström 2001
12
Adaptive Heuristic Critic (AHC) learning
Genetic Algorithms
“Biologically the species is the accumulation of the
experiments of all successful individuals since the beginning”
- H.G. Wells
• Adaptive Heuristic Critic (AHC) learning:
The learning of the decision policy (state⇒
⇒action) in the
controller is separated from
learning the utility function (state ⇒utility) in the critic.
• Inspired by nature.
Natural selection and the survival of the fittest
• A method for function optimization max f ( x )
without derivates. (Holland 1989) x∈Ω
• The “individuals” are points x in the search space Ω
• The “fitness” of a point is the function value f(x).
• Each iteration corresponds to a generation
• The “fittest” individuals are combined and survive
• The average fitness for the population increases
Example p.326-327
13
© Thomas Hellström 2001
14
© Thomas Hellström 2001
Genetic Operations
Genetic Algorithms
Individuals are selected with a
probability based on their fitness
1. Generate a population (points x in Ω )
Reproduce
2. Compute the fitness function for each x
12.9
3. Select points with a probabilty proportional to
their fitness.
4. Execute the genetic operations:
Reproduce, Crossover and Mutation
to produce next generation of individuals (points)
5. Repeat from 2 until convergence.
2.8
12.9
1.3
6.1
2.8
Crossover
1.3
6.1
2.8
12.9
1.3
0.1
2.0
19.0
0.8
0.1
2.0
19.0
0.8
6.1
2.8
12.9
0.6
6.1
2.8
Mutate
15
Genetic Algorithms for Learning Behavioral Control
1.3
6.1
2.8
16
© Thomas Hellström 2001
Genetic Algorithms in a Braitenberg Vehicle
The behavior is parameterized.
E.g: Gains for coordination of: goal attraction,
obstacle ovoidance, noise...
Stimuli: SÁ..Sö 8 infrared proximity detectors
Respons: (L,R) (left and right motor control)
The behavior (mapping Stimuli ⇒ Respons)
is represented by two neural networks:
The fitness function (object function) fz(x) :
• Computed by letting the individuals act in either a
simulated or real world. E.g:
R = ∑ w i ⋅ Si + w 0
L = ∑ v i ⋅ Si + v 0
Algorithm:
f = z •number_of_collisions + zÁ•number_of_steps +
zª•distance_travelled
1) Generate 100 robots (i.e. 100 weight vectors w and v )
2) Let them “live” one by one for 20 seconds and compute
fitness values f([w v])
3) Select some of the best for reproduction
(reproduction, recombination, mutation)
4) repeat from 1) until the end of time
• Controls what kind of behavior we will achieve;
SAFE, FAST, DIRECT (p.335-336)
• Corresponds to the environmental demands in real
evolution.
© Thomas Hellström 2001
6.1
12.9
12.9
© Thomas Hellström 2001
1.3
17
© Thomas Hellström 2001
18
The fitness function
Genetic Algorithms in a Braitenberg Vehicle
Should guide the genetic operations towards the
solution. I.e: It musn’’t be too “sharp””.
The fitness function f ([w v]) is computed by letting the robot live
for 20 seconds:
V=0; fit=0; n=0; max_s=10;
while alive
fitness
sample S ,..,S
compute (L,R) and execute the action (run the motors)
V = V + sum(L+R)/(2*max_s); % reward high speed
dv = abs(L-R)/(2*max_s);% reward straight trajectory
i = max(SÁ,..,S )/1023;% The highest activity,normalized to [0 1]
fit = fit + (1-sqrt(dv))*(1-i); % Accumulate dv and i
n=n+1;
end
f = abs(V)/n * fit/n; % Calculate the total fitness of this individual
behavioral parameter
19
© Thomas Hellström 2001
20
© Thomas Hellström 2001
Learning and Natural Selection
The Baldwin Effect
• There are indications that individuals who have learned
a successful behaviour pass that over to their children!
• Baldwin explained that with “ordinary” natural selection:
• If we were born empty we would have
to learn everything.
• If we could learn everything we wouldn’’t
need any natural selection.
• Learned behavior do not change the genes.
Learning has the effect of smoothing the fitness function. In this
way individuals who can learn are favoured by evolution. I.e.: It looks as if
the acquired skills are inherited (because the ability to learn is inherited)
fitness
fitness
Fitness for those who can learn
Fitness for those who can’t learn
w ----------> wÁ
learning
behavioral parameter
© Thomas Hellström 2001
behavioral parameter
21
Genetic Algorithms for Learning Behavioral Control
22
Genetic Algorithms in Robotics
-
• The behavior is parameterized. E.g: a Neural Network
• Very time consuming
Usually we don’’t have all the time in the world
• On-Line Evolution (Steel 1994):
The robot has a population of concurrent behaviours (individuals)
that compete for actuator control. Fitness is computed for the
robot and the fitness and behaviours impact on the robot controls
reproduction of the behaviours. p.338.
Consequences:
- The robots are usually simulated, at least to start with.
- The evolved behaviors are very simple.
• Fit individuals will tend to "take over" the population
(we will just find a LOCAL maxima)
Solutions:
- Geographical barriers such as islands with limited
inter-breeding
• Evolving Form Concurrently with Control (Sims 1994)
Joints and sensors are combined by the genetic operators to
create new “life forms” (p.340-341).
+
• Model free optimization of behaviors
• Mimics the real world. Is used to study evolution.
© Thomas Hellström 2001
© Thomas Hellström 2001
23
© Thomas Hellström 2001
24