XCS Tutorial

XCS
Structure and Function of the XCS Classifier
System
Stewart W. Wilson
Prediction Dynamics
Concord, MA
[email protected]
1
What is it?
XCS
• Learning machine (program).
• Minimum a priori.
• “On-line”.
• Capture regularities in environment.
2
What does it learn?
XCS
To get reinforcements (“rewards”, “payoffs”)
Environment
Payoffs
Inputs
XCS
Actions
(Not “supervised” learning—no prescriptive teacher.)
3
XCS
What inputs and outputs?
Inputs:
Now binary, e.g., 100101110
—like thresholded sensor values.
Later continuous, e.g., <43.0 92.1 7.4 ... 0.32>
Outputs:
Now discrete decisions or actions,
e.g., 1 or 0 (“yes” or “no”),
“forward”, “back”, “left”, “right”
Later continuous, e.g., “head 34 degrees left”
4
XCS
What’s going on inside?
XCS contains rules (called classifiers), some of which
will match the current input. An action is chosen
based on the predicted payoffs of the matching rules.
<condition>:<action> => <prediction>.
Example: 01#1## : 1 => 943.2
Note this rule matches more than one input string:
010100
010110
010101
011111
011100
011101
011110
011111.
This adaptive “rule-based” system contrasts with
“PDP” systems such as NNs in which knowledge is
distributed.
5
How does the performance cycle work?
XCS
Environment
0011
“left”
Detectors
match
[P]
p
#011 : 01
11## : 00
#0## : 11
001# : 01
#0#1 : 11
1#01 : 10
Effectors
ε
43 .01
32 .13
14 .05
27 .24
18 .02
24 .17
...etc.
F
99
9
52
3
92
15
Reward
Match Set
[M]
#011 : 01
#0## : 11
001# : 01
#0#1 : 11
01
43 .01 99
14 .05 52
27 .24 3
18 .02 92
Action Set
[A]
Prediction
Array
action
nil 42.5 nil 16.6
#011 : 01
001# : 01
43 .01 99
27 .24 3
selection
• For each action in [M], classifier predictions p
are weighted by fitnesses F to get system’s
net prediction in the prediction array.
• Based on the system predictions, an action is chosen
and sent to the environment.
• Some reward value is returned.
6
XCS
1.
How do rules acquire their predictions?
By “updating” the current estimate.
For each classifier C j in the current [A],
p j ← p j + α(R - p j ),
where R is the current reward and α is the learning
rate.
This results in p j being a “recency weighted” average
of previous reward values:
p j (t) = αR(t) + α(1-α)R(t-1) + α(1-α) 2 R(t-2) +
... + (1-α) t p j (0).
2.
And by trying different actions, according to an
explore/exploit regime.
A typical regime chooses a random action with
probability 0.5.
Exploration (e.g., random choice) is necessary in order
to learn anything. But exploitation—picking the
highest-prediction action is necessary in order to make
best use of what is learned.
There are many possible explore/exploit regimes,
including gradual changeover from mostly explore to
mostly exploit.
7
XCS
Where do the rules come from?
• Usually, the “population” [P] is initially empty.
(It can also have random rules, or be seeded.)
• The first few rules come from “covering”: if no
existing rule matches the input, a rule is created
to match, something like imprinting.
Input: 11000101
Created rule: 1##0010# : 3 => 10
Random #’s and action, low initial prediction.
• But primarily, new rules are derived from existing
rules.
8
XCS
How are new rules derived?
• Besides its prediction p j , each classifier’s
error and fitness are regularly updated.
Error:
ε j ← ε j + α(|R - p j | - ε j ).
Accuracy:
κj ≡ εj -n if ε j > ε 0 , otherwise ε 0 -n
κj ′ ≡ κj ⁄  κi 
 i 
∑
Relative accuracy:
Fitness:
, over [A].
F j ← F j + α(κ j ′ - F j ).
• Periodically, a genetic algorithm (GA) takes
place in [A].
Two classifiers C i and C j are selected with
probability proportional to fitness. They are copied
to form C i ′ and C j ′.
With probability χ, C i ′ and C j ′ are crossed to form
C i ″ and C j ″, e.g.,
10##11:1
#0001#:1
10##1#:1
#00011:1
⇒
C i ″ and C j ″ (or C i ′ and C j ′ if no crossover
occurred), possibly mutated, are added to [P].
9
Can I see the overall process?
XCS
Environment
0011
“left”
Detectors
match
[P]
p
#011 : 01
11## : 00
#0## : 11
001# : 01
#0#1 : 11
1#01 : 10
Effectors
ε
43 .01
32 .13
14 .05
27 .24
18 .02
24 .17
...etc.
F
99
9
52
3
92
15
Reward
Match Set
[M]
#011 : 01
#0## : 11
001# : 01
#0#1 : 11
01
43 .01 99
14 .05 52
27 .24 3
18 .02 92
Action Set
[A]
Prediction
Array
action
nil 42.5 nil 16.6
#011 : 01
001# : 01
43 .01 99
27 .24 3
selection
Update:
predictions,
errors,
fitnesses
(cover)
10
GA
XCS
What happens to the “parents”?
They remain in [P], in competition with their
offspring.
But two classifiers are deleted from [P] in order to
maintain a constant population size.
Deletion is probabilistic, with probability
proportional to, e.g.:
• A classifier’s average action set size a j —estimated
and updated like the other classifier statistics.
• a j /F j , if the classifier has been updated enough
times, otherwise a j /F ave , where F ave is the mean
fitness in [P].
—And other arrangements, all with the aim of
balancing resources (classifiers) devoted to each
niche ([A]), but also eliminating low fitness
classifiers rapidly.
11
XCS
What are the results like? — 1
Basic example for illustration: Boolean 6-multiplexer.
101001 →
→ 0
F6
101001
F 6 = x 0 'x 1 'x 2 + x 0 'x 1 x 3 + x 0 x 1 'x 4 + x 0 x 1 x 5
l = k + 2k
k>0
F 20 = x 0 'x 1 'x 2 'x 3 'x 4 + x 0 'x 1 'x 2 'x 3 x 5 +
x 0 'x 1 'x 2 x 3 'x 6 + x 0 'x 1 'x 2 x 3 x 7 +
x 0 'x 1 x 2 'x 3 'x 8 + x 0 'x 1 x 2 'x 3 x 9 +
x 0 'x 1 x 2 x 3 'x 10 + x 0 'x 1 x 2 x 3 x 11 +
x 0 x 1 'x 2 'x 3 'x 12 + x 0 x 1 'x 2 'x 3 x 13 +
x 0 x 1 'x 2 x 3 'x 14 + x 0 x 1 'x 2 x 3 x 15 +
x 0 x 1 x 2 'x 3 'x 16 + x 0 x 1 x 2 'x 3 x 17 +
x 0 x 1 x 2 x 3 'x 18 + x 0 x 1 x 2 x 3 x 19
01100010100100001000 → 0
12
XCS
What are the results like?— 2
13
What are the results like?— 3
XCS
Population at 5,000 problems in descending order
of numerosity (first 40 of 77 shown).
0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
11
00
01
01
11
00
01
10
00
10
11
10
01
00
10
11
11
0#
11
01
11
10
10
00
11
10
10
#1
01
01
10
11
00
10
11
01
01
11
10
#1
##
1#
#1
#1
##
1#
#0
##
0#
##
##
##
#0
0#
##
##
##
00
##
10
#1
##
1#
0#
1#
#1
##
#0
10
00
0#
0#
1#
#1
#1
#1
#0
##
##
1#
#0
##
##
##
#1
##
##
1#
##
0#
#0
1#
##
##
0#
#1
11
##
01
##
#0
11
0#
0#
#0
0#
11
#0
##
##
0#
#1
#0
0#
#1
#0
#1
01
##
#0
1
0
1
0
0
1
0
1
0
0
0
0
1
1
1
1
1
1
1
1
0
0
1
0
0
0
1
0
0
0
1
1
1
1
1
1
0
0
1
0
PRED
0.
0.
1000.
0.
0.
1000.
1000.
1000.
1000.
1000.
1000.
0.
0.
0.
0.
1000.
1000.
0.
1000.
0.
1000.
0.
0.
1000.
1000.
1000.
1000.
1000.
1000.
1000.
0.
1000.
1000.
0.
1000.
1000.
1000.
0.
703.
856.
ERR
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.00
.31
.22
FITN
884.
819.
856.
840.
719.
698.
664.
712.
674.
706.
539.
638.
522.
545.
425.
458.
233.
210.
187.
168.
114.
152.
131.
117.
68.
46.
81.
86.
61.
58.
63.
63.
77.
93.
59.
75.
36.
92.
8.
11.
14
NUM
30
24
22
20
20
19
18
18
17
17
17
16
15
14
12
11
6
6
5
4
4
4
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
GEN
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.50
.33
.50
.33
.33
.33
.33
.33
.33
.33
.33
.33
.50
.33
.33
.33
.33
.33
.33
.33
.33
.33
.33
.67
.50
ASIZ
31.2
25.9
24.1
21.8
22.6
20.9
23.9
22.4
21.2
19.9
24.5
20.0
23.5
20.9
23.0
21.1
22.1
23.1
21.1
19.1
26.2
23.9
21.7
22.8
28.7
20.6
23.9
23.6
22.5
22.2
22.8
23.2
20.7
24.5
21.8
23.1
21.7
19.7
22.3
27.4
EXPER
287
286
348
263
238
222
254
236
155
227
243
240
283
110
141
76
130
221
86
123
113
34
111
57
38
4
113
228
16
46
22
35
7
28
12
21
3
41
10
22
TST
4999
4991
4984
4988
4972
4985
4997
4980
4992
4990
4978
4994
4967
4979
4968
4983
4942
4979
4983
4939
4978
4946
4968
4992
4978
4990
4950
4981
4997
4981
4866
4953
4985
4968
4983
4944
4997
4948
4980
4978
Can you show the evolution of a rule?
XCS
Action sets [A] for input 101001 and action 0
at several epochs.
247
0.
1.
2.
## ## ##
## 10 ##
## 10 0#
PRED
0 431.
0 245.
0 893.
ERR FITN NUM GEN ASIZ EXPER TST
.440
8.
2 1.00 17.2
76
244
.362 109.
2 .67 10.6
14
236
.146 504.
5 .50 11.2
8
200
1135
0.
1.
2.
3.
4.
5.
6.
7.
8.
9.
PRED
519.
510.
125.
1000.
454.
735.
169.
445.
1000.
1000.
ERR FITN NUM GEN ASIZ EXPER
.419
1.
1 .67 16.5
11
.390
27.
2 .67 16.8
15
.261
0.
1 .83 21.7
18
.021
4.
1 .67 17.7
0
.433
2.
1 .50 14.8
53
.343
27.
2 .33 14.4
13
.282
2.
1 .67 24.4
12
.418
13.
5 .67 18.6
27
.000 135.
2 .67 24.2
3
.000 451.
3 .50 23.4
17
TST
1134
1119
1132
1117
1106
1106
1119
1119
1117
1117
##
##
##
#0
#0
#0
1#
1#
10
10
#0
#0
1#
##
10
10
##
##
##
##
#1
0#
##
0#
##
0#
#1
0#
##
0#
0
0
0
0
0
0
0
0
0
0
#0
1#
1#
1#
10
1#
##
#0
10
##
0#
0#
#1
0#
0#
PRED
0 761.
0 652.
0 107.
0 829.
0 1000.
ERR FITN NUM GEN ASIZ EXPER
.336
1.
1 .50 10.6
10
.387
5.
1 .67 10.9
11
.197
6.
1 .50 22.0
8
.228
26.
2 .33 14.3
9
.000 490.
4 .50 11.6
26
TST
1325
1325
1308
1325
1325
1# ## 0#
10 ## 0#
PRED
0 360.
0 1000.
ERR FITN NUM GEN ASIZ EXPER
.394
0.
1 .67 18.1
14
.000 478. 10 .50 20.1
95
TST
2404
2392
PRED
863.
1000.
1000.
1000.
ERR FITN NUM GEN ASIZ EXPER
.237
0.
3 .67 21.1
18
.000 630. 13 .50 22.6
117
.000
49.
1 .33 22.4
9
.000
58.
1 .33 18.4
8
TST
2714
2714
2638
2693
1333
0.
1.
2.
3.
4.
2410
0.
1.
2725
0.
1.
2.
3.
#0
10
10
10
##
##
#0
1#
0#
0#
0#
0#
0
0
0
0
15
Why accurate, maximally general rules?
XCS
Consider two classifiers C1 and C2 having the same action,
and let C2 be a generalization of C1. That is, C2 can be
obtained from C1 by changing some non-# alleles in the
condition to #’s. Suppose that C1 and C2 are equally
accurate. They will therefore have the same fitness.
However, note that, since it is more general, C2 will occur
in more action sets than C1. What does this mean? Since
the GA acts in the action sets, C2 will have more
reproductive opportunities than C1. This edge in
reproductive opportunities will cause C2 to gradually drive
C1 out of the population.
Example:
p
ε
F
C1: 1 0 # 0 0 1 : 0
⇒
1000
.001
920
C2: 1 0 # # 0 # : 0
⇒
1000
.001
920
C2 has equal fitness but more reproductive
opportunities than C1.
C2 will “drive out” C1
16
XCS
Does XCS scale up?
17
XCS
What about complexity?
20m ~5x harder than 11m
11m ~5x harder than 6m.
⇒ D = cg p ,
where D = “difficulty”, here learning time,
g = number of maximal generalizations,
p = a power, about 2.3
c = a constant about 3.2
Thus “D is polynomial in g”.
What is D with respect to l, string length?
For the multiplexers, l = k + 2 k ,
or l → 2 k for large k.
But g = 4·2 k , thus l ~ g,
So that “D is polynomial in l” (not exponential).
18
XCS
What about deferred reward?
Apply ideas from multi-step reinforcement learning.
Need the action-value of each action in each state.
What is the action-value of a state more than one
step from reward?
Intuitive sketch:
O
γ2
F
γ
1
γ2
γ3
γ3
γ2
γ2
γ2
γ
γ3
p j ← p j + α[(r imm + γ max P(x′,a′)) - p j ]
a′∈ A
where p j is the prediction of a classifier in the current
action set [A],
x′ and a′ are the next state and possible actions,
P(x′,a′) is a system prediction at the next state,
and r imm is the current external reward.
19
Can I see the overall process?
XCS
Environment
0011
“left”
Detectors
match
[P]
p
#011 : 01
11## : 00
#0## : 11
001# : 01
#0#1 : 11
1#01 : 10
Effectors
ε
43 .01
32 .13
14 .05
27 .24
18 .02
24 .17
...etc.
F
99
9
52
3
92
15
(Reward)
Match Set
[M]
#011 : 01
#0## : 11
001# : 01
#0#1 : 11
01
43 .01 99
14 .05 52
27 .24 3
18 .02 92
Action Set
[A]
Prediction
Array
action
nil 42.5 nil 16.6
#011 : 01
001# : 01
43 .01 99
27 .24 3
selection
max
discount
(cover)
Update:
predictions,
errors,
fitnesses
+
delay = 1
P
Previous Action Set
[A]-1
GA
• Previous action set [A] -1 is saved and updates
are done there, using the current prediction array
for “next state” system predictions.
• On the last step of a problem, updates occur in [A].
20
XCS
What are the results like?— 1
• Animat senses the 8 adjacent cells.
Fbb
O*b
Qbb
• Coding of each object:
*
F = 110 “food1”
G = 111 “food2”
O = 010 “rock1”
Q = 011 “rock2”
b = 000 “blank”
• “Sense vector” for above situation: 000000000000000011010110
• A matching classifier: ####0#00####00001##101## : 7
21
XCS
What are the results like?— 2
Two generalizations discovered by XCS in Woods1.
22
XCS
What about continuous inputs, actions, time?
Inputs:
<<x 1 ± ∆x 1 > … <x n ± ∆x n >> : <action> ⇒ p
Actions:
<<x 1 ± ∆x 1 > … <x n ± ∆x n >> : <a ± ∆a> ⇒ p
—and combine matching rules à la fuzzy logic,
perhaps.
Time:
<<x 1 ± ∆x 1 > … <x n ± ∆x n >> : <a ± ∆a> ⇒
dp
dt
—action selection based on steepest ascent of p.
23
XCS
What about Non-Markov environments?
Example (McCallum’s Maze):
*
*
* Aliased states. Optimal action not determinable
from current sensory input.
Approaches:
• “History window” — remember previous inputs
• Search for correlation with past input events
✔ • Adaptive internal state:
Environmental condition
External action
Prediction
## ## #1 ## 0# ## ## ## # : 1 0 ⇒ 504
Internal
condition
24
Internal
action
XCS
What if generalizations are not conjunctive?
Example: “if x > y for any x and y, and action a
is taken, payoff is predicted to be p.”
Cannot be represented using a single classifier with
traditional conjunctive condition, since it’s a relation.
However, it can be represented using an “s-classifier”:
(> x y)
: <action a> ⇒ p
i.e., a classifier whose condition is a Lisp
s-expression.
With appropriate elementary functions, s-classifiers
can encode an almost unlimited variety of conditions.
They can be evolved using techniques of
genetic programming.
25
XCS
How is XCS different from other RL systems?
Rule-based, not PDP (“parallel distributed processing”)
• Structure is created as needed
• Learning may often be faster because classifiers are
inherently non-linear
• Learning complexity may be less than most PDPs
• Classifiers can keep and use statistics; difficult in
a network
• Hierarchy and reasoning may be easier, since
knowledge is in subroutine-like packages
26