2 Data - University of York

ILP & ASM
DRAFT
Participants:
Stephen Muggleton
James Cussens
Hendrick Blockeel
Alex Bradley
Daniel van der Wallen
(University of York)
(University of York)
(University of Leuven)
(Syllogic Ireland)
(Syllogic Holland)
This report describes the research done at the University of York. The research was done to give a
proof of concept of ILP for use in Adaptive System Management. For the application we used both
progol (1) and tilde (2).
Draft version:
Not yet all Tilde results and statistics are reported (waiting for input from Leuven).
Not all relevant test are already done, some tests should be done in Houten and Dublin.
Table of contents
1
Introduction _______________________________________________________________ 3
2
Data______________________________________________________________________ 4
3
Model ____________________________________________________________________ 5
4
Experiments _______________________________________________________________ 6
4.1 Experiment 1 __________________________________________________________________ 6
4.1.1
4.1.2
Progol ____________________________________________________________________________ 6
Tilde _____________________________________________________________________________ 8
4.2 Experiment 2 __________________________________________________________________ 9
4.2.1
4.2.2
Progol ___________________________________________________________________________ 10
Tilde ____________________________________________________________________________ 12
4.3 Experiment 3 _________________________________________________________________ 13
4.3.1
4.3.2
progol ___________________________________________________________________________ 14
Tilde ____________________________________________________________________________ 16
4.4 Experiment 4 _________________________________________________________________ 17
4.4.1
4.4.2
4.4.3
Progol ___________________________________________________________________________ 18
Experiment 5 ______________________________________________________________________ 19
Progol ___________________________________________________________________________ 20
5
Visualization of the Data ____________________________________________________ 21
6
Interpretation of the experiments _____________________________________________ 23
7
Conclusions ______________________________________________________________ 25
8
future work _______________________________________________________________ 25
1 Introduction
The current learning techniques in ASM are propositional, the target is related to individual
monitors. This means we can find a set of relations of type: if monitor X > 12 then Bad_
performance. To interpret those rules we need system management knowledge. For example when
a large number of monitors from the set of relations are disk monitors of a certain server then a
system manager could interpret the set as a whole as an indication that the disk on that server is
performing badly. So, from this set of rules we need to derive a more useful indication like: the
disks on server Y are causing low performance. With given knowledge we can maybe deduce that
the disks are busy swapping memory, at this point a real cause of low performance is found. To
automate this we need two things. First, model information so that we know that a disks runs on a
certain server and secondly background knowledge with wich real causes can be deduced from a set
of correlated events. This means, we want the model and other background knowledge to be
included in the learning process. Therefore we did a feasibility study into the possibility of using
ILP systems for this type of learning. This should take into account, besides theoretical issues,
issues like scalability and performance.
2 Data
The data we used is system data from a real computer network. The size of the data-set is a typical
a small size data set compared to the data that is intended to be used.
The data set comes from a computer system that has been monitored for two months. The system
was monitored every 15 minutes on approx. 200 different components. This gives a data set of
5000 records (1 million fields), each containing values for the different monitors. Whereas the
intenional size of data could be something like 12000 components, every 1 minute for 1 month
which gives 500 million fields.
Because we want to compare different monitors at the same time, we have made sure that for every
time slice, every monitor has a value. This is done by first calculating all the time slices and
afterwards filling in the values for all monitors, if the monitor doesn’t have a value we linearly
interpolate between two time slices for which the monitor did have a value. This probably
introduces some noise in the data.
Because of the high numeric character of these monitors and the symbolic character of ilp, we only
looked at interesting values. These are those values that are 2 standard deviations above or below
average, indicated by high and low. This also reduces the size of the data set to be handled by the
ILP algorithm enormously. But of course it requires a preprocessing step which is the calculation of
the average and standard deviation for every monitor over the time period and checking each value
to be high, low or normal.
Next to the monitor data, we use model data which is discussed in the next section.
3 Model
The model underlying the data is a simplified version of the ASM data model. The model consists
of abstract components (cc_component ) and instantiations of those components (ci_component).
This defines an “isa”-relation.
Next to this, abstract components are part of higher level components. This defines the “part-of”
relation. A typical example would be: isa ( AIXServer, Beta ) and partof( Beta , HardDisk ).
This gives the following prolog facts:
class( classid, classname, parentid ).
isa( classid, classid2 ):- class(classid, _ , classid2), class(classid2, _, _).
this defines the hierarchical isa-relations.
Instance( instance_id, instance_name, class_id ) .
Part_of( instance_id, instance_id).
This defines the part-of relations.
So, for every instance we have a “isa” relation, and for every component we have a “partof”
relation (All components are derived from the top-level component).
Monitors are described by three parameters: MonitorId (unique), Name , Component_id
Monitors are linked to the model by the component they run on.
So, in prolog:
monitor(monitor_id, monitor_name, component_id).
The actual monitor values can be recorded in two ways ‘vertically’ or ‘horizontally’. Say we have t
time slices and we have n monitors. We can construct for every t one predicate with arity n, which
makes the total number of monitors t. We also can construct, for every t, n predicates with arity 3,
which makes the total number of monitors t*n.
For our current data set, we have n = 250 and t = 5000, coming down to or 5000 records or 1
million records. (see Data)
4 Experiments
A number of experiments were done with both progol and tilde. The first experiment was the
propositional case where the “unhappiness” of the system (defined as a threshold on a monitor) is
related to individual monitors. The reason for this experiment is to compare the results with
previous results found with decision trees and association rules. The second experiment was done
with time as a variable, the systems could use the preidcate previous(Time, Time-1), so we allow to
find rules that relate the target at time t to the monitors at time t-1. The third experiment was like
the second, but here it was allowed to look back two periods in time. The fourth experiment was
done to see if existing rules could be further refined, so the rules found in experiments 2 and 3 are
added to the background knowledge and the system is run again. This will allow longer rules to
come into the theory.
4.1 Experiment 1
The first experiment was the propositional case where the “unhappiness” of the system (defined as
a threshold on a monitor) is related to individual monitors. The data was discretized by:
Value = high
Value = low
if
if
Value > average + 2 std. dev.
Value < average + 2 std. dev.
So, we expect rules like:
unhappy(Time) :- monitor(‘free diskspace’, Time , low).
4.1.1 Progol
:- set(i,2)?
:- set(noise,5)?
:- set(nodes,200)?
The theory found by progol was:
unhappy(A) :monitor(A, m7_cpu_system_s11 (24) ,hi),
monitor(A, m14_nfs_client_calls_s11 (30),hi).
Runtime: 20 minutes.
the accuracy of this theory:
Contingency table
A
P
(
~P
(
157
9.7)
75
222.3)
232
Overall accuracy= 97.44% +/- 0.22%
Chi-square = 2418.98
Without Yates correction = 2435.49
Chi-square probability = 0.0000
~A
53
200.3)
4714
( 4566.7)
210
(
4767
4714
4999
4.1.2 Tilde
Induction time: 1534.59 seconds.
Theory found by Tilde:
modelid(A)
monitor(A,157,hi) ?
+--yes: unhappy [136 / 136]
+--no: monitor(A,58,hi) ?
+--yes: monitor(A,135,lo) ?
|
+--yes: happy [12 / 20]
|
+--no: unhappy [33 / 34]
+--no: monitor(A,82,hi) ?
+--yes: monitor(A,49,hi) ?
|
+--yes: happy [3 / 4]
|
+--no: unhappy [4 / 4]
+--no: happy [284 / 334]
Contingency table
4.2
Experiment 2
4.2.1 Progol
runtime: 35 minutes.
Found theory:
Rule1:
unhappy(A) :prev(A,B),
monitor(B, m6_cpu_busy_s11 (23),hi),
monitor(B, m75_nr_rec_breq_s11 (58),hi).
Rule2:
unhappy(A) :prev(A,B),
monitor(B, m2_number_of_processes_s11 (19),hi),
monitor(B, m201_number_of_users_s2 (134),hi),
monitor(B, m203_paging_space_used_s2 (136),lo).
Rule3:
unhappy(A) :prev(A,B),
monitor(B, m3_paging_space_used_s11 (20),hi),
monitor(B, m121_545h502_s11 (104),lo),
monitor(B, daytime (227),lo).
Contingency table
A
P
(
~P
(
169
10.3)
63
221.7)
232
~A
54
212.7)
4713
( 4554.3)
223
(
4767
Overall accuracy = 97.66% +/- 0.21%
Chi-square = 2652.71
Without Yates correction = 2669.51
Chi-square probability = 0.0000
Distribution of positive coverage.
Rule1: 157
Rule2: 5
Rule3: 100
Distribution of errors of commission (negative examples)
4776
4999
Rule1: 16
Rule2: 12
Rule3: 26
4.2.2 Tilde
Induction time: 601.3 seconds.
Theory found by Tilde:
modelid(A) , prev(A,B)
monitor(B,58,hi) ?
+--yes: monitor(B,86,hi) ?
|
+--yes: monitor(B,49,hi) ?
|
|
+--yes: unhappy [3 / 3]
|
|
+--no: happy [11 / 17]
|
+--no: unhappy [168 / 169]
+--no: monitor(B,82,hi) ?
+--yes: monitor(B,98,hi) ?
|
+--yes: happy [2 / 2]
|
+--no: monitor(B,134,hi) ?
|
+--yes: unhappy [3 / 3]
|
+--no: monitor(B,20,hi) ?
|
+--yes: happy [2 / 2]
|
+--no: unhappy [5 / 7]
+--no: monitor(B,84,hi) ?
+--yes: monitor(B,227,lo) ?
|
+--yes: unhappy [3 / 3]
|
+--no: monitor(B,52,hi) ?
|
+--yes: happy [5 / 5]
|
+--no: monitor(B,128,hi) ?
|
+--yes: unhappy [4 / 5]
|
+--no: monitor(B,48,lo) ?
|
+--yes: unhappy [2 / 3]
|
+--no: happy [16 / 24]
+--no: happy [259 / 289]
Contingency table
A
P
(
~P
(
188
19.0)
44
213.0)
232
Overall accuracy= 94.68% +/- 0.32%
Chi-square = 1703.63
Without Yates correction = 1713.76
Chi-square probability = 0.0000
~A
222
391.0)
4544
( 4375.0)
410
(
4767
4588
4999
4.3
Experiment 3
4.3.1 progol
Theory given by progol
Rule1:
unhappy(A) :prev(A,B),
monitor(B, m75_nr_rec_breq_s11 (58), hi),
monitor(B, m121_545h502_s11 (104), lo).
Rule2:
unhappy(A) :prev(A,B), prev(B,C),
monitor(B, m101_vs651fde_filetime_s11 (84), hi),
monitor(C, m99_scar_rem_accestime_s11 (82), hi).
Rule3:
unhappy(A) :prev(A,B), prev(B,C),
monitor(C, m43_ora_sga_free_memory_s11 (49),lo),
monitor(C, m304_system_load_s4 (172),hi).
Rule4:
unhappy(A) :prev(A,B),
monitor(B, m2_number_of_processes_s11 (19),hi),
monitor(B, m201_number_of_users_s2 (134),hi),
monitor(B, m203_paging_space_used_s2 (136),lo).
Contingency table
A
P
~P
(
178
(16.0)
54
216.0)
232
Overall accuracy= 95.58% +/- 0.29%
Chi-square = 1834.66
Without Yates correction = 1846.04
Chi-square probability = 0.0000
Distribution of positive coverage.
Rule1: 168
~A
167
329.0)
4600
( 4438.0)
345
(
4767
4654
4999
Rule2: 5
Rule3: 144
Rule4: 5
Distribution of errors of commission (negative examples)
Rule1:
Rule2:
Rule3:
Rule4:
50
8
100
12
4.3.2 Tilde
Induction time: 4576.69 seconds.
Theory found by Tilde:
modelid(A) , prev(A,B) , prev(B,C)
monitor(C,58,hi) ?
+--yes: monitor(C,104,lo) ?
|
+--yes: unhappy [168 / 169]
|
+--no: monitor(B,49,hi) ?
|
+--yes: unhappy [3 / 3]
|
+--no: monitor(C,111,hi) ?
|
+--yes: unhappy [2 / 2]
|
+--no: happy [11 / 15]
+--no: monitor(C,82,hi) ?
+--yes: monitor(B,49,hi) ?
|
+--yes: happy [4 / 5]
|
+--no: unhappy [8 / 8]
+--no: monitor(C,28,hi) ?
+--yes: unhappy [3 / 4]
+--no: monitor(B,28,hi) ?
+--yes: unhappy [3 / 4]
+--no: monitor(C,45,hi) ?
+--yes: unhappy [2 / 3]
+--no: monitor(C,84,hi) ?
+--yes: monitor(C,177,hi) ?
|
+--yes: unhappy [3 / 3]
|
+--no: monitor(C,26,hi) ?
|
+--yes: unhappy [2 / 2]
|
+--no: monitor(B,179,hi) ?
|
+--yes: unhappy [2 / 2]
|
+--no: happy [24 / 34]
+--no: monitor(B,82,hi) ?
+--yes: monitor(C,207,hi) ?
|
+--yes: unhappy [2 / 2]
|
+--no: happy [2 / 3]
+--no: happy [255 / 273]
Contingency table
A
P
(
~P
(
198
26.2)
34
205.8)
232
Overall accuracy= 91.98% +/- 0.38%
Chi-square = 1322.46
Without Yates correction = 1330.19
Chi-square probability = 0.0000
~A
367
538.8)
4399
( 4227.2)
565
(
4767
4433
4999
4.4
Experiment 4
This experiment includes the significant rules from experiment 2 and 3 above as background
knowledge. The goal of this experiment is to see wether or not longer and better rules are found. So
the search is restricted by pruning all nodes which did not include a rule/2 term.
The rules that were used as backgound knowledge are the following:
rule(2, A) :prev(A,B),
monitor(B, m75_nr_rec_breq_s11 (58), hi),
monitor(B, m121_545h502_s11 (104), lo).
rule(4, A) :prev(A,B), prev(B,C),
monitor(C, m43_ora_sga_free_memory_s11 (49),lo),
monitor(C, m304_system_load_s4 (172),hi).
rule(6, A) :prev(A,B),
monitor(B, m6_cpu_busy_s11 (23),hi),
monitor(B, m75_nr_rec_breq_s11 (58),hi).
rule(8, A) :prev(A,B),
monitor(B, m3_paging_space_used_s11 (20),hi),
monitor(B, m121_545h502_s11 (104),lo),
monitor(B, daytime (227),lo).
4.4.1 Progol
Theory found by progol:
Rule1
unhappy(A):-rule(6,A)
Rule2
unhappy(A):-rule(8,A)
Rule3
unhappy(A):-rule(2,A),prev(A,B),monitor(B,21,hi)
Rule4
unhappy(A):-rule(2,A),prev(A,B),monitor(B,154,hi)
Contingency table
A
P
~P
(
168
(10.7)
64
221.3)
232
~A
63
220.3)
4704
( 4546.7)
4767
Overall accuracy= 97.46% +/- 0.22%]
Chi-square = 2520.85]
Without Yates correction = 2536.95]
Distribution of positive coverage.
Rule1:
Rule2:
Rule3:
Rule4:
157
100
7
79
Distribution of errors of commission (negative examples)
Rule1:
Rule2:
Rule3:
Rule4:
16
26
15
24
231
(
4768
4999
4.5
Experiment 5
The fifth experiment was done with the inclusion of a part of the model information from the
system. See model. This, of course is a none-propositional experiment. The data consists of the
same data as the previous experiments plus the relational information from the model. We allowed
only relation attributes in the rules.
4.5.1 Progol
Progol settings were:
hi => avg + 4 std. dev. (!)
set(I, 3)
set(nodes, 1000)
Progol generated the following two rules:
unhappy(A) :prev(A,B), monitor(B,C,hi), monitorclass_id(C,Breq Table (1045)).
unhappy(A) :prev(A,B), monitor(B,C,hi), monitorclass_id(C,NFS server (1026)).
Contingency table
A
P
~P
(
186
(24.2)
46
207.8)
232
Overall accuracy= 92.38% +/- 0.38%
Chi-square = 1260.01
Without Yates correction = 1267.84
Chi-square probability = 0.0000
~A
335
496.8)
4432
( 4270.2)
521
(
4767
4478
4999
5 Visualization of the Data
To get more insight in the data we plotted some of the monitors that are in the rules against time
and/or the target.
Figure 1 Cpu system % on s11 in time
Figure 2 Cpu system % on s11 against target
6 Interpretation of the experiments
The experiments done with Tilde and Progol all give a predictive accuracy between 92%-97%, on
the complete data. The default theorie of predicting always happy, gives a accuracy of 95%, but we
have to take into account that the theories were learned on all positive examples and a small subset
of the negatives, which together forms about 10% of the complete data.
The general problem with such data sets as this, is this skewness of the distribution between
positives and negatives. A system is performing well most of the time and badly sometimes. The
training set is equilized by sampling a small subset of the negatives, giving the default theory an
accuracy of about 50%. The theories have a predictive accuracy on the trainings set of minimal
88%.
We have seen that in a short time, approx. 20, minutes using progol a relatively simple theory, only
1-4 rules, could be, Tilde took about equally long, giving more complex theories, 10-20 rules.
Altough not exactly the same theories were found, both systems came up with a lot of the same
monitors.
The time was included to increase the ‘predictiveness’ of the rules. Many of the propositional
clauses explain approx. the same data so there is a large overlap between the theories.
Tilde had less overall predictive accuracy than progol but had less errors on the positives. We
should be carefull tough to compare these results because for both systems the positive part of the
test set has also been used for training.
The theories leaves us with approx. 20% of the unhappies (positives) that are not explained. This
could mean that in real systems also 80% of performance problems can be predicted and 20% not.
This could, however, also mean that the data is not very clean. As can be seen from the
visualization of some of the attributes, the data is corrupted for some time periods. Partly due to the
mechanism that was used to retrieve the data, (see Data), the data has certain null-areas in which no
values were recorded. These introduce noise in the data as well as in the discretized data because
they have an effect on mean and standard deviation.
When including the model information and restricting progol so that it uses no specific monitor
information, we get a model of only two rules. These two rules predict overall worse than the
propositional theories (92,3% versus 97,8%), but scores better on the positives (80.1% versus
72%). Next to this the theory is more sensible. It says more about the domain because it has a richer
language. The found theory contained a rule predicting unhappy on there being a high monitor of
class NFS server. NFS server is a component class on which several nfs monitors are defined. This
gives exactly the kind of generalisation we want, ‘a NFS problem’ is the level of abstraction system
managers use.
The time the systems took for the induction was approx. 20-60 minutes for the propositional case
approx. 180 minutes for the first order case. For background processing this is reasonable for online
analysis it isn’t. The implementations of the systems is academic meaning that efficiency and
performance is not the first concern of the developpers. Because the search space can be restricted
considerably and the implementations can be improved, we can propably bring this time down
considerably.
7 Conclusions
8 future work