Inference Processes on Clustered Partial Decision Rules

Recent Advances in Intelligent Information Systems
ISBN 978-83-60434-59-8, pages 579–588
Inference Processes on Clustered Partial
Decision Rules
Agnieszka Nowak and Beata Zielosko
Institute of Computer Science, University of Silesia
39, Będzińska St., 41-200 Sosnowiec, Poland, http://zsi.tech.us.edu.pl
agnieszka.nowak,[email protected]
Abstract
The aim of the paper is to study efficeient of inference process using clustered
partial decision rules. Partial decision rules are constructed by greedy algorithm.
They are clustered with Agglomerative Hierarchical Clustering (AHC) algorithm.
We study how exact and partial decision rules clustered by AHC algorithm influence on inference process in knowledge base. Clusters of rules are a way of
modularization of knowledge bases in Decision Support Systems. Results of experiemnts present how different facors (e.g. rule length, number of facts given as
an input knowledge) can influence on the efficiency of inference process.
Keywords: rules clusters, partial decision rules, greedy algorithm, composited
knowledge bases, knowledge representation
1 Introduction
Knowledge based systems are created for specialized domains of competence, in
which effective problem solving normally requires human expertise. In recent
years, knowledge based systems technology has proven itself to be a valuable tool
for solving hitherto intractable problems in domains such a telecommunication,
aerospace, medicine and the computer industry itself. The goals of knowledge
based systems are often more ambitious than conventional programs. They frequently perform not only as problem solvers but also as intelligent assistant and
training aids, they help their creators and users to understand better own knowledge (Luger and Stubblefield, 2002). Proper knowledge acquisition is one of the
main goal of systems based on knowledge. Usually they are implemented using
knowledge acquired from human experts or discovered in data bases (for example using rough set theory (Moshkov et al., 2008a; Pawlak, 1991)). Rules are the
most popular method of knowledge representation. Unfortunately if we use —
possibly different — tools for automatic rules acquisition and/or extraction, the
number of rules can rapidly grows. For modern problems, knowledge bases can
count up to hundreds or thousands of rules. For such knowledge bases, number
of possible inference paths is very high. In such cases knowledge engineer can
not be totally aware that all possible rules interactions are legal and provide to
expected results. It brings problems with inference efficiency and interpretation
580
Agnieszka Nowak, Beata Zielosko
of inference results. For example, if we consider forward reasoning, a lot of fired
rules forming a lot of new facts that are sometimes difficult to properly interpret and they may be useless for user (Nowak et al., 2006a, 2008). The problem
of efficiency may be important in technical applications, especially in real-time
systems. Inference methods are well known, there are well described algorithms
for forward and backward reasoning and some of their modifications (Luger and
Stubblefield, 2002). We believe that the increase of efficiency should not rely on
the modification of these algorithms but on modification of data structures used
by them. We build knowledge base from partial decision rules constructed by
greedy algorithm. Partial decision rules consist of smaller number of attributes
than exact rules (Zielosko and Piliszczuk, 2008). We can say that such rules are
less complex and better for understand. It is not the only improvement of the
efficiency we can do. We propose to reorganize the knowledge base from set of not
related rules, to groups of similar rules (using cluster analysis method). Thanks
to clustering of conditional parts of similar rules, possibly only small subset of
rules need to be checked for a given facts, which influence on performing of inference process (Nowak et al., 2006a; Nowak and Wakulicz-Deja, 2008b). The paper
consists of five sections. In Section 2 main notions of partial decision rules and
greedy algorithm are considered. Section 3 describes performing of inference on
clusters of rules. In Section 4 results of experiments with real-life decision tables
are considered. Section 5 contains summary.
2 Partial decision rules
Decision rules can be considered as a way of knowledge represenatation. In applications we often deal with decision tables which contain noisy data. In this
case, exact decision rules can be ”over-fitted” i.e. depend essentially on the noise.
So, instead of exact decision rules with many attributes, it is more appropriate to
work with partial decision rules with small number of attributes, which separate
almost all different rows with different decisions (Zielosko and Piliszczuk, 2008).
The problems of construction decision rules with minimal number of attributes,
are N P -hard. Therefore, we should consider approximate polynomial algorithms.
In (Moshkov et al., 2008a) we adapted well known greedy algorithm for set
cover problem to construction of partial decision rules. From obtained bounds on
greedy algorithm accuracy and results proved in (Moshkov et al., 2008b) it follows,
that under some natural assumptions on the class N P , the greedy algorithm is
close to the best polynomial approximate algorithms, for minimization of partial
decision rule length.
Now, main notions for partial decision rules are presented.
Let T be a table with n rows labeled with decisions and m columns labeled
with attributes (names of attributes) a1 , . . . , am . This table is filled by values of
attributes. The table T is called a decision table (Pawlak, 1991).
Two rows are called different if they have different values at the intersection
with at least one column ai .
Let r = (b1 , . . . , bm ) be a row from T labeled with a decision d. We will denote
by U (T, r) the set of rows from T which are different (in at least one column ai )
Inference Processes on Clustered Partial Decision Rules
581
from r and are labeled with decisions different from d. We will use the parameter
α to denote a real number such that 0 ≤ α < 1. A decision rule
(ai1 = bi1 ) ∧ . . . ∧ (ait = bit ) → d
is an α-decision rule for the row r of decision table T if attributes ai1 , . . . , ait
separate from r at least ⌈(1 − α)|U (T, r)|⌉ rows from U (T, r). We will say that an
attribute ai separates a row r′ ∈ U (T, r) from the row r, if the rows r and r′ have
different values at the intersection with the column ai .
For example, 0.01-decision rule means that attributes contained in this rule
should separate from r at least 99% of rows from U (T, r). If α is equal to 0 we
have an exact decision rule. Algorithm 1 describes the greedy algorithm with
threshold α which constructs an α-decision rule for the row r of decision table T .
Algorithm 1: Greedy algorithm for partial decision rule construction
Input : Decision table T with conditional attributes a1 , . . . , am , row
r = (b1 , . . . , bm ) of T labeled by the decision d and real number α,
0 ≤ α < 1.
Output: α-decision rule for the row r of decision table T
Q ←− ∅;
while attributes from Q separate from r less than (1 − α)|U (T, r)| rows
from U (T, r) do
select ai ∈ {a1 , . . . , am } with minimal index i such that ai separates
from r the maximal number of rows from U (T, r) unseparated by
attributes from Q;
Q ←− Q ∪ {ai };
end V
return ai ∈Q (ai = bi ) → d;
3 The hierarchical structure of knowledge base
It is known that cluster analysis brings useful technique to reorganize the knowledge base to hierarchical structure of rules. The hierarchy is a very simple and
natural form of presentation the real structure and relationships between data in
large data sets. Instead of one long list of all rules in knowledge base, better
results we achieve if we build composited knowledge base as a set of groups of
similar rules. That is why we used agglomerative hierarchical clustering algorithm
to build a tree of rules clusters (called dendrogram). Such a tree has all features of
binary tree, so we notes that the time efficiency of searching such trees is O(log2 n).
Such an optimization of the inference processes is possible because small part of
whole knowledge base is analyzed only, i.e. the most relevant to the given set of
facts (input knowledge) (Nowak et al., 2008).
582
Agnieszka Nowak, Beata Zielosko
3.1 The knowledge base structure
Having both: X as a set of rules in a given knowledge base and Fsim as a similarity function connected with a set of rules from X, we may build hierarchically
organized model of knowledge base. Such model for n rules X = {x1 , .., xn },
where each rule uses set of attributes A and values of such attributes V (V =
∪a∈A Va is a set of values of attribute a), is represent as labeled binary tree
S2n−1
T ree = {w1 , . . . , w2n−1 } = k=1 {wk }, created by clustering rules using such
similarity function. Labels of nodes in such tree are fifts: {d(xk ), c(xk ), f, i, j},
where c(xi ) ∈ V1 × V2 × . . . × Vm is a vector of left hand-side (conditional part) of
the rule xk and d(xi ) ∈ V1 × V2 × . . . × Vm is a vector of right hand-side (decision
part) of such rule, and f = Fsim : X × X −→ [0..1] is a value of similarity between
two clustered rules (or rules) xi and xj . Elements i and j are numbers of children
actually analyzed k-th node.
3.2 Agglomerative hierarchical clustering of rules
Agglomerative algorithm starts with each object being a separate itself, and successively merge groups according to a distance measure. The clustering may stop
when all objects are in a single group (classical AHC). We could also use modified
AHC, so called mAHC, which is widely presented in (Nowak et al., 2006a,b).
Algorithm 2: Agglomerative hierarchical clustering algorithm for rules in
knowledge bases
Input : A set O of n objects and a matrix of similarities between the
objects.
Output: Clusterings C0 , C1 , ..., Cn−1 of the input set O;
C0 = the trivial clustering of n objects in the set input set O;
while |C| > 1 do
find ci , cj ∈ Ck−1 where similarity s(ci , cj ) is maximal;
Ck = (Ck−1 \{ci , cj }) ∪ (ci ∪ cj );
calculate similarity s(ci , cj ) ∀ ci , cj ∈ Ck ;
end
return C0 , C1 , ..., Cn−1 ;
Algorithm gets as input a finite set O of n objects and a matrix of pairwaise
distances between these objects. This means that executing of clustering algorithm
is completely independent on distances between the objects were computed. The
algorithm starts with a trivial clustering c0 with n singleton clusters. At each
iteration phase two clusters ci , cj with highest similarity in ck−1 are searched and
merged. A new clustering Ck is formed by removing these two clusters and adding
the new merged cluster, e.g. Ck is Ck−1 with clusters ci and cj merged. It is
continued until there is only one cluster left. The output of the algorithm is the
sequence of clusterings C0 , C1 , ..., Cn−1 . A centroid as the representative of created
group-cluster is calculated as the average distance all objects till the given cluster.
Various similarity and distance metrics were analyzed for clustering, because it is
very important to use proper metrics (Ćwik and Koronacki, 2008). Analysis of
Inference Processes on Clustered Partial Decision Rules
583
clustering efficiency is presented in (Kaufman and Rousseeuw, 1990). The results
show that only the Gower’s measure is effective for such different types of data
that are in composited knowledge bases.
3.3 Inference process on hierarchical knowledge base
In decision support systems without rules clusters, rule interpreter has to check
each rule, one by one, and firing these which exactly match to given observations.
It takes O(n) time efficiency, for n as the number of rules, whereas in decision
support systems with hierarchical structure of knowledge base the time efficiency
is minimalized to O(log2 2n − 1), where n is the number of rules clusters. In this
situation rule interpreter doesn’t have to check each rule - one by one. The process
of inference is based on searching binary tree with rules clusters and choosing nodes
with highest similarity value. There are various types of searching tree techniques.
We can choose one of two given methods of searching trees: so called "the best
node in all tree" or "the minimal value of coefficient". Both are widely presented
in (Nowak et al., 2006a,b).
4 Results of experiments
We did experiments with 4 datasets (“lymphography", “spect_all", “lenses" and
“flags") from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/ ).
For each data set we use the greedy algorithm for partial decision rules constraction
with parameter α ∈ {0.0, 0.001, 0.01, 0.1, 0.2, 0.5}. Next, for every data set of rules
we build knowledge base and use AHC algorithm to construct clusters of rules
which are searched during the inference process. Tables 1–4 presents results of
inference process for data sets:“lymphogrampy”, “spect_all”, “lenses” and “flags”.
With each experimental set we notes number of rules (NR ), number of nodes
searched in the dendrogram using "the best node in all tree" method (Nn ) and
different number of input knowledge (facts)(Ndata ). For such items we checked
number of relevant rules in whole knowledge base (Nrr ), number of relevant rules
successively searched (Nrs ), and number of searched rules that are not relevant
(Nnrs ), (P recision) and the percent of the whole knowledge base, which is really
searched (%KB). Precision is a measure of probability of searching only relevant rules during the inference process. It was considered further in (Nowak and
Wakulicz-Deja, 2008a). For us, as it is in Information Retrieval, a perfect precision
equal to 1.0 means that every result retrieved by a search was relevant (but says
nothing about whether all relevant rules were retrieved, simply bacause there is
no need to search each relevant rule).
Based on results presented in Tables 1–4 we can see that the more rules in
knowledge base consists of, the less percent of knowledge base is really searched.
Figure 1 presents how average percent of searched knowledge base depends on the
number of rules.
Tables 5–8 presents minimal, average and maximal length of rules for data
sets:“lymphogrampy”, “spect_all”, “lenses” and “flags”.
Based on results presensted in Tables 5–8 we can see, that with growing of
parameter α length of rules is decreasing.
584
Agnieszka Nowak, Beata Zielosko
Table 1: Results of inference process for lymphography
NR
148
148
148
148
148
148
148
Nn Ndata Nrr Nrs Nnrs P recision
12
1
7
1
0
1
12
2
1
1
0
1
16
3
3
1
0
1
18
2
2
1
0
1
20
1
11
1
0
1
14
2
2
1
0
1
26
1
11
1
0
1
% KB
4.07 %
4.07 %
5.42 %
6.10 %
6.78 %
4.75 %
8.81 %
Table 2: Results of inference process for spect_all
NR
267
267
267
267
267
267
267
267
267
267
267
267
267
Nn Ndata Nrr Nrs Nnrs P recision
16
1
38
1
0
1
12
9
6
1
0
1
4
9
6
0
1
0
12
9
6
1
0
1
24
9
1
1
0
1
8
9
1
1
0
1
26
1
38
1
0
1
16
5
19
1
0
1
16
2
19
1
0
1
16
1
6
1
0
1
14
1
34
1
0
1
20
1
11
1
0
1
8
2
6
1
0
1
% KB
3%
2.25 %
0.75 %
2.25 %
4.5 %
1.5 %
4.87 %
3%
3%
3%
2.63 %
3.75 %
1.5 %
Table 3: Results of inference process for lenses
NR Nn Ndata Nrr Nrs Nnrs P recision
24 8
1
12
1
0
1
24 10
3
2
1
0
1
24 8
4
1
1
0
1
24 8
1
12
1
0
1
24 8
3
2
1
0
1
24 8
4
1
0
1
0
24 8
1
12
1
0
1
24 8
3
2
1
0
1
24 8
4
1
0
1
0
24 8
1
12
1
0
1
24 8
3
2
1
0
1
24 8
1
12
1
0
1
24 8
2
2
1
0
1
24 8
1
12
1
0
1
% KB
17.02 %
21.27%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
17.02%
585
Inference Processes on Clustered Partial Decision Rules
Table 4: Results of inference process for flags
NR
194
194
194
194
194
194
194
194
194
194
194
194
194
194
194
Nn Ndata Nrr Nrs Nnrs P recision
22
3
2
1
0
1
20
1
2
1
0
1
20
2
1
1
0
1
22
3
2
1
0
1
20
1
2
1
0
1
20
2
1
1
0
1
14
2
2
1
0
1
28
1
5
1
0
1
14
2
3
1
0
1
12
1
5
1
0
1
18
1
2
1
0
1
12
1
5
1
0
1
18
1
2
1
0
1
12
1
5
1
0
1
18
1
2
1
0
1
% KB
5.68%
5.16%
5.16%
5.68%
5.16%
5.16%
3.61%
7.23%
3.61%
3.1%
4.65%
3.1%
4.65%
3.1%
4.65%
25
max % of KB
min % of KB
avg % of KB
% of KB really searched
20
15
10
5
0
24
148
number of rules
194
267
Figure 1: Dependance between number of rules and % of KB really searched
Table 5: Length of partial decision rules for lymphography: 148 rows, 18 conditional
attributes
α = 0.0
α = 0.001
α = 0.01
α = 0.1
α = 0.2
min avg max min avg max min avg max min avg max min avg max
1.0 2.13 4.0 1.0 2.13 4.0 1.0 2.13 4.0 1.0 1.49 2.0 1.0 1.06 2.0
586
Agnieszka Nowak, Beata Zielosko
Table 6: Length of partial decision rules for spect_all: 267 rows, 23 conditional attributes
α = 0.0
α = 0.001
α = 0.01
α = 0.1
α = 0.2
min avg max min avg max min avg max min avg max min avg max
1.0 3.19 10.0 1.0 3.19 10.0 1.0 2.92 10.0 1.0 1.56 7.0 1.0 1.30 5.0
Table 7: Length of partial decision rules for lenses: 24 rows, 4 conditional attributes
α = 0.0
α = 0.001
α = 0.01
α = 0.1
α = 0.2
min avg max min avg max min avg max min avg max min avg max
1.0 2.13 4.0 1.0 2.13 4.0 1.0 2.13 4.0 1.0 1.92 3.0 1.0 1.50 2.0
Table 8: Length of partial decision rules for flags: 194 rows, 30 conditional attributes
α = 0.0
α = 0.001
α = 0.01
α = 0.1
α = 0.2
min avg max min avg max min avg max min avg max min avg max
1.0 1.33 3.0 1.0 1.33 3.0 1.0 1.15 3.0 1.0 1.03 2.0 1.0 1.0 1.0
4
flags
lenses
lymphography
spect_all
3.5
avg length of rules
3
2.5
2
1.5
1
0.5
0
0
0.1
0.2
0.3
0.4
0.5
alpha
Figure 2: Avearge length of rules
Figure 2 presents average length of rules for data sets “lymphogrampy”, “spect_
all”, “lenses” and “flags”. Figure 3 presents minimal, average and maximal length
of rules for data set “spect_all”
If we clustering rules before inference process in a given knowledge base, then
instead of all set of rules only small percent of the rules is searched. In large
knowledge bases we can significantly reduce the number of rules relevant to the
inference initial data. Reduction in the case of backward reasoning is less significant than in case of forward reasoning and depends on selected goal, and need to
search solution in small subset of rules. Results of experiments show that inference
587
Inference Processes on Clustered Partial Decision Rules
10
min length
avg length
max length
length of rules
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5
alpha
Figure 3: Length of rules for spect_all
process is sucessful, if for a given set of facts the relevant rules were found and
fired, without searching whole knowlege base. As it can be seen in the Figure 2,
the more rules the knowledge base consists of, the less part of it is really searched.
For datasets with 267 of rules it is even 1.5% of whole knowledge base (Table 2).
Different datasets were analyzed, with different type of data, and various number
of attributes and their values. Also different values of parameter α were checked.
We can see that with the growth of parameter α the length of rules is fall of. If
the value of α is bigger then the length of rules is longer. For high value of α
(e.g. α = 0.5) rules are the shortest what makes inference process faster. It is
because the relevant rule has small number of conditions (for α = 0.5 each rule
consists of only one condition) that must be checked with set of facts in given
knowledge base. For longer rules the time of rule checking is longer. For α = 0 in
dataset like "spect_all" (Figure 3) maximal lenght of rule is even 10. It is obvious
that it makes the inference process slower and less efficient. The best results we
can achieve for large knowledge bases with partial decision rules than exact ones,
because in this case the small percent of whole knowledge base is searched and the
inference process made on searched rules is really fast.
5 Summary
We proposed to optimize inference processes by using modular representation of
knowledge base. Thanks clustering rules we don’t have to search all rules in
knowledge base but only the most relevant cluster. This process is based on
efficient method of searching binary trees called "the best node in all tree" with time
efficiency equal to O(log2 n). Exact rules can be overfitted or depends on the noise.
So, instead of exact rules with many attributes we use partial rules with smaller
number of attributes. So, the length of rules influences on the efficiency of inference
process on knowledge base. Based on results from (Moshkov et al., 2008a) we used
588
Agnieszka Nowak, Beata Zielosko
greedy algorithm for partial decision rules construction, because it was proved that
under some natural assumptions on the class N P , the greedy algorithm is close to
the best polynomial approximate algorithms, for minimization of partial decision
rule length. It is worth to do some experiements on larger datasets with more
than 20 attributes to study if the number of attributes influences on efficiency of
inference process. Experiements on different ways (single linkage, average linkage
and complete linkage) of creating clusters of rules are interesting. We also planning
to do some experiements on modification of AHC algorithm - so called mAHC
presented in (Nowak et al., 2006a,b). Comparing both methods for measuring
efficiency would be very interesting.
References
Jan Ćwik and Jacek Koronacki (2008), Statistical learning systems [in polish], chapter 9, pp. 295–301, EXIT.
Leonard Kaufman and Peter J. Rousseeuw (1990), Finding Groups in Data: An Introduction to Cluster Analysiss, chapter 5, pp. 199–253, Jonh Willey & Sons.
George F. Luger and William A. Stubblefield (2002), Artificial Intelligence: Structures and Strategies for Complex Problem Solving, chapter 3, pp. 93–106, Addison
Wesley.
Mikhail Ju. Moshkov, Marcin Piliszczuk, and Beata Zielosko (2008a), On Partial
Covers, Reducts and Decision Rules, Transaction on Rough Sets, 8:251–288.
Mikhail Ju. Moshkov, Marcin Piliszczuk, and Beata Zielosko (2008b), Partial covers,
reducts and decision rules in rough sets: theory and applications, volume 145 of Studies
in Computational Intelligence, chapter 1, Springer-Verlag Berlin Heidelberg.
Agnieszka Nowak, Roman Simiński, and Alicja Wakulicz-Deja (2006a), Towards modular representation of knowledge base, Springer Verlag - Advances in Soft Computing,
pp. 421–428.
Agnieszka Nowak, Roman Simiński, and Alicja Wakulicz-Deja (2008), Knowledge
representation for composited knowledge bases, pp. 405–414, EXIT.
Agnieszka Nowak and Alicja Wakulicz-Deja (2008a), The analysis of inference efficiency in composited knowledge bases [in polish], pp. 101–108, Uniwersytet Śląski.
Agnieszka Nowak and Alicja Wakulicz-Deja (2008b), The inference processes on composited knowledge bases, pp. 415–422, EXIT.
Agnieszka Nowak, Alicja Wakulicz-Deja, and Sebastian Bachliński (2006b), Optimization of Speech Recognition by Clustering of Phones, Fundamenta Informaticae,
72:283–293.
Zdzisław Pawlak (1991), Rough Sets – Theoretical Aspects of Reasoning about Data,
chapter 6, Kluwer Academic Publishers.
Beata Zielosko and Marcin Piliszczuk (2008), Greedy algorithm for attribute reduction, Fundamenta Informaticae, 85:549–561.